1 Introduction

Railway Maintenance Control Centres are responsible for ensuring the continuous availability of infrastructure assets. Technological advances, including Remote Condition Monitoring (RCM) systems, are commonly used within these centres to provide reliable, real-time data regarding the status of assets and, consequently, to enhance operators’ situation awareness and decision-making. Increasingly, these control environments are embracing predictive maintenance and ‘Intelligent Infrastructure’ (Pedregal et al. 2004; Ollier 2006; Dadashi et al. 2014). In Intelligent Infrastructure, diverse sources of data regarding the current and historical state of the asset, or fleet of similar assets, are combined with contextual data (e.g. usage patterns or weather) to predict the future state of the asset, likely performance, and potential failure (Ollier 2006).

Despite much work on technical aspects of Intelligent Infrastructure in rail and other contexts (e.g. Ollier 2006; Khan 2007; Márquez et al. 2003; Durazo-Cardenas et al. 2018; Vileiniskis et al. 2016), there are significant gaps in effective user-centred design and organisational deployment (Dadashi et al. 2014; Ciocoiu et al. 2017). The question remains as to whether introducing these technological advances will lead to a change in the way operators make decisions when they are conducting cognitive processing associated with fault finding. The vital challenge for future technology is to achieve optimum presentation of information while complementing operators’ existing expectations. Technology adoption and efficient utilisation of Intelligent Infrastructure is, therefore, dependent on careful alignment with operators’ expertise, decision-making and major coping strategies. Not only does this help the acceptance process, many of these coping strategies indicate the important constraints, such as workload or gaps in information, that shape the way operators work.

Design requires a combination of sequential and contextual approaches to understand socio-technical performance (Bainbridge 1997). The present paper explores railway maintenance operators in Great Britain (GB) with an approach that combines both unstructured observation and structured cognitive task analysis to understand the general nature and constraints of maintenance decision-making (contextual) and model the decision-making environments through the theoretical lens of the Decision Ladder (Rasmussen 1986) (sequential). This approach is used to determine the relationship between technology and decision-making, particularly with regards to coping strategies (Hollnagel and Woods 2005).

The scope of this paper is the immediate response to a fault, given that this can be amongst the most challenging situations in rail operations (Golightly and Dadashi 2017). The initial steps associated with identifying and managing a fault are crucial (Belmonte et al. 2011) and failure to rapidly identify and diagnose a problem can lead to ‘out-of-control’ situations and significant rail disruption (Dekker 2018). Also, false alarms and multiple alarms are known problems for workload and genuine fault detection (Wilkinson and Lucas 2002; Seagull and Sanderson 2001).

The primary contribution of the paper is to give detailed analysis of a key function of the railway, maintenance control, that has received almost no human factors attention yet is critical to safe, high-performance operations. Second, the current paper contributes to the small, but growing, body of knowledge around human factors for predictive maintenance and Intelligent Infrastructure in rail by complementing Human–Machine Interface (HMI) design work (Houghton and Patel 2015), more organisational work (Dadashi et al. 2014; Ciocoiu et al. 2015, 2017; Kefalidou et al. 2015) and work in the related domain of electrical control (Dadashi et al. 2016) with cognitive analyses of maintenance operator decision-making. As such, this paper constitutes the first stage of ISO9241-210 (Understand and Specify context of use) for the development of advanced rail maintenance systems. Third, the paper demonstrates the practical application of some of the key methods (facilitation, observation, cognitive task analysis) and representational formats (the Decision Ladder) to structure our understanding of cognition and work in maintenance environments. These methods and analyses are crucial to systemic approaches such as Cognitive Work Analysis (Vicente 1999) or Event Analysis for Systemic Teamwork (EAST) (Walker et al. 2006).

2 Railway maintenance and cognition

2.1 The rail context

Rail infrastructure assets are safety critical. Failures of assets can lead to catastrophic accidents such as the derailments at Potter’s Bar (Butcher 2012) and Gray Rigg (Rail Accident Investigation Branch 2011), both of which involved fatalities and acted as catalysts to significant changes in Great Britain’s (GB) rail sector. In addition, the need to go out in the field and inspect or maintain assets can be dangerous, requiring rigorous protection regimes to ensure the safety of trackworkers (Golightly et al. 2013). Finally, asset availability is a performance issue with asset failure being a source of significant delay and customer dissatisfaction (Pant et al. 2016; Transportfocus 2017). The pattern seen in Great Britain is replicated globally (e.g. Belmonte et al. 2011), as evinced by reliable rail infrastructure being a core pillar of the EU Shift2Rail programme (Shift2Rail 2015).

Maintenance control for rail is responsible for the safe and timely maintaining of the rail infrastructure. In this way, maintenance control supports the real-time operations of the railways by communicating the availability of assets to front-line roles such as signallers. Maintenance control also supports the mid-term goals of the railways by setting out plans for renewal work and maintenance programmes. Finally, maintenance control contributes to the strategic goals of the railway by informing large-scale renewal and replacement of assets (van Amstel-van Saane 2007; Kefalidou et al. 2015).

While maintenance may be planned in accordance with pre-defined schedules, fault detection is also a key process. A fault can be reported by an individual such as a track worker, member of public (e.g. social media contact regarding a rough ride), signaller, driver or sensed through wide range of remote condition monitoring equipment. Fault reports are fed into a control environment where severity and time-sensitivity of the fault are investigated, and a maintenance action is planned and instructed to track workers. Therefore, one of the key activities of maintenance control is responding to fault reports in the form of alarms.

Predictive technology such as Intelligent Infrastructure is dependent on some degree of automation and decision support for an operator. This automation is in part due to analytical overhead of sensing multiple data streams from multiple assets, and in part due to the algorithms for calculating prediction of risk and failure associated with an asset. The volume of sensed data, and the complexity of calculation, necessitates automation. With the shift to prognostic systems, alarms will move from informing the operator of a current or recent event (e.g. failure of a piece of infrastructure) to include anticipatory alarms that warn the operator of an emerging risk (e.g. potential or future failure of a piece of infrastructure) (Dadashi et al. 2014). Concomitantly, alarms will shift from simple prompts for an operator to carry out further actions, including making diagnoses, through to semantically rich messages carrying verbal, textual or pictorial information about the source or cause of the abnormality.

While there is a range of technology developments in the field of asset monitoring and predictive maintenance (Durazo-Cardenas et al. 2018; Vileiniskis et al. 2016; Shafiee et al. 2019), at some stage, these technologies have to be integrated within the maintenance control environment. Infrastructure sensing and analysis requires the understanding of heterogeneous data sources and technical components (Ranjan et al. 2017); so, the organisations adopting advanced maintenance analytics need greater degrees of technical knowledge to interpret the data (Aboelmaged 2014) that often spans organisational boundaries. The human factors implications of such innovation are substantial, encompassing the user-centered design of technology, knowledge and change management, and training (Dadashi et al. 2014; Golightly et al. 2018).

In a context such as the railways, change is nearly always incremental, so there are always questions of technical integration of legacy systems (Kefalidou et al. 2015), along with the challenge of integrating configurations of new and old technology within pre-existing processes and structures (Ciocoiu et al. 2015). The volume of legacy in the railways, and the need to provide continuity of service to the railways wherever possible, even during major infrastructural change, mean that there is rarely a ‘big bang’ that allows radical overhaul of the control environment. Often, the local aspects of legacy (e.g. is there AC or DC traction? are there a number of critical, high demand points, signals, etc. as one might expect near a major junction or terminus?) are reflected in different asset management systems, procured at different times by different suppliers (Ollier 2006). This presents a challenge for new systems as each control environment has its own idiosyncrasies. It is critical to characterise this variation to predict the impact of a new technology and, where possible, configure it to local demands. There is, however, an opportunity in looking across control settings, as it allows us to pull out regularities in how operators handle maintenance control—strategies, decisions, and requirements—irrespective of local conditions.

The rest of this paper will focus primarily on maintenance, and specifically fault analysis decision-making, as the major topic of interest, with a view to understanding how it is influenced by different maintenance environments including current legacy technology. The key research questions are:

  1. 1.

    What is the context of maintenance control and decision-making?

  2. 2.

    What is the nature of fault decision-making, and how are strategies applied to manage time pressure, information overload or information gaps?

  3. 3.

    What do models of these decisions look like, and what are the differences that emerge due to the differing contexts of maintenance control?

2.2 Theoretical background and approach

Control environments are moving more towards integration and centralisation. The diversity of activities in control rooms, and the highly cognitive nature of work, can impose a severe challenge for understanding the work of the operators, and designing effective systems. Operators often have a sequential, rule-based approach towards certain sources of information (Johannsen 1997). Presenting this information in a cohesive way that matches operators’ mental models and cognitive processing is essential for designing effective decision aids. This necessitates appropriate methods for data collection, in alliance with an appropriate theoretical framework to analyse and model results.

Rasmussen and Lind (1982) specified that the route to in-depth understanding of control settings is through investigating an operator’s activities rather than reviewing system requirements. A combination of sequential and contextual data collection and analysis should be conducted to ensure that diverse nuances of human behaviour working within control environments are understood and reflected in design (Bainbridge 1997).

The current study of maintenance control fault-finding response used a series of observational and field studies to develop an in-depth understanding of maintenance control rooms. Three different control environments were studied to reflect different local conditions and legacy both in terms of control equipment and rail infrastructure.

The approach used a combination of observation, informal interview and field study, coupled with a more structured knowledge elicitation activity. This knowledge elicitation was informed by the Critical Decision Method (CDM) (Klein et al. 1989; O’Hare et al. 1998). Participants were asked to recall a recent, challenging fault-finding incident, but then asked to consider each incident in terms of four stages. These four stages, derived from a study of railway electrical control (Dadashi et al. 2016) and based on models of alarm handling (Stanton 2006), were:

  1. 1.

    Receiving notification of the fault (Notification)

  2. 2.

    Checking if it is genuine (Acceptance)

  3. 3.

    Diagnosing the fault (Analysis)

  4. 4.

    Developing a course of corrective action (Clearance)

Given that one of the stated aims of the study was to compare different working contexts, a structure is needed to describe decision-making in each environment and compare it across settings. The lens for this work was Rasmussen’s Decision Ladder (1986). While this is a component of Cognitive Work Analysis (Vicente 1999) it predates CWA and can be used as a representational form in its own right (e.g. Banks et al. 2020). The decision ladder aims to identify various information processing types. These types can be categorised into two groups: (1) information processing activities, and (2) the state of knowledge resulting from information processing. The general information flow forms a template with each information type shown with different symbols—typically, a box to represent processing and a circle to represent the result. The Decision Ladder can also represent shortcuts through cognitive processing, known as shunts and leaps, therefore supporting the expression of the kind of cognitive activity that is typical of expert performance. These shortcuts can also come as the result of automation that replaces the need for knowledge-based analysis on the part of the operator. By mapping a decision-making process on a Decision Ladder template, it is possible to map out the stages of cognitive processing, where there may be shortcuts (due to expertise or automation) and how this decision ladder may vary due to different constraints such as the availability of technology. In this study, the aim was to specify a decision ladder that fitted all maintenance fault-finding situations, but with a secondary outcome of identifying differences in the ladders due to different control rooms.

Finally, factors such as time pressure or gaps in information can lead to the operator applying coping strategies. This is an expression of the need to balance thoroughness of analysis, with efficiency (Hollnagel 2011). This trade-off is exacerbated when information is either incomplete/insufficient, or the operator is overwhelmed by information. Also, operators are likely to be human (for the foreseeable future) with cognitive biases, for example in their ability to interpret cumulative probabilities or in their assessment of risk (Costello and Watts 2014; Sundh and Juslin 2018). These factors lead to a set of coping strategies. Hollnagel and Woods (2005) proposed a taxonomy of typical coping strategies (see Table 1). These strategies allow the operator to deal with differing volumes and quality of information, and express one aspect of the bounded rationality of cognitive systems. Therefore, a step in the analysis was to apply these strategies to the observed and described processes of fault finding.

Table 1 Coping strategy taxonomy from Hollnagel and Woods (2005)

3 Methods

A series of data collection and analysis activities were conducted to tackle the research questions explored in this study, capturing both contextual and sequential aspects of maintenance fault finding. Figure 1 shows a diagram that summarises these research activities and their outputs.

Fig. 1
figure 1

Research framework and methods

3.1 Domain familiarisation

A series of field observations, open structured interviews and workshops were conducted to facilitate familiarisation with various types of maintenance control centres. This included understanding existing remote condition monitoring technologies that are currently in use within railways. Details were collected of the main responsibilities, work settings and a brief description of fault analysis processes in these control rooms.

To start, a 1-h interview was conducted with a senior railway operator to facilitate the identification of various types of railway maintenance control centres and to categorise these in terms of geographical coverage and types of equipment. Three different types of control room were selected on the basis of these recommendations. These control rooms, although similar in terms of their job specifications and responsibilities, had different technologies that were also distributed in different configurations. These control rooms were selected based on the amount and type of RCM equipment they have and are referred to as location A, B and C throughout this paper.

Document review was also conducted. This covered specification of the RCM equipment, procedural manuals, and roles and responsibilities of maintenance technicians. The initial input from the senior railway operator was followed by three field visits of 4 h each (total of 12 h) conducted in the three-maintenance control centres. This involved general observation of activities, and unstructured discussions with operational staff regarding tasks, priorities and the nature of maintenance work.

3.2 Critical decision method

3.2.1 Participants

Maintenance technicians at each location (A, B and C) were approached with the proposed study aims and invited to participate. Two maintenance technicians from each of the selected Maintenance Control Centres (MCC) participated in this study (n = 6). Participants were all male with an average age of 43 years, an average of 22 years of experience in various sectors of the railway, and they were all experienced at the task under observation. Interviewing six participants from the three maintenance control rooms took approximately 12 h.

3.2.2 Procedure

Ethical guidelines of the University of Nottingham were followed with approval from the University of Nottingham Faculty of Engineering Ethics committee. Participants were assured about data confidentiality and their anonymity.

Participants were asked to think of the most recent challenging fault situations they had gone through. These incidents were selected by participants as critical or challenging ones. Therefore, as well as informing the steps of decision-making, the choices of participants also provided an insight into their perception of what constitutes a challenging fault analysis situation.

The incident was then reviewed using a set of probes based on the CDM (O’Hare et al. 1998). Each of the four stages for fault handling (notification, acceptance, analysis, clearance) was discussed, using any of the following probes, as appropriate:

  1. 1.

    How did you become aware of the fault? What was the cue in identification of the problem?

  2. 2.

    What was the most important piece of information that helped you in making your decision?

  3. 3.

    How certain were you regarding the information provided to you?

  4. 4.

    How did you integrate all different sources of information to come to a conclusion?

  5. 5.

    What artefacts did you use?

  6. 6.

    In what order did you attend to various pieces of information?

  7. 7.

    How aware were you regarding your surroundings as well as the fault’s context?

Once a given fault episode was completed, the process was repeated until the time available with each participant ended. Typically, this resulted in four faults per participant. A total of 24 fault episodes were recorded.

Due to the interviews being conducted in a live operational environment, data were not audio recorded but contemporaneous notes were taken using an analysis spreadsheet, discussed below.

3.2.3 Analysis

It is appreciated that obtaining an in-depth understanding of strategies used for problem solving requires far more detailed and extended data collection than merely finding a pattern through a number of questions. However, these data are useful in developing a general view of operator’s potential approaches to overcome complications while they are attending to a fault.

A decision analysis spreadsheet was developed to assist with grouping and structuring the functions of fault analysis with factors adopted from the CDM (O’Hare et al. 1998). Table 2 below shows an example of a completed spreadsheet for one of the fault analyses cases. The four stages of fault analysis are presented in the ‘goals/activities’ column. Additional notes and comments on design recommendations were also recorded for each alarm handling stage and further reviewed.

Table 2 Example of a completed decision analysis spreadsheet

Participants’ comments regarding questions covering what the most important piece of information was, how certain were they regarding the information provided to them, and how did they integrate all sources of information to a conclusion, provided cues as to the strategies they use to overcome information deficiencies. These were then mapped to the list of coping strategies adopted from Hollnagel and Woods (2005) presented in Table 1. A separate 1-h-long meeting with one of the maintenance technicians at location A was used to verify and reconfirm the identified strategies.

Further analysis examined the differences in terms of activities and strategies in relation to the type of artefacts and system distribution available in each control room. Decision ladders developed for each of the control rooms provided a means for comparing activities and strategies in each of the control rooms. Activities and strategies were first compared in terms of the available artefacts in each control room and then compared in terms of the distribution of maintenance workstation within its larger control setting.

4 Results

4.1 Maintenance control centres

4.1.1 Functional overview

Observation and discussion with staff confirmed the basic principles of Maintenance Control Centres (MCCs). These are facilities with responsibility for maintaining the railway infrastructure. This ranges from maintenance of Signalling and Telecommunication facilities to Electrical and buildings as well as track-borne infrastructure (e.g. point machines, track circuits). In GB, MCCs are widespread across the country and are equipped with various legacy systems. This variation is partially rooted in regional investments and traffic-related needs of various locations. In addition, there are various control centres (referred to as National Control Centres) which monitor the performance of a wider region. Figure 2 shows an example of operator workstation in a National Control Centre.

Fig. 2
figure 2

Maintenance workstation in Maintenance Control Centre at location C (National Control Centre)

Three types of MCC are typical of the GB rail network.

  • The first is focused on the performance of the railway service infrastructures (i.e. signals and point machines). This is relatively local, and the maintenance control comprises a workstation located within a signal box responsible for regulating rail traffic (Maintenance location A).

  • The second type refers to maintenance control systems that are integrated and focused on a larger area of coverage and contain both service-related infrastructure and the railway assets such as buildings and power boxes (Maintenance location B).

  • The third type of control centre is focused on a region within the railway network and monitors both service-related infrastructure and assets, and also monitor weather related conditions that impact the state of the assets (e.g. wind gust and ice). This type of centre controls and maintains the route and includes a number of operators, such as train operator company’s representatives, regulators, and maintenance technicians. (Maintenance location C).

Examples of all three types of control room were visited as part of the study—see Table 3.

Table 3 The three maintenance control rooms of the present study

4.1.2 Role of the maintenance technician

The maintenance technician is responsible for detecting and dealing with operational failures, attending to fault logs, monitoring equipment to facilitate predictive maintenance and planning periodic and long-term maintenance checks. They support the railway service and provide aids to operational staff. In doing so, there are situations where maintenance technicians need to go to the site of a specific asset and locate asset-related information from the adjacent loggers and sensors. None of the control centres are staffed 24/7.

4.1.3 Fault management processes

When a fault is being reported, various types of information are presented to the operator: location, equipment type and a brief indication of the fault. These may also occur as alarms within the maintenance control room to notify the operator of an infrastructure malfunction or abnormality. Logbooks are also used to record information: the date, the technician who had attended to the fault, Fault Management System (FMS) number, equipment type (e.g. point machine, main signal, position light signal) and equipment ID, controller unit, field unit, indication of a common fault (e.g. lamp failure, lost reverse detection, earth alarm, etc.) and common fix (e.g. filter unit replaced, etc.), as well as the current status of that fault (fixed, active, unknown or cleared on own). Finally, a more detailed description of each fault can be found in the report that is automatically generated.

The operators then assess the fault through their asset monitoring equipment and re-play the asset behaviour towards the moment of its failure and diagnose the fault. Often, operators would require further information to build a mental image of the situation that led to the failure. This then is followed by sending a specialised track team to the field to rectify the failure and resume the normal service. During this process, maintenance controllers are in communication with signallers, route managers, and other operational staff to develop and share a good understanding of the impact of the failure on the service.

4.1.4 Maintenance control contexts

These maintenance control rooms had various ranges of condition monitoring equipment. The comparison of these three provided insights into how operators cope and adapt to the technological innovations that are being added to their existing control environments. It is interesting to note that the scope and high-level activities and roles of these maintenance operators were very similar. The main difference was due to the geographical location, area of coverage and more importantly the technological capabilities that have become available to operators in each of these control rooms.

Workstation at location A: The workstation had seven information displays. Artefacts available to the maintenance technicians included equipment linked to various fault monitoring and remote condition monitoring systems for monitoring the state of point machines and track circuits based on data from on-track sensors and loggers. Some of these interfaces were web-based, while others comprised stand-alone software applications. Systems available to the technicians had different interfaces that are not always consistent in terms of their basic presentation. Apart from the use of similar colour coding (e.g. red for alarms and green for cleared), the format for information presentation differed between different interfaces. In addition to the condition monitoring facilities, signalling displays of the area under coverage and Control Centre of the Future (CCF) (a wide area view of the regional network) were also available to the signalling technicians.

Control room at location A only had logging facilities equipped with alarms to notify the maintenance operator when the logged value was above a certain threshold. Additionally, since the technicians were located in the same signal box as the signaller, they could overhear relevant information (e.g. signallers commenting that a point was not behaving as normal) and this, in turn, formed another source of their information when it came to identify the occurrence of a failure.

Workstation at location B: This workstation consisted of six information displays. These include five integrated information displays and one display used for web-based applications, as well as the administrative tasks that the maintenance technician needed to fulfil as part of their duties. The information displays on the workstation provided information regarding signalling workstations, power supply, monitoring facilities for the office equipment, modems and other communication links. Location B has some predictive monitoring capability, but the system only covered local assets. These provided operators with detailed trends and graphs associated with the fault that assist them in diagnosing faults.

Workstation at location C: This workstation consisted of nine displays. These covered various asset types, point monitoring, and wheel monitoring, but also a rich range of contextual information including weather monitoring and train schedules. A display was also dedicated to e-mail and other information resources. Location C was a national control centre and not only had many predictive monitoring solutions, but also covered a large geographical area. This technology provided diagnostic support, assisting operators in a more confident acceptance of the fault. The wide range of RCM equipment in the control room provided operators with duplicated information which could be beneficial in supporting diagnoses, though in some cases generated excessive information.

Looking through the range of control facilities and information displays, a number of features are common. First, operators have access to a number of legacy systems. Second, the information presentation formats are different from one system to another, even within the same workstation. Third, tasks and activities of the operators remained ostensibly the same. While this was the outcome from the observation and familiarisation, the CDM aimed to ascertain whether this held true for operators cognitive processing and strategies.

4.2 Fault analysis through critical decision method

A total of 24 fault analysis episodes were recorded and analysed through the Critical Decision Method technique—nine of these fault analysis episodes were recorded in location A, nine in location B, and seven in location C.

From the 24 cases of fault finding, 13 different types of faults were selected by maintenance technicians. These faults are perceived by the operators to be the most recurring and challenging cases. False alarms, point failures and signal failures were selected more than other cases. The distribution of fault types is shown in Fig. 3.

Fig. 3
figure 3

Number of the faults reviewed in the study

These faults are selected by technicians due to both their frequency (i.e. false alarm is a constant occurrence) and their severity [i.e., point and signal failure can seriously impact the service (Golightly and Dadashi 2017)]. The fault process can be summarised across the four stages as follows

Notification When a fault is being reported, the operator is made aware of it. As well as getting alerted through another controller and audible and visual channels, the operator also has to identify the location from which the fault has originated and needs to start analysing the faulty situation on the basis of their local knowledge and experience.

Acceptance The second stage is to identify whether the fault is genuine or not. This is to assess the credibility of the data presented. If the fault is not genuine and the operator imposes an unnecessary speed restriction or even stops a train to send an investigation team to the track, this can lead to unnecessary delays and a waste of time and resources, as well as excess costs in terms of delay attribution fines.

Analysis The third stage of fault analysis is to assess the fault, seek potential causes of the fault and diagnose it.

Clearance Finally, the fourth stage refers to the development and evaluation of the optimum corrective action.

Nineteen of the fault cases followed this basic process. In the remaining five cases, where the technician was not completely certain whether the fault was authentic or not, a test of authenticity was performed and, in of which two of the cases where there was a false alarm, the technician assessed the causes associated with the generation of a false alarm. In these five cases, upon confirming the authenticity of the fault episode, the cause was diagnosed, and a corrective course of action was selected.

4.3 Decision ladders

A canonical Decision Ladder was developed representing the basic process of fault identification and analysis in location A. This is shown in Fig. 4, with the transition between incoming notification of a fault, through to acceptance and analysis and planning a course of action. These four areas are circled on the diagram.

Fig. 4
figure 4

Decision ladder for fault analysis in control room at location ‘A’

One of the research questions in this study was whether changes in the artefacts and equipment available to operators would affect the process of fault analysis. Therefore, further decision ladders of problem were developed, based on the data from the CDM interviews. The decision ladders of fault analysis in locations ‘B’ and ‘C’ are shown in Figs. 5 and 6 respectively.

Fig. 5
figure 5

Decision ladder for fault analysis in control room at location ‘B’

Fig. 6
figure 6

Decision ladder for fault analysis in control room at location ‘C’

The shaded areas in Figs. 5 and 6 refer to the activities that are being assisted through the artefacts available in those control rooms. Although the workstation at location ‘A’ had no noticeable support from any advanced equipment in their room, ‘B’ and ‘C’ used various technologies to diagnose faults and assist the investigation process. The second stage (confirmation) and the third stage (diagnosis) benefitted from increased analytical support. Most notably, at workstations ‘A’ and ‘B’ when operators wanted to check if the fault is genuine, they applied their knowledge of the fault location and the history of that asset. In control room at location ‘C’ operators had more trust in the system, potentially because the equipment had been maintained more regularly and alarm thresholds had been updated fairly recently. The sophisticated nature of fault management systems, and the strategic nature of the role of operators in this control room, contributed to this difference.

4.4 Strategies

Both the familiarisation studies and the CDM interview findings identified regular strategies and tactics applied by maintenance operators. Comments recorded during the CDM study were assessed against Hollnagel and Woods’ (2005) coping strategies (Table 1). Table 4 shows participants’ responses to the questions for the selection of fault analysis episodes, with probes around cues and information seeking being particularly relevant to uncovering strategies.

Table 4 Examples of technicians’ responses to the three questions about a selection of faults

Deficiencies in information presentation were one of the main challenges facing the participants endeavouring to deal optimally with faults. There were at least six information displays on a technician’s workstation. While it is essential that duplication of information is inevitable due to safety critical issues associated with their roles, technicians also identified difficulties with unnecessarily redundant information and misleading data. Additionally, temporal aspects associated with handling alarms often meant that technicians did not have sufficient time to exhaustively search for information to handle the fault effectively. Hence, they were selected as representative cases for the CDM.

The seven faults listed in Table 4 are identified by participants as challenging due to some form of information deficiency and, therefore, would be appropriate candidates to explore operator’s strategies when dealing with information deficiencies. Strategies are presented in brackets. The strategies adopted by maintenance technicians to analyse the faults include categorising, filtering, queuing, similarity matching and extrapolation. However, participants also tend to use the frequency of occurrence of events in the past as a basis for recognition (frequency gambling). Many of these strategies (categorising, filtering, queuing) were in response to the high number of alarms that were generated, sometimes by the same fault, and as a means of managing tasks. Similarity matching, extrapolation and frequency gambling were more relevant to the interpretation of events, and how to make sense on both the genuineness of the alarm, the causes, and therefore the restorative action, given factors such as previous occurrences at that location, and similar events elsewhere.

When comparing the three control environments, both locations ‘A’ and ‘B’ showed similar strategies, though participants at location ‘B’ did not need to do as much ‘filtering’ and ‘categorising’ since their more advanced condition monitoring systems helped with searching and grouping faults. However, they still used ‘extrapolation’ and ‘similarity matching’ when it came to identify and assessing whether the fault was genuine at the ‘acceptance’ stage. At location ‘C’, operations were more centralised, and operators have access both to more information and more advanced analytics, removing the need for ‘filtering’ and ‘categorising’. The key aspects of the fault are clear and unambiguous. Also, because there was sufficient integration of information, operators in location ‘C’ had to resort to less ‘extrapolation’ to fill too many gaps in their interpretation of a fault. However, operators still engaged in ‘similarity matching’ by recalling a similar scenario to diagnose the fault and an appropriate course of action.

5 Discussion

The study reported in this paper had three aims—to shed light on the cognitive work of rail maintenance controllers, which is a critical role, but has received little attention; to determine design recommendations for the development of future maintenance automation in the form of ‘Intelligent Infrastructure’; and to understand the value of the combination of methods used.

In terms of the nature of the maintenance role, the results suggest that maintenance is a cognitive task, adhering to conventional models of alarm/fault handling, and reliant on different levels of automation which play an increasing role. What is less expected is the variation in the role depending on location and scope of functions, though this matches the similar experience of rail signalling (Pickup et al. 2013). The analysis clearly suggests that local conditions and needs, both in the maintenance control box and for the infrastructure covered, is an important factor when reflecting on the nature of work. It is also interesting that ‘active overhearing’ and the ability to work with others is an advantage in some of these environments (location A). This suggests a team and distributed, rather than purely individual, orientation to the work.

One interesting aspect of the analysis is that the choice of faults for the knowledge elicitation (Fig. 3) is perceived by the operators to be either the most recurring or challenging cases. False alarms, point failures and signal failure were selected more than other cases. These faults affect the immediate operation of the railways and operators found them more challenging, possibly due to the time pressure felt while analysing these fault situations. It is worth noting that a study of signallers and controllers in rail disruption (Golightly and Dadashi 2017) also identified point and signal failures as amongst the most challenging events, due to the wide-ranging causes and the need for extended diagnosis. It seems that, in most instances, operators did not have a clear view of the fault (e.g. due to the lost communication between the sensor and the logger in ‘lost data link’) while, in other instances, they had too much information to analyse. Support for these cases would appear to be an area where there could be a significant gain for operations. It is, therefore, the case that these events are worth of special attention, both cognitively, in terms of high-quality sensing and algorithms, and in terms of Human–Machine Interaction (HMI).

Data collected about fault analysis episodes suggest that the second stage (confirmation) and the third stage (diagnosis) benefit from advanced technologies which can take on cognitive load. In both locations ‘A’ and ‘B’, when operators wanted to check if the fault is genuine, they relied on their knowledge of the fault location and the history of that asset. In location ‘C’, operators had more trust in the system, potentially because of the sophistication of the fault management systems and the strategic nature of the role of operators in this control room. It highlights that maintenance fault finding is a complex set of activities, and that human judgement and machine intelligence are tightly connected, rather than independent (Hollnagel and Woods 2005). Being able to reflect this complex process in the form of a decision ladder which includes both human and automation as a single cognitive system will allow designers to consider this process more holistically in future.

Review of the strategies adopted by operators during fault-finding episodes revealed that ‘filtering’, ‘similarity matching’, and ‘categorising’ are, respectively, the most utilised coping strategies when facing information deficiencies, particularly in those scenarios where responding to a fault is time critical and where thoroughness must be traded off against efficiency (Hollnagel 2011). Those points where coping strategies are applied indicate where automation may offer significant benefits. It also suggests that the design of HMI should support these functions and, similar to Golightly et al. (2018), rather than a black box of ‘red’, ‘amber’, ‘green’, the automation should support exploration of the reasoning behind decisions so that both the cause, and potential rectifying action, can be understood.

In terms of the second aim of highlighting design considerations emerging from the study, these are presented in Table 5. The design recommendations are primarily derived from the strategies but the point where they apply in the fault finding process can be mapped to the four stages, as captured in the decision ladder. The design consideration noted in this paper is not particularly surprising and echoes similar design principles and guidelines to many other user interfaces. However, corresponding these design considerations to different cognitive processes and operators’ strategies will allow designers to target these considerations at specific design components.

Table 5 Design considerations

Finally, in terms of the third aim of applying methods that combine contextual and sequential approaches, this paper showcases a possibility of developing and understanding cognitive capabilities and strategies using a relatively simple knowledge elicitation technique but triangulated together. One of the key strengths of the method was to gain input from a senior member of staff early in the process. This not only identified the right (and varied) locations to perform the work, but also led to significant buy-in from the staff involved in the observation and the CDM.

There are limitations to the work. The numbers used are somewhat small, and while this is a specialist community, it would be useful to validate the work, especially as new developments are coming on line all the time in the Intelligent Infrastructure space. Another limitation is confirming the level of understanding of participants regarding the strategies. While attempts were made to ensure they understood the strategies, and while validation took place with a subject matter expert after the study, it would be useful to link the strategies back to more detailed and structured observation for confirmation.

6 Conclusions

Maintenance control fault finding is critical to rail performance and safety. It is also a function under change through the increasing use of automation and ‘Intelligent Infrastructure’. This study has shed light on the nature of this work, and how it varies by location and depending on the level of automation. Fault finding is a process of identifying the alarm, dealing with its veracity (as there are many false alarms) and coming up with diagnosis and mitigating actions. While operators in control environments with wider areas of control and more forms of automated support are likely to value this technology, particularly in the diagnostic phase, all roles use coping strategies to deal with gaps in information and, often, duplication and a surfeit of information.

Fundamentally, this work is a first step to describing the combination of human problem solving and automation in maintenance fault finding as an integrated unit of analysis as defined in cognitive systems engineering (Hollnagel and Woods 2005). This is vital if we are to avoid the pitfalls of data-driven, bottom-up design of automation, and instead move to a top-down decision-led model of design. The work has also uncovered false alarms, point failures and signal failures as key scenarios to support.

Future work includes the detailed design and validation of the design principles shown in Table 5. An additional and useful avenue of work would be to continue to understand the nature of maintenance control and how it interacts with, and supports, other rail functions. Walker et al. (2006) have used the Event Analysis of Systemic Teamwork (EAST) approach to study the role of maintenance in trackwork. EAST can also embody automation as an actor in a team. We advise applying this method to fault finding to fully map out the actors (including automation), tasks and information in this distributed activity, to determine key dependencies, fragilities and points of potential support. Eventually, this could be linked into existing analyses of other functions involved in the disruption management process as described in Golightly and Dadashi (2017).