1 Introduction

The phrase ‘no news is good news’ is particularly true for train operating companies: when the railways do make the headlines of the daily news, the item is usually filled with images of stranded passengers, overcrowded trains and blank information screens. These situations are typically caused by large incidents, such as severe weather conditions or power shutdowns. In extreme cases, the impact caused by such incidents is so large that they result in an out-of-control situation. With this term, we refer to situations where dispatchers no longer have an overview over the system, and therefore decide to terminate all railway traffic in a large part of the railway network, even though the required resources (infrastructure, rolling stock and crew) may actually be available.

Despite occurring infrequently, out-of-control situations have a huge impact. The termination of all traffic in a part of the network results in large numbers of passengers having to find different means of transport to complete their journey. These negative experiences may accumulate into serious reputation damage for the involved train operating company. Moreover, as a significant part of the rolling stock is in the wrong place (with respect to the schedule) once the out-of-control situation is over, it can take multiple days to recover the planned rolling stock schedule, potentially causing further cancellations and overcrowded trains.

A key factor that causes out-of-control situations is the inability of dispatchers to make effective rescheduling decisions when faced with extreme disruptions. Hence, one might hope that computerized support for generating modified timetables, rolling stock and crew schedules after disruptions will help avoiding these situations. However, currently existing disruption management techniques often require assumptions that severely limit their applicability to large-scale disruptions. In particular, the current state-of-the-art in railway disruption management is only able to deal with isolated, well-defined disruptions (see Cacchiani et al. (2014) for a broad range of examples). It is usually assumed that there is only one single disruption, such as a partial or complete track blockage, that the duration is known, that all information about the resources is correct, and that all stakeholders in the operations act as expected. In practice, these assumptions are not always met. Real-time management information systems for the timetable, rolling stock and crew may lag behind, especially when disruptions cause many deviations from the regular schedules. Next to that, train drivers and conductors may not be aware or even ignore rescheduling decisions made by dispatchers. Furthermore, the duration of a disruption often depends on the time needed for repairing malfunctioning or broken infrastructure, which can take longer or shorter than expected.

In order to develop effective approaches for dealing with out-of-control situations, it is necessary to better understand how multiple primary disruptions evolve into large-scale problems in the first place. We propose to do so by considering the various elements of the railway system (infrastructure, timetable, rolling stock and crew schedule, dispatchers and information systems) as parts of a complex system and analyzing it by using tools from complex systems science. There have already been attempts to capture dynamics of interacting trains from data (Monechi et al. 2018), but such concepts have not yet been applied to large-scale disruptions. Macro-scale dynamics in other network-based systems have been analyzed more in depth, for example, in epidemiology (Liljeros et al. 2003), vegetation systems (Tirabassi et al. 2014; Yin et al. 2016) and the energy grid (Buldyrev et al. 2010). We exploit similarities between these systems and the railway system. The generated insights are used to develop new disruption management techniques aiming to reduce the impact of out-of-control situations and, if possible, avoid them.

The contribution of this paper lies in reporting on a multidisciplinary framework for dealing with out-of-control situations, comprised of two main parts. The first part involves the detection and prediction of large disruptions using complex systems science, with the aim of providing dispatchers with sufficient time for responding to the situation and understanding in which region the situation is most critical. This allows us to study the evolution towards out-of-control situations and ultimately, to predict them. The second part involves a number of countermeasures that can be applied in (near) out-of-control situations, based on techniques from operations research. The core idea is to completely decouple the operations in the disrupted region from the rest of the railway network. Next to that, we propose the use of self-organizing, decentralized scheduling principles for rolling stock and crew, which are robust for the features of out-of-control situations and reduce the dependence on dispatchers.

The remainder of this paper is structured as follows. In Sect. 2, we give a detailed description of out-of-control situations, how they arise and what is currently done to prevent them. Sections 3 and 4 provide an introduction and overview of relevant railway disruption management and complexity science literature, respectively. In Sect. 5, we describe the framework for dealing with out-of-control situations. We conclude the paper in Sect. 6.

2 Out-of-control situations

Out-of-control situations typically arise after large incidents (e.g. a power shutdown in a crucial part of the network) or combinations of large disruptions. These disruptions can accumulate and easily spread over the network when the infrastructure is highly utilized and there are strong links between resource schedules. In such situations, decision making (both by dispatchers and local personnel) becomes slower and less effective due to the uncertainty in the disruption duration and the availability of resources. On top of that, the decision making process may lack updated information or human ability to adapt adequately to the situation. In these situations, the railway system can get into a state of out-of-control, which we qualitatively define as a situation ‘where dispatchers cease to have an overview of the system and consequently decide to terminate all railway traffic in the affected region, even though the required resources (infrastructure, rolling stock and crew) might be available.’

2.1 Out-of-control situations in the Netherlands

To illustrate the severeness of these events, we discuss examples from the Netherlands throughout this section, after a short elaboration on the Dutch railway system in general. The railway system in the Netherlands consists of about 7000 km of tracks and has a large amount of timetabled trips per kilometer track, making it an interesting example to study out-of-control situations. The maintenance and management of the infrastructure is the responsibility of ProRail, the Dutch infrastructure manager. ProRail is also responsible for the timetable during real-time operations.Footnote 1 The largest train operating company is Netherlands Railways (NS), handling approximately 1.3 million passenger trips each day. In the real time operations, NS reschedules the rolling stock of crew and is responsible for providing the correct information to the passengers. The decision making takes place on nineteen different locations: five regional centers of NS, thirteen traffic control centers of ProRail and one national control center.

Multiple out-of-control situations in the Dutch railway network during the harsh winter of January and February 2012 led to one of the most extensive analyses of these events—a report of the Dutch Ministry of Infrastructure, NS and ProRail (Nederlandse Spoorwegen 2012). The authors found three main causes of out-of-control situations in the Dutch railway system:

  • The local nature of decision making. Because dispatchers have a locally restricted area of authority, the global picture is not always available. For example, to reduce workload, dispatchers might directly coordinate a route for a train through their area without registering this train in the system; this leads to so-called ‘ghost trains’, catching dispatchers in other areas by surprise.

  • The fragmented decision making process. In the Dutch railway system, the decision making is not only fragmented in terms of (spatial) area, but also spread across different organizations and coordination levels.

  • The loss of routine through the usage of all kinds of additional measures on such days. In the anticipation of extreme weather, timetables are often adapted prior to these events. However, it is argued that this might have a negative impact in these situations, because dispatchers normally rely strongly on their routine and experience with the timetable.

These reported causes of out-of-control situations are found not only in the Dutch railway system, but are also features of many railway systems around the world. For example, Schipper and Gerrits (2018) compare disruption management practices in different European countries. They find that the Belgian and Austrian railways have a similar level of decentralized disruption management and that the German railways even has a higher level, compared to the Dutch railway system, and are therefore susceptible to the same problems related to the local and fragmented nature of decision making.

Over the years, many changes have been made in the Dutch railway operations to reduce the probability of the development of these events, acting on the mentioned report of the Dutch ministry. The rescheduling procedures have been reshaped in order to accelerate the decision making process. NS also refined the reduced timetable that is used on days where extreme weather is expected. Using operations research tooling, NS is now able to completely reschedule the timetable, rolling stock schedule and crew schedule 16 h in advance (Fioole and Huisman 2018). While this certainly improves the controllability of the system, the downside of the reduced timetable is that about 20% of all trains are canceled (even 50% in the densely populated area in the west of the Netherlands called ‘Randstad’), strongly reducing the transport capacity (Trap et al. 2017; Fioole et al. 2019). Furthermore, as the decision to operate the reduced timetable is based on weather forecasts, in some cases it turns out that the measure was not necessary after all. Finally, as illustrated in the remainder of this section, not all out-of-control situations are caused by extreme weather conditions, once more highlighting the inadequacy of the current approach.

In the following, three examples of out-of-control situations in the Netherlands are discussed, illustrating various causes and development of these events.

2.1.1 3 February 2012: winter weather

Extreme weather is a major factor in triggering out-of-control situations, since it often causes multiple large disruptions around the same time (e.g. related to trees or other obstacles falling on tracks). It is estimated that out-of-control situations with causes related to extreme weather happened about ten times during the period 2009–2012.

The case of 3 February 2012 is one of these events and is analyzed in the earlier mentioned report of the Dutch Ministry of Infrastructure (Nederlandse Spoorwegen 2012). On this day, the extreme weather conditions led to 305 infrastructure disruptions, of which 20 involved problems with switches that lasted more than half an hour each. Furthermore, there were 250 problems with rolling stock, including six broken trains (the daily average is between one and two trains). The amount of delayed trains because of missing personnel was 89, two times higher than usual. Typical for out-of-control situations, in many cases during this day, passengers as well as crew members were uninformed about when or if trains would be running. The accumulation of these problems led to an increasing amount of schedule alterations (by dispatchers) throughout the day. Despite the use of an adapted timetable, the problems ultimately resulted in the loss of overview and a subsequent shutdown of a large part of the network—an out-of-control situation.

The evolution of the delay on the day is visualized in Fig. 1. Initially, the disrupted area was confined between Amsterdam and Utrecht, but later spread towards Rotterdam and Roosendaal (for locations of mentioned cities, see Fig. 1). At the beginning of the evening, the delay even reached the far east of the Netherlands (Enschede).

Fig. 1
figure 1

Average delay on the Dutch railway network, at four different times on 3 February 2012, a day with harsh winter weather. Abbreviations indicate passenger stations mentioned in the text: Amsterdam (Asd), Rotterdam (Rtd), Roosendaal (Rsd), Utrecht (Ut), Enschede (Es), Zwolle (Zl) and Heerenveen (Hr)

2.1.2 17 January 2017: electric outage

An example of an out-of-control situation that is not caused by extreme weather is 17 January 2017, involving a power outage in large parts of Amsterdam. The power outage started in the early morning and was already restored at 07:15. Still, this disruption had a significant impact on the railway traffic around Amsterdam during the morning, with incorrect data in the railway’s information systems hindering all traffic to and from Amsterdam until after 10:00. Moreover, when the systems were up and running again, dispatchers were faced with a very large workload since the resource schedules were heavily disrupted. As a result, trains were running irregularly for the majority of the day. Ultimately, it took until 9:00 pm that the regular service was fully restored. This example shows a typical long-lasting effect of (sometimes quickly resolved) disruptions.

2.1.3 18 January 2018: storm

A third example of a Dutch out-of-control situation was caused by a severe storm crossing the Netherlands and parts of Germany on 18 January 2018. This day started with a collision with a person at Heerenveen, which resulted in some problems in the morning that mainly affected the area around Zwolle. Soon after this, the storm kicked in and because of fallen trees and damaged overhead lines, the fire department ordered the closing of several stations. Subsequently, the decision was made to cancel all train activity up to 14:00. This got extended to 16:00, and ultimately up to 17:00 no trains were running.

Around 17:00, the storm had settled and dispatchers tried to restart operations. However, the lack of an overview of the whereabouts of rolling stock and crew, in combination with many disruptions caused by trees having fallen on tracks strongly limited the possibilities of dispatchers. For this reason, it was decided to broadcast a negative travel advice for the rest of the day even though the storm had already passed.

2.2 Comparison and takeaways

The three cases reflect different evolutions of out-of-control situations. During the first (3 February 2018), many trains were still running and the delay had a lot of time to spread across the country. The second (17 January 2017) and third (18 January 2018) are cases where a standstill of a large part of the system occurred. To put the three case studies in perspective, we compare the total (summed) delay in Fig. 2. Although cancelled trains technically do not contribute to delay, we use the delay as an approximation of how disrupted the system is. The grey colors (including a bandwidth) show the delay evolution on average of 365 days, as a reference. It is visible that the railway system on 17 January 2017 (red) returns to a normal state in the early afternoon already, while the system on 3 February 2012 (black) remained disrupted up to the end of the day. The sudden decrease of delay on 18 January 2018 (orange) around 11:00 a.m. reflects the large-scale cancellations of trains. Also note that the positions of the total delay maxima vary—some are in the early morning (red, orange), while others gradually build up (black).

Fig. 2
figure 2

Total delay summed over the whole country (in hours) for different dates with a a regular, and b a logarithmic vertical axis. Colors indicate different dates. As a reference, the average delay evolution is plotted over all days between 1 July 2017 and 30 June 2018 (shading indicates one standard deviation offset from the average)

In all three case studies, dispatchers were unable to respond adequately to the disruptions. Furthermore, after a temporary standstill of the railway traffic, returning to the regular timetable is seen to be very difficult, leading to inefficient use of available resources. All in all, the case studies illustrate the severeness of these events, and subsequently demonstrate clearly the need for new, more flexible, strategies for dealing with out-of-control situations.

3 Literature review on railway disruption management

When a disruption occurs, the timetable, rolling stock circulation and crew schedule need to be adjusted to run a new feasible schedule. Since solving the problem in an integrated manner leads to unacceptably long computation times, both in theory and in practice, the problem is usually decomposed and solved sequentially. First, the timetable is adjusted. The modified timetable then serves as input for the rolling stock rescheduling problem. Finally, both the adjusted timetable and rolling stock schedule are input for the crew rescheduling problem. It must be noted that such a sequential approach can lead to the situation where no feasible solution exists for one of the later stages due to a decision made in an earlier stage. Hence, it is sometimes necessary to resolve the timetabling or rolling stock rescheduling problem, until an overall feasible solution is found (Dollevoet et al. 2017). Recent surveys of proposed methods and algorithms for the different steps are presented in Cacchiani et al. (2014) and Ghaemi et al. (2017b).

3.1 Timetable rescheduling

Timetable rescheduling deals with finding a new feasible timetable by canceling, retiming, rerouting or reordering train services. Of the three rescheduling phases, timetable rescheduling has received the highest attention in the literature. Approaches differ in the type of incident that has occurred (either a small disturbance in the timetable or a more serious disruption, such as a track blockage), in the level of detail the railway infrastructure is considered (either macroscopic or microscopic) and in the extent the inconvenience of passengers is taken into account. Objectives are usually to stay close to the regular timetable and minimize the total or maximum delay.

Many microscopic approaches formulate timetable rescheduling problems as job scheduling problems, in which a number of operations (the passing of trains) with certain operation times (running times) have to be scheduled on machines (block sections), see e.g. D’Ariano et al. (2007). In case of small delays, such models can be solved within a reasonable amount of time. Macroscopic approaches use a higher level representation of the railway network, which has the advantage that additional aspects can be incorporated. For example, Schöbel (2007) introduces the problem of delay management, where one decides whether trains depart on time or should wait for delayed feeder trains. The objective in delay management is usually to minimize the total delay of all passengers combined. More recently, this problem has been extended with the routing of passengers (Dollevoet et al. 2012) and the capacities of stations (Dollevoet et al. 2014).

Only a few contributions consider timetable rescheduling after larger disruptions. Louwerse and Huisman (2014) introduce the problem of finding a new timetable in case of partial or complete blockades. Additional constraints are added to increase the probability that a feasible rolling stock schedule exists for the modified timetable. Veelenturf et al. (2015) extend this model by considering a larger part of the network, allowing rerouting of trains and incorporating the transition from the regular timetable to the modified timetable and back. Ghaemi et al. (2017a) propose a different mixed-integer programming formulation for the same problem, incorporating railway infrastructure on a microscopic level. In a follow-up paper, Ghaemi et al. (2018) study the impact of uncertain disruption duration estimations on the rescheduling strategy and passenger delays by combining the rescheduling model with a passenger assignment model and a probabilistic disruption time prediction model. Zhu and Goverde (2019a) consider dynamic passenger assignment for major railway disruptions considering information Interventions. Zhu and Goverde (2019b) propose a mixed-integer linear programming model for railway timetable rescheduling with flexible stopping and flexible short-turning during disruptions that also optimizes the short-turning locations depending on the available capacity and generated train delays. Zhu and Goverde (2020a) extend this model into a rolling horizon two-stage stochastic programming problem to deal with uncertainties of disruption durations. Zhu and Goverde (2020b) propose an integrated timetable rescheduling and passenger reassignment model during railway disruptions that extends the models in Zhu and Goverde (2019a, b) towards passenger-oriented timetable rescheduling.

3.2 Rolling stock rescheduling

The rescheduling of rolling stock calls for adapting the rolling stock circulation to the modified timetable by changing the compositions of certain trains. Sometimes, this implies that shunting movements are canceled or that new shunting movements are introduced. In case no train units are available, train services must be canceled. Hence, the goal is usually to minimize a combination of the number of canceled trains, the number of changed shunting movements and the difference with the planned end-of-day inventory at the stations.

Nielsen et al. (2012) present a rolling horizon approach for rescheduling rolling stock. In this approach, the rolling stock is rescheduled periodically, as information about the disruption is updated. The model used is based on a mixed-integer programming formulation of the rolling stock scheduling problem proposed in Fioole et al. (2006). Kroon et al. (2014) use the same model but also take passenger flows into account when rescheduling the rolling stock. Since disruptions can cause passengers to take different paths, their model tries to facilitate this change in demand by adapting the rolling stock schedule. To solve the problem, the authors iteratively compute a rolling stock schedule and simulate the corresponding passenger flows, until a satisfactory overall solution is found. In Van der Hurk et al. (2018) this model is extended with the possibility to steer passengers by providing travel advice, which is shown to improve the overall service quality even if only a part of the passengers follow the advice. Lusby et al. (2017) propose a path-based model to reschedule rolling stock, which they solve using column generation. Haahr et al. (2016) compare this approach with the composition model used by Nielsen et al. (2012) and Kroon et al. (2014) and conclude that both models are fast enough to be used in rescheduling contexts. Borndörfer et al. (2017) develop yet another rolling stock rescheduling approach based on a hypergraph model.

3.3 Crew rescheduling

When the timetable and rolling stock schedule are updated, it is known which tasks need to be executed by the train drivers and conductors. Crew rescheduling involves assigning these tasks to the crew members. Often, many changes are necessary to the crew schedules as disruptions cause many duties to become infeasible. For example, a train driver on a delayed train might arrive too late for the next scheduled service, meaning that the service duty must be carried out by a different train driver. Many (labor) restrictions need to be respected when reassigning tasks, the most important one being that a crew duty should always end at the planned crew base. If a task cannot be assigned to any crew member, it must be canceled. This is especially undesired for driving tasks, as this requires the rolling stock schedule to be updated once more. Therefore, the objective in crew rescheduling is usually minimizing the number of canceled tasks and changes to duties.

Huisman (2007) addresses crew rescheduling in the context of scheduled maintenance operations. As the number of possible duties is very large, the problem is solved using a combination of column generation and Lagrangian relaxation. Potthoff et al. (2010) consider the crew rescheduling problem when a disruption has occurred causing a blockage of a route. To keep the problem size tractable, first a core problem with a limited number of tasks is solved. In case the solution contains canceled tasks, tasks that are in some sense close to canceled tasks are added to the core problem. This process is repeated until all tasks are covered or a time limit is exceeded. Rezanova and Ryan (2010) develop a similar dynamic approach for the case where an entire train line is cancelled for some period of time. Veelenturf et al. (2012) extend the crew rescheduling problem by also allowing the retiming of trips. This increases scheduling flexibility, such that more tasks can be covered. In Veelenturf et al. (2014), uncertainty with respect to the length of the disruption is taken into account by requiring that duties have feasible completions in a number of different scenarios. A completely different approach to crew rescheduling is taken by Abbink et al. (2010). In this paper, train drivers are represented by driver-agents. In case the duties of some drivers have become infeasible, the driver-agents try to solve this by swapping tasks amongst themselves.

3.4 Human factors

Now that decision support systems are becoming more prevalent, it is important to recognize that the design of such systems requires careful consideration of human factors. This is mainly due to the heterogeneous nature of disruptions (Golightly and Dadashi 2017) and differences in the competences of dispatchers and the strategies that they apply (Belmonte et al. 2011). Moreover, communication and social interaction is a key feature of disruption management, as dispatchers operate in a dynamic, complex and distributed environment (Farrington-Darby et al. 2006). Finally, also the way the railways are organized in terms of institutional arrangements plays a role in how disruptions are managed (Schipper and Gerrits 2018; Steenhuisen et al. 2009). The extent to which disruptions are handled decentrally versus centrally differs strongly between railway systems. Furthermore, in many countries, infrastructure managers and train operating companies are separated, which can be a source of conflict in disruption management as both parties have different service targets or different views on how to realize those targets.

3.5 Takeaways

There is a vast amount of literature on disruption management for railway systems. However, only a few contributions (Ghaemi et al. 2018; Nielsen et al. 2012; Van der Hurk et al. 2018; Veelenturf et al. 2014) take the uncertainty that comes with major disruptions into account, at least to some extent. Furthermore, the largest disruptions that are considered in the literature are complete blockages of one route for a number of hours. For combinations of larger disruptions, the performance of current models is unknown. On top of that, the effectiveness of the proposed methods is completely dependent on the data accuracy in information systems and the willingness of stakeholders to cooperate, two assumptions that are often violated in case of larger disruptions. These observations lead us to the conclusion that the current state-of-the-art of railway disruption management is unable to cope with out-of-control situations.

4 Perspectives from complexity science

To better understand and cope with out-of-control situations like in the above, we propose to combine operations research with techniques from ‘complexity science’. In this section, we provide a small introduction to this field of study. While generally not treated as separate scientific field, complexity science refers to a vast collection of methods involving data analysis and modelling of systems consisting of a multitude of interactions. These methods usually are based on principles from statistical physics and mathematics, in particular graph (network) theory and dynamical systems theory. Typical studied behavior involves nonlinear dynamics, critical transitions or emergent phenomena, with examples found in the Earth’s climate system (Runge et al. 2019), urban systems (Ouyang et al. 2012), power grids (Buldyrev et al. 2010), biological systems like epidemics (Liljeros et al. 2003; Scarpino and Petri 2019) and social systems (Sobkowicz et al. 2012). In fact, various concepts in complexity science, be it rephrased, coincide with concepts in specialised fields, but generalises these concepts to be applicable to other fields.

Understanding out-of-control situations, like other examples mentioned above, requires analyzing the system as a whole. In other words, while there are many studies deriving detailed statistics on spatially confined areas (e.g., particular lines or stations), the sum of these statistics may provide insights into regular railway dynamics, but presumably not in cases of out-of-control situations, in which the interaction of these individual elements are of importance. Complexity science, both theoretically and applied, focuses on such ‘systems thinking’: investigating the interactions of individual elements (e.g., trains) and how they can give rise to emergent behavior.

In particular, this leads to the study of macro-states: system-wide scenarios corresponding to a certain characteristic that are often in strong contrast to each other. Transitions between macro-states may be in the form of critical transitions or tipping points (Scheffer et al. 2009): sudden changes driven by background conditions, and may even set in motion other macro-state transitions. Examples of such tipping points in nature are found in large ocean circulations (Stommel 1961; Dekker et al. 2018), meltdown of the Greenland ice sheet (Lenton 2012; Scheffer et al. 2009) and vegetation patterns (Hirota et al. 2011). In socio-technical systems like transportation systems, it is argued that such macro-states and associated transitions are also found. This happens in particular in the context of disruptions, where the term ‘resilience’ refers to the ability to (quickly) recover from a perturbed state back to the regular state, sometimes through an intermediary state of decreased, but controlled, efficiency (Bešinović 2020).

Complexity methods and analyzing system-wide dynamics in transportation systems in general are not new. For example, air transport typically requires system-wide analysis (Pagani et al. 2019; Lordan et al. 2015; Guo et al. 2019; Monechi et al. 2015). Also in railway literature, examples of studies capturing complex interactions can be found. For example, Bhatia et al. (2015) studies the effect of station (node) failure in the Indian railway network and relates this to network-based recovery techniques. Monechi et al. (2018) aims to identify universal laws in delayed train interaction in the Italian and German railways. However, to the knowledge of the authors, complexity methods have rarely been applied in the context how delay evolves in severe railway disruptions and how these findings should connect to the existing literature on disruption management. The framework presented in this paper connects these two.

5 Framework for dealing with out-of-control situations

As we have seen in the previous section, existing disruption management techniques are ineffective when it comes to preventing or reducing the impact of out-of-control situations. Therefore, in this section we propose a new framework for dealing with such situations.

The framework is based on the three key building blocks: (i) early warnings, (ii) isolation and (iii) decentralized decision making. In case of a situation that might become out-of-control, early warning signals are essential in order to buy dispatchers time to respond and take necessary precautionary measures. In case the disruption cannot be handled using conventional approaches, we propose to isolate the disruption: completely decoupling part of the network—denoted by the disrupted region—such that no trains or crews are allowed to cross the borders of this region. Although unconventional, this measure prevents the disruption from propagating further through the network and may thus be appropriate under severely delayed circumstances. Moreover, by decoupling the relevant region, the rest of the country can be assumed to be under control, which in particular means that complete information is available. This way, the decoupling allows for tailored disruption management strategies for both parts. Inside the disruption regions, we propose the usage of decentralized decision making to dispatch rolling stock and crew inside the disrupted region, in order to reduce the dependence on central dispatchers and quickly restore a reasonable service.

The entire framework we propose is shown in Fig. 3. It contains two parts, subdivided into six steps. The first part comprises generating early warning signals and localizing the disrupted region, utilizing existing methods in complexity science. In the second part, techniques from operations research are used to find appropriate rescheduling measures, with the aim to minimize the impact of the disruption and maintain a high quality service. The majority of the steps cannot be solved using existing approaches but require the development of new methodologies.

Fig. 3
figure 3

The proposed framework for dealing with out-of-control situations

A possible seventh step of the framework is to re-couple the isolated region with the rest of the network, and to transition back to the regular timetable once the disruption is over. However, such an operation is highly complex and could easily lead to repeated loss of control. Hence, the safest option is to maintain the two parts separate for the rest of the day: during the night, there is sufficient available time to set up the resources again in order to start the regular timetable the next day.

In the remainder of this section, we elaborate on each step in more detail and indicate which techniques can be used to support the decisions that are required to be made per step.

Step 1 Anticipate delay evolution using early warning metrics

In order to reduce the impact of an out-of-control situation or completely prevent them, one needs to anticipate these events as soon as possible. In our framework, we view the development of an out-of-control situation as a state transition, as is often done in physics: the transition towards an out-of-control situation can be seen as a transition of one state (system at rest) towards another (system disrupted). Subsequently, the early warning signals are found by defining these states, and investigating how this transition works. So, in short, Step 1 of our framework requires (a) a definition of an early warning procedure, and (b) a definition of states (‘that are to be warned about’).

As mentioned in Sect. 4, (macro-)states are an important topic in complexity science. Mathematically, they are often treated as (stationary or non-stationary) equilibria, and transitions from one to another are referred to as bifurcations, or tipping points. A common data-based approach is to look at statistical metrics like increased auto-correlation and variance (Scheffer et al. 2009; Thompson and Sieber 2011), using historical data. These metrics are well established in physical systems, but cannot directly be applied to the railway system due to its high degree of heterogeneity and discontinuity of processes. The absence of delay is actually associated with strong auto-correlation, and high delay variance in a station does not point towards a ‘critical slowing down’ (i.e., the natural increased vulnerability to perturbations prior to a state transition), but may rather simply implicate that a lot of trains pass by. A related (data-based) methodology for railway systems is presented in Dekker et al. (2019), where an early warning procedure is based on one year of statistical data in the Dutch railways.

Another approach to identify state transitions is a model-based approach: by capturing the dynamics in a model and looking at its properties to find situations in which such transitions happen. In some models, these transition (bifurcation) points can be derived mathematically, while in other models, this is more subtle. However, in the absence of models that simulate large-scale disruptions (as shown in Sect. 2) well, these concepts cannot directly be applied to railway systems. Still, there are related attempts in literature to find structural behavior of railway systems. For example, Monechi et al. (2018) analyzed railway logistics from Germany and Italy, finding a set of ‘rules’ by which some delay is propagated. Kecman and Goverde (2015) used Dutch railway data and focus on quantifying parameters of running and dwell times, which are important uncertainties in microscopic models. Goverde (2010) made an analytical approach of describing the system, using the timetable and parametrization of quantities like dwell times to make a forward integration model. Furthermore, Ball et al. (2016) showed the equilibrium diagram of a simple model when connecting the rolling stock layer with a crew layer, illustrating the effect of interdependent networks. These papers illustrate different approaches to define structural railway dynamics, but there is no overall consensus on a macroscopic approach, making it hard to make accurate predictions for large disruptions from a (purely) modelling perspective (Monechi et al. 2018). The problem of heterogeneity and the absence of deterministic physical equations is not unique to railways, and methods used in other systems can be of use in this step. For example, Sebille et al. (2012) used a transition matrix method to predict the movement of plastics in the ocean. Another example is the interaction between forest and savanna systems, where Hirota et al. (2011) showed various types of large-scale pattern formation.

As an illustration of a model-based approach, consider a transition matrix (like in Sebille et al. 2012) to predict the evolution of delay in a railway system. From data, one can derive a statistical relation between delay in one region and delay in another region. This principle is used to create an \(N\times N\) transition matrix T, where \(T_{ij}\) is the contribution of delay at location i to the future delay at location j. Although such a simple procedure would involve many assumptions (e.g., the Markovian character of delay propagation), it may shed light on delay correlations and cause-and-effect on a larger scale than would be seen in existing agent-based models. Then, a definition of what is ‘out-of-control’ or any other undesired state (like ‘disrupted state’ defined in Dekker et al. (2019)) is necessary to pinpoint which states need to be anticipated and where early warnings would warn about them, quantitatively. Subsequent early warnings are then found by propagating the model utilizing Monte-Carlo simulation techniques, and finding moments in time where there is at least a certain predefined likelihood of entering such an undesired state. Those moments in time would set off an alarm—providing an early warning signal.

Step 2 Identifying and isolating the disrupted region

After an early warning signal of an eminent out-of-control situation has been issued and conventional control measures have proven not to be effective, we propose a specified region—referred to as the ‘disrupted region’—to be isolated. The size and boundaries of this region are not trivial, as they are not only dependent on the prediction given in Step 1, but also on what part is optimally decoupled in terms of logistics. Here, we mention a few important considerations in the definition of a disrupted region.

First and foremost, one needs to consider whether it is necessary to decouple a region at all. If early warning indicators (Step 1) anticipate a large disrupted system, there are many alternative countermeasures to consider and the system might also remain controllable (although disrupted). Second, in some situations (e.g. when a station is completely disrupted), several stations or tracks may be forced to be at the boundary of the disrupted region. Third, one needs to identify tracks that have a large impact on the propagation of the delay throughout the country—isolating parts of these tracks more strongly reduces the propagation of delay. These tracks can be identified as propagation corridors using the statistical models used to create the early warning signals. Fourth, the amount of rolling stock within and outside the disrupted region needs to be considered. Locking a large disrupted region when there are very few trains in the area reduces the efficiency of the logistics. Fifth and finally, the size of the control area should not be too large as the service level within the region is likely to be lower compared to the rest of the network, since decentralized dispatching strategies will be used to schedule the resources within the disrupted region. But it also should not be too small, because the robustness of the decentralized dispatching strategies may drop if there is not enough room for adaptation.

Because of the above considerations, a sensible approach would be to create a set of predefined regions that can be isolated in case of emergency. This would be comparable to the concept of contingency plans, that are used in, for example, the Dutch railway system. Such plans are pre-defined protocols that prescribe how a disruption at a specific location should be handled. While the exact moment in time would affect the number of resources (rolling stock and crew) in the region—which is dealt with in the next steps—limiting the decision space to a finite set of regions suited for isolation already accounts for considerations related to schemes and infrastructure and allows for quick decision making.

Step 3 Rescheduling the non-disrupted region

Outside the disrupted region complete information is available, so conventional disruption management techniques can be applied to reschedule the railway traffic in this part of the network. The rescheduling of the crew is the most complicated, as crew duties must end at their fixed base and it is likely that crew members outside the disrupted region have their base inside the disrupted region (and vice versa). This problem can be addressed by, for example, imposing that the duties of such crew members should end at the boundary between the two regions and taking into account the expected time it takes for them to travel back to their base.

Even though this step can be approached as a traditional disruption management problem, this type of disruption, a combination of (possiby many) track blockages, is of greater size than what is typically considered in existing literature. Since computation times are likely to increase with the size of the disruption, dedicated (possibly heuristic) algorithms need to be developed in order to find good solutions for this problem in a reasonable amount of time.

Step 4 Determining a modified line system for the disrupted region

When the disrupted region is decoupled from the rest of the network, it is unlikely that the original line system, specifying which lines are operated at which frequencies, can be maintained. This has two main reasons. First, as the platforms at the boundary stations are divided among the disrupted and the non-disrupted region, and turning a train takes more time than simply continuing in the same direction, the railway infrastructure is unlikely to allow for the same number of trains as in the regular line system. Second, as there is only a limited amount of rolling stock available within the disrupted region at the time of decoupling, and trains are not allowed to transfer between the regions, it is possible that there is insufficient rolling stock available to operate the regular line plan. As such, it is certainly necessary to modify the line system for the disrupted region.

The above described infrastructural and rolling stock considerations can be included in a mixed-integer programming model for modifying the line plan, effectively moving line planning from the strategic to the operational setting. As few existing line planning models take the available infrastructure and rolling stock into account and such integration is known to be computationally challenging (Schöbel 2012, 2017), this problem asks for novel solution approaches to (partially) integrate timetabling and rolling stock scheduling into the line planning problem, without leading to long computation times.

A first attempt towards solving this problem has been made by Van Lieshout et al. (2020). In this paper, the authors take infrastructure and rolling stock into account in a Benders-like fashion. The master problem corresponds to the line planning problem and suggests a line plan that minimizes some measure of passenger inconvenience. The sub-problem then evaluates and checks whether the line plan results in a feasible timetable, and adds one or more cuts to the master problem if this is not the case. The authors show that this method finds workable and passenger-friendly line plans in a short amount of time.

Step 5 Scheduling rolling stock and crew in the disrupted region

Since out-of-control situations are characterized with great uncertainty regarding the exact whereabouts of the rolling stock and crew, it is not possible to communicate detailed instructions to the crew. Instead, the idea is to provide a simple strategy dictating what task to do next and at what time. This way, we reduce the dependence on central traffic controllers and avoid having to wait for clearance from dispatchers lacking complete information. Such a decentralized approach also allows to start up the operations very quickly after a temporal interruption of train services, which was seen to take a very long time in the considered case studies.

To the best of our knowledge, decentralized dispatching strategies for railway systems have not yet been considered in the literature. However, given that in the previous step of the framework a workable line plan is generated, it should be possible to develop appropriate strategies that restore a stable service in the disrupted region at short notice. Simple principles could be used to determine when trains should depart after arriving at a station, and which rolling stock units are used to operate the different lines. For the scheduling of the crew, more intricate strategies are required, as some crew members eventually need to exit the disrupted region in order to end at their base, and the other way around. The performance of strategies can be evaluated using simulation.

Step 6 Managing the passenger flows

In the sixth and final step of the framework, the passenger flows are managed. Since the line plan in the disrupted region is adjusted, passengers also have to be routed differently through the network. Furthermore, as transport capacity might be strongly reduced on some corridors, travel advice can be used to steer passengers in order to avoid overcrowded trains and platforms. To do so, currently existing methods for providing travel advice should be modified to take into account that the disrupted region is operated using decentralized scheduling principles, instead of a fixed timetable. Effectively, this comes down to quantifying the uncertainty of the travel times in the disrupted region, followed by shortest path computations with the uncertain travel times. For the stochastic shortest path problem, there are already solution methods developed in the literature, see e.g. Nie and Wu (2009).

6 Conclusion

Many methods have been proposed for rescheduling railway systems after disruptions. However, in out-of-control situations, characterized by a large number of affected resources and a high degree of uncertainty, existing methods cannot be applied. In this paper, we therefore presented a new framework for dealing with such situations, using three key concepts: predicting these events using early warning signals, isolating the disrupted region and making use of decentralized decision making. The framework consists of six steps. The first two steps make use of a translation from complexity theory towards the field of railway systems, to develop models or statistics that apply to disrupted situations and on country-wide scales. The steps 3–6 are based on the idea of isolating a region and allowing self-organisation principles to take over in information-deprived circumstances. The individual steps of the framework give rise to new interesting problems that cannot be easily solved using traditional approaches, highlighting the potential of multidisciplinary collaborations for tackling complex real-life problems.

Some first attempts to solve the steps of the framework have been made, but more research is required. In the future, we plan to work on further developing the methodology to solve all steps. Moreover, we are setting up a microscopic simulation to thoroughly test the framework’s performance. We also encourage other researchers to explore alternative approaches to effectively mitigate the impact of out-of-control situations.

Many other questions remain. For example, under which circumstances is the method of isolation appropriate? Another related problem is to investigate the role of information in out-of-control situations. Qualitatively, we know from case reports that de-synchronisation of information is an important factor for railway practitioners to determine a situation to be out-of-control, but how can we quantify this, and what are early warning signals for information loss?