Monitoring Multimodal Travel Environment Using Automated Fare Collection Data: Data Processing and Reliability Analysis

  • Laiyun Wu
  • Jee Eun KangEmail author
  • Younshik Chung
  • Alexander Nikolaev
Original Paper


Monitoring transit system “health” by extracting and tracking such quantities as travel time, transfer time, number of passengers, etc., is critical to the benefit of travelers, planners and operators within a transit system. Most of the data typically available to and useful for analysts are generated by tracking vehicles instead of individual passengers/travelers—these data are useful, albeit within certain limits. This paper presents methods for obtaining system-level transit information from a new type of data—that coming from an Automated Fare Collection (AFC) system,—which provides hour-to-hour, day-to-day transit information, such as the value and reliability in both travel time and traveler count, and the location of congested road clusters in a city. The AFC data of public transit system in Seoul, South Korea is used as an example to illustrate the proposed data extraction methods and analysis. This paper is structured and detailed so as to provide both methodological and practical guidance for researchers and data-handling analysts.


Automated fare collection data Transit information Transit environment Transit reliability Data processing 


Monitoring and extracting insights from the public transit operation information in the form of travel times and passenger count, etc., offers potential benefits to passengers, transit planners and operators. Travelers depend on public transit system conditions to get to their destinations on time; both travel times and throughput/crowdedness of a transit system, and also, the variability of these quantities are of travelers’ concern. Likewise, monitoring and understanding transit information can help planners and operators deal with emerging issues in real-time before the issues become serious, improve the network performance at the strategic planning and daily operational levels, and inform further refinement/redesign solutions en route to building a more reliable transit network system. In general, the more efficient and stable a transit system is, the higher quality of service it would normally provide and the higher profit it would generate.

For the purpose of monitoring a transit travel environment, much effort has been invested into using various technologies and data collection methods, including manual on-board recording, surveys, automated passenger counting (APC), and automated vehicle location (AVL) systems tracked by global positioning system (GPS) technology. The typically available AVL and APC data usually capture travel times and waiting times of buses (Dueker et al. 2004; Feng and Figliozzi 2011; Furth et al. 2006). Even the manually collected data, often used in the past [i.e., prior to the proliferation AVL/APC systems (Levinson 1983)], have found much use in applications. It is also worth noting that although AVL/APC system data processing is more cost-effective, the data extracted from these systems tend to mislead analysts into overestimating waiting times, in particular for buses (Grisé and El-Geneidy 2017).

Monitoring technologies, such as APC/AVL systems, manual recording and surveys, all focus on tracking vehicle (passenger carrier) movements. However, passenger trajectories (routes) in this case remain hidden. By tracking and synthesizing travelers’ individual data, not only can travelers’ needs be studied and modeled in greater detail, but also system-level transit service information can be extracted and evaluated more accurately.

Automatic Fare Collection (AFC) systems offer a means of obtaining a primary data source for analysis, or perhaps, a supplementary source (on top of any existing data); among other things, AFC does allow for tracking individual passenger trajectories. While the primary function of AFC is fare collection and user class validation for different fare rates, AFC also stores travel information of each user, recorded as time-stamped transactions. These data provide detailed travel information about transit system users that can potentially be informative for operators and planners. Due to its size, these use-history-based data must be efficiently processed to generate system-level information about the travel environment in reasonable time. In particular, by using AFC data, one can investigate—per route and/or in aggregate—how travel time, transfer time and number of passengers vary over time, within-a-day, day-to-day, or season-to-season. APC/AVL systems can be supplemented or even replaced by an AFC system, for the purpose of collecting data about vehicle/passenger travel time, vehicle idle time, and passenger count per transit system segment.

The information obtained from an AFC system is primed to help us better assess transit network reliability and improve operating performance (Tribone et al. 2014). Transit network reliability is a concept that captures the extent of unexpected increase or decrease in travel environment measures. It is particularly valuable, considering the common, incorrect belief that public transportation is always very reliable (Van Vugt et al. 1996). Indeed, unexpected events can affect the operation of transit network quite severely. For example, traffic accidents, road maintenance and weather are known to limit the road network throughput. Passenger load variation and staff operation affect both ground and underground public transportation. A lack of reliability can undermine the attractiveness of public transit and lead to revenue losses (Chen et al. 2002). This reliability/unreliability associated with transit network use is attributed to the innate stochasticity present within transit environments and traveler agendas.

Traditionally, transit network reliability has multiple aspects, including system connectivity reliability, travel time reliability, demand reliability and traveler behavior reliability. The connectivity reliability (Bell and Iida 1997; Bell et al. 1999) refers to situations that arise when transit links go out of service, and is determined by the infrastructure conditions (road, track, signal control, etc.) and weather (Veiseth et al. 2007). The second aspect is travel time reliability (Asakura and Kashiwadani 1991; Asakura 1999; Bell et al. 1999): it is quantified as the probability that a given node pair will field a trip within a given time window. The capacity reliability for a transport network (Chen et al. 1999, 2002) is defined as the probability that the transit network can accommodate a certain passenger. The last aspect is behavioral reliability (Clark and Watling 2005): this measurement deals with the travelers’ (e.g., drivers’) attitudes and responses to unexpected events.

Transit network reliability is often assessed based on the on-time performance and headway adherence. The on-time performance refers to the percentage of public transit trips that can be finished within the schedule times. The headway adherence deals with the regularity of transit vehicle arrivals, compared to the scheduled headway. Much prior research has addressed the evaluation of these two measures and their applications. Strathman and Hopper (1993) presented a model to empirically assess the factors affecting the on-time performance in a bus system in Portland, Oregon. Nakanishi (1997) also assessed a bus system performance by using New York City Transit’s definition of on-time performance and service regularity. Chen et al. (2009) analyzed bus system reliability in Chinese cities from three perspectives: punctuality, deviation of headways and evenness of headways. Also, on-time performance and headway adherence indicators were also reported as examples in the Transit Capacity and Quality of Service Manual (Kittelson et al. 2003), to illustrate the relationship between actual degree of reliability of a system and the perceptions of passengers and operators. Furth and Muller (2007) introduced the concept of “reliability buffer time”, defined as the difference between the nth percentile and the median of journey times. To put our work in perspective, note that in what follows, we will assess this variability by calculating the inter-quantile range of a journey time distribution.

To recap, the reliability/unreliability of a transit network depends on multiple factors (El-Geneidy et al. 2011): driver behavior, schedule (in)flexibility, transit signal priority, route design, etc. Those aspects will not only help improve transit service, but also, inform transit reliability models. By using AFC data, we will show to be able to investigate and understand how travel times, transfer times and number of passengers vary.

In this paper, we present the data processing methods to generate stochastic transit travel environment from AFC data, wherein user-based information is converted to system-level information. In Sect. 2, we review the literature related to AFC data use in public transit systems. Sections 3 and 4 introduce the objectives and methods of AFC data processing, in application to a case study of the transit system in Seoul, South Korea. Section 5 presents examples of reliability analyses and congestion detection analyses from the real-world data we processed.

The contribution of this paper is thus twofold. First, the paper presents a series of comprehensive methods to generate public transit travel environment based on AFC data. Second, with the extracted travel environment information, it shows how one can assess how reliable the transit network is, at different time scales, and how one can search for unusual patterns in these data.

AFC Data Use for Understanding Public Transit Systems

Automated Fare Collection (AFC) systems, often called smart transit card systems, have found use in public transportation systems worldwide. Not only do AFC systems enable a secure and fast way of fare collection, but also, they offer a cost-effective way of collecting and monitoring transit user data.

The information most commonly extracted from AFC data is station-to-station origin–destination (OD) travel demand. An extracted OD demand matrix contains the information of the transit demand levels, which allows operators and planners to respond to the system’s needs and provide travelers with more efficient services.

Prior studies developed algorithms to extract and complete the OD information based on AFC data. The research objectives and methodologies varied over different reported studies, based on the data availability. Given AFC data with entry-only information, the researchers have typically focused on inferring the destination stations by using rule-based approaches (Barry et al. 2002; Gordon et al. 2013; Nassir et al. 2011; Trépanier et al. 2007; Zhao 2004). Barry et al. (2002) came up with a model to synthesize AFC data based on two simplifying assumptions: (1) the destination station of a previous transaction is the same as the origin of the next transaction, and (2) the last destination of the day is the same as the origin of a first transaction of the same day. Zhao (2004) inferred the alighting stations with Chicago AFC data, by adding a third important assumption to deal with the multi-modal nature of the city’s transit environment: there is no private transportation mode trip mixed in with the public transportation trips. Moreover, in the latter work, the destination estimation process also relied on other data sources, namely automated vehicle location (AVL), automated passenger count (APC) and geographic information system (GIS) data. Trépanier et al. (2007) pointed out that individual trip destinations could also be estimated by looking at similar trips made by the same card holders, found in the trip history database. Another modified algorithm that explicitly considers schedule delay for each transaction Nassir et al. (2011) was also helpful for researchers to accurately infer transaction destination within public transit systems.

Once a destination location is obtained (estimated), one can generate a passenger trip OD matrix (Gordillo 2006). As such, Cui (2006) estimated the bus passenger origin–destination flows using AFC, APC and AVL data. Further, Sun et al. (2015) pointed out the stochastic nature of public transit environment and proposed an integrated Bayesian statistical inference framework to characterize passenger flow in a metro system using the appropriately defined random variables.

Besides destination inference and origin–destination flow estimation, one can employ user-level information captured as AFC data to evaluate system-level transit network operation, namely to distill users’ travel patterns (Chakirov and Erath 2011; Ma et al. 2013; Sun et al. 2012), perform route choice estimation analysis (Kusakabe et al. 2010; McMullan and Majumdar 2012; Sun and Xu 2012), trip purpose inference (Lee and Hickman 2014), travel time analysis and overall transit system reliability assessment (Sun et al. 2016; Sun and Xu 2012). However, few studies utilized AFC data to reconstruct a complete transit environment, which would include the exact consideration of the segment and hour-to-hour travel times, transfer times and per-segment crowdedness (passenger count). In this paper, we perform such analyses, and also, come up with an index to measure the transit network reliability for various time and segments.

To summarize, while AFC data have a high potential to feed into analytical work, there has been limited success among the research and practitioner communities in using such data for measuring and monitoring transit environments from a system-level perspective. This calls for further efforts that can facilitate the adoption of the AFC data processing methods as part of a transit system manager’s toolbox. To this end, this paper presents algorithms to estimate detailed system-level transit environment information such as travel time, transfer time and crowdedness from AFC data. The resulting information can be used to analyze transit service reliability as well as to understand users’ route choice behavior, either taken alone or (preferably) combined with user trajectory information.

Data and Objectives: The Seoul Case

The objectives of our task of reconstructing and analyzing a stochastic multi-modal travel environment include deriving empirical distributions for system-level travel times, transfer times, as well as inferring trip costs and crowdedness information between every two stations in the transportation network, given a traveler’s boarding time.

We analyze the AFC data from the transit system of Seoul, South Korea. In South Korea, smart cards have been predominantly used since 2005, with the 70% or higher use rates in large cities (Park et al. 2008). Especially, the smart card usage is about 80% in Seoul Metropolitan Area. Since around 2010, more than 90% of public transit passengers choose to use smart cards, and the number of subscribers is still growing (Jang 2010).

For our study in this paper, the AFC data of all the metro and bus routes within Seoul Metropolitan area are utilized. The stations and routes within the study area are shown in Figs. 1 and 2. These AFC data consist of 12 weeks worth of transaction records: 1 week in each month of year 2013. During those 12 weeks, the city of Seoul operated 18 metro lines and 11,637 buses over 936 routes (see Figs. 1 and 2). A total of 1,101,544,931 transactions and 35,978,530 unique smart card IDs are recorded in our dataset. Among those records, over 1 billion of them, each transaction corresponds to one traveler; each record contains such information as card ID number, boarding and alighting stations and times, and route information, among other data (see Table 1 for more details).
Fig. 1

Metro stations and lines in Seoul

Fig. 2

Bus stations and lines in Seoul

Table 1

Transaction data description



Transaction data



Passenger card ID number


Bus run departure date and time


Transportation operators ID


Transaction ID


Transportation method code


Bus routes ID


Vehicle ID


Passenger card user class code


Number of passengers


Boarding date and time


Boarding payment amount


Boarding station ID


Alighting date and time


Alighting payment amount


Alighting station ID


Distance (in meters) card user traveled


Time (in seconds) card user spent

The output of our data extraction task is system-level information, which comes in two forms: we either extract transit vehicle operation information at each link the from AFC data (via what is henceforth termed the link-based method), or aggregate the information of travelers’ paths between transit node pairs (via what is henceforth termed the path-based method). Since the travel times, transfer times and crowdedness are changing rapidly over time, they are extracted and aggregated in reference to the trip start time t.

Before elaborating on the output of our data extraction task, we first present the terms and notation used throughout the paper:
  • Transaction: a transaction is one record in the AFC database. The information contained in each transaction is listed in Table 1.

  • Trip: a trip is comprised a series of transactions along the trip segments that connect the origin and destination of a traveler. During the time between two transactions, the traveler has to transfer to another bus or metro route; these transfers are also part of the trip. Figure 3 illustrates a trip covering eight stations [\(s_a, s_b, s_{b1}, s_{b2}, s_c, s_{c1}, s_{c2}, s_d\)]. Each two consecutive stations are adjacent. However, there are only three transactions in this trip: (\(s_a, s_b\)), (\(s_b, s_c\)) and (\(s_c, s_d\)). \(s_{b1}, s_{b2}, s_{c1}\), and \(s_{c2}\) are called “pass-through” stations, which refer to the stations between the boarding and alighting stations corresponding to a transaction. Transaction travel time and between-transaction transfer time are shown in the graph as well. The travelers can transfer from metro to bus, bus to metro, or bus to bus up to 5 times for no additional basic fare as long as the transfer time is within 30 min. Therefore, the travelers tend to transfer within 30 min when they are on a trip. We assume that a longer-than-30-min time-span between transactions of the same traveler signals the ending of one trip and the beginning of the next one. The trip travel time is taken to be the summation of all transaction travel times and transfer times, which are part of the trip.

  • Origin (O) and destination (D): we use the first boarding station and the last alighting station of a trip to represent the origin and destination of this trip.

Fig. 3

Illustration: components of a trip

  • Segment/link: a trip segment is the road or railway track between two directly connected bus or metro stations. The terms “link” and “segment” are used interchangeably throughout the manuscript.

  • Path: a path is a sequence of segments connecting the origin of a trip with its destination. It is worth mentioning that there are usually multiple paths connecting the same O and D. The relationship between paths and segments is shown in Fig. 4.

  • Route: a route is a path connected via a fixed sequence of stations, between which the bus line or metro (bus/metro) operates. The beginning of a route is the station where the bus/metro starts to provide its service and the end of a route is the station where the service stops (with respect to a given trip).

  • Bus or metro run: a bus/metro run is a scheduled bus/metro car sequentially traveling all stations on its route. Usually, the times between two runs of bus/metro are scheduled beforehand. Three different bus runs are illustrated in Fig. 5.

Fig. 4

Illustration: routes and segments

Fig. 5

Illustration: bus runs

The travel time values, denoted by (\(T_{s_1 \rightarrow s_2}(t)\), \(T_{s_1 \rightarrow s_2}^{\text {bus}}(t)\) and \(T_{s_1 \rightarrow s_2}^{\text {metro}}(t)\)), are the times to travel between a pair of stations \(s_1\) and \(s_2\) along a specific path; \(T_{s_1 \rightarrow s_2}(t)\) is the total travel time, and t is the start time of the trip. For a multi-modal trip, \(T_{s_1 \rightarrow s_2}^{\text {bus}}(t)\) is the time spent on bus (along the respective path), and \(T_{s_1 \rightarrow s_2}^{\text {metro}}(t)\) is the time spent on metro (again, along this same path)—to be calculated separately. Note that since travel times are link additive, the travel time along a path can be derived by adding up the average travel times for each link along the path (this is called the link-based method). \(T_{s_1 \rightarrow s_2}(t)\) could also be generated by averaging the travel times of all trips connecting station \(s_1\) and \(s_2\) (this is called the path-based method).

Transfer time \(W_{s_1 \rightarrow s_2}(t)\) is the total time taken to transfer between bus/metro stations \(s_1\) and \(s_2\) along a path, for trip start time t. Recall that a transfer is assumed to take less than 30 min—indeed, it is often the case in practice.

Crowdedness is denoted by \(L_{s_1 \rightarrow s_2}(t)\). Specifically, the two quantities \(L_{s_1 \rightarrow s_2}^{\text {bus}}(t)\) and \(L_{s_1 \rightarrow s_2}^{\text {metro}}(t)\) are for bus and metro, respectively, computed as the number of passengers averaged over all the (multi-modal) segments that are part of the transit path. Again, recall that a segment (or link) is defined to provide a direct connection between two adjacent stations.

Travel cost \(C_{s_1 \rightarrow s_2}(t)\) is the amount of money that a passenger needs to spend to travel between two transit nodes \(s_1\) and \(s_2\), beginning the trip at time t. Note that the number of transfers \(N_{s_1 \rightarrow s_2}(t)\) becomes fixed, once a specific path has been chosen by the traveler.

Number of transfers \(N_{s_1 \rightarrow s_2}\) is the number of interruptions, or events where the passenger switches between travel modes or buses, to travel from \(s_1\) to \(s_2\) along the chosen path.

With the key pieces of notation and terminology set, we are now ready to describe the specific data processing methods and algorithms.


We present two methods to extract system-level travel information. The first one is referred as link-based method. It requires us to extract link-level information, and then, add the numbers for link-additive variables. The advantage of the link-based method is that the times for all possible of paths can thereafter be calculated via simple summation operations. The other method we describe is called the path-based method, which is helpful for exploring more transit information, including transfer time and number of transfers. However, using the latter method, no information can be generated (extracted) for a path never traveled by anyone per the AFC records; this is a typical case for non-peak time windows and suburban area trips.

Link-Based Method

The link-based data processing method focuses on assessing the aggregate-level travel time and crowdedness for any pair of directly connected stations; recall that such a direct connection is called “link” or “segment”. The link-based method relies on schedules and length of travel of vehicles, namely buses and metro cars. Based on the observations of recorded trips, segment-level travel time and crowdedness are extracted and stored; this method, however, cannot be used to calculate transfer time or travel cost.

Two algorithms are presented here for bus and metro AFC data, respectively. Algorithm 1 extracts bus departure/arrival time and number of boarding/alighting passengers at each station. The travel time and crowdedness for travel between a given pair of nodes can then be calculated by combining the segment-level information accordingly. Algorithm 2 is designed for metro transactions. The boarding and alighting times for metro transactions are the times that the users (travelers) enter and leave the gates of metro stations. Note that travelers are free to change routes without leaving metro stations. Therefore, in Algorithm 2, in order to track the exact vehicles that passengers take, we need an extra pre-processing step to estimate the vehicle arrival/departure times by disaggregating metro transactions into route level.
Table 2

Notation for AFC data and Algorithm 1




AFC records with unique identifier r


Card ID


Record unique identifier


Boarding station


Alighting station


Boarding time


Alighting time


Bus run identifier


Number of passenger using the card


Set of boarding time at station s for bus d


Set of alighting time at station s for bus d


Set of number of boarding passengers at station s for bus d


Set of number of alighting passengers at station s for bus d


Boarding time at station s for bus d


Alighting time at station s for bus d

\(T_{d,s\rightarrow s+1}\)

Travel time for bus d from station s to s+1


Idle time for bus d at station s

\(P_{d,s\rightarrow s+1}\)

Number of passengers for bus d from station s to s+1

In Algorithm 1, each record (\(R_r\)) from bus AFC data (\(R_{\text {bus}}\)) contains the following information: card ID (N), record ID (r), boarding station (\(s^b\)), alighting station (\(s^a\)), boarding time (\(t^b\)), alighting time (\(t^a\)), bus run identifier (D), and number of passengers boarding/alighting (P). It is worth mentioning that the bus run identifier is generated by combining the bus vehicle ID (\(VEHC\_ID\)), route ID (\(BUS\_ROUTE\_ID\)) and the departure time (\(RUN\_DEPART\_DTIME\)), as found in the AFC data. For clarity, all the notations used in Algorithm 1 are listed in Table 2. The algorithm’s steps are summarized as follows.

Step 1: generate boarding/alight sets. Go through bus AFC records one by one, and for every record having bus run D the same as a specific bus run d and the boarding station \(s^b\) the same as a specific station s, put the respective boarding time \(t^b\) into set \(B_{d,s}\), alighting time \(t^a\) into set \(A_{d,s}\), boarding passengers P into set \(P^b_{d,s}\), alighting passenger to set \(P^a_{d,s}\).

Step 2: estimate departure and arrival times for each bus run and at each station. The aggregated (averaged) travelers’ alighting and boarding times can serve as an approximation of the actual bus arrival and departure times. An estimator for the departure time (\(b_{d,s}\)) for a bus run can be taken as the average value over all the boarding times (Avg(\(B_{d,s}\))), or the last recorded boarding time (Max(\(B_{d,s}\))), or, e.g., the 80% percentile value among the recorded boarding times (80% percentile in \(B_{d,s}\)) at a given station. Similarly, we can find estimators for arrival time (\(A_{d,s}\)).

Step 3: calculate bus travel times and idle times. Link-level bus travel times are generated in this step. The travel time for bus run d on the segment connecting stations s and \(s+1\) is \(T_{d,s \rightarrow s + 1}\). This travel time for bus run d is the difference between the departure time \(b_{d,s}\) at station s and arrival time \(a_{d,s+1}\) at station \(s+1\). Then we can use the link-additive approach to calculating the travel time between two stations s and \(s+i\). It is worth mentioning that, because the instances of travel along some of the combined links (\(s \rightarrow s + i\)) may be directly observed in the AFC dataset, the calculation of \(T_{d,s \rightarrow s + i}\) can be conducted directly for a number of paths in the travel environment: we refer to this approach as the “Modified Link-Based” method in later sections.

The idle time for bus run d (time spent loading/unloading at station s) is found as the difference between the departure time (\(b_{d,s}\)) and arrival time (\(a_{d,s}\)) at s.

Step 4: evaluate crowdedness. This step is to find the number of passengers (\(P_{d,s\rightarrow s+1}\)) traveling on each segment \(s \rightarrow s + 1\) for bus run d. This passenger count is also called “crowdedness”. For bus run d beginning at station s, the number of boarding passengers \(P^B_{d,s}\) is the sum of all values in set \(P^b_{d,s}\), while the number of alighting passengers \(P^A_{d,s}\) is the sum of all values in set \(P^a_{d,s}\). The number of passengers on segment \(s \rightarrow s + 1\) is the difference between the sum over \(P^B_{d,s}\) and the sum over \(P^A_{d,s}\).

In summary, the arrival times, departure times and passenger numbers at each station (obtained as described above) provide enough knowledge for us to infer the segment-level travel times and levels of service, to enable the estimation of those quantities for any path (i.e., between any two stations, which are not necessarily directly connected).

Algorithm 1 is not applicable for metro transaction processing because metro transactions do not contain any vehicle information. The travelers only swipe their cards at the gates of in- and out-bound metro stations; therefore, neither vehicle information nor transfer information of metro trips are recorded.

Algorithm 2 is developed to disaggregate metro transactions and estimate the possible metro vehicle utilization. Since we already know the origin and destination of each traveler’s metro trip, then by considering the possible segments and vehicles he/she has taken, we then are able to estimate the link-level metro travel time and crowdedness, in line with the logic of Algorithm 1. To this end, Algorithm 2 proceeds in three steps, as follows.

Step 1: for each metro AFC record \(R_r(N,r,s^b,s^a,t^b,t^a,P)\), we first find the shortest path using the available metro time table and applying the Dijkstra’s shortest paths algorithm (Dijkstra 1959).

Step 2: Divide each record
$$\begin{aligned} (R_r(N,r,s^b,s^a,t^b,t^a,P)) \end{aligned}$$
into segment-level records
$$\begin{aligned} (R_{r,1}(N,r_1,s^b_1,s^a_1,t^b_1,t^a_1,P),...,R_{r,n}(N,r_n,s^b_n,s^a_n,t^b_n,t^a_n,P)). \end{aligned}$$
The segment-level travel time should be proportional to the travel time along the shortest path. Then, each segment record is assigned to a metro run identifier D. During this assignment, a 3-min time gap is assumed at the first origin/last destination station, or for any transfer that a metro passenger does.

Step 3: once the metro run identifier D is assigned to the segment-level metro records, we apply Algorithm 1 to calculate metro link-level travel times and levels of service.

Path-Based Method

For each transaction record, there is only one boarding station and one alighting station. However, a traveler sometimes creates multiple transactions en route to their final destination. In this data processing method, if the same traveler takes the next bus or metro car within 30 min or less of the previous transaction, then the next transaction is considered a continuation of their trip. Note that this is exactly how the “transit system” defines a trip: those travelers who take bus/metro within 30 min of the last swipe of their smart card are eligible for a fare discount.

Therefore, if a traveler has not reached their final destination in a single trip segment, then they will try to transfer within this 30-min time gap. In this case, the time spent in transfer from the previous transaction destination to the next transaction origin adds to the transfer time of the trip, denoted by (W). Recall also, that a total amount of time to finish the whole trip is called trip travel time (T).

If a given trip has the same origin and destination as the path we are interested in, then the trip information can be directly used for extracting the path travel information.

The developed Algorithm 3 generates full trip information and aggregates it into the system-level path information. The entire AFC record dataset (R) is the input into this algorithm. Each AFC record (\(R_r\)) contains the following information: card ID (N), record ID (r), boarding station (\(s^b\)), alighting station (\(s^a\)), boarding time (\(t^b\)), alighting time (\(t^a\)), the number of passengers boarding/alighting (P) and travel cost (F) for this transaction. Algorithm 3 is comprised eight steps, as follows.

Step 1: sort records by card number and boarding time. All the records having the same card ID N, based on the boarding time (\(t^b\)), are to be ordered in an ascending order o.

Step 2: find the “previous” and “next” transfer times for each transaction. The transfer times between the records, corresponding to the same trips, are computed. For the records having the same card ID N, the transfer time between each previous alighting station (\(s_{o-1}^a\)) and current boarding station (\(s_{o}^b\)) is denoted by \(B_o\). The transfer time between the current alighting station (\(s_{o}^a\)) and next boarding station (\(s_{o+1}^b\)) is denoted by \(A_o\). Note that for the first transaction, the value of \(B_o\) is set to infinity, as is the value of \(A_o\) for the last transaction of a trip. Now, for the records having the same card ID N, it is updated to \(R_o(N,r,s_o^b,s_o^a,t_o^b,t_o^a,P_o,F_o,B_o,A_o)\).

Step 3: exclude outliers. In this step, we weed out the outliers – the records that turn out to have a negative transfer times (indicating data entry errors), – if any.

Step 4: generate transaction category. Using the computed transfer times \(B_o\) and \(A_o\), we now see whether any given transaction is an isolated single record, or a starting/continuing record. If there is no travel information recorded within the previous 30 min (\(B_o\) > 30 min) and within the next 30 min (\(A_o\) > 30 min) of this transaction, it is labeled as a “single-transaction trip” (\(C_o\) = “single-transaction trip”). Conversely, if there is no transaction recorded within the previous 30 min (\(B_o\) > 30 min) but there is a transaction within the following 30 min (\(A_o\) < 30 min) of this transaction, then it is labeled as “Initial” (\(C_o\) = “Initial”). If there is a transaction recorded within the previous 30 min (\(B_o\) < 30 min) but there is no transaction within the following 30 min (\(A_o\) > 30 min) of this transaction, then it is labeled as “Stop” (\(C_o\) = “Stop”). Finally, if there is a transaction recorded within the previous 30 min (\(B_o\) < 30 min) and there is a transaction within the following 30 min (\(A_o\) < 30 min) of this transaction, then it is labeled as “Transfer” (\(C_o\) = “Transfer”).

Step 5: identify trips and single-transaction trips. By combining different transaction categories, labeled in Step 4, three types of trips are extracted: “initial-transfer-stop”, “initial-stop”, and “single-transaction trip”. For each type of trip, the number of passengers is calculated as the average number of passengers over all the transactions, while the travel costs and transfer times are generated by adding up the travel costs and transfer times over all the transactions, respectively.

The records having the same card ID N are denoted by
$$\begin{aligned} R_o(N,s_m^b,s_{k}^a,t_m^b,t_{k}^a, \frac{1}{k-m}\sum _{o=m}^k P_o, \sum _{o=m}^k F_o,\sum _{o=m+1}^k B_o,\sum _{o=m}^{k-1} A_o). \end{aligned}$$
Here, the variables with subscript m are the ones extracted from the “Initial” transactions, the variables with subscript k are the ones extracted from the “Stop” transactions, and the variables with subscripts \(m+1,\ldots ,k-1\) are the ones extracted from the “Transfer” transactions.

Step 6: generate trip travel times. The travel time of a trip between boarding station \(s_m^b\) and alighting station \(s_k^a\) is found as the difference between the boarding time at station \(s_m^b\) (\(t_m^b\)) and alighting time (\(t_m^a\)) at station \(s_m^a\). The path travel time is the average of the trip travel times, over all such trips beginning at the same hour.

Step 7: generate trip transfer times. The transfer time of a trip between boarding station \(s_m^b\) and alighting station \(s_k^a\) is found as the sum of all transfer times taken between transactions (\(W_{s_m^b \rightarrow s_{k}^a} = \sum _{o=m+1}^k B_o \text {, or } \sum _{o=m}^{k-1} A_o\)). Path transfer time is the average trip transfer time, over all such trips beginning at the same hour.

Step 8: generate travel fare costs. The travel cost of a trip between boarding station \(s_m^b\) and alighting station \(s_k^a\) is found as the sum of the transfer costs of all transactions (\(F_{s_1^b \rightarrow s_{k}^a} = \sum _{o=m}^k F_o\)). Path travel cost is the average trip travel cost over all such trips beginning at the same hour (Table 3).
Table 3

A summary of notations used in Algorithm 3




Travel fare cost


Transfer time before this transaction


Transfer time after this transaction


Transaction category

\(T_{s_1^b \rightarrow s_k^a}\)

Travel time for trip from station \(s_1^b\) to \(s_k^a\)

\(W_{s_1^b \rightarrow s_k^a}\)

Transfer time for trip from station \(s_1^b\) to \(s_k^a\)

\(F_{s_1^b \rightarrow s_k^a}\)

Travel fare cost for trip from station \(s_1^b\) to \(s_k^a\)


Function to return the hour of the day

Summary: Link-Based, Modified Link-Based and Path-Based Methods

In order to generate the complete system-level travel information, we employ both the link-based method and path-based method.

Since the data extracted from the link-based method are link-additive, this method is extremely useful for producing the travel time and crowdedness estimates. Sometimes, when a particular traveler’s path is found to be used for travel along a given route, then instead of adding up all the link travel times, we can think of this path (observed in its entirety in the AFC data) as being one “long link”, and obtain this “long link” travel time simply by calculating the time difference between the path origin and destination. We call this the “Modified link-based method”. This method only requires the departure and arrival time of boarding and alighting stations of this respective “long link”, while the arrival/departure times for the in-between stations remain unknown.

As the link-based method does not offer any information about the transfer times and travel cost, the path-based method is still necessary to obtain the complete path travel information. Besides, the path travel times generated by the path-based method can serve as a validation for the results returned by its counterpart, i.e., generated by the link-based method.

The outputs of our three methods are quite unique: the link-based method is good for estimating the travel times and crowdedness; the modified link-based method is good for calculating the specific travel times only; and the path-based method is good for generating travel times, transfer times and travel costs. The number of transfers can be inferred using either of the above-presented algorithms. To do so, one can count the number of transfers within the link-addition steps in the link-based method, or within the trip-generation steps in the path-based method, respectively (Table 4).
Table 4

Path information generated by link-based, modified link-based method, and path-based method


Link-based method

Modified link-based method

Path-based method

Travel time




Transfer time




Travel cost








Number of transfers




Figure 6 provides an illustrative example of input data to run the three calculation methods described above. On the map excerpt of the figure, the stations A-E are located along the same bus route, with \(d_1\) through \(d_5\) denoting five bus runs that occur around the same time (e.g., within one hour of the morning peak hours) between these stations.

In the following illustration, \(s^a,s^b, s^c, s^d, s^e\) are five consecutive stations, \(d_1\) to \(d_5\) are five bus runs and \(R_1, R_2\) are two transactions from AFC data. Nodes represents the actual visit of each run of bus or travel of passenger. The travel time between stations \(s^a\) and \(s^e\) will be first calculated by using the link-based method, then the modified link-based method, and finally, the path-based method.
Fig. 6

Data excerpt for illustrating travel time calculation

In executing the link-based method, we first calculate the average travel times for links between each two stations. The travel time along \((s^a,s^b)\) can be found by averaging the observations of the times for bus runs \(d_1\), \(d_3\) and \(d_4\). The travel time along \((s^b,s^c)\) can be found by averaging the observations of the times for bus runs \(d_1\) and \(d_3\). The travel time along \((s^c,s^d)\) can be estimated from only one observed run, \(d_1\). Finally, the travel time along \((s^d,s^e)\) can be found by averaging the observations of the times for bus runs \(d_1\), \(d_2\) and \(d_3\). By adding up all the link-based average travel time estimates, one can evaluate a total travel time between \(s^a\) and\(s^e\).

For a bus with no passengers to be dropped off or picked up, finding the departure and arrival times for all links may be a hard task. This is where a modified link-based method for calculate travel times is in order. This modified method only requires the departure time at \(s^a\) and arrival time at \(s^e\): now one can look up the \(s^a\) to \(s^e\) travel times for runs \(d_1\), \(d_3\) and \(d_5\), and average them.

Finally, the path-based method uses individual AFC records, instead of aggregated bus run information. With records \(R_1\) and \(R_2\) available, \(R_1\) can inform the \(s^a\) to \(s^e\) travel time estimation directly (other records, where travel also occurs directly between stations \(s^a\) to \(s^e\), will improve this estimate further).

Illustrative Examples

In this section, we discuss the insights extracted about the Seoul transit system using our methods: at the link-, path- and system-levels. We compare the methods, and summarize their advantages and drawbacks. A modified link-based method is then introduced in the following section. By evaluating the travel times and/or crowdedness at each level, we are able to develop and compute a reliability index—called hereafter IQR (for inter-quantile range)—and use it to identify the reliable/unreliable route segments and paths.

Specifically, this illustrative analysis case focuses on bus route 0017, passing through the Seoul downtown area, on the north side of Han River. There are 42 stations on this loop route. It passes through two metro stations (Hyochang Park and Yongsan Station), six schools, several businesses and residential areas.

Link-Level Travel Time, Crowdedness and Reliability

The travel time and level of service values along different links become available and can be compared after an application of Algorithm 1 on route 0017 (Route ID:11110897). We take the 12th route segment as an example to show how link travel time and crowdedness values are changing from over the time of a day. This segment is found between stations Yongsan Underground Passage and Yongsan Station Exit 3, marked red in Fig. 7.
Fig. 7

Location of bus route 0017 and segment 12


By Time/Hour of Day

During weekdays, the morning peak hours feature lower travel times than the afternoon peak hours do. Peak hour travel times have higher values than non-peak hours (Fig. 8).

As for the crowdedness, the afternoon peak hours turn out to be the most crowded times, followed by the morning peak hours, and then, non-peak hours (Fig. 9). To be precise, the morning peak hours and afternoon peak hours are defined to span the time periods 7–9 AM and 5–7 PM, respectively, and the calculation is based on the starting hour of traveling this 12th segment. The box plots in Figs. 10 and 11 illustrate this hourly trend of the travel time and crowdedness fluctuation more clearly.
Fig. 8

Travel time by time of day

Fig. 9

Crowdedness by time of day

Fig. 10

Travel time by hour of day

Fig. 11

Crowdedness by hour of day

By Segments

As shown in Fig. 12, the segment travel times on bus route 17 vary proportional to their lengths. Therefore, to compare the segments, we rescale the travel times as follows:
$$\begin{aligned} T^{r}_{D,i}(t)=\frac{T_{D,i}(t)-\text {Min}({\mathbf {T}}_{D,i}(t))}{\text {Median}({\mathbf {T}}_{D,i}(t))}. \end{aligned}$$
For each bus run D, we calculate the travel time spent in passing through every segment i, denoted by \(T_{D,i}(t)\). The rescaled segment travel time is then obtained, denoted by \(T^{r}_{D,i}(t)\). The minimum and median travel times are calculated within each \(T_{D,i}(t)\) group with the same values of D and i.
Each box in Fig. 13 represents the difference between the first and third quartiles, and this difference is defined as the Interquartile range, abbreviated IQR. IQR signals the stability (robustness) of the travel times evaluated for each segment. The lower the IQR, the more reliable the derived average travel time estimate is. Based on Fig. 13, Segment 7 is identified as the most stable one, and segment 15 is identified as the least stable in terms of travel time reliability.
Fig. 12

Travel time by segment order

Fig. 13

Rescaled travel time by segment order

The per-segment-levels of service also vary. Figure 14 shows the crowdedness values for every segment of route 0017.
Fig. 14

Crowdedness by segment order

By plotting the travel time IQR and the hour of day, we find that the travel times are usually more unreliable (variable) during daytime: e.g., see this for segment 12 and segment 13. However, for some segments, it may be steady throughout the entire day: such is the case with segment 4. We can see that the travel time reliability is decided both by the traffic conditions and segment locations. The travel times for most of the segments are fairly reliable even during peak hours.

On the other hand, the reliability of crowdedness shows a double peak pattern for most of the segments: the number of passengers is most unreliable during the morning and afternoon peak hours (Figs. 15, 16).
Fig. 15

Travel time IQR and hour of day

Fig. 16

Crowdedness IQR as a function of hour of day

Path-Level Information and Reliability

The link-based, modified link-based and path-based methods are all capable of generating travel time estimates for any given paths. Link-based method and modified link-based method both use bus run information, however, path-based method uses AFC transaction data directly. In order to illustrate the differences in their output, the travel time between station 72983 (Cheongshim) and 8599 (Yongsan Station Exit 3) was evaluated using the three methods. This path under investigation consists of 9 bus segments and 10 stations on the same bus route 0017.

For the link-based method, the first step is to add all bus travel times and idle times for those 9 segments for each bus run. Then, the 12:30 bus run that has the travel time information recorded for all 9 segments is aggregated. For the path-based method, 4969 trip observations were extracted, with all of them having station 72983 and station 8599 as their origin and destination stations, respectively (Fig. 17).
Fig. 17

The map of the path between station Cheongshim and Yongsan Station Exit 3 on bus route 0017


Figure 18 shows the difference between the travel time values obtained using link-based,modified link-based method, and path-based method. Link-based method is: calculating all link-level travel times, and then, adding them up to get the path travel time (this output is shown in red in the output plot). Please notice that all the segment travel time must come from the same bus run. In order to represent the variability in the travel time, we use a box plot. Modified link-based method is: finding the departure time at station 72983 and the arrival time at station 8599 and then taking the difference (this output is shown in green in the output plot). Path-based method is: extracting transactions contain station 72983 as boarding station and station 8599 as alighting station, finding the travel time of those transactions and then taking average of them (this output is shown in blue in the output plot). Observe that all the methods return a similar range for the travel time for this path. However, since we always begin by finding all the segments’ travel times, the first way of applying the link method turns out to rely on a smaller number of observations, which affects the resulting range and accuracy of the output. To recap, the obtained results confirm that both the algorithms have merit but reveal that each method’s accuracy depends on the volume of the data that it can exploit.

Figure 18 shows that there are differences among the IQR values obtained from above three methods, on the same time of day. The reason for this is that the transaction data used in three methods are not the same. The link-based method will average all the travel times for a link and all the average numbers will be added up. The path-based method will only use the transaction data with the same origin and destination of this “path”. Figure 18 also shows there may be a positive correlation in travel times during afternoon peak hour (low IQR, suggesting the travel times are close). Possible reasons include consistent traffic conditions. On the other hand, there may be a negative correlation in travel times during morning peak hour or non-peak hour. Possible reasons include inconsistent traffic conditions or uneven headway of bus runs.

There are also limitations of our proposed methods. The most obvious one is that the extracted bus time table (including arrival and departure time of each station) is heavily based on passengers’ card-swiping behaviors. If passengers tend to swipe their cards long before the arrival or long after the departure, the extracted time table will be inaccurate, especially when there are only few boarding/alighting passengers on that station.
Fig. 18

Comparison of the link- and path-based outputs by time of day

System-Level Segment Reliability Analysis

By visualizing the rescaled travel times for each route link on a box plot, we assess the reliability of the links, by looking at the travel times’ IQR.

The larger the IQR is, the lower reliability that link has. The maps below show the segment-level IQR for the bus segments at different times of day. The green segments exhibit smaller IQR (indicated as “Low” in the legend), which means that these segments have stable travel times during weekdays irrespective of time of day. The red segments are the ones with high IQR value (indicated as “High” in the legend), representing the travel time on those segments are more unstable. Same color indicates the same level of IQR in Fig. 19a, c, e. Similar representation for average speed at segment level is used in Fig. 19b, d, f.

Based on the map, we see that most of the unstable segments are usually in the Central Business District (CBD), or the residential center of Seoul. The travel time for most segments is unstable during the non-peak hours, compared to the peak hours. Obviously, travel time stability for buses does not necessarily imply efficiency, i.e., lower travel times. For example, despite a stable travel time during the afternoon peak hours, the vehicle speed at that time is lower than at other times (Fig. 19c, d). On the contrary, overall, the IQR and vehicle speed are both higher during the non-peak hours (Fig. 19e, f). Figure 20 clearly illustrates the difference between the patterns observed at different times of day. A road segment in green signals that its non-peak hour bus travel speed is higher than the corresponding peak hour speed. A red segment, on the other hand, signals that its non-peak hour bus travel speed is lower than the corresponding peak hour speed. A yellow segment signals that its non-peak hour travel speed lies in between the two peak hour extremes. As we can see from the plots, most of the segments are in green, which shows that although the peak hour bus travel time IQR is low (signaling reliable travel times), the travel speed on those segments is also low. In short, during peak hours, the bus system of Seoul is congested but reliable.
Fig. 19

Travel time and IQR of bus segments in Seoul

Fig. 20

Comparison between non-peak hour and peak hour travel speed

Monitoring System-Level Congestion

Monitoring the congestion of transit system is important to the operators and planners of transit agency. In this section, we propose a methodology of using obtained AFC data to identify the congested road segments and the congested road clusters. By identifying and visualizing the segments and clusters on the map, transportation practitioners can easily identify the congested area and propose mitigation strategy accordingly.

Input data: processed AFC data. For different times of day t={MP (morning peak), AP (afternoon peak), NP (non-peak)}, we have an edge-weighted dynamic network \(G_{d,t}=(V,E,W_{d,t}=\{w^1_{d,t},w^2_{d,t},\ldots ,w^E_{d,t}\})\), where V is the set of vertices (stations), E is the set of edges (route segments), and \(w^e_{d,t}\) is the corresponding weight (route segment travel time) for edge \(e=1,2,\ldots ,E\) and time of day t={MP,AP,NP} on day \(d=1,2,\ldots ,D\).

Step 1: pick out the congested road segments by date (d) and time of day (t). We denote the 98th percentile value of set \(W^e_t=\{w^e_{1,t},w^e_{2,t},\ldots ,w^e_{D,t}\}\) as p value \(p^e_t\) for edge e at t. For a specific day d and time of day t, in network \(G_{d,t}\), edge e will be identified as a congested edge if \(w^e_{d,t} \ge p^e_t\). We use 98% as a cutoff value because our focus is on the highly congested clusters—the ones signaling traffic jams that need to be dealt with immediately. Below, other parameters of this methods are decided in consideration of the same reason.

Step 2: identify the congested clusters by date (d) and time of day (t). Using DBSCAN clustering algorithm (Ester et al. 1996), we can identify the congested clusters based on the congested segments obtained from Step 1. For date \(d = 1,2,\ldots ,D\) and time of day t={MP, AP, NP}, we have \(G_{d,t}=(V,E,W_{d,t}\). For all congested segments E in the graph\(G_{d,t}\), starting from any congested segment, as long as its 5 mile neighborhood (that means, the radius parameter of DBSCAN, \(\epsilon\), is 5 mile) contains 2 or more other congested segments (the other parameter of DBSCAN, minimum number of points required to form a dense region, MinPt, is 3), these congested segment will be identified in one cluster. Repeat this step for all the points. The noise points are identified as isolated congested segments or the segments located in a low-density area. In this step, to constraint the size of a cluster, we only identify the cluster generated from DBSCAN that contains 20 or more congested segments as a congested cluster.

In comparison to other clustering methods, DBSCAN has its own advantages: (1) researchers don’t have to pre-define the number of congested clusters before performing the clustering analysis. (2) The shape of cluster is merely depend on the algorithm parameters, and this is realistic: the cluster may be along a traffic corridor, around an special event, etc. (3) DBSCAN can detect noise segment, which is defined as the ones in the low-density area. (4) The clustering result is only sensitive to the algorithm parameters, and insensitive to other algorithm settings, such as the order or processing all the segments.

Step 3: cluster visualization. In the final step, we visualize all the clusters on the map and showed them as a day-to-day animation, to help the transportation practitioners identify road congestion, seek reasons and come up with solutions. Kernel density of the clusters are also visualized on the map. It is a non-parametric density estimation method and the estimated density are added to the map as a red layer: the more red it is, the higher the cluster density.

The processed AFC data we have are the travel time data for all bus route segments in Seoul over the year 2013. There are 12 weeks recorded, 1 week per month. We assume the roads are more congested during weekdays, so we only monitored and visualized all the weekdays during these 12 weeks, hence, we have data for 60 days (5 days a week * 1 week per month * 12 months = 60 days in total). In the Seoul metropolitan area, there are 18234 road segments on which buses are operated .

After performing the above Step 1, the number of identified congested road segments is listed in Table 5 for each time of day over this 60 weekday period. On average, for each day, more roads are congested in non-peak hour. However, during the peak hours, the max(min) number of congested road segments are higher(lower) than the non-peak hours, that means, extreme cases (i.e., severe congested road condition) are more easily happen during peak hours.

After performing Step 2, the descriptive analysis of congested segments in clusters and the congested clusters are listed in Table 5. On average, there are around 26, 32 and 34 congested segments per day during morning peak, afternoon peak and non-peak hours. Most of the days does not have congested cluster at all, but for the traffic-busy days, it could have as many as 8 congested clusters spreading out the whole city.
Table 5

Descriptive statistics for congested segments

Time of day

Data description

Total (60 days)

Number per day



Standard deviation

75th percentile

50th percentile

25th percentile

Morning peak

Extracted congested segments









Extracted congested segments in clusters









Congested clusters









Afternoon peak

Extracted congested segments









Extracted congested segments in clusters









Congested clusters










Extracted congested segments









Extracted congested segments in clusters









Congested clusters









The cluster pattern varies for each month at different time of day. Many statistics can be extracted from the result to monitor and evaluate the road congestion. For example, the number of days with congested clusters for each month, and the number of congested clusters on those days are listed in Table 6. During morning peak hour, although there are three days showed congested clusters in March, but the number of clusters and the cluster size are small. However, during September, only two day showed congestion but the number of cluster and the cluster size are relatively large. In general, both the number of days and the number/size of clusters need to be taken into consideration when we are dealing with road congestion condition for different months.
Table 6

Number of days having clusters and number of congested clusters per day


Number of days with congested clusters

Number of clusters each day (M, Tu, W, Th, F)




























































































Another tool that will help us to understand the road congestion condition is the cluster visualization map. Based on the parameter setting for DBSCAN and cluster size threshold, we can see the visualization map of clusters, and the pattern varies throughout days. For example, the distribution of congested road clusters in morning peak hour is shown in Fig. 21 (only the days with clusters are shown here). Smaller black dots are congested road segments that have been identified as noise point. A noise point is a congested road segment that belongs to none of the clusters. Larger black dots are identified DBCSAN clusters, but with size less than 20 (pre-defined cluster size threshold). The colored points are the final congested clusters. The points in the same color means they are in the same cluster. We can also see the red kernel density layer from the map as well. Under our parameter setting, for morning peak hours, the result shows 16 out of 60 days has congested road clusters, but the patterns are quite different. For instance, in Fig. 21a, j and k share similar pattern, they all show an overall congested transit network but with less congestion density throughout the city. However, for the rest of the days, their pattern shows only one or two highly compact congested clusters. Decision-makers can decide which congestion pattern (i.e., dense local congestion, system-wide congestion) they want to investigate, based on Table 6 and corresponding maps in Fig. 21.
Fig. 21

Dates with congested road clusters of Seoul metropolitan area, morning peak hour, 2013


This paper presented three data processing algorithms to infer transit environment information based on AFC data. Due to the multi-modal nature of what the AFC data capture, none of the algorithms is perfect in isolation, yet they complement each other by presenting the traits of the transit environment from different angles.

The observations made from AFC data are summarized as follows:
  • By using Route 0017 as an example, we find that the travel times in afternoon peak hours is usually higher than the morning peak hours for a specific road segment. And the travel times in peak hours are usually higher than non-peak hours. Similar trends are observed for crowdedness on this road segment.

  • Using Route 0017 as an example, by normalizing the travel times on each segment, we can observe and compare the service reliability of this route on each segment. The lower its IQR, the more reliable service this road segment has. Also, we observe that service reliability is fluctuating with time: some segments are always reliable, but some segments are not.

  • We compare the extracted traffic information from three proposed methods: linked-based, modified link-based and path-based methods. Overall the extracted information is in a similar range. The main reason behind some observable variations is that transaction data used in three methods are not the same.

  • We extract and compare the value and IQR of travel times of the public transit system in Seoul City. In conclusion, the bus system of Seoul is congested but reliable during peak hours; it is smooth but unreliable during non-peak hours. Please note the reliability we are measuring is focused on the variations of travel times for each run, instead of the accuracy of bus arrival/departure times according to its scheduled time table.

  • We try to identify when and where the congested roads are showing a clustered pattern by using DBSCAN method. There are two typical clustered patterns observed: one pattern is showing one or a few highly compact congested clusters, the other one is showing an overall congested transit network but with less congestion density throughout the city.

The results show that the transit environment information extracted is well-detailed to enable query-specific studies such as monitoring the performance of transit system elements, analyzing the reliability of bus and metro modes, and detecting road congestion. It allows one to observe how transit environment accessibility and utilization patterns vary from hour to hour, day to day, and even month to month.



This work was, in part, supported by the National Science Foundation Award 1636602 and Transportation Informatics University Transportation Center. The authors are grateful for their generous support.


  1. Asakura Y (1999) Reliability measures of an origin and destination pair in a deteriorated road network with variable flows. In: Transportation networks: recent methodological advances. Selected proceedings of the 4th EURO transportation meetingGoogle Scholar
  2. Asakura Y, Kashiwadani M (1991) Road network reliability caused by daily fluctuation of traffic flow. In: PTRC Summer Annual Meeting, 19th, 1991, University of Sussex, UKGoogle Scholar
  3. Barry J, Newhouser R, Rahbee A, Sayeda S (2002) Origin and destination estimation in new york city with automated fare system data. Transp Res Record 1817:183–187CrossRefGoogle Scholar
  4. Bell M, Cassir C, Iida Y, Lam W (1999) A sensitivity based approach to network reliability assessment. In: 14th international symposium on transportation and traffic theoryGoogle Scholar
  5. Bell MG, Iida Y (1997) Transportation network analysisGoogle Scholar
  6. Chakirov A, Erath A (2011) Use of public transport smart card fare payment data for travel behaviour analysis in singapore. [Arbeitsberichte/IVT] 729Google Scholar
  7. Chen A, Yang H, Lo HK, Tang WH (1999) A capacity related reliability for transportation networks. J Adv Transp 33(2):183–200CrossRefGoogle Scholar
  8. Chen A, Yang H, Lo HK, Tang WH (2002) Capacity reliability of a road network: an assessment methodology and numerical results. Transp Res Part B Methodol 36(3):225–252CrossRefGoogle Scholar
  9. Chen X, Yu L, Zhang Y, Guo J (2009) Analyzing urban bus service reliability at the stop, route, and network levels. Transp Res Part A Policy Pract 43(8):722–734CrossRefGoogle Scholar
  10. Clark S, Watling D (2005) Modelling network travel time reliability under stochastic demand. Transp Res Part B Methodol 39(2):119–140CrossRefGoogle Scholar
  11. Cui A (2006) Bus passenger origin-destination matrix estimation using automated data collection systems. Master’s thesis, Massachusetts Institute of TechnologyGoogle Scholar
  12. Dijkstra EW (1959) A note on two problems in connexion with graphs. Numerische Mathematik 1(1):269–271MathSciNetCrossRefGoogle Scholar
  13. Dueker KJ, Kimpel TJ, Strathman JG, Callas S (2004) Determinants of bus dwell time. J Public Transp 7(1):2CrossRefGoogle Scholar
  14. El-Geneidy AM, Horning J, Krizek KJ (2011) Analyzing transit service reliability using detailed data from automatic vehicular locator systems. J Adv Transp 45(1):66–79CrossRefGoogle Scholar
  15. Ester M, Kriegel H-P, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231Google Scholar
  16. Feng W, Figliozzi M (2011) Empirical findings of bus bunching distributions and attributes using archived avl/apc bus data. In: ICCTP 2011: towards sustainable transportation systems, pp 4330–4341Google Scholar
  17. Furth P, Muller T (2007) Service reliability and optimal running time schedules. Transp Res Record 2034:55–61CrossRefGoogle Scholar
  18. Furth PG, Hemily B, Muller TH, Strathman JG (2006) Using archived AVL-APC data to improve transit performance and management. Number Project H-28Google Scholar
  19. Gordillo F (2006) The value of automated fare collection data for transit planning: an example of rail transit od matrix estimation. Master’s thesis, Massachusetts Institute of TechnologyGoogle Scholar
  20. Gordon J, Koutsopoulos H, Wilson N, Attanucci J (2013) Automated inference of linked transit journeys in london using fare-transaction and vehicle location data. Transp Res Record 2343:17–24CrossRefGoogle Scholar
  21. Grisé E, El-Geneidy A (2017) Identifying the bias: evaluating the effectiveness of automatic data collection 2 methods in estimating the details of bus dwell time 3. Technical reportGoogle Scholar
  22. Jang W (2010) Travel time and transfer analysis using transit smart card data. Transp Res Record 2144:142–149CrossRefGoogle Scholar
  23. Kittelson, Associates, U. S. F. T. Administration, T. C. R. Program, and T. D. Corporation (2003) Transit capacity and quality of service manual. Number 100. Transportation Research BoardGoogle Scholar
  24. Kusakabe T, Iryo T, Asakura Y (2010) Estimation method for railway passengers’ train choice behavior with smart card transaction data. Transportation 37(5):731–749CrossRefGoogle Scholar
  25. Lee SG, Hickman M (2014) Trip purpose inference using automated fare collection data. Public Transport 6(1–2):1–20CrossRefGoogle Scholar
  26. Levinson HS (1983) Analyzing transit travel time performance. Number 915Google Scholar
  27. Ma X, Wu Y-J, Wang Y, Chen F, Liu J (2013) Mining smart card data for transit riders’ travel patterns. Transp Res Part C Emerg Technol 36:1–12CrossRefGoogle Scholar
  28. McMullan A, Majumdar A (2012) Assessing the impact of travel path choice on london’s rail network using an automatic fare collection system. Transp Res Record 2274:154–163CrossRefGoogle Scholar
  29. Nakanishi Y (1997) Bus performance indicators: on-time performance and service regularity. Transp Res Record 1571:1–13CrossRefGoogle Scholar
  30. Nassir N, Khani A, Lee S, Noh H, Hickman M (2011) Transit stop-level origin-destination estimation through use of transit schedule and automated data collection system. Transp Res Record 2263:140–150CrossRefGoogle Scholar
  31. Park J, Kim D-J, Lim Y (2008) Use of smart card data to define public transit use in seoul, south korea. Transp Res Record 2063:3–9CrossRefGoogle Scholar
  32. Strathman JG, Hopper JR (1993) Empirical analysis of bus transit on-time performance. Transp Res Part A Policy Pract 27(2):93–100CrossRefGoogle Scholar
  33. Sun L, Lee D-H, Erath A, Huang X (2012) Using smart card data to extract passenger’s spatio-temporal density and train’s trajectory of mrt system. In: Proceedings of the ACM SIGKDD international workshop on urban computing, pp. 142–148. ACMGoogle Scholar
  34. Sun L, Lu Y, Jin JG, Lee D-H, Axhausen KW (2015) An integrated bayesian approach for passenger flow assignment in metro networks. Transp Res Part C Emerg Technol 52:116–131CrossRefGoogle Scholar
  35. Sun Y, Shi J, Schonfeld PM (2016) Identifying passenger flow characteristics and evaluating travel time reliability by visualizing afc data: a case study of shanghai metro. Public Transport 8(3):341–363CrossRefGoogle Scholar
  36. Sun Y, Xu R (2012) Rail transit travel time reliability and estimation of passenger route choice behavior: analysis using automatic fare collection data. Transp Res Record 2275:58–67CrossRefGoogle Scholar
  37. Trépanier M, Tranchant N, Chapleau R (2007) Individual trip destination estimation in a transit smart card automated fare collection system. J Intell Transp Syst 11(1):1–14CrossRefGoogle Scholar
  38. Tribone D, Block-Schachter D, Salvucci F, Attanucci J, Wilson N (2014) Automated, data-driven performance regime for operations management, planning, and control. Transp Res Record 2415:72–79CrossRefGoogle Scholar
  39. Van Vugt M, Van Lange PA, Meertens RM (1996) Commuting by car or public transportation? a social dilemma analysis of travel mode judgements. Eur J Soc Psychol 26(3):373–395CrossRefGoogle Scholar
  40. Veiseth M, Olsson N, Saetermo I (2007) Infrastructure’s influence on rail punctuality. WIT Transactions on The Built Environment 96Google Scholar
  41. Zhao J (2004) The planning and analysis implications of automated data collection systems: rail transit od matrix inference and path choice modeling examples. Master’s thesis, Massachusetts Institute of TechnologyGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.University at Buffalo, The State University of New YorkBuffaloUSA
  2. 2.Yeungnam UniversityGyeongsanKorea

Personalised recommendations