1 Introduction

In today’s age of data abundance, nearly every aspect of social activity is afflicted by the huge amounts of data available. The information is characterized by volume, velocity and variety which is often referred to as “Big Data” [1]. Human life depends greatly on transportation systems, which points to the promises and difficulties caused by the Big data. Intelligent transportation system (ITS), which was introduced few years ago has gathered huge amount of data over a vast geographical range including many countries. However, this data is very rich in information, disorganized and consistently growing. The data could be used by experts to analyze and understand traffic management, its performance and the nature of accidents including transportation patterns. This data could also be used for proactive measures such as road safety, traffic snarls, road blocks and traffic flux in real-time.

Road safety has been considered a high priority issue among highway authorities for many years. One of the most prominent behavioral study application is in the area of proactive road safety diagnosis using surrogate safety methods. This is based on statistical methods using accident data which require log observation periods. Therefore one must wait for enough accidents to occur in order to have large data volume for analysis. In 1960s, many attempts were made to predict the number of collisions based on observations without a collision being reached [2].

In recent years, most of safety systems are based on data taken during or after a crash. Crash data analysis has been used to predict accidents and hence many different measures have been taken to avoid accidents. With the involvement of big data, road safety could be achieved or improved at a much higher rate if used properly. In large cities, ITS is being used to check and balance traffic congestion on the roads. However, many real-time monitoring systems have also been installed to either provide security or to monitor irregular activities on the road. These monitoring systems could be used to analyze traffic data and hence analyze accidents occurring on the road.

This paper presents a review of Big Data and its involvement in road safety. The authors have tried to gather as much information as possible on big data and how it can be used to provide road safety to drivers. It also discusses the available methods and type of data available during and after crashes which could be used to avoid or predict accidents. The remainder of this paper is organized as follows: Sect. 2 presents the background, Sect. 3 discusses the challenges of big data. Section 4 shows existing studies limitations. Section 6 introduces a proposed solution. Section 7 concludes the paper and suggests future work directions.

2 Background

Road and traffic accident analysis requires a deep knowledge of the factors effecting the uncertainty and unpredictability of accidents on the road. There are a number of variables and data sets involved in accident data which is discrete in nature. Therefore, the data recorded is heterogeneous and must be treated with variance [3]. Traffic accident analysis is often performed based on statistical or data mining techniques. Multiple studies have employed road accident data analysis using statistical techniques [4, 5] in order to identify existing relationships between accident and relevant data. Some other studies used data mining techniques [6, 7] in order to identify the main factors associated with a road and traffic accidents. However, most of these techniques can only handle a small subset of traffic accidents. Therefore, certain relationships remain hidden. Now Days, Big data is being increasingly deployed by municipalities and police forces in order to make roads safer. Indeed the use of big data technology allows traffic monitoring systems which can handle a large amount of data. Most road accidents can be mainly categorized into two, statistical and data mining techniques. Multiple studies on road accident data analysis have used traditional techniques and data mining techniques. Therefore, effective strategies should be used to improve traffic operation using Big Data applications. Big Data generated from traffic congestion, car crashes, traffic flow, weather conditions, road design, human behavior and intelligent transport detection systems could be used to perform analysis and avoid accidents, hence improving safety on the road. Therefore, Big Data plays an important role in providing safety drivers.

2.1 Big Data Sources in Transportation

Big Data in the transportation system come from many sources such as traffic surveillance systems on the road and inside the vehicle. In addition sensors installed in the vehicle play an important role in collection of data during normal, uncertain or crash conditions. At the same time technologies like Global Positioning System (GPS), cellular phones, Bluetooth, Ground-based Radio Navigation, Automatic Vehicle Identification, Automatic Vehicle Location and Radio Frequency Detection (RFID) play an important role in addition to the fixed infrastructure [8].

Many other data sources such as mobile devices, social media data, demographic data, weather reporting systems, geometric characteristics, and crash data are extensively used in traffic operation, and safety management. However, efficient data integration and fusion have to be carried out to receive the maximum out put from the data. For example, integration of sensor data with the GPS or road monitoring data. Therefore, a new method is required to filter the most important variables from the available data types, perform analysis and predict road safety. Data integration would play an important role in the future for Big Data applications in the transportation arena.

The benefits of Big Data technologies include direct and indirect applications. Direct applications could be congestion reduction, incident prediction, and travel time estimation. Indirect applications are carried out through enhancement of traffic modeling in the model development, calibration and validation processes. Traffic simulation could also be greatly improved based on the real data collected from field.

2.2 Big Data Applications

Traffic demand could vary from time to time on different roads. Traditionally, the volume to capacity ratio and level of service (LOS) are implemented by the transportation authorities as an indicator of congestion control [9]. However, volume to capacity ratios lack the capability to capture the variability of congestion. There could be an incident and a number of cars could be diverted to other routes depending on the condition on a particular road. Big Data can provide a more comprehensive and accurate data from many different sources in real time about congestion to the authorities. They can zoom into specific location and check the performance of the whole system in order to make decisions.

2.2.1 Real-Time Crash Control

In a normal road transportation system the drivers need to respond to many different complex events and at the same time to maintain high speed. These events include vehicle maneuvering, taking instant decisions regarding routes, reading road signs and maintaining safe distance from other maneuvering vehicles simultaneously. Hence, any additional disruption in the traffic condition may create driving error which can eventually result in a crash. Crashes could be avoided by spotting the disrupted traffic situations as early as possible. This would result in proactive measures such as sending warnings to the driver, applying various traffic smoothing techniques, variable speed limits, maintaining lines, maneuvering of vehicle, such action would bring the traffic back to normal [10,11,12].

2.2.2 Vehicle Motion Planning

Many different driver braking behaviors are proposed and are based on potential filed model [13,14,15]. Number of different methods are discussed to avoid other vehicles and keep in lane. A technique to assign a potential field to obstacles in order to prevent crashes when a vehicle enters the potential field area with a certain distance. This could result in automating the deceleration of the vehicle speed or alerting the driver of incidence which might occur in order to take action. This method requires a large amount of data from each vehicle traveling in the same direction on a particular highway. Big Data can play an important role in capturing such data and analyzing it with high speed data processing techniques and sending either alerts to the vehicle or mobile devices to take action before a crash occurs. However, this could be difficult in the case of sudden crashes or if an unusual scene occurs on the road such as blind corner at a intersection. This could effect the speed of the vehicle due to the shape of the intersection or driver deciding to reduce the speed and pass the intersection. Therefore, it is difficult to define the unique value of the speed to predict the action [16].

2.2.3 Vehicle Position Detection

Vehicle position or location can greatly affect the prediction of the potential for crashes if calculated precisely and in advance. However, creating a map of potential vehicles is a challenge because this would also include the velocity of the vehicle in consideration. The detection method of a vehicle based on 3D point clouds is proposed in [17]. The data obtained by the 3D map could then be used to create a road map with multiple driving profiles and compared to predicted a crash. The method based on cloud point has advantages in the accuracy and estimation of the motion. Therefore, every position is mapped on the standard map and sensing data can be compared.

3 Big Data Challenges

Gartner defines big data challenges and opportunities as being three dimensional (3Vs model):

  • Volume (increasing amounts of data): There is a huge explosion in the data available. The challenge is no longer the availability, but the management of this data. In transport, the volume of data has increased because of growth in the amount of traffic (all modes) and detectors.

  • Velocity (speed of data in and out): The velocity of data has increased in transport due to improved communications technology and media (particularly fiber optic cabling) and increased processing power and speed for monitoring and processing. It is important to keep up with real-time data. This will help build better insights and enhance decision-making capabilities. Currently, there are a few reliable tools, though many still lack the necessary sophistication.

  • Variety (range of data types and their sources): Transportation big data can be obtained from many sources. The variety of transport-related data has increased significantly. Modern trains and aircraft report internal system telemetry in real time from anywhere in the world. Along with the rise in unstructured data, there has also been a rise in the number of data formats. Therefore, data integration would play an important role in Big Data applications in future transportation systems.

4 Utilizing Big Data Analysis to Increase Road Safety

Crash occurrences are often regarded as random events affected by human behaviour, roadway design, traffic flow and weather conditions. However, big data generated from the ITS could be leveraged to develop real-time traffic monitoring [18]. One of the key objectives in accident data analysis is to identify the main factors associated with road and traffic accidents. Afterwards, accidents can be prevented by early detection of causal factors or traffic patterns and provide timely alerts to drivers. On the other hand, the early detection of accident occurrence is important to save lives, and it can help to reduce loses and damage.

[19] A system and method for classifying and identifying a driver using driving performance data has been proposed. The system provides an accurate and predictive way to measure and analyze driving behavior. Classification and identification of a driver (e.g., a driving signature which could represent driving patterns in a continuous manner) could be used for insurance purposes and/or driving risk evaluation. The system could also be used to analyze, classify, and/or provide feedback and coaching to drivers and vehicle owners (e.g., for green driving (e.g., fuel efficient and environmentally friendly green driving), personal safety, family safety, fleet safety, etc.). However, such a system does not provide real-time detection for traffic accidents. Moreover, the author did not explain how the data of each individual vehicle is collected and processed.

In [20], the author explores In-Vehicle Data Recorders (IVDRs) information to investigate undesirable driving events (such as hard braking, lane changing, and sharp turning) among 148 individuals. The information was logged over three years. The objective was to gain deeper understanding about the heterogeneity among drivers with respect to behavior change over time, the effect of trip duration and the distribution of events counts. The paper introduced a statistical model that works on each driver’s data separately. In some respects drivers are similar, enabling the application of the same methodological approach to most drivers. In other respects, differences among drivers are substantial, and thus personalized examination of the data is advised. Analyzing individuals’ data may assist drivers, insurance agents, safety officers, or driving instructors who wish to understand how individuals’ behaviors can change over time and what variables explain this change. However, studying driver behavior using precise data with a small number of drivers is not enough to detect dangerous driving patterns. Therefore, more advanced analysis is required where the behavior of thousands of drivers can be studied simultaneously.

In [18], the author introduced a real-time modeling framework to monitor traffic and study the relation between congestion and rear-end crashes. A Microwave Vehicle Detection System (MVDS) deployed on an express-way network was utilized to collect the traffic big data. It was found that congestion on urban express-ways was highly localized and time-specific. Data mining (random forest) and Bayesian inference techniques were implemented in the real-time crash prediction models. The identified effects confirmed the significant impact of congestion on rear-end crash likelihood. In aggregate safety analysis, the issue related to averaging congestion intensity might be the cause of the insignificant effects of congestion found in many crash frequency studies. Real-time congestion measurement based on Big Data is more desirable to identify congestion patterns as it considers both the temporal and spatial dimensions. This work focused on studying the effect of congestion on accident occurrence. However, multiple factors other than congestion could lead to crashes in the real world. Therefore, to fully realize the power of Big Data, more data sources should be utilized especially real-time weather condition as it is an important factor for express- way operation and safety as well.

The heterogeneous nature of road accident data makes the analysis task difficult. In [21] data segmentation has been used to overcome this heterogeneity of accident data. A framework is proposed that used K-modes clustering analysis as a preliminary task for segmentation. In addition, association rule mining is used to identify the various circumstances that are associated with the occurrence of an accident for both the entire data set (EDS) and the clusters identified by the K-modes clustering algorithm. The results reveal that the combination of k mode clustering and association rule mining produced important information that would remain hidden if no segmentation had been performed prior to generate association rules. In ITS every car has its own mobile database of information, which makes the system database distributed. However, the mining technique is centralized in one central server. Subsequently, a real-time system will face the problem of increased overhead on the network communication system to send and receive alerts, new patterns and update patterns.

The author of [22] proposed an analysis method of driving behaviours based on large-scale and long-term vehicle recorder data. The method classifies drivers by their skill, safety, physical/mental fatigue, and aggressiveness. In this study, the ability of a dataset that is sparse but large-scale (over 100 fleet drivers) and long-term (10 months worth) was examined. The focus was on classifying drivers recently involved in accidents through examining the correlation in driving behaviours. The drivers classification was done using long-term records of their driving operations (braking, wheeling, etc.) with several attributes (max speed, acceleration, etc.). Following a machine learning approach, two methods to characterize driver’s behaviours were used; entropy-like model, and KL divergence model, where effective features were selected and successfully found some informative outcomes. This work is an example of some existing studies that do not consider real-time collection and processing of data, which makes such studies less efficient in providing timely information that can be used to improve road safety. However, in the future real-time applications will have higher demand in order to serve the intelligent transportation applications.

Early detection of accidents can save lives, provides quicker openings of roads, hence decreases wasted time and resources. In [23] a real-time accident detection model is introduced that utilizes transport big data with computational intelligence techniques. In this model, Istanbul City traffic-flow data for the year 2015 from various sensor locations are populated using big data processing methodologies. The extracted features are then fed into the nearest neighbor model, a regression tree, and a feed-forward neural network model. The acquired raw data is passed through an ETL (Extract-Transform-Load) process through Hadoop distributed file system (HDFS) and Apache Spark. The original data is stored in a SQLServer format and imported to Hadoop environment via Sqoop. The imported data is then processed on a 10-PC cluster using Spark and HDFS. The results revealed that all models are very good in catching accidents, however the number of false positives are considerably high. Indeed, road and traffic accidents are uncertain and it is difficult to predict incidents. Accordingly, some existing prediction models create high false predictions. Such predictions may cause significant disturbances to the ITS. Hence, it is critical to find the features that provide accurate prediction and detection, and design more accurate prediction models.

Self-driving vehicle technology promises to provide many economical and societal benefits and impacts. Safety is on the top of these benefits. Trajectory or path planning is one of the essential and critical tasks in operating an autonomous vehicle. [6] proposed a method for predicting accidents and selecting safe-optimal trajectory in a autonomous cloud based connected vehicle environment. The prediction is done by applying the Distributed Random Forest classification algorithm, the estimated time to arrive is calculated using the Linear Regression (LR) algorithm. All experiments were done using 10-fold cross validation using H2O Big Data analytics software, where selecting the safe trajectory is based on using Big Data mining and analysis of real-life accidents data and real-time connected vehicles data. The decision of selecting a trajectory is done automatically without any human intervention. Human involvement would be only at defining and prioritizing the driving preferences and concerns at the beginning of a planned trip. However, the proposed method still needs to be further tested in a more realistic environment.

Through studying the causes of road accidents using big real-time accidents data, [5] designed an anticipation and alert system of congestion and accidents. It was designed around dividing the roadway into segments, based on the infrastructure availability. The system aims to prevent or at least decrease traffic congestions as well as crashes. It uses DSRC, cellular, wi-fi, and hybrid communication. The data is analyzed and validated by using H2O and R Big Data tools in the cloud infrastructure that combines the historical data and the real-time data received from the vehicles. The system receives online streamed data from vehicles on the road in addition to real-time average speed data from vehicles detectors on the road side to (1) Provide accurate Estimated Time of Arrival (ETA) using a Linear Regression (LR) model (2) Predict accidents and congestions before they happen using Naive Bayes (NB) and Distributed Random Forest (DRF) classifiers (3) Update ETA if an accident or congestion takes place by predicting accurate clearance time. The Lambda Architecture (LA) is considered a good fit for real-time solutions in big Data analytics as it has proved its scalability, robustness, ability for generalization, extensibility, and fault tolerance.

In [24] the author built a prediction model for highway accidents. Data imbalance is one of the major problem encountered in training datasets for data mining. A dataset is imbalanced if the cases of the positive class are outnumbered by cases of the negative class. This can result in high false negatives, mainly harming the minority class, which is the most important class. The primary approach for handling class imbalance is sampling. Sampling transforms the dataset to be more balanced by adding or removing instances until a desired class ratio is reached. To overcome the imbalanced data set, the author employed an over-sampling operation to repair the data and prevent the biased result in classification analysis of the imbalanced data. The data used to build the learning model is generated on the Gyeongbu Expressway which connects Seoul and Busan. The data are text files showing traffic data created between Jan 1st, 2011 and June 30th, 2013. Traffic data are created using a vehicle detection sensor (VDS) which measures speeds of cars every 30 s and records the number of cars that run on the road. The Hadoop framework was utilized to process and analyze big traffic data efficiently. The performance of the data mining process was tested using total and target precision.

Fig. 1.
figure 1

Comparison table based on 3Vs

5 Discussion and Future Research Directions

This section points out the limitations of existing studies. A comparison table of the different studies based on the three Vs is shown in Fig. 1, which shows that the majority of related works do not give a solution to the three big data challenges. Consequently, some critical issues may arise and need to be considered in future research to fully utilize the transportation big data.

  • The Big Data generated by the ITS systems is worth further exploration to bring all their full potential for more proactive traffic management. Most of existing studies focus on one or two of the transportation big data sources, whereas there exists many other resources that produce valuable information (e.g. smart phones, traffic lights, weather stations, etc.).

  • Transportation big data can be obtained from many sources. However, such data is highly heterogeneous. Therefore, data integration would play an important role in Big Data applications in the future transportation systems.

  • Traffic and transportation systems simulation software can be greatly improved based on the real data collected from the intelligent transportation systems. Such simulation software will play an essential role in accelerating the development of ITS while reducing the costs of testing new applications.

Fig. 2.
figure 2

Proposed solution architecture

6 Proposed Solution

The data collected through the technologies of intelligent transportation systems (ITS) are increasingly complex and are characterized by heterogeneous formats, large volume, nuances in spatial and temporal processes, and frequent real-time processing requirements. Simple data processing, integration, and analytics tools do not meet the needs of complex ITS data processing tasks. Based on our related work research, we can conclude that, an efficient method will need to give a solution to the three main big data challenges (volume, variety and velocity). In the following discussion, we present our approach that combines a number of technologies and tools in order to satisfy the set of big data challenges. The overall architecture of proposed solution is shown in Fig. 2.

  • A Solution to the Volume Problem

    A Hadoop system is created with the ability to handle massive amounts of data. It is the most well-known framework for big data processes, it was developed to process large data sets in a distributed manner, data is stored on different nodes. Hadoop was designed for scalable applications and offers its own type of storage. Hadoop is not only a storage system but is a platform for large data storage as well as processing. In fact, Hadoop can be divided into two parts: processing and storage. MapReduce is a programming model which allows the processing huge data stored in Hadoop. However, MapReduce reads and writes from disk, as a result, it slows down the processing speed. Therefore, MapReduce fails when it comes to real-time data processing as it was designed to perform batch processing on voluminous amounts of data.

    Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. This is accomplished through in-memory caching, and optimized query execution, Spark can run fast analytic queries against data of any size. It can also process real time data. Spark’s strength is its ability to process live streams efficiently. By reducing the number of read/write cycle to disk and storing intermediate data in-memory, Spark makes it possible. However, as it does not have its own storage system, it runs analytics on other storage systems.

    Outside of the differences in the design of Spark and Hadoop MapReduce, many organizations have found these big data frameworks to be complimentary, using them together to solve a broader challenge. Hadoop is used for Batch processing whereas Spark can be used for both. In this regard, Hadoop users can process using MapReduce tasks where batch processing is required. Spark uses the best parts of Hadoop for reading and storing data. Hence, MapReduce and Spark can be used together where MapReduce is used for batch processing and Spark for real-time processing.

  • Solution to the Velocity Problem

    Data velocity management is a key component Big Data analytics not due to the speed of the data arriving at the data warehouse but processing of data. Data in ITS may arrive from many different sources such as inbuilt sensors in the vehicle, data monitoring systems, external sensors, live cameras or IoT devices installed either inside or outside the vehicle. Data velocity is not only associated with the speed but also with volume because at the end the data has to be processed. It would be very difficult to send a large amount of data with very high speed, process it in the data warehouse and then send results to the vehicle for taking decision. There is a big challenge in predicting the road conditions or accidents where the processing is happening. At the moment data is sent to a data warehouse to process and predict or produce an output. This process my take large amount of time and may not be feasible to predict immediate hazards on the roads. A mechanism must be built in which data processing could be executed within the vehicle after obtaining data from internal and external sources. This could involve installing an intelligent artificial neural network system which could analyze the data in real time and predict the situation on the road for the driver or vehicle to take a decision.

    First of all the data must be filtered within the vehicle to separate good data form a bad data. This will allow the computational systems to process data good data faster while vehicle is in motion. Secondly the data must be processed within the vehicle for immediate action against hazards. This would involve having a large cache size on board the computer system which can drastically reduce the processing time. This data can then be sent to the central location with a very high speed network infrastructure for later analysis. At the same time there must be customizable applications must installed within the vehicle which could help customize the application based on the traffic pattern, data latency, data filtration, and data processing.

  • Solution to the Variety Problem

    Data generated by the ITS Systems has different formats, such as numerical data gathered through sensors on both infrastructure and vehicles, multimedia and text data captured from social media, and GIS and image data loaded for digital maps. Therefore, collected data can be structured but also unstructured from a wide variety of sources. One challenge involves finding methods in order to deal with unstructured data. Deep learning networks can figure out how to make sense of the data’s various input formats and feed that into other networks to harvest meaning from the data. A second challenge consists in identifying data that pertains to the decision-making process. Techniques related to data searching and filtering can be used.

7 Conclusion and Future Work

Big Data play an important role in the rapid development of intelligent transport systems. Traffic accidents are unpredictable most of the time. However, big data’s real time nature, analysis of data at fast speed is vital for the crash prediction in the transportation system. Traditional congestion methods lack the ability to capture variable or dynamics of the congestion. A real-time congestion management based on Big Data is desirable. In this study many techniques such as congestion control, variable mapping, cloud mapping, sensing data, vehicle location, vehicle velocity monitoring are surveyed in order to provide a detail information on how Big Data could be utilized in traffic monitoring systems and allow smooth movement on the high way with very few accidents. Many different methods suggested by authors were discussed and presented to show the importance of Big Data in the intelligent transport system. The authors would like to continue research on the topic and develop a methodology to predict road accidents or safety. The authors aim to collect real-time data from a transportation system running in the country and propose different techniques to predict or avoid accidents on the road. As a conclusion, the application of Big Data for better operation should emphasize real-time monitoring of traffic conditions and a quick response based on the retrieved data. This would be possible only if the used method is able to cope with data variety, velocity and volume at the same time. The authors of this paper proposed a solution combining methods and techniques that address the three big data challenges. A future direction consists of implementing the proposed model and testing it on some ITS applications. Another research direction consists of addressing security and privacy problems. Indeed, although the purpose of accident detection and prevention applications is to improve transportation safety, such applications need to consider the privacy, data protection and security issues of the participating individuals.