1 Introduction

With the rapid development of the society, there has been a large increase in urban traffic in recent years, resulting in many transportation problems such as congestion or accidents. ITS aims to address these problems and improve transportation intelligently. Traffic flow prediction, as an essential task of ITS, is to predict the future flow using historic flows. Traffic flow prediction is greatly helpful to make a better travel decision, alleviate traffic congestion and improve traffic operation efficiency for individual travelers, public transport, and transport planning. Thus accurate and efficient prediction will make great significance for ITS.

There exist a great amount of methods for traffic flow prediction, which can be divided into three main classes: data driven statistical methods, machine learning methods and deep learning methods. At the beginning, among the data driven statistical models, a majority of approaches use conventional statistical time series methods such as the Auto-Regressive Integrated Moving Average (ARIMA) model [1] and the seasonal (SARIMA) model [2]. However, a large number of studies have found that the traffic flow data are random, varied and nonlinear. The ARIMA algorithm cannot analyze the nonlinear traffic flow data because it is based on the linear relationship.

Furthermore, several machine learning approaches have also been proposed to deal with traffic flow prediction, such as SVM [3], K-nearest Neighbors (KNN), the online Support Vector Regression (SVR) [4] and so on. KNN has firstly been used in traffic flow prediction [5], Sun et al. use flow-aware WPT KNN to predict traffic parameters [6]. In [7], a spatio-temporal Bayesian multivariate adaptive-regression splines (ST-BMARS) model is developed to predict short-term freeway traffic flow. Additionally, an Artificial Neural Networks (ANN) model are used in road traffic prediction and congestion control in [8].

In recent years, deep learning has drawn growing attention from many researchers. Deep learning methods exploit much deeper and more complex architecture to extract inherent features in data from the lowest level to the highest level. So a lot of deep learning methods have been proposed and employed for traffic flow prediction, such as Stacked Auto Encoder (SAE) [9, 10], DBN [11, 12], RNN. In [13], Ma et al. combined the Restricted Boltzmann Machine (RBM) with Recurrent Neural Network (RNN) and formed a RBM-RNN model that inherits the advantages of both RBM and RNN. Zhao et al. proposed a Long Short-Term Memory (LSTM) based method for traffic flow data prediction, which uses LSTM to extract the temporal feature of traffic flow data [14]. Compared with other deep learning models, Convolutional Neural Network (CNN) has better performance in understanding and exploring the pattern characteristics of traffic data. Thus in [15], A Convolutional Neural Network based method that learns traffic as images was proposed to predict traffic speed.

In general, the short-term traffic flow prediction module has been well exploited in some deep learning models, however, there exist some defects: First, the problem of long-term prediction is still not well solved. Second, the existing models only utilize single parameter to predict and ignores the objective parameter correlation between traffic parameters. Third, most models usually adopt classic models with poor scalability and lack of personalized design for specific prediction problems. In our work, considering the advantage of deep learning, especially Convolutional Neural Network (CNN), we develop a novel CNN based model called ResDeconvNN which has 3 input channels and apply it to long-term traffic flow prediction. The spatio-temporal relations and correlation of the three traffic parameters: flow, speed and occupancy (FSO) are fully considered and applied simultaneously in traffic flow prediction problems. We combine residual net and deconvolutional neural network to form a ResDeconvNN model which can extract the spatial-temporal information of the traffic pattern features well. Experiments demonstrate that the proposed approach gets lower mean relative error, mean absolute error and root mean square error and can achieve better performance than the other existing methods.

2 Proposed Methodology

2.1 Basic Principle

Now it is generally acknowledged that CNN has shown remarkable learning ability in the pattern recognition and has a good ability to extract the input features. Compared to other deep learning models, CNN has fewer weight parameters and the raw data can be directly used as input for automatic feature-learning while avoiding the distortion of input. Based on this, in order to adapt to the transportation environment, we design a ResDeconvNN model for long-term traffic flow prediction. Since flow, speed and occupancy (FSO) are the three main elements of traffic data that have parameter correlation, which describe the traffic features in a certain time and space. So in this model, we use these correlations effectively and predict the flow of next day based on these 3 historic parameters to improve the traffic flow prediction performance.

2.2 FSO Matrix Generation

The raw flow, speed and occupancy (FSO) data are collected by a detector on the road. Generally, FSO data coming from the detector has a time interval of 5 min, and there is a certain distance between the detectors installed on highways. For each of the FSO parameters, traffic information with time and space dimensions should be considered to predict traffic flow. Thus we let x- and y-axes represent time and space dimensions of a matrix. Mathematically, denote the time-space matrix by:

$$ {\text{X}} = \left[ {\begin{array}{*{20}c} {x_{11} } & {x_{12} } & \ldots & {x_{1n} } \\ {x_{21} } & {x_{22} } & \ldots & {x_{2n} } \\ \ldots & \ldots & \ldots & \ldots \\ {x_{m1} } & {x_{m2} } & \ldots & {x_{mn} } \\ \end{array} } \right] $$
(1)

Matrix X can be viewed as one of the three channels of an image, where n is the length of time intervals, m is the length of road. And pixel \( x_{ij} \) is the corresponding value of FSO associated with time i and space j.

As is mentioned above, it explains the process of converting raw data to 3 matrices as 3 channels which represent the value of flow, speed and occupancy respectively in a day. For each matrix, in the time dimension, considering there is quite few traffic at night and the pattern character of traffic is simple, we choose data collected from 7 am to 10 pm. So there will be 180 time series at 5-min sampling interval, and the width of our matrix is 180. In the space dimension, we have 35 detectors and map the spatial sequence of the detector directly to the height dimension. Thus the height of our matrix is 35.

Finally, we merge the 3 channel matrices to generate a time-space FSO matrix. Considering the difference in the numerical range of each parameter. We normalized the data of each channel. Here, we adopt the maximum minimum value normalization method which is defined as:

$$ x_{norm} = \frac{{x - x_{min} }}{{x_{max} - x_{min} }} $$
(2)

Where \( x_{norm} \) is the normalized data of each channel, x represents the original data of each channel, \( x_{max} \) and \( x_{min} \) represent the maximum and minimum values of the original data of each channel.

2.3 The ResDeconvNN Model

The overall structure of the proposed ResDeconvNN model is shown in Fig. 1. Our method mainly incorporates two parts, where the first part is the residual net module and the second part is the deconvolutional neural network module.

Fig. 1.
figure 1

The structure of the ResDeconvNN model. Where Conv denotes the convolution layer, Max-Pooling means the max pooling layer, Max unpool indicates the unpooling layer and Deconv represents the deconvolution layer.

For one thing, to make the long-term prediction of traffic flow more accurate, we draw on the ideas of residual, in this model we introduce residual structure to solve the problem of gradient disappearance when the network model is deep. Next, we designed the deconvolution neural network (DeconvNN) module to decode the traffic flow data of the next day from the integrated spatial and temporal characteristics.

The principle of the residual module is as follows.

Previous researches have shown that with the network depth increasing, accuracy gets saturated and then degrades rapidly when adding more layers to the network. Such degradation is not caused by overfitting, but because the deeper network becomes too hard to be optimized. However, when we add the identity mapping to some shallow network and change the optimization of these networks, it can greatly reduce the optimization difficulty of the whole network. Figure 2 shows a building block of residual. Assume that the input of the network is x, the expected output is H(x) = F(x) + x. By connecting the input x directly to the output, the goal of optimization is recast into residual F(x) = H(x) − x. In most cases, optimizing F(x) is much easier than optimizing H(x).

Fig. 2.
figure 2

Residual learning: a building block.

The Input Layer.

Unlike traditional models that have only single input channel, the input layer of our designed model has 3 channels, so that we can fully exploit the parametric correlation among flow, speed and occupancy data. As mentioned in 2.2, we transform the raw data into a spatial-temporal matrix which has 3 channels. Thus, the input data of the model is a four-dimensional matrix, which represent batch size, the number of detectors, the number of time series, the number of channels respectively.

The Convolution Layer.

The convolution layer is the key part of this model to learn the complex spatio-temporal characteristics of traffic data. In convolution layer, first of all, the spatio-temporal feature map of previous layer is convolved by different kernels. The convolutional result is then fed into a nonlinear activation function to form more complex spatio-temporal characteristics. Finally, the convolutional output can be written as:

$$ x_{j}^{l} = \varphi \left( {\sum\nolimits_{i = 1}^{{c^{l - 1} }} {x_{i}^{l - 1} } } \right. * k_{ij}^{l} + \left. {b_{j}^{l} } \right) $$
(3)

Where * represents convolution operation, l is the index at the lth layer and j is the index of feature map at the lth layer, \( c^{l - 1} \) is the number of feature maps of the previous layer. \( x_{i}^{l - 1} \) denotes a output feature map of the (l − 1) layer, \( x_{j}^{l} \), \( k_{ij}^{l} \), \( b_{j}^{l} \) represent the output feature map, kernel weights and bias at the lth layer. \( \varphi \) is the rectified liner unit active function which is defined as:

$$ \varphi \left( x \right) = max\left( 0 \right., \left. x \right) $$
(4)

The Pooling Layer.

The function of the pooling layer is to down sample the convolutional result so as to filter the redundant information of traffic characteristics. Therefore, the pooling operation reduces the size of the feature map and reduces the training parameters of the network, but also retains the significant pattern information. Common pooling operations include mean pooling, max pooling, and random pooling. In this paper, the max pooling technique is employed.

The Deconvolution Layer.

The deconvolution neural network is mainly composed of deconvolution layer and unpooling layer. Deconvolution is the inverse of the convolution. In our model, we associate the forward process of deconvolution with the backward process of the convolution, and realize the deconvolution operation by referring to the reverse derivation formula of the convolution layer and we call it transpose convolution. Therefore, the output of the deconvolution layer can be defined as:

$$ x_{j}^{l} = \varphi \left( {\sum\nolimits_{i = 1}^{{c^{l - 1} }} {x_{i}^{l - 1} } } \right. * \left( {x_{ij}^{l} } \right)^{R} + \left. {b_{j}^{l} } \right) $$
(5)

Where, * represents the convolution operation, R represents the transpose operation of the matrix, and \( \varphi \) is the activation function.

The Unpooling Layer.

Pooling operation can bring the loss of information, which is an irreversible process. However, we can still realize the unpooling operation by referring to the reverse derivation process of the pooling layer. In this paper, we adopted max pooling method. Different from the formula, for the feature map j of lth pooling layer, we need to record the location of the maximum value while computing the pooling result. Then, the output of the lth unpooling layer can be defined as:

$$ x_{j}^{l} = unmp\left( {x_{j}^{l - 1} } \right.\left. {argmax_{j} } \right) $$
(6)

Where, unmp represents the unpooling operation, and \( argmax_{j} \) is the index of the position where the maximum value is.

2.4 Model Optimization

In order to predict traffic flow, parameters need to be trained with training samples and we need a loss function to describe the prediction accuracy of the model. In the training phase, the loss function which is optimized by stochastic gradient descent method of our model is defined as:

$$ L = L_{mse} + L_{reg} + L_{mgdl} $$
(7)

Where \( L_{mse} \) is mean squared error (MSE), which calculates the difference between ground truth and prediction result. \( L_{reg} \) is regularized loss which to avoid the problem of overfitting. \( L_{mgdl} \) measures the gradient loss between the predicted and real values.

3 Experiment and Results

3.1 Dataset Description

The factual FSO data associated with position and time are collected from detectors deployed on Yan’an elevated highways of shanghai in year 2011, as shown in Fig. 3, Yan’an elevated highway is marked in red, which connects the HongQiao transportation hub and the center of the city.

Fig. 3.
figure 3

Mark of Yan’an elevated highway in shanghai.

The process of our proposed methodology is illustrated in Fig. 4. Due to the lack of data from March 20 to March 23, there are actually only 361 days of data which are available for the experiment. In addition, there are some abnormal elements and need to be repaired. So the raw data are first preprocessed to remove abnormal elements. And then transformed to generate a spatial-temporal matrix with 3 channels. Because we need to use the previous day’s flow to predict the flow of corresponding next day, thus there are 360 samples.

Fig. 4.
figure 4

The process of the proposed methodology for traffic flow prediction.

For the division of training set and test set, we first shuffle the 360 days of samples to disrupt their order. Then the traffic data for the previous day (i.e. the data for experiment) are the i th samples, and the traffic data for the next day (i.e. the labels for experiment) are the (i + 1) th samples (i = 1, 2 … 360). As mentioned in 2.3, the 4th dimension of the input data represents the number of channels. Since the flow, speed and occupancy of the previous day are used to predict the flow of the next day, so here the 4th dimension of the data is 3, and the 4th dimension of the labels is 1. Therefore, for the training set: we select the 1st to 330th data as the training data, and the 1st to 330th label as the training labels; similarly, for the test set, we choose the 331 to 360 data as the test data, and the 331 to 360 label as the test labels. That is, our training set contains 330 samples and our test set contains 30 samples.

3.2 Learning Rate and Network Iteration

We adopt the exponential decay method to set the learning rate. Exponential decay is a more flexible method to set learning rate, which can dynamically adjust the learning rate. In our model, the initial learning rate is set to 1.0, the decay coefficient is set to 0.5, the total number of network iterations is 30000, and the learning rate are calculated every 2000 times to update the original learning rate. With the iteration of network, the learning rate would decrease exponentially, and finally the model tend to be stable and get an optimal value. In our experiment, the value of the loss function L was 0.1545 at the beginning (Where \( {\text{L}}_{\text{mse}} \) = 0.03827, \( {\text{L}}_{\text{reg}} \) = 0.00088, \( {\text{L}}_{\text{mgdl}} \) = 0.11535), and in the process of network iteration, the value of the loss function fluctuated slightly up and down as it decreased and converged gradually. After 30000 iterations, the loss eventually stabilized at 0.0550 (Where \( {\text{L}}_{\text{mse}} \) = 0.00312, \( {\text{L}}_{\text{reg}} \) = 0.00077, \( {\text{L}}_{\text{mgdl}} \) = 0.05111).

3.3 Experimental Environment and Model Configuration

The experiments are conducted on the server with i7-5820 K CPU, 48 GB memory and NVIDIA GeForce GTX1080 GPU. The proposed models and the contrastive models are implemented on TensorFlow framework of deep learning.

The configuration of our proposed model is shown in Table 1.

Table 1. Configuration of ResDeconvNN for traffic flow prediction

3.4 Results and Evaluation

We compare our method to multiple existing methods including basic methods (RW), RW is to predict the current value using the last value, and classical methods (ANN) as well as some advanced deep learning methods (DBN, RNN and SAE). The prediction performance is measured by 3 criteria: Mean Relative Error (MRE), Mean Absolute Error (MAE) and Root Mean Square Error (RMSE), MAE and RMSE can evaluate absolute error between the prediction and the reality while MRE can evaluate from the perspective of the relative error. MRE, MAE and RMSE are defined as:

$$ MRE = \frac{1}{N}\sum\nolimits_{i = 1}^{N} {\frac{{\left| {y^{{\prime }} - y} \right|}}{y}} $$
(8)
$$ MAE = \frac{1}{N}\sum\nolimits_{i = 1}^{N} {\left| {y^{{\prime }} - y} \right|} $$
(9)
$$ RMSE = \sqrt {\frac{1}{N}\sum\nolimits_{i = 1}^{N} {\left( {y^{{\prime }} - y} \right)^{2} } } $$
(10)

Where y denotes the prediction, y′ denotes the reality and N denotes the number of samples in test set.

The comparison results of prediction are shown in Table 2. From the table, we find that in terms of MRE, except for DBN, our mode gets the lowest error and value is nearly close to DBN. Further, in terms of the other two criteria, our method performs best when compared with other 5 methods, and we can sum up that our method can achieve better performance than the other existing methods.

Table 2. Results of Experiment

Figures 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 and 16 show the predicted and real curves of the randomly selected detector of the Yan’an elevated highway in shanghai. Where Figs. 5, 6, 7, 8, 9 and 10 show the flow fitting curve of the 25th detector on December 12, Figs. 11, 12, 13, 14, 15 and 16 show the flow fitting curve of the 2nd detector on December 28. As shown in these figures, the predicted curves precisely fit to the ground-truth curves except where the peak of the ground-truth curve is obvious.

Fig. 5.
figure 5

Prediction results of ResDeconvNN.

Fig. 6.
figure 6

Prediction results of ANN.

Fig. 7.
figure 7

Prediction results of DBN.

Fig. 8.
figure 8

Prediction results of RNN.

Fig. 9.
figure 9

Prediction results of SAE.

Fig. 10.
figure 10

Prediction results of RW.

Fig. 11.
figure 11

Prediction results of ResDeconvNN.

Fig. 12.
figure 12

Prediction results of ANN.

Fig. 13.
figure 13

Prediction results of DBN.

Fig. 14.
figure 14

Prediction results of RNN.

Fig. 15.
figure 15

Prediction results of SAE.

Fig. 16.
figure 16

Prediction results of RW.

Figures 17, 18, 19 and 20 show the visualized heat maps transformed of the prediction and the reality. Where Figs. 17 and 18 show the heat map of the predicted and real flow matrix respectively on December 2, Figs. 19 and 20 show the heat map of the predicted and real flow matrix respectively on December 30. Heat map can obviously reveal to us the real situation of traffic flow in a day.

Fig. 17.
figure 17

Heat map visualized by the predicted flow matrix of the next day (2011/12/2).

Fig. 18.
figure 18

Heat map visualized by the real flow matrix of the next day (2011/12/2).

Fig. 19.
figure 19

Heat map visualized by the predicted flow matrix of the next day (2011/12/30).

Fig. 20.
figure 20

Heat map visualized by the real flow matrix of the next day (2011/12/30).

4 Conclusions

In this paper, a model using residual net and deconvolutional neural network is developed to predict long-term traffic flow accurately. The proposed method takes the advantage of residual ideal, which can successfully learn the latent nonlinear traffic flow features. Furthermore, this is also attributed to the correlation of FSO and spatio-temporal correlations of the traffic data. Finally, we apply deconvolutional neural network to decode the flow of the next day accurately. Based on experimental results, our method is robust and obtains better prediction results compared to existing methodologies.