Automotive Innovation

, Volume 2, Issue 2, pp 127–136 | Cite as

End-to-End Self-Driving Using Deep Neural Networks with Multi-auxiliary Tasks

  • Dan WangEmail author
  • Junjie Wen
  • Yuyong Wang
  • Xiangdong Huang
  • Feng Pei


End-to-end self-driving is a method that directly maps raw visual images to vehicle control signals using deep convolutional neural network (CNN). Although prediction of steering angle has achieved good result in single task, the current approach does not effectively simultaneously predict the steering angle and the speed. In this paper, various end-to-end multi-task deep learning networks using deep convolutional neural network combined with long short-term memory recurrent neural network (CNN-LSTM) are designed and compared, which could obtain not only the visual spatial information but also the dynamic temporal information in the driving scenarios, and improve steering angle and speed predictions. Furthermore, two auxiliary tasks based on semantic segmentation and object detection are proposed to improve the understanding of driving scenarios. Experiments are conducted on the public Udacity dataset and a newly collected Guangzhou Automotive Cooperate dataset. The results show that the proposed network architecture could predict steering angles and vehicle speed accurately. In addition, the impact of multi-auxiliary tasks on the network performance is analyzed by visualization method, which shows the salient map of network. Finally, the proposed network architecture has been well verified on the autonomous driving simulation platform Grand Theft Auto V (GTAV) and experimental road with an average takeover rate of two times per 10 km.


Self-driving Multi-auxiliary tasks CNN-LSTM Deep learning 

1 Introduction

In recent years, with the development of deep learning and hardware improvements, autonomous driving technology has advanced rapidly. Traditionally, the modular method based on rules is adopted for self-driving system [1, 2, 3, 4]. This method generally includes modules such as perception, fusion, decision and control, making it reliable, interpretable and easy to track errors. However, such a method relies heavily on the sophisticated design of each module and handcrafted features. This approach may omit useful information for decision-making and be incapable of covering various driving scenarios. Therefore, a rule-based system is limited in its ability to handle complex road situations. Moreover, the perception module often requires the cumbersome process of data labeling, which requires substantial resources. In contrast, end-to-end imitation learning avoids complex module design by directly mapping raw data to control commands. In addition, supervised training data are easily obtained from car control area network (CAN) and camera recordings, avoiding intensive data labeling. In the 1990s, Carnegie Mellon University’s ALVINN system was the first to apply the concept of end-to-end to autonomous driving [5]. The system used a fully connected neural network with only one hidden layer to predict the steering angle using image inputs and realized autonomous driving under various conditions. In 2006, LeCun et al. adopted the end-to-end approach and used a convolutional neural network to perform obstacle avoidance in self-driving [6]. As technologies based on CNN have flourished in recent years [7, 8, 9, 10], the end-to-end autonomous driving concept has once again attracted research attention. In 2016, NVIDIA adopted an end-to-end deep convolutional neural network to achieve autonomous driving by controlling the steering wheel angle in various road environments [11]. Using CNN alone can make a suitable prediction of the steering angle. However, CNN only relies on the spatial information of single-frame image and cannot obtain the dynamic temporal information, limiting its performance with respect to simultaneously predicting both the steering angle and the speed. With the introduction of the long short-term memory (LSTM) recurrent neural network, Xu et al. [12] proposed a fully convolutional network (FCN) LSTM that is capable of learning both visual and temporal dependencies. An auxiliary semantic task further improved the network’s understanding of driving scenarios. Yang et al. [13] proposed a multi-mode and multi-task CNN-LSTM, which simultaneously inputs images and speed information into the network to predict the steering angle and speed.

In this paper, a CNN-LSTM network with multi-auxiliary tasks of semantic segmentation and object detection is proposed to achieve better vehicle control in end-to-end steering angle and speed prediction. The networks are evaluated off-line on both virtual and real datasets, and online tests are conducted on the GTAV virtual environment and real experimental roads. A visualization analysis is also implemented to demonstrate the impact of multi-auxiliary tasks on the network performance, explaining what the network learns and how it makes decisions in inference.

2 Architecture and Method

The CNN-LSTM model is trained using supervised learning. First, various CNN-LSTM models are designed and compared for end-to-end steering angle and speed prediction. Based on the benchmarked CNN-LSTM structure, the CNN–state-transitive LSTM structure is proposed to improve the traditional LSTM structure with respect to inference time and memory consumption. Finally, the multi-auxiliary tasks of semantic segmentation and object detection are incorporated based on the CNN–state-transitive LSTM model. In the training stage, the two auxiliary tasks of semantic segmentation and object detection share CNN feature layer parameters with the end-to-end self-driving master model. The end-to-end self-driving master model uses the two auxiliary tasks to simultaneously predict steering angle and speed. The two auxiliary tasks need different sub-networks to complete the semantic segmentation and object detection, respectively. During the inference stage, the two auxiliary tasks of semantic segmentation and object detection would significantly affect the CNN feature layer parameters of the end-to-end self-driving master model by backpropagation, ultimately affecting the steering angle and speed prediction of the master model. Thus, in inference, the CNN-LSTM model parameters are only needed to output the steering angle and speed, and sub-networks are not needed for semantic segmentation and object detection.

2.1 Basic Network Architecture

A single image frame from the front-view camera is not sufficient for accurate speed prediction, and it is difficult to obtain the temporal information in the driving process. Even without obstacles, the same image may correspond to different speeds because the steering angle and vehicle speed interact with each other. Thus, it is crucial to add the LSTM structure to the CNN and to use the image sequences and feedback speed as low-dimensional features to accurately predict the steering angles and speeds simultaneously. In this work, CNNs and CNN-LSTMs with different concatenated architectures are compared, and then, an improved CNN-LSTM structure is proposed. Because the low-dimensional speed feature that is directly concatenated to the high-dimensional image feature may be ignored by the network, the original speed is encoded as a one-hot feature, and then, the one-hot feature is concatenated to the network with a fully connected layer (FC). The three different network architectures in this paper are shown in Fig. 1. The CNN is the pre-trained deep residual network ResNet 50 [7] from the ILSVRC2012 image classification task. The LSTM structure is a two-layer LSTM with 256 neurons in each layer.
Fig. 1

Basic network architectures

2.2 State-Transitive LSTM Architecture

The prediction of the steering angle and speed control is actually temporal dependencies. The steering angle and speed corresponding to the previous frame may affect the prediction for the next frame. The LSTM can address such a time sequence problem. Figure 2 shows that the inputs of LSTM network are the image and speed feature sequences. The sequence length is between 10 and 16 frames. The end-to-end CNN-LSTM model consumes both substantial time and memory resources in the forward propagation processes by calculating image sequence features. For self-driving, the inference time of the network is crucial and directly affects the response time of the vehicle. To address this problem, a method is proposed to improve the CNN-LSTM structure. The improved LSTM network only receives the feature vector encoded by the CNN and the state vector processed by the last frame of the sequence. In contrast, the traditional LSTM structure intakes a sequence of feature maps, including those corresponding to the current frame and the previous frames. It is unnecessary to process the complete CNN feature vector of corresponding to between 10 and 16 consecutive frames to output the predictions. The improved model retains the ability of the LSTM network to process continuous space–time states while greatly reducing the time and memory consumption. The elimination of the repeated model inference calculations allows the improved CNN-LSTM structure to achieve real-time prediction.
Fig. 2

LSTM structures for inference (left column: traditional LSTM; right column: state-transitive LSTM)

2.3 Multi-auxiliary Task Learning

Drivers make driving decisions quickly by focusing on important information and understanding driving scenarios. The trained end-to-end autonomous driving with deep learning imitates an expert driver by using supervised learning with backpropagation of driving behavior labels such as the steering angle and speed. In this approach, it is difficult for the end-to-end model to learn and understand the mapping relationship between some significant feature information and the driving behavior. There is no focus on salient regions with an impact on driving decisions, such as vehicles, pedestrians, traffic lights and the free space in the scene. If the network does not focus on salient regions, its control command predictions are adversely affected. Thus, multi-task learning is used with the multi-auxiliary tasks of semantic segmentation and object detection based on the CNN-LSTM. This facilitates focus on the salient region as well as the understanding of driving scenarios.

The complete network architecture is shown in Fig. 3. The network uses image and speed sequences as inputs. The self-driving master network is a CNN–state-transitive LSTM. Sub-networks are added for the multi-auxiliary tasks of semantic segmentation and object detection. The different sub-networks are added after the CNN shares feature layers for the multi-auxiliary tasks of semantic segmentation and object detection. The sub-networks are not used during model inference. The single-shot multi-box detector (SSD) [14] for object detection and the original DeepLabV3+ [15] method for semantic segmentation are used. For the SSD detection sub-network, the input is the last convolutional layer feature of the shared CNN ResNet 50 [7], and the feature has 2048 dimensions. The SSD sub-network consists of four convolutional layers to obtain sufficient multi-scale feature images. Finally, the extra multi-box layer is added to output the predictions. For the DeepLabV3+ segmentation sub-network, the input is also the last convolutional layer feature of the shared CNN ResNet 50 [7]. The Atrous spatial pyramid pooling (ASPP) approach is used for the encoder in the segmentation sub-network, and then, the decoder method by using up-sampling and convolutional layer are applied to output the predictions.
Fig. 3

End-to-end network architecture with auxiliary tasks

2.4 Visualization Method

Understanding the prediction mechanism inside the end-to-end network is critical to the safety of autonomous driving. Visualization of salient regions is an important method to illustrate where the network focuses on. Therefore, the VisualBackProp method proposed by Bojarski et al. [16] is used for visualization, as shown in Fig. 4.
Fig. 4

Visualization method

3 Experimental Data

3.1 GTAV and GAC Dataset

The GTAV dataset contains 30-h data, corresponding to approximately half a million images sampled on the autonomous driving simulation platform of Grand Theft Auto V (GTAV). Figure 5 shows the complete map of the GTAV driving area. To verify the performance of the model on a real experimental road, we built Guangzhou Automotive Cooperate (GAC)-owned dataset and collected more than 30-h data on the main roads in Guangzhou. The GTAV and GAC datasets contain both daytime and nighttime data. Video streams, speed and steering angle data are recorded simultaneously. The video streams are captured from centered and two-sided front-view cameras, each with a frame rate of 30 frames per second. The data collected from the GTAV environment and the real experimental roads are shown in Fig. 6.
Fig. 5

GTAV map

Fig. 6

Data collected on GTAV and the main roads of Guangzhou

3.2 Data Pre-processing

Several data augmentation and data balancing techniques are adopted to improve the robustness and generalization of the end-to-end network for better prediction. Because various lighting conditions may affect the performance of an end-to-end network for image-based systems, the image saturation, contrast and brightness are adjusted with a certain probability to obtain more training samples of different lighting conditions. Moreover, a data synthesis technique is adopted to use images from the side cameras to generate simulated failure cases for training. According to the distribution of steering angles, the normal driving process yields too much steering angle data near zero; thus, the steering angles near zero are randomly discarded to avoid biasing toward going straight. Random horizontal flips are used for data balancing.

4 Experiments

The proposed method is evaluated on the GTAV, GAC and Udacity datasets. The CNN and CNN-LSTM network structures are compared. Furthermore, CNN-LSTM networks with different concatenated architectures are compared. Finally, the state-transitive LSTM is compared with the traditional LSTM network structure. In addition, the multi-auxiliary tasks are introduced based on the CNN-LSTM network. The end-to-end self-driving network is evaluated with respect to different aspects including the architecture, time consumption, auxiliary tasks and salient region visualization. Real-world tests are conducted to evaluate the performances of the end-to-end self-driving models. The settings of the hyper-parameters of the model are as follows. The images are downsampled to 224 × 224 pixels, and the Adam [17] optimizer is used with an initial learning rate of 1e−4 and a batch size of four. The learning rate is decayed by 0.5 whenever the training loss plateaus. A gradient clipping of 10 is applied to avoid gradient explosion in the LSTM. For the end-to-end self-driving master model, both the steering angle prediction and speed prediction apply the regression method to use the mean absolute loss as a loss function. To alleviate the data imbalance problem stemming from going straight more frequently than turning, different loss weights are applied. The data corresponding to a steering angle near zero have a small training loss weight, and the data corresponding to turning have a higher weight. For the two auxiliary tasks, the detection loss is a regression problem, but the segmentation loss is a pixel-wise cross-entropy loss. Thus, the weight parameter is tuned to adjust the weight between the self-driving loss, the segmentation loss and the detection loss. Off-line and online tests are used to evaluate the model. In the off-line tests, the mean absolute error (MAE) is calculated between the predicted steering angle (speed) and the label value. The result shows that the predicted value is closer to the real value and the model achieves superior fitting ability with a lower MAE. In the online tests, the intervention frequency is used as the evaluation index on the simulation platform GTAV and on the experimental road.

4.1 Comparison Between CNN and CNN-LSTM

In this paper, the end-to-end self-driving network model is used to simultaneously predict the steering angle and speed. As shown in Fig. 1 in Sect. 2.1, the inputs of network structure 1 are a single image frame and low-dimensional speed, in which the steering angle and speed are predicted by CNN. The inputs of network structure 2 are image and low-dimensional speed sequences, in which the steering angle is still predicted using a single image frame by the CNN, while the speed is predicted using the image and low-dimensional speed sequences by the LSTM network. The difference between structure 3 and structure 2 is that the CNN-LSTM in structure 3 predicts the steering angle and speed simultaneously using the image and low-dimensional speed sequences. The experimental results in Table 1 show that the speed prediction is more accurate and the MAE value is smaller for network structure 2 than network structure 1, indicating that LSTM is helpful for speed prediction. For both the GTAV dataset and the GAC real driving dataset, the MAE value obtained by network structure 3 for the steering angle and speed predictions is lower than that of network structure 2, demonstrating that the steering angle prediction also depends on temporal information. Combining the spatial and temporal characteristics of image and speed sequences facilitates superior steering angle and speed prediction.
Table 1

Network structure comparison

Network structure

Steering angle (°)/speed (m·s−1) MAE

GTAV dataset

GAC dataset










4.2 Comparison Between Traditional and State-Transitive LSTM

The input sequence length to the LSTM network is set to 16, which requires more time and memory consumption in every forward propagation process for the repeated calculation of image sequence features. The CNN-LSTM structure 3 (Fig. 1) is used because of its superior results, and the traditional LSTM is improved based on the state transfer method described in Sect. 2.2. The results in Table 2 show that the MAE values of both the steering angle and speed on the GTAV and GAC datasets are relatively close between the traditional LSTM and state-transitive LSTM. However, the prediction time is reduced from 300 to 100 ms. The modified LSTM network structure does not significantly adversely affect the prediction performance and adequately learns the temporal sequence characteristics of the images without repeated calculations. Because the prediction time is greatly reduced, the response time of the vehicle is noticeably improved during the real driving tests, such that the car can be controlled at a higher speed.
Table 2

LSTM comparison

Network structure

Prediction time (ms)

Steering angle (°)/speed (m·s−1) MAE

GTAV dataset

GAC dataset





CNN–state-transitive LSTM3




4.3 Evaluation of Auxiliary Tasks

The experiments for multi-auxiliary tasks of semantic segmentation and object detection on the GTAV and GAC datasets are conducted based on the method described in Sect. 2.3. The CNN–state-transitive LSTM with structure 3 (Fig. 1) is used as the end-to-end self-driving master model because of its suitable prediction accuracy and time. The results with and without the use of the auxiliary tasks are compared. As shown in Table 3, the MAE values for the steering angle and speed predictions on both the GTAV and GAC datasets of the self-driving model with auxiliary tasks are lower than without auxiliary tasks. This result indicates that the auxiliary task training has a positive effect on the end-to-end self-driving model. The MAE value of the self-driving model with only the semantic segmentation auxiliary task is lower than the self-driving model with only the object detection auxiliary task. Thus, learning to understand the scene (semantic segmentation) may affect the final decision more than focusing on the salient regions (object detection). The MAE value is lowest when simultaneously using both the multi-auxiliary tasks of semantic segmentation and object detection. This further indicates that the multi-auxiliary tasks of semantic segmentation and object detection enable the end-to-end self-driving master model to understand the driving scenarios and focus on the salient characteristic information, ultimately yielding more accurate decisions.
Table 3

Auxiliary task evaluation

Auxiliary task usage

Steering angle (°)/speed (m·s−1) MAE

GTAV dataset

GAC dataset

CNN–state-transitive LSTM3 (no auxiliary tasks)



CNN–state-transitive LSTM3 + object detection



CNN–state-transitive LSTM3 + semantic segmentation



CNN–state-transitive LSTM3 + semantic segmentation + object detection



4.4 Results on the Open Udacity Dataset

The performance of the final end-to-end model using the CNN-state-transitive LSTM network structure 3 (Fig. 1) with multi-auxiliary of semantic segmentation and object detection is further verified. The experiment is conducted on the open Udacity dataset, and the results are compared with the benchmark results of existing open end-to-end models such as PilotNet [11] by NVIDIA, the CNN-LSTM [18] by Kim and the CNN-LSTM [13] by SAIC. Because the open end-to-end self-driving models only predict the steering angle, the comparisons are only conducted with respect to steering angle prediction. Table 4 shows that the model outperforms the existing models with respect to the Udacity testing result. The combination of the CNN-LSTM structure with multi-auxiliary tasks significantly influences the ultimate self-driving decisions. Another reason for the significant performance improvement used by the method may be that our base model uses a more complex residual network (ResNet 50) than the open simple CNN network. However, the comparison of all of the experimental results reflects the effectiveness of the proposed network architecture.
Table 4

Comparison results on the open Udacity dataset


Steering angle (MAE)

Nvidia’s PilotNet CNN [11]


Kim’s CNN-LSTM [18]




CNN–state-transitive LSTM3 + semantic segmentation + object detection


4.5 Visualization Results

With the visualization method described in Sect. 2.4, the visualization results of the different networks are compared. Figure 7 illustrates that each network focuses on important information for driving such as lane division lines and vehicles. The information of interest potentially exerts a significant impact on the predicted controls. According to the images in the left column, the CNN focuses mostly on the road signs and sky, while the CNN–state-transitive LSTM structure 3 focuses on lane lines, but also focuses on the lines on other lanes, potentially adversely affecting the lane keeping. In contrast, the CNN–state-transitive LSTM structure 3 with multi-auxiliary tasks not only focuses on the lane lines but also distinguishes the current lane lines from other lane lines. For the scenario in the left column, the steering prediction of the final model outperforms the other two, but its speed prediction does not show any significant improvements. This may be attributed to the fact that in scenarios where the road is straight and clear, the available speed range is too wide to make a precise prediction. The images in the right column show another driving scenario where an obstacle appears on the road. In this situation, the speed range is smaller to avoid collision. The right column shows that although all the three networks focus on the bus, the CNN–state-transitive LSTM structure 3 with multi-auxiliary tasks performs better than the other two networks for both steering and speed predictions.
Fig. 7

Comparison of network prediction visualization

4.6 Online Testing on Experimental Roads

In the previous sections, the effectiveness of the end-to-end self-driving model was verified using off-line and online testings. It was observed that the off-line test result of CNN–state-transitive LSTM structure 3 with the multi-auxiliary tasks of semantic segmentation and object detection was better than existing methods. The ultimate goal of the research is to achieve autonomous driving of real vehicles on real experimental roads. To further verify the effectiveness and reliability of the proposed end-to-end self-driving network architecture, the autonomous driving testing platform is implemented on a GAC Trumpchi GE3 vehicle, mainly using the Nvidia PX2 hardware platform, as shown in Fig. 8. The outer cycle of the Guangzhou biological island is selected for road test, obtaining an experimental scenario that is relatively simple with few cars and pedestrians. The intervention frequency along the outer cycle of the biological island, which corresponds to approximately 10 km, is used as the evaluation standard in the online test. The final test result of the model achieves an average intervention frequency of two, as shown in Table 5. Because the prediction time of traditional LSTM network models is relatively long, affecting the vehicle control response time, the state-transitive LSTM network is used throughout the experiment. Table 5 demonstrates that the average intervention frequency for the CNN–state-transitive LSTM structure 3 with multi-auxiliary tasks is the lowest, which is consistent with the off-line test result. The average intervention frequency for the CNN is high, indicating that a single image frame input cannot address the temporal dependencies. This manifested as an abrupt change in steering angle and a sudden deviation from the road. Adding auxiliary tasks does not increase the inference time of the end-to-end self-driving model, but demonstrates a beneficial impact on the ultimate decision-making of the end-to-end autonomous driving system.
Fig. 8

Testing on experimental road

Table 5

Online testing results


Intervention frequency



CNN–state-transitive LSTM3


CNN–state-transitive LSTM3 + object detection


CNN–state-transitive LSTM3 + semantic segmentation


CNN–state-transitive LSTM3 + semantic segmentation + object detection


5 Conclusions

The proposed network combining the CNN and state-transitive LSTM with multi-auxiliary tasks is robust and effective. Its improved performance and reliability are verified experimentally on the simulation platform GTAV and on a real self-driving vehicle. The online and off-line test results on the simulation platform for autonomous driving are consistent with the online test results on the real-world autonomous driving environment. Resource and environmental constraints limit the experiments to simple scenarios with few vehicles and pedestrians. In the future, large-scale data covering various driving scenarios will be collected to improve the network’s ability to handle complex road situations. The current network will be continuingly optimized and extended. For example, many scenarios that are lacking in the real world can be generated on the self-driving virtual simulation platform by using data synthesis technology. This includes recovering from collisions and off-road driving. Virtual samples and real samples will be combined to train the end-to-end self-driving model by using generative adversarial networks. Moreover, the reinforcement learning will be applied to the end-to-end self-driving system. Through the additional losses in reinforcement learning, the network model will be rewarded for progress and penalized for undesirable events, thereby improving its ability to respond to complex environments.



This work is funded by the Youth Talent Lifting Project of China Society of Automotive Engineers.

Compliance with Ethical Standards

Conflict of interest

On behalf of all authors, the corresponding authors state that there is no conflict of interest.


  1. 1.
    Chen, C., Seff, A., Kornhauser, A., et al.: Learning affordance for direct perception in autonomous driving. In: IEEE International Conference on Computer Vision, pp. 2722–2730 (2015)Google Scholar
  2. 2.
    Huval, B., Wang, T., Tandon, S., et al.: An empirical evaluation of deep learning on highway driving. Comput. Sci. Robot. (2015). arXiv:1504.01716
  3. 3.
    Gurghian, A., Koduri, T., Bailur S.V., et al.: Deeplanes: end-to-end lane position estimation using deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 38–45 (2016)Google Scholar
  4. 4.
    Geiger, A., Lauer, M., Wojek, C., et al.: 3D traffic scene understanding from movable platforms. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 1012–1025 (2014)CrossRefGoogle Scholar
  5. 5.
    Pomerleau, D.A.: Knowledge-Based Training of Artificial Neural Networks for Autonomous Robot Driving. Robot Learning. Springer, Boston (1993)Google Scholar
  6. 6.
    Muller, U., Ben, J., Cosatto, E., et al.: Off-road obstacle avoidance through end-to-end learning. Adv. Neural Inf. Process. Syst. 18, 739–746 (2005)Google Scholar
  7. 7.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)Google Scholar
  8. 8.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Sci. Comput. Vis. Pattern Recognit. (2014). arXiv:1409.1556
  9. 9.
    Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. Comput. Sci. Comput. Vis. Pattern Recognit. (2014). arXiv:1409.4842
  10. 10.
    He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. Comput. Sci. Comput. Vis. Pattern Recognit. (2015). arXiv:1512.03385
  11. 11.
    Bojarski, M., Del Testa, D., Dworakowski, D., et al.: End to end learning for self-driving cars. Comput. Sci. Comput. Vis. Pattern Recognit. (2016). arXiv:1604.07316
  12. 12.
    Xu, H., Gao, Y., Yu, F., et al.: End-to-end learning of driving models from large-scale video datasets. Comput. Sci. Comput. Vis. Pattern Recognit. (2017). arXiv:1612.01079
  13. 13.
    Yang, Z., Zhang, Y., Yu, J., et al.: End-to-end multi-modal multi-task vehicle control for self-driving cars with visual perception. Comput. Sci. Comput. Vis. Pattern Recognit. (2018). arXiv:1801.06734
  14. 14.
    Liu, W., Anguelov, D., Erhan, D., et al.: SSD: Single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37 (2016)Google Scholar
  15. 15.
    Chen, L.C., Zhu, Y., Papandreou, G., et al.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818 (2018)Google Scholar
  16. 16.
    Bojarski, M., Choromanska, A., Choromanski, K., et al.: VisualBackProp: visualizing CNNs for autonomous driving. arXiv preprint arXiv:1611.05418 (2016)
  17. 17.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)Google Scholar
  18. 18.
    Kim, J., Canny, J.: Interpretable learning for self-driving cars by visualizing causal attention. In: IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)Google Scholar

Copyright information

© China Society of Automotive Engineers (China SAE) 2019

Authors and Affiliations

  1. 1.GAC R&D CenterGuangzhouChina

Personalised recommendations