Keywords

1 Introduction

Autonomous driving has seen dramatic advances in recent years, for instance for road scene parsing [1,2,3,4], lane following [5,6,7], path planning [8,9,10,11], and end-to-end driving models [12,13,14,15]. By now, autonomous vehicles have driven many thousands of miles and companies aspire to sell such vehicles in a few years. Yet, significant technical obstacles, such as the necessary robustness of driving models to adverse weather/illumination conditions [2,3,4] or the capability to anticipate potential risks in advance [16, 17], must be overcome before assisted driving can be turned into full-fletched automated driving. At the same time, research on the next steps towards ‘complete’ driving systems is becoming less and less accessible to the academic community. We argue that this is mainly due to the lack of large, shared driving datasets delivering more complete sensor inputs.

Fig. 1.
figure 1

An illustration of our driving system. Cameras provide a 360-degree view of the area surrounding the vehicle. The driving maps or GPS coordinates generated by the route planner and the videos from our cameras are synchronized. They are used as inputs to train the driving model. The driving model consists of CNN networks for feature encoding, LSTM networks to integrate the outputs of the CNNs over time; and fully-connected networks (FN) to integrate information from multiple sensors to predict the driving maneuvers

Surround-View Cameras and Route Planners. Driving is inarguably a highly visual and intellectual task. Information from all around the vehicle needs to be gathered and integrated to make safe decisions. As a virtual extension to the limited field of view of our eyes, side-view mirrors and a rear-view mirror are used since 1906 [18] and in the meantime have become obligatory. Human drivers also use their internal maps [19, 20] or a digital map to select a route to their destination. Similarly, for automated vehicles, a decision-making system must select a route through the road network from its current position to the requested destination [21,22,23].

As said, a single front-view camera is inadequate to learn a safe driving model. It has already been observed in [24] that upon reaching a fork - and without a clearcut idea of where to head for - the model may output multiple widely discrepant travel directions, one for each choice. This would result in unsafe driving decisions, like oscillations in the selected travel direction. Nevertheless, current research often focuses on this setting because it still allows to look into plenty of challenges [6, 12, 25]. This is partly due to the simplicity of training models with a single camera, both in terms of available datasets and the complexity an effective model needs to have. Our work includes a surround-view camera system, a route planner, and a data reader for the vehicle’s CAN bus. The setting provides a 360-degree view of the area surrounding the vehicle, a planned driving route, and the ‘ground-truth’ maneuvers by human drivers. Hence, we obtain a learning task similar to that of a human apprentice, where a (cognitive/digital) map gives an overall sense of direction, and the actual steering and speed controls need to be set based on the observation of the local road situation.

Driving Models. In order to keep the task tractable, we chose to learn the driving model in an end-to-end manner, i.e. to map inputs from our surround-view cameras and the route planner directly to low-level maneuvers of the car. The incorporation of detection and tracking modules for traffic agents (e.g. cars and pedestrians) and traffic control devices (e.g. traffic lights and signs) is future work. We designed a specialized deep network architecture which integrates all information from our surround-view cameras and the route planner, and then maps these sensor inputs directly to low-level car maneuvers. See Fig. 1 and the supplemental material for the network’s architecture. The route planner is exploited in two ways: (1) by representing planned routes as a stack of GPS coordinates, and (2) by rendering the planned routes on a map and recording the progression as a video.

Our main contributions are twofold: (1) a new driving dataset of 60 h, featuring videos from eight surround-view cameras, two forms of data representation for a route planner, low-level driving maneuvers, and GPS-IMU data of the vehicle’s odometry; (2) a learning algorithm to integrate information from the surround-view cameras and planned routes to predict future driving maneuvers. Our experiments show that: (a) 360-degree views help avoid failures made with a single front-view camera; and (b) a route planner also improves the driving significantly.

2 Related Work

Our work is relevant for (1) driving models, (2) assistive features for vehicles with surround view cameras, (3) navigation and maps, and (4) driving scene understanding.

2.1 Driving Models for Automated Cars

Significant progress has been made in autonomous driving, especially due to the deployment of deep neural networks. Driving models can be clustered into two groups [7]: mediated perception approaches and end-to-end mapping approaches, with some exceptions like [7]. Mediated perception approaches require the recognition of all driving-relevant objects, such as lanes, traffic signs, traffic lights, cars, pedestrians, etc. [1, 26, 27]. Excellent work [28] has been done to integrate such results. This kind of systems developed by the automotive industry represent the current state-of-the-art for autonomous driving. Most use diverse sensors, such as cameras, laser scanners, radar, and GPS and high-definition maps [29]. End-to-end mapping methods construct a direct mapping from the sensory input to the maneuvers. The idea can be traced back to the 1980s, when a neural network was used to learn a direct mapping from images to steering angles [24]. Other end-to-end examples are [5, 12, 14, 15, 25]. In [12], the authors trained a neural network to map camera inputs directly to the vehicle’s ego-motion. Methods have also been developed to explain how the end-to-end networks work for the driving task [30] and to predict when they fail [17]. Most end-to-end work has been demonstrated with a front-facing camera only. To the best of our knowledge, we present the first end-to-end method that exploits more realistic input. Please note that our data can also be used for mediated perception approaches. Recently, reinforcement learning for driving has received increasing attention [31,32,33]. The trend is especially fueled by the release of excellent driving simulators [34, 35].

2.2 Assistive Features of Vehicle with Surround View Cameras

Over the last decades, more and more assistive technologies have been deployed to vehicles, that increase driving safety. Technologies such as lane keeping, blind spot checking, forward collision avoidance, adaptive cruise control, driver behavior prediction etc., alert drivers about potential dangers [36,37,38,39]. Research in this vein recently has shifted focus to surround-view cameras, as a panoramic view around the vehicle is needed for many such applications. Notable examples include object detection, object tracking, lane detection, maneuver estimation, and parking guidance. For instance, a bird’s eye view has been used to monitor the surrounding of the vehicle in [40]. Trajectories and maneuvers of surrounding vehicles are estimated with surround view camera arrays [41, 42]. Datasets, methods and evaluation metrics of object detection and tracking with multiple overlapping cameras are studied in [43, 44]. Lane detection with surround-view cameras is investigated in [45] and the parking problem in [46]. Advanced driver assistance systems often use a 3-D surround view, which informs drivers about the environment and eliminates blind spots [47]. Our work adds autonomous driving to this list. Our dataset can also be used for all aforementioned problems; and it provides a platform to study the usefulness of route planners.

2.3 Navigation and Maps

In-car navigation systems have been widely used to show the vehicle’s current location on a map and to inform drivers on how to get from the current position to the destination. Increasing the accuracy and robustness of systems for positioning, navigation and digital maps has been another research focus for many years. Several methods for high-definition mapping have been proposed [48, 49], some specifically for autonomous driving [50, 51]. Route planning has been extensively studied as well [52,53,54,55,56], mainly to compute the fastest, most fuel-efficient, or a customized trajectory to the destination through a road network. Yet, thus far their usage is mostly restricted to help human drivers. Their accessibility as an aid to learn autonomous driving models has been limited. This work reports on two ways of using two kinds of maps: a s-o-t-a commercial map TomTom MapsFootnote 1 and the excellent collaborative project OpenStreetMaps [57].

While considerable progress has been made both in computer vision and in route planning, their integration for learning driving models has not received due attention in the academic community. A trending topic is to combine digital maps and street-view images for accurate vehicle localization [58,59,60,61].

2.4 Driving Scene Understanding

Road scene understanding is a crucial enabler for assisted or autonomous driving. Typical examples include the detection of roads [62], traffic lights [63], cars and pedestrians [1, 2, 64, 65], and the tracking of such objects [66,67,68]. We refer the reader to these comprehensive surveys [69, 70]. Integrating recognition results like these of the aforementioned algorithms may well be necessary but is beyond the scope of this paper.

3 The Driving Dataset

We first present our sensor setup, then describe our data collection, and finally compare our dataset to other driving datasets.

3.1 Sensors

Three kinds of sensors are used for data collection in this work: cameras, a route planner (with a map), and a USB reader for data from the vehicle’s CAN bus.

Cameras. We use eight cameras and mount them on the roof of the car using a specially designed rig with 3D printed camera mounts. The cameras are mounted under the following angles: \(0^\circ \), \(45^\circ \), \(90^\circ \), \(135^\circ \), \(180^\circ \), \(225^\circ \), \(270^\circ \) and \(315^\circ \) relative to the vehicle’s heading direction. We installed GoPro Hero 5 Black cameras, due to their ease of use, their good image quality when moving, and their weather-resistance. All videos are recorded at 60 frames per second (fps) in 1080p. As a matter of fact, a full 360-degree view can be covered by four cameras already. Please see Fig. 2 for our camera configuration.

Route Planners. Route planners have been a research focus over many years [53, 54]. While considerable progress has been made both in computer vision and in route planning, their integration for learning to drive has not received due attention in the academic community. Routing has become ubiquitous with commercial maps such as Google Maps, HERE Maps, and TomTom Maps, and on-board navigation devices are virtually in every new car. Albeit available in a technical sense, their routing algorithms and the underlying road networks are not yet accessible to the public. In this work, we exploited two route planners: one based on TomTom Map and the other on OpenStreetMap.

TomTom Map represents one of the s-o-t-a commercial maps for driving applications. Similar to all other commercial counterparts, it does not provide open APIs to access their ‘raw’ data. We thus exploit the visual information provided by their TomTom GO Mobile App [71], and recorded their rendered map views using the native screen recording software supplied by the smart phone, an iPhone 7. Since map rendering comes with rather slow updates, we capture the screen at 30 fps. The video resolution was set to \(1280 \times 720\) pixels.

Fig. 2.
figure 2

The configuration of our cameras. The rig is 1.6 m wide so that the side-view cameras can have a good view of road surface without the obstruction by the roof of the vehicle. The cameras are evenly distributed laterally and angularly

Apart from the commercial maps, OpenStreetMaps (OSM) [57] has gained a great attention for supporting routing services. The OSM geodata includes detailed spacial and semantic information about roads, such as name of roads, type of roads (e.g. highway or footpath), speed limits, addresses of buildings, etc. The effectiveness of OSM for Robot Navigation has been demonstrated by Hentschel and Wagner [72]. We thus, in this work, use the real-time routing method developed by Luxen and Vetter for OSM data [73] as our second route planner. The past driving trajectories (a stack of GPS coordinates) are provided to the routing algorithm to localize the vehicle to the road network, and the GPS tags of the planned road for the next 300 m ahead are taken as the representation of the planned route for the ‘current’ position. Because the GPS tags of the road networks of OSM are not distributed evenly according to distance, we fitted a cubic smoothing spline to the obtained GPS tags and then sampled 300 data points from the fitted spline with a stride of 1 m. Thus, for the OSM route planner, we have a \(300 \times 2\) matrix (300 GPS coordinates) as the representation of the planned route for every ‘current’ position.

Human Driving Maneuvers. We record low level driving maneuvers, i.e. the steering wheel angle and vehicle speed, registered on the CAN bus of the car at 50 Hz. The CAN protocol is a simple ID and data payload broadcasting protocol that is used for low level information broadcasting in a vehicle. As such, we read out the specific CAN IDs and their corresponding payload for steering wheel angle and vehicle speed via a CAN-to-USB device and record them on a computer connected to the bus.

Vehicle’s Odometry. We use the GoPro cameras’ built-in GPS and IMU module to record GPS data at 18 Hz and IMU measurements at 200 Hz while driving. This data is then extracted and parsed from the meta-track of the GoPro created video.

3.2 Data Collection

Synchronization. The correct synchronization amongst all data streams is of utmost importance. For this we devised an automatic procedure that allows for synchronization to GPS for fast dataset generation. During all recording, the internal clocks of all sensors are synchronized to the GPS clock. The resulting synchronization error for the video frames is up to 8.3 ms, i.e. half the frame rate. If the vehicle is at a speed of 100 km/h, the error due to the synchronization for vehicle’s longitudinal position is about 23 cm. We acknowledge that a camera which can be triggered by accurate trigger signals are preferable with respect to synchronization error. Our cameras, however, provide good photometric image quality and high frame rates, at the price of moderate synchronization error. The synchronization error of the maps to our video frame is up to 0.5 s. This is acceptable, as the planned route (regardless of its representation) is only needed to provide a global view for navigation. The synchronization error of the CAN bus signal to our video frames is up to 10 ms. This is also tolerable as human drivers issue driving actions at a relative low rate. For instance, the mean reaction times for unexpected and expected human drivers are 1.3 and 0.7 s [74].

Drive360 Dataset. With the sensors described, we collect a new dataset Drive360. Drive360 is recorded by driving in (around) multiple cities in Switzerland. We focus on delivering realistic dataset for training driving models. Inspired by how a driving instructor teaches a human apprentice to drive, we chose the routes and the driving time with the aim to maximize the opportunity of exposing to all typical driving scenarios. This reduces the chance of generating a biased dataset with many ‘repetitive’ scenarios, and thus allowing for an accurate judgment of the performance of the driving models. Drive360 contains 60 h of driving data.

The drivers always obeyed Swiss driving rules, such as respecting the driving speed carefully, driving on the right lane when not overtaking a vehicle, leaving the required amount of distance to the vehicle in front etc. We have a second person accompanying the drivers to help (remind) the driver to always follow the route planned by our route planner. We have used a manual setup procedure to make sure that the two route planners generate the ‘same’ planned route, up to the difference between their own representations of the road networks. After choosing the starting point and the destination, we first generate a driving route with the OSM route planner. For TomTom route planner, we obtain the same driving route by using the same starting point and destination, and by adding a consecutive sequence of waypoints (intermediate places) on the route. We manually verified every part of the route before each driving to make sure that the two planned routes are in deed the same. After this synchronization, TomTom Go Mobile is used to guide our human drivers due to its high-quality visual information. The data for our OSM route planner is obtained by using the routing algorithm proposed in [73]. In particular, for each ‘current’ location, the ‘past’ driving trajectory is provided to localize the vehicle on the originally planned route in OSM. Then the GPS tags of the route for the next 300 m ahead are retrieved.

3.3 Comparison to Other Datasets

In comparison to other datasets, see Table 1, ours has some unique characteristics.

Planned Routes. Since our dataset is aimed at understanding and improving the fallacies of current end-to-end driving models, we supply map data for navigation and offer the only real-world dataset to do so. It is noteworthy that planned routes cannot be obtained by post-processing the GPS coordinates recorded by the vehicle, because planned routes and actual driving trajectories intrinsically differ. The differences between the two are resulted by the actual driving (e.g. changing lanes in road construction zones and overtaking a stopped bus), and are indeed the objectives meant to be learned by the driving models.

Table 1. Comparison of our dataset to others compiled for driving tasks (cam = camera)

Surround Views and Low-Level Driving Maneuvers. Equally important, our dataset is the only dataset working with real data and offering surround-view videos with low-level driving maneuvers (e.g. steering angle and speed control). This is particularly valuable for end-to-end driving as it allows the model to learn correct steering for lane changes, requiring ‘mirrors’ when carried out by human drivers, or correct driving actions for making turns at intersections. Compared with BDDV [12] and Oxford dataset [75], we offer low level driving maneuvers of the vehicle via the CAN bus, whereas they only supply the cars ego motion via GPS devices. This allows us to predict input control of the vehicle which is one step closer to a fully autonomous end-to-end trained driving model. Udacity [76] also offers low-level driving maneuvers via the CAN bus. It, however, lacks of route planners and contains only a few hours of driving data.

Dataset Focus. As shown in Table 1, there are multiple datasets compiled for tasks relevant to autonomous driving. These datasets, however, all have their own focuses. KITTI, Cityscapes and GTA focus more on semantic and geometric understanding of the driving scenes. Oxford dataset focus on capturing the temporal (seasonal) changes of driving scenes, and thus limited the driving to a ‘single’ driving route. BDDV [12] is a very large dataset, collected from many cities in a crowd-sourced manner. It, however, only features a front-facing dashboard camera.

4 Approach

The goal of our driving model is to map directly from the planned route, the historical vehicle states and the current road situations, to the desired driving actions.

4.1 Our Driving Model

Let us denote by \(\mathrm {I}\) the surround-view video, \(\mathrm {P}\) the planned route, \(\mathrm {L}\) the vehicle’s location, and \(\mathrm {S}\) and \(\mathrm {V}\) the vehicle’s steering angle and speed. We assume that the driving model works with discrete time and makes driving decisions every \(1 \slash f\) seconds. The inputs are all synchronized and sampled at sampling rate f. Unless stated otherwise, our inputs and outputs all are represented in this discretized form.

We use subscript t to indicate the time stamp. For instance, the current video frame is \(I_t\), the current vehicle’s speed is \(V_t\), the \(k^{th}\) previous video frame is \(I_{t-k}\), and the \(k^{th}\) previous steering angle is \(S_{t-k}\), etc. Then, the k recent samples can be denoted by \(\mathbf {V}_{[t-k+1,t]} \equiv \langle V_{t-k+1}, ..., V_{t}\rangle \), \(\mathbf {S}_{[t-k+1,t]} \equiv \langle S_{t-k+1}, ..., S_{t}\rangle \) and \(\mathbf {V}_{[t-k+1,t]} \equiv \langle V_{t-k+1}, ..., V_{t}\rangle \), respectively. Our goal is to train a deep network that predicts desired driving actions from the vehicle’s historical states, historical and current visual observations, and the planned route. The learning task can be defined as:

$$\begin{aligned} F: (\mathcal {S}_{[t-k+1,t]}, \mathcal {V}_{[t-k+1,t]}, \mathcal {L}_{[t-k+1,t]}, \mathcal {I}_{[t-k+1,t]}, P_t) \rightarrow \mathcal {S}_{t+1} \times \mathcal {V}_{t+1} \end{aligned}$$
(1)

where \(\mathcal {S}_{t+1}\) represents the steering angle space and \(\mathcal {V}_{t+1}\) the speed space for future time \(t+1\). \(\mathcal {S}\) and \(\mathcal {V}\) can be defined at several levels of granularity. We consider the continuous values directly recorded from the car’s CAN bus, where \(\mathcal {V}=\{V | 0 \le V \le 180\}\) for speed and \(\mathcal {S}=\{S | -720 \le S \le 720\}\) for steering angle. Here, kilometer per hour (km/h) is the unit of V, and degree (\(^\circ \)) the unit of S. Since there is not much to learn from the historical values of \(\mathrm {P}\), only \(P_t\) is used for the learning. \(P_t\) is either a video frame from our TomTom route planner or a \(300 \times 2\) matrix from our OSM route planner.

Given N training samples collected during real drives, learning to predict the driving actions for the future time \(t+1\) is based on minimizing the following cost:

$$\begin{aligned} \begin{aligned} L(\theta ) = \sum _{n=1}^N \Big ( l({S}^{n}_{t+1}, F_{\text {s}}(\mathbf {S}^{n}_{[t-k+1, t]}, \mathbf {V}^{n}_{[t-k+1, t]}, \mathbf {L}^{n}_{[t-k+1, t]}, \mathbf {I}^{n}_{[t-k+1, t]}, P_t)) \\ + \, \lambda l({V}^{n}_{t+1}, F_{\text {v}}(\mathbf {S}^{n}_{[t-k+1, t]}, \mathbf {V}^{n}_{[t-k+1, t]}, \mathbf {L}^{n}_{[t-k+1, t]}, \mathbf {I}^{n}_{[t-k+1, t]}, P_t))\Big ), \end{aligned} \end{aligned}$$
(2)

where \(\lambda \) is a parameter balancing the two losses, one for steering angle and the other for speed. We use \(\lambda =1\) in this work. F is the learned function for the driving model. For the continuous regression task, l(.) is the L2 loss function. Finding a better way to balance the two loss functions constitutes our future work. Our model learns from multiple previous frames in order to better understand traffic dynamics.

4.2 Implementation

Our driving system is trained with four cameras (front, left, right, and rear view), which provide a full panoramic view already. We recorded the data with all eight cameras in order to keep future flexibility.

This work develops a customized network architecture for our learning problem defined in Sect. 4.1, which consists of deep hierarchical sub-networks. It comes with multiple CNNs as feature encoders, four LSTMs as temporal encoders for information from the four surround-view cameras, a fully-connected network (FN) to fuse information from all cameras and the map, and finally two FNs to output future speed and steering angle of the car. The illustrative architecture is show in Fig. 1.

During training, videos are all resized to \(256 \times 256\) and we augment our data by using \(227 \times 227 \) crops, without mirroring. For the CNN feature encoder, we take ResNet34 [77] model pre-trained on the ImageNet [78] dataset. Our network architecture is inspired by the Long-term Recurrent Convolutional Network developed in [79]. A more detailed description about the network architecture is provided in the supplementary material.

5 Experiments

We train our models on 80% of our dataset, corresponding to 48 h of driving time and around 1.7 million unique synchronized sequence samples. Our driving routes are normally 2 h long. We have selected 24 out of the 30 driving routes for training, and the other 6 for testing. This way, the network would not overfit to any type of specific road or weather. Synchronized video frames are extracted at a rate of 10 fps, as 60 fps will generate a very large dataset. A synchronized sample contains four frames at a resolution of \(256\times 256\) for the corresponding front, left, right and rear facing cameras, a rendered image at \(256 \times 256\) pixels for TomTom route planner or a \(300 \times 2\) matrix for OSM route planner, CAN bus data and the GPS data of the ‘past’.

We train our models using the Adam Optimizer with an initial learning rate of \(10^{-4}\) and a batch size of 16 for 5 epochs, resulting in a training time of around 3 days. For the four surround-view cameras, we have used four frames to train the network: 0.9 s in the past, 0.6 s in the past, 0.3 s in the past, and the current frame. This leads to a sampling rate of \(f=3.33\). A higher value can be used at the price of computational cost. This leads to \(4 \times 4 = 16\) CNNs for capturing street-view visual scene.

We structure our evaluation into two parts: evaluating our method against existing methods, and evaluating the benefits of using a route planner and/or a surround-view camera system.

5.1 Comparison to Other Single-Camera Methods

We compare our method to the method of [12] and [25]. Since BDDV dataset does not provide data for driving actions (e.g. steering angle) [12], we train their networks on our dataset and compare with our method directly. For a fair comparison, we follow their settings, by only using a single front-facing camera and predicting the driving actions for the future time at 0.3 s.

Table 2. MSE of speed prediction and steering angle prediction when a single front-facing camera is used (previous driving states are given)

We use the mean squared error (MSE) for evaluation. The results for speed prediction and steering angle prediction are shown in Table 2. We include a baseline reference of only training on CAN bus information (no image information given). The table shows that our method outperforms [25] significantly and is slightly better than [12]. [25] does not use a pre-trained CNN; this probably explains why their performance is a lot worse. The comparison to these two methods is to verify that our frontal-view driving model represents the state of the art so that the extension is made to a sensible basis to include multiple-view cameras and to include route planners.

We note that the baseline reference performs quite well, suggesting that due to the inertia of driving maneuvers, the network can already predict speed and steering angle of 0.3 s further into the future quite well, solely based on the supplied ground truth maneuver of the past. For instance, if one steers the wheels to the right at time t, then at \(t+0.3\,\mathrm{s}\) the wheels are very likely to be at a similar angle to the right. In a true autonomous vehicle the past driving states might not be always correct. Therefore, we argue that the policy employed by some existing methods by relying on the past ‘ground-truth’ states of the vehicle should be used with caution. For the real autonomous cars, the errors will be exaggerated via a feedback loop. Based on this finding, we remove \(\mathcal {S}_{[t-k+1,t]}\) and \(\mathcal {V}_{[t-k+1,t]}\), i.e. without using the previous human driving maneuvers, and learn the desired speed and steering angle only based on the planned route, and the visual observations of the local road situation. This new setting ‘forces’ the network to learn knowledge from route planners and road situations.

5.2 Benefits of Route Planners

We evaluate the benefit of a route planner by designing two networks using either our visual TomTom, or our numerical OSM guidance systems, and compare these against our network that does not incorporate a route planner. The results of each networks speed and steering angle prediction are summarized in Table 3. The evaluation shows that our visual TomTom route planner significantly improves prediction performance, while the OSM approach does not yield a clear improvement. Since, the prediction of speed is easier than the prediction of steering angle, using a route planner will have a more noticeable benefit on the prediction of steering angles.

Table 3. MSE (smaller = better) of speed and steering angle prediction by our method, when different settings are used. Predictions on full evaluation set and the subset with human driving maneuver \({\le }30\) km/h

Why the Visual TomTom Planner Is Better? It is easy to think that GPS coordinates contain more accurate information than the rendered videos do, and thus provide a better representation for planned routes. This is, however, not case if the GPS coordinates are used directly without further, careful, processing. The visualization of a planned route on navigation devices such as TomTom Mobile Go makes use of accurate vehicle localization based on vehicle’s moving trajectories to provide accurate procedural knowledge of the routes along the driving direction. The localization based on vehicle’s moving trajectories is tackled under the name map-matching, and this, in itself, is a long-standing research problem [80,81,82]. For our TomTom route planner, this is done with TomTom’s excellent underlying map-matching method, which is unknown to the public though. This rendering process converts the ‘raw’ GPS coordinates into a more structural representation. Our implemented OSM route planner, however, encodes more of a global spatial information at a map level, making the integration of navigation information and street-view videos more challenging. Readers are referred to Fig. 3 for exemplar representations of the two route planners.

In addition to map-matching, we provide further possible explanations: (1) raw GPS coordinates are accurate for locations, but fall short of other high-level and contextual information (road layouts, road attributes, etc.) which is ‘visible’ in the visual route planner. For example, raw GPS coordinates do not distinguish ‘highway exit’ from ‘slight right bend’ and do not reveal other alternative roads in an intersection, while the visual route planner does. It seems that those semantic features optimized in navigation devices to assist human driving are useful for machine driving as well. Feature designing/extraction for the navigation task of autonomous driving is an interesting future topic. (2) The quality of underlying road networks are different from TomTom to OSM. OSM is crowdsourced, so the quality/accuracy of its road networks is not always guaranteed. It is hard to make a direct comparison though, as TomTom’s road networks are inaccessible to the public.

5.3 Benefits of Surround-View Cameras

Surround-view cameras offer a modest improvement for predicting steering angle on the full evaluation set. They, however, appear to reduce the overall performance for speed prediction. Further investigation has shown that surround-view cameras are especially useful for situations where the ego-car is required to give the right of way to other (potential) road users by controlling driving speed. Notable examples include (1) busy city streets and residential areas where the human drives at low velocity; and (2) intersections, especially those without traffic lights and stop signs. For instance, the speed at an intersection is determined by whether the ego-car has a clear path for the planned route. Surround-view cameras can see if other cars are coming from any side, whereas a front camera only is blind to many directions. In order to examine this, we have explicitly selected two specific types of scenes across our evaluation dataset for a more fine-grained evaluation of front-view vs. surround-view: (1) low-speed (city) driving according to the speed of human driving; and (2) intersection scenarios by human annotation. The evaluation results are shown in Tables 3 and 4, respectively. The better-performing TomTom route planner models are used for the experiments in Table 4. Surround-view cameras significantly improve the performance of speed control in these two very important driving situations. For ‘high-speed’ driving on highway or countryside road, surround-view cameras do not show clear advantages, in line with human driving – human drivers also consult non-frontal views less frequently for high-speed driving.

Table 4. MSE (smaller = better) of speed prediction by our Front-view+TomTom and Surround-view+TomTom driving models. Evaluated on manually annotated intersection scenarios over a 2-h subset of our evaluation dataset. Surround-view significantly outperforms front-view in intersection situations

As a human driver, we consult our navigation system mostly when it comes to multiple choices of road, namely at road intersections. To evaluate whether route planning improves performance specifically in these scenarios, we select a subset of our test set for examples with a low speed by human, and report the results in this subset also in Table 3. Results in Table 3 supports our claim that route planning is beneficial to a driving model, and improves the driving performance especially for situations where a turning maneuver is performed. In future work, we plan to select other interesting situations for more detailed evaluation.

Qualitative Evaluation. While standard evaluation techniques for neural networks such as mean squared error, do offer global insight into the performance of models, they are less intuitive in evaluating where, at a local scale, using surround view cameras or route planning improves prediction accuracy. To this end, we use our visualization tool to inspect and evaluate the model performances for different ‘situations’.

Figure 3 shows examples of three model comparisons (TomTom, Surround, Surround+TomTom) row-wise, wherein the model with additional information is directly compared to our front-camera-only model, shown by the speed and steering wheel angle gauges. The steering wheel angle gauge is a direct map of the steering wheel angle to degrees, whereas the speed gauge is from 0 km/h to 130 km/h. Additional information a model might receive is ‘image framed’ by the respective color. Gauges should be used for relative model comparison, with the front-camera-only model prediction in orange, model with additional information in red and human maneuver in blue. Thus, for our purposes, we define a well performing model when the magnitude of a model gauge is identical (or similar) to the human gauge. Column-wise we show examples where: (a) both models perform well, (b) model with additional information outperforms, (c) both models fail.

Fig. 3.
figure 3

Qualitative results for future driving action prediction, to compare three cases to the front camera-only-model: (1) learning with TomTom route planner, (2) learning with surround-view cameras (3) learning with TomTom route planner and surround-view cameras. TomTom route planer and surround-view images shown in red box, while OSM route planner in black box. Better seen on screen (Color figure online)

Our qualitative results, in Fig. 3(1, b) and (3, b), support our hypothesis that a route planner is indeed useful at intersections where there is an ambiguity with regards to the correct direction of travel. Both models with route planning information are able to predict the correct direction at the intersection, whereas the model without this information predicts the opposite. While this ‘wrong’ prediction may be a valid driving maneuver in terms of safety, it nonetheless is not correct in terms of arriving at the correct destination. Our map model on the other hand is able to overcome this. Figure 3(2, b) shows that surround-view cameras are beneficial at predicting the correct speed. The frontal view supplied could suggest that one is on a country road where the speed limit is significantly higher than in the city, as such, our front-camera-only model predicts a speed much greater than the human maneuver. However, our surround-view system can pick up on the pedestrians on the right of the car, thus adjusts the speed accordingly. The surround-view model thus has a more precise understanding of its surroundings.

Visualization Tool. To obtain further insights into where current driving models perform well or fail, we have developed a visual evaluation tool that lets users select scenes in the evaluation set by clicking on a map, and then rendering the corresponding 4 camera views, the ground truth and predicted vehicle maneuver (steering angle and speed) along with the map at that point in time. These evaluation tools along with the dataset will be released to the public. In particular, visual evaluation is extremely helpful to understand where and why a driving model predicted a certain maneuver, as sometimes, while not coinciding with the human action, the network may still predict a safe driving maneuver.

6 Conclusion

In this work, we have extended learning end-to-end driving models to a more realistic setting from only using a single front-view camera. We have presented a novel task of learning end-to-end driving models with surround-view cameras and rendered maps, enabling the car to ‘look’ to side, rearward, and to ‘check’ the driving direction. We have presented two main contributions: (1) a new driving dataset, featuring 60 h of driving videos with eight surround-view cameras, low-level driving maneuvers recorded via car’s CAN bus, two representations of planned routes by two route planners, and GPS-IMU data for the vehicle’s odometry; (2) a novel deep network to map directly from the sensor inputs to future driving maneuvers. Our data features high temporary resolution and \(360^\circ \) view coverage, frame-wise synchronization, and diverse road conditions, making it ideal for learning end-to-end driving models. Our experiments have shown that an end-to-end learning method can effectively use surround-view cameras and route planners. The rendered videos outperforms a stack of raw GPS coordinates for representing planned routes.