Keywords

1 Introduction

Detection and tracking of vehicles in a video allows to estimate every vehicle trajectory while they remain in the scene. This has applications in a wide range of tasks: vehicle counting, accident detection, roundabout entry/exit analysis or assisted traffic surveillance. In a real-life scenario speed and robustness are a must, which translate to the requisites of real-time performance and occlusion handling.

In terms of the current tracking solutions we can distinguish two types: low-level and high-level trackers. The former exploits the visual information in the current frame to find the object of interest while the latter can use more complex information to estimate the new object position (probabilistic models, environment maps, etc.). Current low-level trackers [2, 5, 7] cannot handle total occlusions and do not provide a framework for multiple object tracking. In addition, the best current solutions require a high end GPU or do not operate in real time with multiple objects on CPU [7, 20].

In recent years, the high-level tracking problem has been focused as a tracking-by-detection approach [1]. This framework considers the tracking task as a data association problem between detections and trackers over time. This assumes the existence of reliable detections in every frame of a video, something that in a real-life scenario is not a valid option as current state-of-the-art deep-learning based detectors operate above 75 ms per frame [17].

In this paper, we present a traffic monitoring system that performs multiple object detection and tracking in a video in real-time handling total occlusions. The system is composed of a deep-learning based detector, a low-level Discriminative Correlation Filter (DCF) based tracker, a high-level Kalman Filter based tracker and data association based on the Hungarian algorithm. The contributions of our proposal are:

  • A traffic monitoring system that can process more than 400 vehicles simultaneously in videos with HD resolution in real-time.

  • The system also handles occlusions by detecting the upcoming occlusion and searching the occluded vehicle in a zone called ROI (Region-Of-Interest) that is proportional to the error degree in the tracking process. We provide a metric for on-line tracking failure detection by estimating the distance between two independent tracking methods allowing us to update the system’s tracking error accordingly.

  • We extend our system for solving a real-life traffic application: roundabout I/O (Input/Output) with near 1,000 vehicles.

The rest of this paper is structured as follows. Section 2 gives an overview of closely related work. In Sect. 3 we explain the details of our approach. In Sect. 4 we discuss the implementation details of our system and introduce the traffic application developed. Finally, conclusions are given in Sect. 5.

2 Related Work

Traffic monitoring systems detect and track all the vehicles in a video sequence. This task presents two main challenges: to manage total occlusions and to operate in real-time with multiple vehicles.

The work in the field of object detection is mainly based on deep convolutional neural networks (ConvNets). One of the first works in this area was R-CNN [12] which uses a region proposal algorithm (such as selective search [23] or edge boxes [25]) and applies a classification network to each of them. Improving the previous approach, Fast-RCNN [11] introduces the regions in an intermediate stage of the network, thus, saving a lot of computing time. Finally, becoming the milestone in the object detection field, Faster-RCNN [22] introduces a region proposal algorithm based entirely on a neural network called the Region Proposal Network (RPN). The RPN uses the information from intermediate layers of a standard classification network to provide different locations in which an object may appear.

To improve the performance of the proposal of regions in all possible scales, Lin et al. [18] replicate the RPN from Faster-RCNN in several layers of the network in which deeper feature maps are combined with shallower ones. The shallower the layer the smaller the object it will locate. This approach, called Feature Pyramid Network (FPN) obtains outstanding results as shown in the COCO detection challenge 2016 [19]. All these approaches present a high level of performance but, their main limitation is their computational cost, which makes them harder to use in applications that demand real-time performance.

In the last years, top trackers from the Visual Object Tracking (VOT) challenge [15] are based on two approaches: Discriminative Correlation Filters (DCF) based trackers, and deep-learning based trackers. On the one hand, DCF based trackers predict the target position training a correlation filter that can differentiate between the object of interest and the background [5, 6, 13]. On the other hand, deep-learning based trackers use ConvNets. SiamFC [2] is one of the first approaches of this kind. This tracker consists of two branches that apply an identical transformation—deep features extractor—to two inputs: the search image and the exemplar. Then, both representations are combined through cross-correlation, generating a score map that indicates the most probable position of the object.

Due to the increase in performance of deep learning detectors in recent years, the task of tracking is increasingly being seen as a data association problem, i.e. tracking-by-detection. In this approach, the primary concern is to assign detections to trackers over time. Some international challenges [1] have emerged to rank solutions to this problem, evaluating precision, robustness and speed among other performance metrics. In the past few years, complex solutions to this tracking approach that obtain outstanding results have appeared. Some of them focus on extending traditional high-level tracking approaches. As an example, Kim et al. [14] and Chen et al. [4] propose extensions to the classical multiple hypotheses tracking (MHT) [21]. The former introduces on-line appearance representations while the latter enhances the classical MHT by incorporating a detection model that includes detection-scene and detection-detection analysis.

All these approaches have demonstrated good performance in classic multiple object tracking metrics as commented before. Their fundamental limitation is the speed, as none of the work discussed in this section shows performance metrics above 2.6 Hz even without accounting for the detection time. Also, they assume the existence of detections in every frame of a video without taking into account high performance object detectors inference time.

Some work in the traffic monitoring field has been done in the recent years [8]. In [10], vehicle counting is performed employing an environment segmentation strategy. In [9] a tracking approach using background subtraction and Kalman filter tracking to tackle the data collection in roundabouts is proposed. These approaches usually run at real-time speed due to the use of background subtraction for detecting mobile objects. These object identification methods could represent a limitation in scenarios that present camera movement (on-board cameras), shadows, image artifacts, or objects that appear very close to each other since they usually are identified as only one by the background subtraction algorithm.

3 Video Traffic Monitoring

We propose a complete traffic monitoring system that combines tracking and detection and can operate as a baseline for multiple applications.

Fig. 1.
figure 1

Architecture of our traffic monitoring system. It is formed by three modules: detection (yellow), tracking (red) and data association (blue). (Color figure online)

Our system is made up of three blocks (Fig. 1): detection, tracking and data association. To detect vehicles in an image, we use a deep learning based detector. For tracking, we combine a DCF-based tracker with a Kalman-based one, which enables to calculate a failure detection metric to identify occluded vehicles. Finally, in the data association module, we assign each detection with its correspondent tracker through the Hungarian method [16, 24] and perform an update of the trackers.

Algorithm 1 presents the main steps of the system. The inputs to the system at every time instant t are the new frame (\( Im_t \)) of the video, and the set of trackers in the previous time instant (\(\varPhi _{t-1}\)). First, the trackers positions in the new image (\( Im_t \)) are estimated. We start calculating the new position of the object with a DCF tracker (Algorithm 1, line 3—Algorithm 1:3—). Tracking based just on DCF trackers has two limitations: (i) we cannot handle occlusions (Fig. 3); (ii) it does not provide a robust tracking failure detection (i.e. knowing when the tracking fails) as the PSR (Peak to Sidelobe Ratio) value [3], which measures the spread degree of the convolution operation of the correlation filter, is not a reliable measure. As shown in Fig. 4, the PSR takes different threshold values for different videos and scenarios, which makes difficult to identify when a tracker is lost.

figure a

To provide a solution to both problems, we introduce a Kalman Filter (KF) tracker that, by modeling the movement of the object can handle occlusions and, in combination with the DCF tracker, can estimate the error in the tracking process. So, once the vehicle’s new position is calculated by the DCF tracker, we estimate the position using the Kalman filter. We use a linear constant velocity model in the KF, so the state of each vehicle is modeled as:

$$\begin{aligned} \mu := [x,y,v_x,v_y] \end{aligned}$$
(1)

Here x and y are the position of the object, and \(v_x\) and \(v_y\) represent the linear velocity in both axes. We perform Kalman prediction in Algorithm 1:4. With the bounding boxes proposed by both methods, we estimate the region of interest (ROI) in which the object might be located (Algorithm 1:5). The larger the difference between the two trackers, the larger the ROI. Occlusions can be determined in cases where both predictors propose very different bounding boxes, since the bounding boxes provided by DCF will remain static, while those from the Kalman filter will follow the previous movement pattern of the object (Fig. 2).

Fig. 2.
figure 2

Images courtesy of Aplygenia S.L.

Creation of a search ROI for occlusion handling. (a) Both tracking methods agree on the object position. (b) As the DCF fails to track the occluded object, the distance between both estimations increases and so it does the search ROI. Finally in (c), when the detector finds the vehicle at the other side of the road and the tracker recovers.

Fig. 3.
figure 3

Images courtesy of Aplygenia S.L. (Color figure online)

(a) The low-level DCF tracker (in green) cannot recover the identity of the object once occluded as it only relies on appearance. (b) The combination of a DCF and a KF manages occlusions, as it also takes into account the object motion model.

Fig. 4.
figure 4

Images courtesy of Aplygenia S.L.

PSR values are poor predictors of tracking failures for the DCF tracker. The image shows a case in which the object is being tracked successfully but the PSR value changes and shows a high degree of dispersion. The opposite case, a tracking failure not detected by the PSR values, is also frequent.

Our system is robust enough so we do not need to call the detector in every frame. The aim of the detection component is twofold. First, it initializes every tracker or object of interest in the scene. Second, it refines the location and size of the bounding boxes of the trackers along their trajectories through the data association component (see Fig. 1), improving tracking performance metrics. If the time elapsed since the previous detection is greater than or equal to \(\tau \), detection is performed using a convolutional neural network (Algorithm 1:6–7), which returns a set of detections \(\varPsi _{t}\). In practice, this is performed with a fully convolutional network called FPN [18], which uses feature maps information at different scales to locate from small to large objects, through a pyramidal architecture with lateral connections between them. The FPN provides high precision at a high computational cost, taking about 130 ms to perform a full detection in an HD image. If no detection is performed at current time t, tracking prediction alone (\(\overline{\varPhi }_t\)) determines the current trackers state (\(\varPhi _t\), Algorithm 1:20).

The data association block aims to assign each detection to its corresponding tracker and to identify objects that enter or leave the scene. In so doing, we build up the cost matrix \(IOU_t\) (see Algorithm 1:8–10), where every entry is the Intersection Over Union (IOU) between a tracker \(\overline{\varphi }_{t}^i\) and a detection \(\psi _{t}^{j}\). That association is solved by the Hungarian Method (Algorithm 1:11). For every successful assignation (\({<}\varphi ^\alpha _t,\psi ^\beta _t{>}\)), tracker \(\varphi ^\alpha _t\) is updated with detection \(\psi ^\beta _t\) (Algorithm 1:13). Finally, trackers not updated in the data association phase are candidates for being deleted, and detections not assigned are initialized as new trackers (Algorithm 1:14–19).

4 Results

The proposed system (Fig. 1) runs on a server with an Intel Xeon E52623v4 2.60 GHz CPU, 128 GB RAM and an Nvidia GP102GL 24 GB [Tesla P40] as GPU. Table 1 shows the times of the two most computational expensive operations of our system: detection and tracking—computing times of other tasks are negligible. In a 30 fps video, we have 0.03 seconds per frame for the tracking task. Using 15 threads for parallelization, theoretically, the system is able to process up to 148 objects in the image while maintaining real-time performance, i.e. 30 fps. As mentioned before, detection is the slowest part of our system, taking an average 0.135 s in an HD image and 0.075 s in VGA resolution. These values are below the 0.2 threshold required by the system for the detection module, as we only perform detection 5 times every 30 frames.

Table 1. Computational times for the detection and tracking modules of the traffic monitoring system.

4.1 Roundabout Monitoring

In this section, we analyze our complete system (Fig. 1) for roundabout monitoring. The objective of the system is to identify the entry and the exit a vehicle takes, maintaining its identity while it remains in the roundabout. The final goal is to provide the I/O matrix R, in which every element (R(ij)) represents the number of vehicles that joined the roundabout taking entry i and exit j. If a vehicle enters the roundabout and exits it with the same ID we count that as a tracking success. On the contrary, if the identity changes along the video, then we count that vehicle as a tracking failure.

Table 2. Computational times for the fast version of our traffic monitoring system.

For performing the metrics, we use a video dataset which consists of five videos of roundabouts recorded from an Unnamed Aerial Vehicle (UAV) at 30 fps with HD resolutionFootnote 1. The videos have different conditions that are challenging for traffic monitoring: shadows, total occlusions (two level roads), camera movement, etc. Figure 5 shows a snapshot of some of these videosFootnote 2.

As explained before, the robustness of our system allows us to avoid calling the detector at every frame. This led us to develop a fast version that performs tracking in one of every 3 frames and detection in one of every 6 frames, without degrading the performance metrics for roundabout monitoring. Table 2 shows the times for this version fast version.

Table 3. Results in the video dataset for roundabout monitoring. The columns are: video, number of occlusions (\(\#\)occ), number of vehicles occluded (\(\#\)vocc), duration of video, total number of vehicles (\(\#\)vehicles) and success rate obtained by our tracking system.

Table 3 shows the results obtained from processing the I/O matrix of five videos with 995 vehicles in total. We have used the fast version of our traffic monitoring system to highlight the robustness of the proposal even when processing just 10 of each 30 frames. Theoretically, the system can track up to 492 objects, although in these videos the maximum number of concurrent objects was 60. An average success rate of 91% is obtained. Results also show our system’s ability to handle occlusions as two of the videos are scenarios with a high rate of total occlusions: in one of them the 50% of the vehicles are totally occluded nearly twice on average.

Fig. 5.
figure 5

Example frames of some videos of the roundabout monitoring dataset. These videos are recorded from an UAV flying over a roundabout. Images courtesy of Aplygenia S.L., their distribution is restricted.

5 Conclusions

We have presented a traffic monitoring system that combines a convolutional neural network detection, DCF and Kalman trackers, and a Hungarian data association. The system is able to track hundreds of objects in real-time while being robust to occlusions. The combination of the DCF and Kalman filters allows to estimate the error of each tracker, thus increasing the robustness and reliability of the system. We have applied the traffic monitoring system to the problem of roundabout monitoring. Our system achieves a 91% success rate for the I/O matrix, even in cases with high occlusion rates, shadows and movement of the UAV onboard camera.