Definition of the Subject

The demand on today’s transportation systems is growing quite rapidly, with an estimated 30% travel demand increase predicted over the next decade [1]. Transportation infrastructure growth is not keeping pace with this traffic demand, therefore researchers and practitioners have turned to intelligent transportation systems (ITS) to improve overall traffic efficiency, thereby maximizing the current infrastructure’s capacity.

ITS uses sensors, communication, and traffic control technologies to better handle the increased demands in traffic, to enhance public safety, and to reduce environmental impacts of transportation [2]. One key ITS function that is important for many applications is the ability to gather traffic information using vehicle detection and surveillance techniques. Vehicle detection technology is widely used to provide information for vehicle counting, classification, and traffic characterization. Further, when it is implemented on a moving vehicle, it can even be utilized for vehicle navigation purposes [3]. The generalized vehicle detection problem from a moving vehicle is challenging, which aims to determine the surrounding vehicle’s (relative) position, speed, and trajectory. A driver is able to determine a short-term and long-term trajectory based on the vehicle’s current position and information about the surrounding vehicles.

In the case where vehicle detection is carried out on a moving vehicle, the surrounding vehicles’ information is commonly collected by various sensing systems. These sensing systems typically consist a suite of sensors that can provide real-time measurements, and play an important role in the development of driver assistant systems (DAS) .

One of the most common sensing techniques used is computer vision. Computer vision sensors provide a large amount of information on the surrounding environment. However, the computer vision sensors often suffer from intensity variations, narrow fields of view, and low-accuracy depth information [4]. In contrast, a laser ranging method (i.e., LIDAR) measures distance and relative angle from the sensor to the target by calculating time of flight of the laser. Its measurements depend on the size and reflectivity of the target, so the probability of detection decreases with distance [5]. Since their characteristics of computer vision and LIDAR complement each other, it is useful to integrate both computer vision and LIDAR for detecting different objects around the vehicle’s environment.

A tightly coupled LIDAR and computer vision system is proposed in this entry to solve the vehicle detection problem. This sensing system is mounted on a test vehicle, as is shown in Fig. 1. A pair of LIDAR sensors is mounted on the front bumper of the vehicle, and a camera is mounted behind the front windshield.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 1
figure 2019

The mobile sensing platform. This probe vehicle carries one camera and two IBEO ALASCA XT LIDAR sensors

Since sensor fusion systems are commonly used to integrate sensory data from disparate sources, the output will be more accurate and complete in comparison to the output of a single sensor. In order to effectively extract and integrate 3D information from both computer vision and LIDAR systems, the relative position and orientation between these two sensor modalities should be obtained. A sensor calibration process is used to identify the parameters that describe the relative geometric transformation between sensors [6, 7], which is a key step in the sensor fusion process. However, current calibration methods work only for visible beam LIDAR, 3D LIDAR, and 2D LIDAR [812]. To date, there does not exist any convenient calibration methods for multi-planar “invisible-beam” LIDAR and computer vision systems.

A novel calibration approach of a camera with a multi-planar LIDAR system is proposed in this entry, where the laser beams are invisible to the camera. The camera and LIDAR are required to observe a planar pattern at different positions and orientations. Geometric constraints of the “views” from the LIDAR and camera images are resolved as the coordinate transformation coefficients. The proposed approach consists of two stages: solving a closed-form equation, followed by applying a nonlinear algorithm based on a maximum likelihood criterion. Compared with the classical methods which use “beam-visible” cameras or 3D LIDAR systems, this approach is easy to implement at low cost.

The combination of LIDAR and camera is employed in vehicle detection since the geometric transformation between two sensors is known from calibration. In the sensor fusion system, LIDAR sensor estimates possible vehicle positions. This information is transformed into the image coordinates. Different regions of interest (ROIs) in the imagery are defined based on the LIDAR object hypotheses. An Adaboost object classifier is then utilized to classify the vehicle in ROIs. A classifier error correction approach chooses an optimal position of the detected vehicle. Finally, the vehicle’s position and dimensions are derived from both the LIDAR and image data. This sensor fusion system can be used in ITS applications such as traffic surveillance and roadway navigation tasks.

This entry is organized as follows: section “Introduction and Background” reviews background and related work, including the calibration methods for LIDAR and camera, and sensor fusion–based vehicle detection algorithms. Section “A Novel Multi-Planar LIDAR and Computer Vision Calibration Procedure Using 2D Patterns” focuses on the calibration of LIDAR and camera system. The coordinate systems of the LIDAR and camera sensors are introduced, followed by the mathematical derivation of geometric relations between the two sensors. The equations are then solved in two stages: a closed-form solution, followed by applying a nonlinear algorithm based on a maximum likelihood criterion. Section “Tightly Coupled LIDAR and Computer Vision Integrated System for Vehicle Detection” describes the sensor fusion–based vehicle detection system. Both hardware and data processing software of the sensor fusion system are introduced in this section. Finally, future work is discussed in section “Conclusions and Future Work”.

Introduction and Background

During the past decade, a variety of research has been carried out in the traffic surveillance area, where numerous techniques have been developed to obtain parameters such as vehicle counts, location, speed, trajectories and classification data, for both in-vehicle navigation and freeway traffic surveillance applications [13].

As one of the most popular traffic surveillance techniques, computer vision-based approaches are one of the most widely used and promising techniques. LIDAR is another attractive technology due to its high accuracy in ranging, wide-area field of view, and low data processing requirements [5]. The other sensors used in vehicle detection include radar and embedded loop sensors. A brief comparison of the sensor technologies and their advantages as well as disadvantages is given in Table 1.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Table 1 Performance comparison of existing sensor technologies used in ITS [14]

Sensor fusion systems in the vehicle detection application aim to gather information from the far-field as well as near-field sensors, and combine them in a meaningful way [15]. The output of the sensor fusion system should be the states of objects around the test vehicle.

In this section, a few of the most popular LIDAR sensors that are available commercially today are introduced. This is followed by a discussion of current LIDAR and computer vision calibration methods. The calibration methods include the visible LIDAR beam–based calibration, 3D LIDAR and the 2D single planar LIDAR calibration. Sensor fusion techniques for vehicle detection and tracking systems are also discussed here.

Laser Range Finder (LIDAR)

The vehicle detection solution aims to estimate the states of surrounding vehicles. Vehicle state includes position, orientation, speed, and acceleration. State estimation addresses the problems of estimating quantities from sensors that are not directly observable [16].

LIDAR sensors are commonly utilized in vehicle navigation for detecting surrounding vehicles, infrastructure, and pedestrians. It can also be used in vehicle localization, either as the only sensor or in some combination with GPS and INS.

One of the most popular LIDAR sensors is the SICK LMS 2xx series [17]. A SICK LIDAR operates at distance up to 80 m with an angular resolution of 0.5o and a measurement accuracy of typically 5 cm. The distance between the sensor and an object is calculated by measuring the time interval between an emitted laser pulse and a reception of the reflected pulse. Amplitude of the received signal is used to determine reflectivity of the object surface. The SICK LIDAR is able to detect dark objects at long ranges. Moreover, compared to the CCD cameras and RADAR systems, the view angle of SICK LIDAR is larger, e.g., 180o. Figure 2a illustrates the SICK LMS200 LIDAR.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 2
figure 2020

A variety of LIDAR sensors: (a) SICK LMS LIDAR, (b) HOKUYO UXM-30LN LIDAR, (c) IBEO ALASCA XT LIDAR, and (d) Velodyne HDL-64E LIDAR

The HOKUYO UXM-30LN LIDAR is another single planar range sensor for intelligent robots and vehicles [18]. Its detection range is up to 60 m, and the horizontal field of view is 190o. The distance accuracy is 30 mm when the range is less than 10 m, and 50 mm when the range is between 10 and 30 m. The angular resolution is 0.25o. The device is shown in Fig. 2b.

As another example, the ALASCA XT laser scanner made by IBEO is a multi-planar LIDAR, which splits the laser beam into four vertical layers. The aperture angle is 3.2o. The distance range is up to 200 m, and the horizontal field of view is 240o [19]. Figure 2c shows the IBEO sensor.

The Velodyne HDL-64E LIDAR is a 3D sensor which is specifically designed for autonomous vehicle navigation [20]. With 360o horizontal by 25o vertical field of view, 0.09o angular resolution, and 10 Hz refresh rate, Velodyne provides surrounding 3D traffic information with high accuracy (<5 cm resolution) and efficiency. The detection range is 100 m for cars, and the latency is less than 0.05 ms. Figure 2d illustrates the Velodyne sensor.

LIDAR sensors were initially used by automatic guided vehicles in indoor environments. More recently, the performance of these ranging sensors has been improved, so that they now can be used in outdoor environments on vehicles. A primary application example is the DARPA Urban Challenge in 2007, in which autonomous vehicles were capable of driving in traffic and performing complex maneuvers such as acceleration and deceleration, lane change, and parking [13]. IBEO and SICK LIDARs were used in many of the finalists for object detection and localization. The Velodyne sensor was used by the five out of six of the finishing teams.

LIDAR and Computer Vision Calibration

Sensor fusion systems are commonly used to combine the sensory data from disparate sources, so that the result will be more accurate and complete in comparison to the output of one sensor. As an example, the winner of the 2007 DARPA Urban Grand Challenge performed sensor fusion between a GPS receiver, long- and short-range LIDAR sensors, and stereo cameras [21].

In order to effectively extract and integrate 3D information from both computer vision and LIDAR systems, the relative position and orientation between these two sensor modalities can be obtained. The relative geometric transformation can be solved through a calibration process [6, 7]. Several approaches have been defined and utilized for LIDAR and computer vision calibration. These techniques can be roughly classified into three categories.

Visible Beam Calibration

Visible beam calibration is performed by using cameras to observe the LIDAR beams or reflection points. The calibration system usually consists of an active LIDAR and some infrared or near-infrared cameras. The LIDAR system typically projects stripes with a known frequency, while these stripes are visible to the camera [810]. For example, the LIDAR beams used in [9] are captured by a 955fps high-speed camera. However, the image of the high-speed camera is not suitable for monitoring. Color image of the LIDAR beams is generated by letting the vision output go through a beam splitter.

This approach requires a high-cost infrared camera, which should be sensitive to the spectral emission band of the LIDAR. Therefore, this method is not suitable for the low-cost sensor fusion systems.

Three-Dimensional (3D) LIDAR-Based Calibration

The 3D LIDAR-based calibration method calibrates the computer vision system with a 3D LIDAR system. Various features are captured by both the camera and the LIDAR. These features are usually in the form of planes, corner, or edges of specific calibration object. An elaborate setup is required. Moreover, dense LIDAR beams in both the vertical and horizontal directions are necessary for the calibration.

The 3D calibration algorithm presented in [11] uses checkerboard in calibration, which is commonly used for camera calibration. Coefficients of the checkerboard plane are first calculated by LIDAR, and then the coefficients are computed by camera in computer vision coordinates. A two-stage optimization procedure is implemented to minimize the distance between the calculated results and the measurement output.

When the features are edges or corners, accuracy of the calibration method depends on the accuracy by which features are localized [22]. When the features are planes, the LIDAR beams must be sufficiently dense [11]. Therefore, these methods cannot easily be applied to single planar or sparse multi-planar LIDAR systems.

Two-Dimensional (2D) Planar-Based Calibration

This approach works for the calibration of camera and 2D LIDAR integration system. The calibration system proposed in [23] consists of a monochrome CCD camera and a LIDAR. The camera and the LIDAR have been pre-calibrated so that their coordinates are parallel to each other. A “V”-shaped pattern is designed to obtain the translation between these two sensors. The calibration procedure is implemented in two steps: the LIDAR detects the “V” shape and finds the vertex, and camera detects the intersection line which cuts the pattern into two parts.

Another calibration approach is proposed in [12] using a checkerboard for calibration. This method is based on observing a plane of an object and solving distance constraints from the camera and LIDAR systems. This approach works for a single planar LIDAR only, e.g., the SICK LMS 2xx series LIDAR.

To date, there does not exist any convenient calibration methods for multi-planar “invisible-beam” LIDAR and computer vision systems. Section “A Novel Multi-Planar LIDAR and Computer Vision Calibration Procedure Using 2D Patterns” proposes a method to handle this case, which is the first calibration method for this system as to the author’s best knowledge.

Sensor Fusion–Based Vehicle Detection and Tracking

Computer vision is generally used on mobile platform–based object detection and tracking systems, either separately or along with LIDAR sensors [24]. Most of the computer vision techniques utilize a simple segmentation method such as background subtraction or temporal difference to detect objects [25]. However, these approaches suffer with the fast background changes due to camera motion. A trainable object detection method is proposed in [26] based on a wavelet template, which defines the shape of an object in terms of a subset of the wavelet coefficients of the image. However, the application of vision sensors in vehicle navigation is far from sufficient: clustering, illumination, occlusion, among many other factors, affect the overall performance. Fusion of camera and active sensors such as LIDAR or RADAR, is being investigated in the context of on-board vehicle detection and classification.

A LIDAR and a monocular camera–based detection and classification system is proposed in [27]. Detection and tracking are implemented in the LIDAR space, and the object classification work both in LIDAR space (Gaussian Mixture Model classifier ) and in computer vision system (Adaboost classifier ). A Bayesian decision rule is proposed to combine the results from both classifiers, and thus a more reliable classification is achieved.

Another integration structure is proposed in [28], in which a LIDAR is integrated with a far-infrared camera and an ego motion sensor. LIDAR-based shape extraction is employed to select region of interests (ROIs). This system combines a straightforward methodology with a backward loop one. Kalman filtering is used as the data fusion algorithm.

A similar technique is presented in [29], which makes use of RADAR, velocity, and steering sensors to generate position hypotheses. Examination of the hypotheses is implemented by a computer vision sensor. Classification is performed using a shape model for either the monocular camera vision or the infrared spectrum images.

A Novel Multi-planar LIDAR and Computer Vision Calibration Procedure Using 2D Patterns

In this section, a novel calibration approach is proposed for a LIDAR and computer vision sensor fusion system. This system consists of a camera with a multi-planar LIDAR , where the laser beams are invisible to the camera. This calibration method also works for computer vision and 3D LIDAR systems.

Although several calibration methods have been developed to obtain the geometric relationship between two sensors, only a few of them have provided a complete sensitivity analysis of the calibration procedure (see section “LIDAR and Computer Vision Calibration”). As part of the calibration method proposed in this section, the effects of LIDAR noise level as well as total number of poses on calibration accuracy are also discussed.

This section is organized as follows: section “Sensor Alignment” gives the setup using planar planes and defines the calibration constraint. Section “Calibration Solutions” describes in detail how to solve this constraint in two steps. Both a closed-form solution and a nonlinear minimization solution based on maximum likelihood criterion are introduced. Experimental results with different poses are provided in section “Experimental Results”. Finally, a brief summary is given in section “Summary”.

Sensor Alignment

The setup for a multi-planar LIDAR and camera calibration is described here.

Sensor Configuration

In the calibration system, an instrumented vehicle is equipped with two IBEO ALASCA XT LIDAR sensors which are mounted on the front bumper. The LIDAR sensor scans with four separate planes. The distance range is up to 200 m, the horizontal field of view angle of a single LIDAR is 240o, and the total vertical field of view for the four planes is 3.2o. A camera is mounted on the vehicle behind the front windshield, as is shown in Fig. 1.

In order to use the measurements from different kinds of sensors at various positions on the vehicle, the measurements should be transformed from their own coordinate into some common coordinate system. This section focuses on obtaining the spatial relationship between video and LIDAR sensors. The geometric sensor model is shown in Fig. 3.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 3
figure 2021

Geometric model with the camera and the LIDAR

Vision, LIDAR, and World Coordinate Systems

There are several coordinate systems in the overall system to be considered: the camera coordinates, the LIDAR coordinates, and the world coordinate systems.

A camera can be represented by the standard pinhole model. One 3D point in the camera coordinate denoted by \( {{\mathbf{P}}_c} = {\left[ {\matrix{ {{{\mathbf{X}}_c}}& {{{\mathbf{Y}}_c}}& {{{\mathbf{Z}}_c}} \cr } } \right]^T} \) is projected to a pixel \( {\mathbf{p}} = {\left[ {\matrix{ u& v \cr } } \right]^T} \) in the image coordinate. The pinhole model is given as [30]:

$$ s{\bf{p}} \sim {\bf{A}} \left[ {{\bf{R}} \quad {\bf{t}} } \right]{{\bf{P}}_{c}} \quad {{\rm{with}} \quad {\bf{A}} = \left[{\matrix{ \alpha& \gamma& {{u_0}} \cr 0& \beta& {{v_0}} \cr 0& 0& 1 \cr } } \right]} $$
(1)

where \( s \) is an arbitrary scale factor. \( {\mathbf{A}} \) is the camera intrinsic matrix defined by coordinates of the principal point \( \left( {\matrix{ {{u_0}}& {{v_0}} \cr } } \right) \), scale factors \( \alpha \) and \( \beta \) in image \( u \) and \( v \) axes, and skewness of the two image axes \( \gamma \). \( \left( {\matrix{ {{\mathbf{R}},}& {\mathbf{t}} \cr } } \right) \) are called extrinsic parameters. The \( 3 \times 3 \) orthonormal rotation matrix \( {\mathbf{R}} \) represents the orientation of world coordinates to the camera coordinate system. The translation matrix \( {\mathbf{t}} \) is a three-vector representing origin of world coordinates in the camera’s frame of reference. In the real world, lens of the camera may also have image distortion coefficients, which include radial and tangential distortions and are usually stored in a five-vector [31]. In this entry, the lens is assumed to have no significant distortion, or the distortion has already been eliminated.

The LIDAR sensor provides distance and direction of each scan point in LIDAR coordinates. Distances and directions can be converted into a 3D point denoted by \( {{\mathbf{P}}_l} = {\left[ {\matrix{ {{X_l}}& {{Y_l}}& {{Z_l}} \cr } } \right]^T} \) [19]. The origin of LIDAR coordinates is the equipment itself. \( X \), \( Y \), and \( Z \) axes are defined as forward, leftward, and upward from the equipment, respectively. The camera and LIDAR reference systems are shown in Fig. 4.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 4
figure 2022

Two coordinate systems. (a) The camera coordinates and screen coordinate systems, and (b) the LIDAR coordinate system

In addition to the camera and LIDAR reference systems, another coordinate system is used in the calibration procedure: the world frame of reference. In the calibration process, a checkerboard is placed in front of the sensors. The first grid on the upper-left corner of this board is defined to be the origin of the world coordinates [31].

Suppose a fixed point \( {\mathbf{P}} \) is denoted as \( {{\mathbf{P}}_c} = {\left[ {\matrix{ {{X_c}}& {{Y_c}}& {{Z_c}} \cr } } \right]^T} \) in the camera coordinates, and \( {{\mathbf{P}}_l} = {\left[ {\matrix{ {{X_l}}& {{Y_l}}& {{Z_l}} \cr } } \right]^T} \) in the LIDAR coordinates . The transformation from LIDAR coordinate to camera coordinate is given as:

$$ {{\mathbf{P}}_c} = {\mathbf{R}}_l^c{{\mathbf{P}}_l} + {\mathbf{t}}_l^c $$
(2)

where \( \left( {\matrix{ {{\mathbf{R}}_l^c,}& {{\mathbf{t}}_l^c} \cr } } \right) \) are the rotation and translation parameters which relate LIDAR coordinate system to the camera coordinate system.

The purpose of this calibration method is to solve Eq. 2 and obtain coefficients \( \left( {\matrix{ {{\mathbf{R}}_l^c,}& {{\mathbf{t}}_l^c} \cr } } \right) \), so that any given point in the LIDAR reference system can be transformed to the camera coordinates.

Basic Geometric Interpretation

A checkerboard visible to both sensors is used for calibration. In the following sections, the planar surface defined by the checkerboard is called the checkerboard plane. Without loss of generality, the checkerboard plane is assumed to be on \( Z = 0 \) in the world coordinates. Let \( {{\mathbf{r}}_3} \) denotes the i-th column of the rotation matrix \( {\mathbf{R}} \). \( {{\mathbf{r}}_3} \) is also the surface normal vector of the calibration plane in camera coordinate systems [31].

Note the origin of world coordinate is the upper-left corner of the checkerboard, and the origin of camera coordinate is the camera itself. The translation vector \( {\mathbf{t}} \) represents relative position of the checkerboard’s upper-left corner in the camera’s reference system. Since both \( {\mathbf{t}} \) and \( {{\mathbf{P}}_c} \) are points on the checkerboard plane denoted in camera coordinates, a vector \( \vec{v} \) is defined as \( \vec{v} = {{\mathbf{P}}_c} - {\mathbf{t}} \). Note that \( \vec{v} \) is a vector on the checkerboard plane, and \( {{\mathbf{r}}_3} \) is orthogonal to this plane, so:

$$ {{\mathbf{r}}_3} \cdot \vec{v} = 0 $$
(3)

where ⋅ denotes the inner product. The geometric interpretation for Eq. 3 is illustrated in Fig. 5.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 5
figure 2023

Geometric interpretation of the camera coordinates, the LIDAR coordinates, and checkerboard

By substituting Eq. 2 into Eq. 3, Eq. 3 becomes:

$$ {\mathbf{r}}_3^T\left( {{\mathbf{R}}_l^c{{\mathbf{P}}_l} + {\mathbf{t}}_l^c - {\mathbf{t}}} \right) = 0 $$
(4)

Since point \( {{\mathbf{P}}_l} \) in LIDAR coordinates is \( {\left[ {\matrix{ {{X_l}}& {{Y_l}}& {{Z_l}} \cr } } \right]^T} \), from Eq. 4:

$$ {\mathbf{r}}_3^T\, \left[ {\matrix{ {{\mathbf{R}}_l^c}& {{\mathbf{t}}_l^c - {\mathbf{t}}} \cr } } \right]\, \left[ {\matrix{ {{X_l}} \cr {{Y_l}} \cr {{Z_l}} \cr 1 \cr } } \right] = 0 $$
(5)

For each LIDAR point on the checkerboard plane, Eq. 5 explains the geometric relationships and constraints on \( \left( {\matrix{ {{\mathbf{R}}_l^c,}& {{\mathbf{t}}_l^c} \cr } } \right) \). This is the basic constraints for the calibration from the LIDAR to the vision coordinate system.

Calibration Solutions

This subsection provides the method to efficiently obtain calibration coefficients \( \left( {\matrix{ {{\mathbf{R}}_l^c,}& {{\mathbf{t}}_l^c} \cr } } \right) \). An analytical solution is proposed, followed by a nonlinear optimization technique based on the maximum likelihood criterion.

Closed-Form Solution

Initially, the camera’s intrinsic parameters are calibrated using a standard Camera Calibration Toolbox [31]. For each pose of the checkerboard, there is one set of camera extrinsic parameters \( \left( {\matrix{ {{\mathbf{R}},}& {\mathbf{t}} \cr } } \right) \). Each \( \left( {\matrix{ {{\mathbf{R}},}& {\mathbf{t}} \cr } } \right) \) is determined also using the toolbox, after which \( {{\mathbf{r}}_3} \) and \( {\mathbf{t}} \) in Eq. 5 are obtained.

For simplicity, it is defined that \( {\mathbf{r}}_3 = {\left[ {\matrix{ {r_{{31}}}& {r_{{32}}}& {r_{{33}}} \cr } } \right]^T} \), \( {{\mathbf{\Delta }}_i}{ = }{\mathbf{t}}_l^c - {\mathbf{t}} = {\left[ {\matrix{ {\Delta_x}& {\Delta_y}& {\Delta_z} \cr } } \right]^T} \), and \( {m_{{ij}}} \) is the element on the i-th row, j-th column in matrix \( {\mathbf{R}}_l^c \). Suppose for one pose of the checkerboard, there are \( p \) LIDAR points on the checkerboard plane, denoted as \( {{\mathbf{P}}_{{l,1}}} = {\left[ {\matrix{ {{X_{{l,1}}}}& {{Y_{{l,1}}}}& {{Z_{{l,1}}}} \cr } } \right]^T} \), \( {{\mathbf{P}}_{{l,2}}} = {\left[ {\matrix{ {{X_{{l,2}}}}& {{Y_{{l,2}}}}& {{Z_{{l,2}}}} \cr } } \right]^T} \),…,\( {{\mathbf{P}}_{{l,p}}} = {\left[ {\matrix{ {{X_{{l,p}}}}& {{Y_{{l,p}}}}& {{Z_{{l,p}}}} \cr } } \right]^T} \). The geometric interpretation becomes a \( {\mathbf{Ax}} = {\mathbf{0}} \) problem, where \( {\mathbf{A}} \) is a \( N \times 12 \) matrix, and \( {\mathbf{x}} \) is a 12-vector to be solved. \( {\mathbf{A}} \) and \( {\mathbf{x}} \) are given in Eq. 6.

$$ \eqalign{ & {\bf{A}} = [ {{r}_{31}} {\bf{P}} _{l,p} \, {r}_{31} {\bf{E}} \, {r}_{32}{\mathop{\bf P}\nolimits} _{l,p}\,\, r_{32} {\mathop{\bf E}\, r_{33}{\mathop{\bf E}\nolimits} _{l,p}\,r_{33} {\mathop{\bf E}\nolimits} }] \cr & {\mathop{\bf x}\nolimits} = [ {{\mathop{\bf m}\nolimits} _1 \,\Delta _x \,{\mathop{\bf m}\nolimits} _2 \Delta _y\, {\mathop{\bf m}\nolimits} _3\, \Delta _z } ]^T} $$
(6)

where \({\mathop{\bf P}\nolimits} _{l,p} = \left[ {{\mathop{\bf P}\nolimits} _{l,1} \,\,\,{\mathop{\bf P}\nolimits} _{l,2} \,\, \ldots \,\,{\mathop{\bf P}\nolimits} _{l,p} } \right]^T \), \( {\mathop{\bf E}\nolimits} = \left[ {1\,\,\,1\,\, \cdots \,\,1} \right]^T \) is a \((p \times 1)\) vector, and \({\mathop{\bf m}\nolimits} _i = \left[ {m_{i1} \,\,\,m_{i2} \,\,\,m_{i3} } \right]\), \(i = 1,2,3\).

By getting the LIDAR points \( {{\mathbf{P}}_{{l,1}}} \), \( {{\mathbf{P}}_{{l,2}}} \), …, \( {{\mathbf{P}}_{{l,p}}} \), \( {\mathbf{x}} \) is estimated using the least square method. In order to avoid the solution \( {\mathbf{x}} = {\mathbf{0}} \), normalization constraints are proposed. Faugeras and Toscani suggested the constraint \( m_{{31}}^2 + m_{{32}}^2 + m_{{33}}^2 = 1 \), which is singularity free [32]. This restriction is proposed from the coincidence that \( \left[ {\matrix{ {m_{{31}}}& {m_{{32}}}& {m_{{33}}} \cr } } \right] \) is the third row of the rotation matrix \( {\mathbf{R}}_l^c \). Thus solving the equation \( {\mathbf{Ax}} = {\mathbf{0}} \) is transformed into minimizing the norm of \( {\mathbf{Ax}} \), i.e., minimizing \( \left| {{\mathbf{Ax}}} \right| \) with the restriction \( m_{{31}}^2 + m_{{32}}^2 + m_{{33}}^2 = 1 \).

\( \left| {{\mathbf{Ax}}} \right| \) can be minimized using a Lagrange method [33]. Let \( {{\mathbf{m}}_3} = \left[ {{m_{{{31}}}}\;{m_{{{32}}}}\;{m_{{{33}}}}} \right] \) and \( {{\mathbf{m}}_9} \) be a vector containing the remaining nine elements in \( {\mathbf{x}} \). The Lagrange equation is written as:

$$ L = {{\mathbf{A}}_9} \cdot {{\mathbf{m}}_9} + {{\mathbf{A}}_3} \cdot {{\mathbf{m}}_3} + \lambda \left( {{\mathbf{m}}_3^T{\mathbf{m}}_3 - 1} \right) $$
(7)

where \( {{\mathbf{A}}_3} \) contains the 9th to 11th columns of \( {\mathbf{A}} \), and \( {{\mathbf{A}}_9} \) contains the remaining nine columns corresponding to \( {{\mathbf{m}}_9} \).

The closed-form linear solution is:

$$ \eqalign{ & \lambda {{\bf{m}}_3} = \left({{\bf{A}}_{{3}}^T{{\bf{A}}_{{3}}} - {\bf{A}}_{{3}}^T{{\bf{A}}_9}{{\left({{\bf{A}}_{{9}}^T{{\bf{A}}_9}} \right)}^{{ -1}}}{\bf{A}}_{{9}}^T{{\bf{A}}_{3}}} \right){{\bf{m}}_3} \cr & {{\bf{m}}_9} = - {\left({{\bf{A}}_{{9}}^T{{\bf{A}}_9}} \right)^{{ - 1}}}{\bf{A}}_{{9}}^T{{\bf{A}}_{{3}}}{{\bf{m}}_3}} $$
(8)

It is well known that \( {{\mathbf{m}}_3} \) is the eigenvector of the symmetric positive definite matrix \( {\mathbf{A}}_{{3}}^T{{\mathbf{A}}_{{3}}} - {\mathbf{A}}_{{3}}^T{{\mathbf{A}}_9}{\left( {{\mathbf{A}}_{{9}}^T{{\mathbf{A}}_9}} \right)^{{ - 1}}}{\mathbf{A}}_{{9}}^T{{\mathbf{A}}_{{3}}} \) associated with the smallest eigenvalue. \( {{\mathbf{m}}_9} \) is obtained after \( {{\mathbf{m}}_3} \). Once \( {{\mathbf{m}}_3} \) and \( {{\mathbf{m}}_9} \) are known, the rotation and translation matrix \( \left( {{\mathbf{R}}_l^c,{\mathbf{t}}_l^c} \right) \) is available.

Because of data noise, the rotation matrix \( {\mathbf{R}}_l^c \) may not in general satisfy \( {\left( {{\mathbf{R}}_l^c} \right)^T}{\mathbf{R}}_l^c = {\mathbf{I}} \). One solution is to obtain \( \hat{\mathbf{ {R}}}_l^c \), which is the best approximation of given \( {\mathbf{R}}_l^c \). This \( \hat{\mathbf{{R}}}_l^c \) has the smallest Frobenius norm of the difference \( {{\hat {\mathbf R}}}_l^c - {\mathbf{R}}_l^c \), subject to \( {({{\lower-0.7em\hbox{$\smash{\scriptscriptstyle\frown}$}}{R}} _l^c )^T}\hat{\mathbf{{R}}}_l^c = {\mathbf{I}} \) [30].

Maximum Likelihood Estimation

The closed-form solution is obtained by minimizing an algebraic distance \( \left| {{\mathbf{Ax}}} \right| \), which is not physically meaningful. In this subsection, the problem is refined through maximum likelihood function using multi-pose checkerboard planes, which is more meaningful.

In the proposed camera calibration approach, differences of image points and the corresponding projection of the ground truth point in an image are minimized [30]. This method is also valid for visible-beam LIDAR calibration [12]. In the test, the Euclidean distances from camera to the checkerboard are checked. Note that Eq. 4 can be written as:

$$ {\mathbf{r}}_3^T\left( {{\mathbf{R}}_l^c{{\mathbf{P}}_l}{\mathbf{ + t}}_l^c} \right) = {\mathbf{r}}_3^T{\mathbf{t}} $$
(9)

where both \( {\mathbf{R}}_l^c{{\mathbf{P}}_l}{\mathbf{ + t}}_l^c \) and \( {\mathbf{t}} \) are points on the calibration plane surface, and \( {\mathbf{r}}_3 \) is the normal vector to this surface. Therefore, both the left and right sides of Eq. 9 are the distance between the checkerboard plane and the origin of the camera reference system.

Suppose there are totally n poses of the calibration plane. For the i-th pose, there is a set of \( \left( {{\mathbf{r}}_3,{\mathbf{t}}} \right) \) denoted as \( \left( {{\mathbf{r}}_3^i,{{\mathbf{t}}^i}} \right) \). LIDAR points are assumed to be corrupted by Gaussian distributed noise. The maximum likelihood function is defined by minimizing sum of the difference between \( {\mathbf{r}}_3^T\left( {{\mathbf{R}}_l^c{{\mathbf{P}}_l}{\mathbf{ + t}}_l^c} \right) \) and \( {\mathbf{r}}_3^T{\mathbf{t}} \) for all the LIDAR points. Suppose for the i-th plane, there are \( {p_i} \) LIDAR points. The solution satisfies:

$$ \arg\, \mathop{{\min }}\limits_{{{\mathbf{R}}_l^c,{\mathbf{t}}_l^c}} \sum\limits_{{i = 1}}^n {\frac{1}{{{p_i}}}\sum\limits_{{j = 1}}^{{{p_i}}} {{{\left( {{{\left( {{\mathbf{r}}_3^i} \right)}^T}\left( {{\mathbf{R}}_l^c{\mathbf{P}}_{{l,j}}^i{\mathbf{ + t}}_l^c} \right) - {{\left( {{\mathbf{r}}_3^i} \right)}^T}{{\mathbf{t}}^i}} \right)}^2}} } $$
(10)

where \( {\mathbf{R}}_l^{{c,i}}{\mathbf{P}}_{{l,j}}^i{\mathbf{ + t}}_l^{{c,i}} \) is the coordinate of \( {\mathbf{P}}_{{l,j}}^i \) in the camera reference system, according to Eq. 2.

By using Rodriguez formula [32], the rotation matrix \( {\mathbf{R}}_l^c \) is transformed into a vector, which is parallel to the rotation axis and whose magnitude is equal to the rotation angle. Thus \( \left( {{\mathbf{R}}_l^c,{\mathbf{t}}_l^c} \right) \) forms a vector. Equation 10 is solved using the Levenberg–Marquardt algorithm (LMA) [34, 35], which provides numerical solutions to the problem of minimizing nonlinear functions. LMA requires an initial guess for the parameters to be estimated. In this algorithm, \( \left( {{\mathbf{R}}_l^c,{\mathbf{t}}_l^c} \right) \) in the closed form is used as this initial state. For each pose, a set of \( \left( {{\mathbf{R}}_l^c,{\mathbf{t}}_l^c} \right) \) is obtained. The weighted average is used as an initial guess, where the scalar weight is normalized as a relative contribution of each checkerboard pose. Then LMA gives a robust solution even if the initial state starts far off the final solution.

Summary of Calibration Procedure

The calibration procedure proposed in this approach can be summarized as:

  1. 1.

    Place the checkerboard in view of the camera and LIDAR systems. Make sure that the plane is within the detection zone of both sensors. The different poses of checkerboard cannot be parallel to each other, otherwise the parallel poses do not provide enough constraints on \( {\mathbf{R}}_l^c \).

  2. 2.

    Take a few measurements (images) of the checkerboard under different orientations. For each orientation, read the LIDAR points on this plane from the output.

  3. 3.

    Estimate the coefficients using the closed-form solution given in section “Closed-Form Solution”.

  4. 4.

    Refine all the coefficients using the maximum likelihood estimation in section “Maximum Likelihood Estimation”.

Experimental Results

The proposed vision-LIDAR calibration algorithm has been tested on both a computer simulation platform and with real-world data.

Computer Simulations

The camera is assumed to have been calibrated. It is simulated to have the following properties: \( \alpha = 1,200 \), \( \beta = 1,000 \), and the skewness coefficient \( \gamma = 0 \). The principal point is (320, 240), and the image resolution is \( 640 \times 480 \). The calibration checkerboard consists of \( 1{0} \times 1{0} \) grids. The size of each square grid is \( {\text{5cm}} \times {\text{5cm}} \). The position and orientation of the LIDAR relative to the camera have also been defined. The LIDAR’s position in camera coordinates is \( {\mathbf{t}}_l^c = {\left[ {\matrix{ {{1}0}& {150}& {100} \cr } } \right]^T} \) centimeters, and the rotation matrix \( {\mathbf{R}}_l^c \) is parameterized by a three-vector rotation vector \( {\left[ {\matrix{ { - {8}{{5}^{\text{o}}}}& {{1}{{0}^{\text{o}}}}& { - {8}{{0}^{\text{o}}}} \cr } } \right]^T} \).

The LIDAR points are calculated based on the location of the camera, and relative position as well as orientation of the checkerboard. Gaussian noise is added to the points.

Performance with respect to Gaussian Noise Level

A checkerboard plane is placed in front of the camera and the LIDAR. Three poses are used here. All of them have \( {\mathbf{t}} = {\left[ {\matrix{ { - 20}& { - 20}& { - 550} \cr } } \right]^T} \). Three rotation matrices are defined by the rotation vectors as \( {{\mathbf{r}}_1} = {\left[ {\matrix{ {{{170}^{\text{o}}}}& { - {5^{\text{o}}}}& {{{85}^{\text{o}}}} \cr } } \right]^T} \), \( {{\mathbf{r}}_2} = {\left[ {\matrix{ {{{170}^{\text{o}}}}& {{{15}^{\text{o}}}}& {{{85}^{\text{o}}}} \cr } } \right]^T} \), \( {{\mathbf{r}}_3} = {\left[ {\matrix{ {{{170}^{\text{o}}}}& { - {{25}^{\text{o}}}}& {{{85}^{\text{o}}}} \cr } } \right]^T} \), respectively. Gaussian noise with zero mean and standard deviation (from 1 to 10 cm) is added to the LIDAR points. The estimation results are then compared with ground truth. For each noise level, 100 independent random trials are delivered. The averaged calibration error is shown in Fig. 6, where the calculation results are denoted as \( {\hat{{\mathbf R}}}_l^c \) and \( {\hat{{\mathbf t}}}_l^c \), respectively. This figure illustrates that the calibration error increases with noise level, as expected. For \( \sigma < 7 \)cm (which is larger than the normal standard deviation for most LIDAR sensors), the error of norm of \( {\mathbf{R}}_l^c \) is less than 0.01. With three checkerboard poses, the relative translation error is less than 5% when \( \sigma < 5 \)cm.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 6
figure 2024

Rotation and translation error with respect to the noise level. (a) Rotation error with respect to noise level. (b) Translation error with respect to noise level

Performance with respect to the Number of Checkerboard Positions

The checkerboard was originally setup as parallel to the image plane. Then it is rotated by, where the rotation axis is randomly selected in a uniform sphere. The number of checkerboards used for calibration varies from 4 to 20. Gaussian noise with zero mean and standard deviation \( \sigma = 4 \)cm is added to the LIDAR points. For each position, 100 trials of independent rotation axes are implemented. The averaged result is illustrated in Fig. 7. This figure shows that when the number of checkerboard positions increases, the calibration error decreases.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 7
figure 2025

Rotation and translation error with respect to number of checkerboard positions. (a) Rotation error with respect to number of checkerboard. (b) Translation error with respect to number of checkerboard

Performance with respect to the Orientation of Checkerboard

The checkerboard plane is initially set as parallel to the image plane. It is then rotated around a randomly chosen axis with angle \( \theta \). The rotation axis is randomly selected from a uniform sphere. The rotation angle \( \theta \) varies from \( {1}{{0}^{\text{o}}} \) to \( {8}{{0}^{\text{o}}} \), and 10 checkerboards are used for each \( \theta \). Gaussian noise with zero mean and standard deviation \( \sigma = 4 \)cm is added to the LIDAR points. For each rotation angle, 100 trials are repeated and the average error is calculated. The simulation result is shown in Fig. 8. The calibration error decreases when the rotation angle increases. When the rotation angle is too small, the calibration planes are almost parallel to each other, which cause error. When the rotation angle is too large, the calibration plane is almost perpendicular to the image plane, which makes the LIDAR measurement less precise.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 8
figure 2026

Rotation and translation error with respect to the orientation of the checkerboard plane. (a) Rotation error with respect to orientation of the checkerboard plane. (b) Translation error with respect to orientation of the checkerboard plane

Real Data Calibration

The calibration method has been tested using an IBEO ALASCA XT LIDAR system and a Sony CCD digital camera with a 6 mm lens. The image resolution is \( 640 \times 480 \). The checkerboard plane consists of a pattern of \( {16} \times {16} \) squares, so there are totally 256 grids on the plane. The size of each grid is \( 2.54{\text{cm}} \times 2.54{\text{cm}} \) (\( 1\; \times 1\;{\text{in}}{.} \)).

Twenty images of the plane were taken with different orientations, and the LIDAR points are recorded simultaneously. Two examples of the calibration results are shown in Fig. 9, where the LIDAR points are mapped to image reference system using estimated \( {\mathbf{R}}_l^c \) and \( {\mathbf{t}}_l^c \). Although the ground truth of \( {\mathbf{R}}_l^c \) and \( {\mathbf{t}}_l^c \) are not known, Fig. 9 shows that the estimation results are pretty reasonable.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 9
figure 2027

Two checkerboard positions. The LIDAR points are indicated by blue dots. The calibration method proposed in this entry is used to estimate \( {\mathbf{R}}_l^c \) and \({\mathbf{t}}_l^c \)

Application in Vehicle Detection

The calibration method has been integrated into a mobile sensing system. This mobile sensing system is designed to detect and track surrounding vehicles, which is the first and fundamental step for any of the automatic traffic surveillance systems. However, object detection is a big challenge for the moving platform. Both the foreground and the background are rapidly changing, which makes it difficult to extract the foreground regions from the background. The sensor fusion technique is used to compensate for the spatial motion of the moving platform. Figure 10 gives two images from the vehicle detection video.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 10
figure 2028

Two sensor fusion image frames. The red rectangle is an enlarged image of the detected area

In Fig. 10a, there are totally four vehicles detected by the LIDAR, where the farthest vehicle is 55 m away from the mobile sensing system. It is hard to obtain the vehicle’s distance and orientation from an image alone. The LIDAR points provide a reliable estimation of this vehicle’s position. In Fig. 10b, one car parallel to the probe vehicle is detected by the LIDAR. Meanwhile, it is partially visible in the image, together with its shadow on the ground. Although this vehicle is hardly recognizable in the image, with a wide angle of view, LIDAR data provide enough information to reconstruct the location of this vehicle.

The experiment results illustrate that the calibration algorithm provides good results. The sensor fusion system combining LIDAR and computer vision information sources presents distance and orientation information. This system is helpful for vehicle detection and tracking applications.

Summary

In this section, a novel calibration algorithm was developed to obtain the geometry transformation between a multi-plane LIDAR system and a camera vision system. This calibration method requires LIDAR and camera to observe a checkerboard simultaneously. A few checkerboard poses are observed and recorded. The calibration approach has two stages: closed-form solution followed by a maximum likelihood criterion–based optimization. Both simulation and real-world experiments have been carried out. The experiment results show that the calibration approach is reliable. This approach will be used in the vehicle detection and tracking system.

Tightly Coupled LIDAR and Computer Vision Integrated System for Vehicle Detection

Computer vision sensors are generally used in current mobile platform–based object detection and tracking systems. However, the application of vision sensors is far from sufficient: clustering, illumination, occlusion, among many other factors, affect the overall performance. In contract, a LIDAR sensor provides range and azimuth measurements from the sensor to the targets. However, the accuracy of its measurements depends on the reflectivity of the targets and the weather. The fusion of camera and active sensors such as LIDAR is being investigated in the context of on-board vehicle detection and tracking.

In this section, a tightly coupled LIDAR/CV system is proposed, in which the LIDAR scanning points are used for hypothesizing regions of interest and for providing error correction to the classifier, while the vision image provides object classification information. LIDAR object points are first transformed into image space. ROIs are generated using the LIDAR feature detection method. An Adaboost classifier based on computer vision systems [36] is then used to detect vehicles in the image space. Dimensions and distance information of the detected vehicles are calculated in body-frame coordinates. This approach provides a more complete and accurate map of surrounding vehicles in comparison to the single sensors used separately. One of the key features of this technique is that it uses LIDAR data to correct the Adaboost classification pattern. Moreover, the Adaboost algorithm is utilized both for vehicle detection and for vehicle distance and dimension extraction. Then the classification results provide compensatory information to the LIDAR measurements.

This section is organized as follows: in section “Overview of the Vehicle Detection System” a brief introduction of the vehicle detection system is given. Section “Vision-Based System” describes the vision-based system. The vehicle detection algorithm is introduced in section “Moving Vehicle Detection System”. A vehicle tracking approach using particle filter is proposed in section “Vehicle Tracking System”. Experimental results of vehicle detection are provided in section “Experiment Results”, followed by conclusions and future work in section “Summary and Discussion”.

Overview of the Vehicle Detection System

Vehicle detection is the first and fundamental step for any driver assistant system (DAS) . With sensors mounted on a moving platform, the detected data change rapidly, making it difficult to extract objects of interest. In the proposed approach, spatial motion of the moving platform has been compensated by using the sensor fusion approach.

Multi-module Architecture

The input of vehicle detection system consists of two LIDAR sensors and a single camera, as is shown in Fig. 1. The detection area is covered by two LIDAR sensors overlapping with each other. The camera is placed behind the rearview mirror. The field of view of this camera is fully covered by the LIDAR ranging space.

Figure 11 presents the flowchart of this vehicle detection system. It consists of four subsystems: a LIDAR-based subsystem, a coordinate transformation subsystem, a vision-based subsystem, and a vehicle detection subsystem.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 11
figure 2029

Flowchart of the mobile sensing system

LIDAR-Based Subsystem

The LIDAR Data Acquisition Module uses IBEO External Control Unit (ECU) to communicate, collect, and combine data from a pair of LIDAR sensors.

The Prefiltering and Clustering Module aims to transform scan data from distances and azimuths to positions, and cluster the incoming data into segments using a Point-Distance-Based Methodology (PDBM) [12]. If there exists any segment consisting of less than three points, and the distance of this segment is greater than the given threshold, these points are considered as noise. The segment will be disregarded.

The Feature Extraction Module extracts primary features in the cluster. The main feature information in one segment is its geometrical representation, such as lines and circles. One of the advantages of geometrical features is that they occupy far less space than storing all the scanned points.

A vehicle may have any possible orientation. The contour of a vehicle is constructed by four sides: front, back, left side, and right side. The LIDAR sensor can capture one side, or two neighboring sides, as is shown in Fig. 12.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 12
figure 2030

Orientation of the vehicle significantly changes its appearance in the scan data frame. The rectangles show position of the LIDAR

When the object is close to the probe vehicle, the extracted feature provides enough information for object classification. However, if the target is far away, it may be represented by only one or two scanning points. For those objects with only a few LIDAR points, it is difficult to get reliable size, location, and orientation information from the scan data alone. Note that the computer vision image also contains size and orientation information, which can be extracted by the object classification technique. By employing the sensor fusion technology, LIDAR scan data and Adaboost output are complementary to each other.

The ROI Generation Module calculates positions of ROI bounding boxes in LIDAR coordinates. It is worth mentioning that the ROI is not defined by LIDAR points alone, since the scan points of one target may not be able to represent its full dimension. In this algorithm, the width and length of ROI are defined by both LIDAR data points and the maximum dimension of a potential vehicle.

Each ROI is defined as a rectangular area in the image. The bottom of the rectangle is the ground. The top of the rectangle is set to be the maximum height of a car. The left and right edges are obtained from the furthest left and right scanning points in a cluster, as well as the typical width of a car.

The LIDAR to Image Coordinate Transformation Module transforms all the LIDAR points into the image frame. The relative position and orientation between LIDAR and camera sensors should be obtained for the transformation. A unique multi-planar LIDAR and computer vision calibration algorithm is described in section “A Novel Multi-Planar LIDAR and Computer Vision Calibration Procedure Using 2D Patterns”, which calculates the geometry transformation matrix between “multiple invisible beams” of LIDAR sensors and the camera. The calibration results are used in the sensor fusion system.

During the road test, each LIDAR scan point \( {{\mathbf{P}}_l} \) is transformed into camera coordinate as \( {{\mathbf{P}}_c} \). \( {{\mathbf{P}}_c} \) is then transformed into point \( {{\mathbf{p}}_c} \) in the image plane. Figure 13 illustrates LIDAR scan points and the transformation results in the image reference system.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 13
figure 2031

The LIDAR scan points. (a) Points in the LIDAR coordinate system. Each green dot is one scan point. (b) These points are transformed to the image frame. Each blue star in the image is one LIDAR point. Vehicles in the red circles are the enlarged image of the detected area

After the LIDAR to image transformation coefficients are calculated, ROIs generated in the LIDAR-based subsystem is converted into image frame. A larger ROI is generated due to inaccuracy of the transformation from LIDAR data to image data.

The Image to LIDAR Coordinate Transformation Module is called to correct Adaboost classification result. More details are given in the following sections.

Vision-Based System

Object classification from the hypothesized ROIs is required for vehicle detection purpose. Feature representations are used for object classifiers by an Adaboost algorithm [36]. Viola et al. proposed that the object is detected based on a boosted cascade of feature classifiers, which performs feature extraction and combines features such as simple weak classifiers to a strong one.

The Adaboost classifier requires off-line training using target as well as nontarget images. In the vehicle detection applications, the target images, i.e., the rearview of vehicles, are called positive samples; while the non-vehicles are named as negative samples. Figure 14 illustrates some of the positive as well as negative samples in the training dataset. Training samples are taken from both Caltech vehicle image dataset [37] and video collected by the test vehicle. The positive samples include passenger cars, vans, trucks, and trailers. The negative sample sets include roads, traffic signs, buildings, plants, and pedestrians.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 14
figure 2032

Positive and negative samples used in Adaboost training

Image Training Preprocessing

Both the positive and negative samples are used by computer vision system for data training. All the samples are originally colored images. In order to remove the effects of various illumination conditions and camera differences, gray-level transformation is required as a preprocessing step. The gray-level normalization method is applied to the whole image dataset, which transforms gray level of the image to be in [0, 1] domain. The color image is transformed by [38]:

$$ {I^{ * }}\left( {x,y} \right) = \frac{{I\left( {x,y} \right) - {I_{{\min }}}}}{{{I_{{\max }}} - {I_{{\min }}}}} $$
(11)

where \( I\left( {x,y} \right) \) is the intensity of pixel \( \left( {x,y} \right) \), \( {I_{{\min }}} \) and \( {I_{{\max }}} \) are the minimum and maximum values in this image, respectively. \( {I^{ * }}\left( {x,y} \right) \) is the normalized gray-level value.

The next step is to normalize the sizes of all the positive samples. It is implemented before training since different resolutions may cause different number of features to be counted. The sizes of the normalized positive samples determine the minimum size of objects that can be detected [39]. In this test, the normalized image size is set as \( 25 \times 25 \) pixels.

Haar Training

In the vision-based system, Haar-like features are used to classify objects [36]. This approach combines more complex classifiers in a “cascade” which quickly discard the background regions while spending more computation on the Haar-like area [36].

More specifically, 14 feature prototypes are utilized for the Haar training [40]. These features represent characteristic properties like edge, line, and symmetry. The features prototypes can be grouped into three categories:

  • Edge features: The difference between sums of the pixels within two rectangular regions

  • Line features: The sum within two outside rectangular regions subtracted from the sum in the center rectangular

  • Center-surround features: The difference between the sums of the center rectangular and the outside area

Here black areas with “–1” have negative weights, and white areas with “1” have positive weights.

After the weak classifier has been trained at each stage, the classifier is able to detect almost all the targets of interest while rejecting certain nontarget objects. A cascade of classifiers is generated to form a decision tree. The training process consists of totally 15 stages. Each stage is trained to eliminate 60% of the non-vehicle patterns, and the hit rate (HR) in each stage is set to be 0.998. Therefore, the total false alarm rate (FAR) for this cascade classifier is supposed to be \( {0.4^{{15}}} \approx 1.07e - 06 \), and the hit rate should be around \( {0.998^{{15}}} \approx 0.97 \).

Moving Vehicle Detection System

The Adaboost algorithm designs a strong classifier that can detect multiple objects in the given image. However, there is no guarantee that this strong classifier is optimal for the object detection. In contrast to the classic Adaboost algorithm, in this test there is only one vehicle in each ROI defined by the LIDAR clustering algorithm. Therefore, it is not necessary for the Adaboost algorithm to detect several possible targets in one ROI. A classification correction technique is proposed to utilize the LIDAR scanning data to reduce redundancy in the Adaboost detection results.

There are two kinds of redundancy errors in the classification results. Figure 15 gives some examples of these two cases. Both have detected more than one object, while the ground truth is that there is only one vehicle. One kind of error is that the Adaboost detects two possible targets, while the area of the smaller one is almost covered by the larger one, as is shown in Fig. 15a. Another error shown in Figure 15b is that all the detected areas belong to the same object, while none of them cover the full body of the target.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 15
figure 2033

Two kinds of redundancy errors in Adaboost detection. (a) Two bounding boxes overlap on the same target. (b) Two bounding boxes detected with no overlap on the same target

Suppose in the i-th ROI, there exists a LIDAR point cluster \( {\Re_i} \), which has the following features in LIDAR coordinate system: \( c_i^{{LIDAR}} \) as the center of the cluster, \( w_i^{{LIDAR}} \) as the width of the object, and \( l_i^{{LIDAR}} \) as the possible length of the vehicle. On the image side, there are detected target candidates denoted as \( {d_1} \), \( {d_2} \),…, \( {d_n} \). Initially, \( {d_1} \), \( {d_2} \),…, \( {d_n} \) are transformed from image coordinate to camera coordinate, then to LIDAR coordinates.

The scan points in LIDAR coordinates are denoted as \( {D_1} \), \( {D_2} \),…, \( {D_n} \). The j-th candidate \( {D_j} \) has center \( c_{{i,j}}^{{DETECT}} \), width \( w_{{i,j}}^{{DETECT}} \), and height \( h_{{i,j}}^{{DETECT}} \). In LIDAR coordinate frame, two vectors are defined as \( {\mathbf{m}}_i^{{LIDAR}} = \left( {c_i^{{LIDAR}},w_i^{{LIDAR}}} \right) \) and \( {\mathbf{m}}_i^{{DETECT}} = \left( {c_{{i,1}}^{{DETECT}},w_{{i,1}}^{{DETECT}}, \cdots, c_{{i,n}}^{{DETECT}},w_{{i,n}}^{{DETECT}}} \right) \). Here \( {\mathbf{m}}_i^{{LIDAR}} \) is the size information obtained by LIDAR, and \( {\mathbf{m}}_i^{{DETECT}} \) consists of measurements in LIDAR coordinate systems which are transformed from the image reference system. The coefficient \( {{\mathbf{w}}_i} \) for mapping multiple target areas to the LIDAR information satisfies:

$$ {{\mathbf{w}}_i} = \arg\ \min \left\| {{\mathbf{m}}_i^{{LIDAR}} - {\sum\nolimits_{{j = 1:n}}}{{\mathbf{w}}_{{i,j}}}{\mathbf{m}}_{{i,j}}^{{DETECT}}} \right\| $$
(12)

where \( \left\| {{ }.{ }} \right\| \) is the Euclidean norm. \( {{\mathbf{w}}_i} \) is used as a weight to recalculate the detected area, which is a combination of all the detected objects in one ROI.

LIDAR scanning points and Adaboost classification results are then combined to generate a complete map of vehicles. A summary of the vehicle detection process is given here:

Algorithm 1: Vehicle Detection

Given LIDAR points cluster \( {\Re_i} \) with features \( \left( {c_i^{{LIDAR}},\;w_i^{{LIDAR}},\;l_i^{{LIDAR}}} \right) \), detected target candidate \( {d_1} \), …, \( {d_n} \) in the image;

if no object detected in the ROI

 enlarge ROI and search again

else

 define \( {\mathbf{m}}_i^{{LIDAR}} = \left( {c_i^{{LIDAR}},w_i^{{LIDAR}}} \right) \)

 transform \( {d_1} \), …, \( {d_n} \) in the image coordinate frame to LIDAR reference frame

 define \( {{\mathbf{m}}_{i}^{DETECT}} = \left({{c}_{{i,1}}^{DETECT}, {w}_{i,1}}^{DETECT}, \cdots,{c_{{i,n}}^{{DETECT}},w_{{i,n}}^{{DETECT}}} \right)\)

 calculate the weight vector which minimize \( \left\| {{\mathbf{m}}_i^{{LIDAR}} - \sum {{\mathbf{w}}_{{i,j}}}{\mathbf{m}}_{{i,j}}^{{DETECT}}} \right\| \)

end if

The detected vehicle is located at \( {\sum_{{j = 1:n}}}{{\mathbf{w}}_{{i,j}}}c_{{i,j}}^{{DETECT}} \).

The vehicle detection system proposed in this section can be summarized as:

  • The Adaboost classifier training is implemented off-line with both positive and negative samples.

  • ROI is defined by the LIDAR scan data. No more than one vehicle is assumed to exist in each ROI.

  • Use the Adaboost classifier to make a preliminary vehicle detection.

  • LIDAR data is used to correct the Adaboost redundancy error, and to merge detected areas in one ROI.

  • Combine the Adaboost detected area (in LIDAR coordinate) and the LIDAR output to generate a complete vehicle distance and dimension map.

Vehicle Tracking System

LIDAR and computer vision sensors are integrated in a probabilistic manner for vehicle tracking. A sampling importance resampling (SIR) particle filter is used as the tracker, which is a sophisticated model estimation technique [41]. Unlike the commonly used Kalman filter and extended Kalman filter (EKF), this particle filter does not assume that the linear dynamic system is perturbed by Gaussian noise. The key idea of particle filter is to represent the estimation by a set of random samples (they are called particles) with associated weights.

Particle Filter

Let \( {\mathbf{x}}_t^{{(i)}} \) be the i-th sample of the position and velocity of the target at time \( t \), \( i = 1,2, \cdots, {N_s} \), where \( {N_s} \) is the total number of samples or particles in the particle filter. \( w_t^{{(i)}} \) is the i-th weight at time t associated with \( {\mathbf{x}}_t^{{(i)}} \). The procedure of SIR particle filter is defined as follows:

  1. 1.

    Initial Particle Generation

    Generate \( {N_s} \) particles \( \left\{ {{\mathbf{x}}_0^{{(i)}},\;w_0^{{(i)}}} \right\} \), \( i = 1,2, \cdots, {N_s} \). Here \( {\mathbf{x}}_0^{{(i)}} \) is obtained from vehicle detection results and \( w_0^{{(i)}} = {{1} \left/ {{{N_s}}} \right.} \).

  2. 2.

    Particle Updating

    For each particle \( {\mathbf{x}}_{{t - 1}}^{{(i)}} \) at time \( t - 1 \), generate a particle \( {\mathbf{x}}_t^{{(i)}} \) at time t. This step corresponds to the prediction step in Kalman filter and EKF. However, in Kalman filter and EKF, at time t the state is updated only once. Here in the particle filter, each of the particles should be updated, so totally \( {N_s} \) particle updating calculations are implemented. In this system, \( {\mathbf{x}}_t^{{(i)}} = \left[ {\matrix{ {{\mathbf{p}}_t^{{(i)}}}& {{\mathbf{v}}_t^{{(i)}}} \cr } } \right] \) is the sample of the position and velocity, so a linear model is used to update the particles:

    $$ {\mathbf{p}}_t^{{(i)}} = {\mathbf{p}}_{{t - 1}}^{{(i)}} + {\mathbf{v}}_{{t - 1}}^{{(i)}}T + n $$
    (13)

    where \( T \) is the time interval and \( n \) is the noise.

  3. 3.

    Particle Weighting

    Each particle \( {\mathbf{x}}_0^{{(i)}} \) at time \( t \) is associated with the weight \( w_t^{{(i)}} \), which is also called the importance factor. The weight at time \( t \) is a function of the weight at time \( t - 1 \), and the probability function of measurements and states at time \( t \). The importance factor is commonly calculated as [41]:

    $$ w_t^{{(i)}} = w_{{t - 1}}^{{(i)}}p\left( {{{\mathbf{z}}_t}\left| {{\mathbf{x}}_t^{{(i)}}} \right.} \right) $$
    (14)

    where \( {{\mathbf{z}}_t} \) is the measurement at time \( t \). In the proposed sensor fusion system, both LIDAR and computer vision sensors are utilized for vehicle tracking. Therefore, the measurement is \( {{\mathbf{z}}_t} = \left\{ {\matrix{ {^l{{\mathbf{z}}_t};}& {^c{{\mathbf{z}}_t}} \cr } } \right\} \), where \( ^l{{\mathbf{z}}_t} \) is the output of LIDAR and \( ^c{{\mathbf{z}}_t} \) is the measurement of the camera. This step corresponds to the update step in Kalman filter and EKF. The probability \( p\left( {{{\mathbf{z}}_t}\left| {{\mathbf{x}}_t^{{(i)}}} \right.} \right) \) will be discussed in the following subsection.

  4. 4.

    Resampling

    A common problem with the particle filter is the degeneracy, which is the phenomenon that after a few iterations only one particle has non-negligible weight [41]. \( \widehat{{{N_{{eff}}}}} \) has been defined to measure the degeneracy, which is calculated as [41]:

    $$ \widehat{{{N_{{eff}}}}} = \frac{1}{{\sum\nolimits_{{i = 1}}^{{{N_s}}} {{{\left( {w_t^{{(i)^\prime}}} \right)}^2}} }} $$
    (15)

    where \( w_t^{{(i)^\prime}} \) is the normalized weight of the \( i \)-th particle at time \( t \) with \( w_t^{{(i)^\prime}} = \frac{{w_t^{{(i)}}}}{{\sum\nolimits_{{i = 1}}^{{{N_s}}} {w_t^{{(i)}}} }} \). If \( \widehat{{{N_{{eff}}}}} \) is less than a given threshold \( {N_T} \), the degeneracy is detected.

    Resampling is performed at each iteration. It is designed to eliminate particles that have small weights and replicate particles that have large weights. Particles that have large weights are considered to be the “good” particles while the particles with small weights are “bad” particles. The resampled weights are set as \( w_t^{{(i)}} = {{1} \left/ {{{N_s}}} \right.} \).

    After obtaining \( w_t^{{(i)}} \), the posterior filtered density can be approximated as [41]:

    $$ p\left( {{{\mathbf{x}}_t}\left| {{\mathbf{z}}_{{1:t}}} \right.} \right) \approx \sum\nolimits_{{i = 1}}^{{{N_s}}} {w_k^i\delta \left( {{{\mathbf{x}}_t} - {\mathbf{x}}_t^{{(i)}}} \right)} $$
    (16)

The Sensor Model

The sensor model describes the process by which sensor measurements are made in the physical world. It relates sensor output to the state of the vehicle. In the vehicle detection application it is defined as the conditional probability \( p\left( {^l{{\mathbf{z}}_t};\;^c{{\mathbf{z}}_t} \left| {{\mathbf{x}}_t^{{(i)}}} \right.} \right) \), which is the probability of LIDAR and computer vision sensor measurements given the state of the vehicle.

A LIDAR sensor model is described in [42], which represents the probability as a mixture of four distributions corresponding to four types of measurement errors: the small measurement noise, errors due to unexpected objects or obstacles, errors due to failure to detect objects, and random unexplained noise.

Let \( z_t^{{(k)*}} \) denote the true distance to an obstacle, \( z_t^{{(k)}} \) denote the recorded measurement, and \( z_{{\max }} \) denote the maximum possible reading. The small measurement error is defined as a Gaussian distribution \( p_{\text{hit}} \) over the range \( \left[ {\matrix{ {0,}& {z_{{\max }}} \cr } } \right] \) with mean \( z_t^{{(k)*}} \) and standard deviation \( \sigma_{\text{hit}} \).

The LIDAR detection zone is often blocked by the moving vehicles, which leads to the measurement whose length is shorter than the true length. This particular type of measurement error is modeled by a truncated exponential distribution p short with the coefficient \( \lambda_{\text{short}} \).

LIDAR sometimes fails to detect obstacles due to low reflectivity of the target. The errors due to failure to detect objects is defined as a pseudo point-mass distribution \( p_{{\max }} \) centered at \( z_{{\max }} \).

Finally, unexplainable measurements may be returned by the LIDAR sensor, which is caused by interference. This type of error is modeled by a uniform distribution \( p_{\text{rand}} \) over the entire measurement range.

\( p\left( {^l{\mathbf{z}}_t\left| {{\mathbf{x}}_t^{{(i)}}} \right.} \right) \) is calculated as a combination of the four types of errors as [42]:

$$ \eqalign{ p \left( {^l{\bf{z}}_t\left|{{\bf{x}}_{t}^{(i)}} \right.} \right) = & {\alpha_{\rm{hit}}}{p_{\rm{hit}}}\left( {^l{\bf{z}}_t\left| {{\bf{x}}_t^{{(i)}}} \right.} \right) \cr & \ + {\alpha_{\rm{short}}}{p_{\rm{short}}}\left( {^l{\bf{z}}_t\left| {{\bf{x}}_t^{{(i)}}} \right.} \right) \cr & \ + {\alpha_{{{ \max }}}}{{p}_{\max}} \left({^l{\bf{z}}_{t} \left| {{\bf{x}}_{t}^{(i)}} \right.} \right) \cr & \ + {{\alpha}_{\rm{rand}}}{p_{\rm{rand}}}\left({^l{\bf{z}}_t\left| {{\bf{x}}_t^{{(i)}}} \right.} \right)} $$
(17)

where \( {\alpha_{\text{hit}}} \), \( {\alpha_{\text{short}}} \), \( {\alpha_{{{ \max }}}} \), and \( {\alpha_{\text{rand}}} \) are the weights for \( {p_{\text{hit}}} \), \( {p_{\text{short}}} \), \( {p_{{{ \max }}}} \), and \( {p_{\text{rand}}} \), respectively. \( {\alpha_{\text{hit}}} + {\alpha_{\text{short}}} + {\alpha_{{{ \max }}}} + {\alpha_{\text{rand}}} = 1 \). The parameters in Eq. 17 are commonly used as the a priori information, which are obtained by data training.

A camera weight model is proposed in [43] as:

$$ \eqalign{ & p \left( {^c{\bf{z}}_{t} \left| {{\bf{x}}_{t}^{(i)}}\right.} \right) = \cr & \left\{ \eqalign{ {S} & \quad \ {\rm the\ object\ is\ in\ the\ camera\ detection\ zone}. \cr {1 - S} & \quad \ {\rm the\ object\ is\ out\ of\ the\ camera\ detection\ zone}.} \right. } $$
(18)

where \( S \) is a constant, \( 0 \ \leq {S}\ \leq 1 \). This model is based on the assumption that the camera is able to detect all the objects in the detection zone.

The sensor fusion probability model is calculated based on the LIDAR probability model as well as the camera probability model. It is proposed in [43] that \( ^l{\mathbf{z}}_t \) and \( ^c{\mathbf{z}}_t \) are independent measurements. So

$$ p\left( {^l{\mathbf{z}}_t;^c{\mathbf{z}}_t\left| {{\mathbf{x}}_t^{{(i)}}} \right.} \right) = p\left( {^l{\mathbf{z}}_t\left| {{\mathbf{x}}_t^{{(i)}}} \right.} \right)p\left( {^c{\mathbf{z}}_t\left| {{\mathbf{x}}_t^{{(i)}}} \right.} \right) $$
(19)

However, in the proposed sensor fusion system, LIDAR and the camera are not two independent sensors. They have been calibrated to observe the same target, and the geometric relationships are given as a priori information. Moreover, the classification result of the camera is corrected by the LIDAR output.

In this section, a novel sensor fusion probability model is proposed. As defined in [42], LIDAR tracking process is modeled as a mixture of four types of errors: the small measurement noise, unexpected objects detection error, detection failure error, and random unexplained noise. In the field test, small measurement noise error is found to be the exclusive error source of the LIDAR sensor. The other types of errors are removed by integration of LIDAR and camera. The LIDAR tracking model is [42]:

$$ \eqalign{p \left( {^l{\bf{z}}_t\left| {{\bf{x}}_t^{{(i)}}} \right.} \right) = \left\{ {\eqalign{ & {{1}\over{{\sqrt {{2\pi }} \delta }}}\exp \left( { -{{\left( {^l{{\bf{z}}_t} -^l{\bf{z}}_t^{*}} \right)}^2} \over {{2{\delta^2}}}} \right)0 \leq^l{{\bf{z}}_t} \leq {z_{{\max }}} \cr & \ \ {0 \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad{\rm otherwiase}}}} \right.} $$
(20)

The computer vision-based vehicle tracking is implemented by KLT tracking [36, 47]. A function \( s\left( {x_{{t(i)}}^m} \right) \) is used to determine if a predicted corner \( x_{{t(i)}}^m \) is close to an observed corner \( {z_{{t(i)}}} \). \( s\left( {x_{{t(i)}}^m} \right) \) is defined as \( s\left( {x_{{t(i)}}^m} \right) = - \sum\limits_{{j = 1}}^M {\exp {{\left( {d_{{t(i,j)}}^m} \right)}^2}} \), where \( M \) is the total number of corners, \( {\left( {d_{{t(i,j)}}^m} \right)^2} = {\left\| {z_{{t(j)}} - x_{{t(i)}}^m} \right\|^2} \) is the Euclidean distance between i-th detected corner of the m-th particle \( x_{{t(i)}}^m \) and j-th extracted corner \( z_{{t(j)}} \) [47]. The camera model is defined as [47]:

$$ p\left( {^c{\mathbf{z}}_t\left| {{\mathbf{x}}_t^{{(i)}}} \right.} \right) = \exp \left( { - \sum {{{\left( {s\left( {x_{{t(i)}}^m} \right) - 1} \right)}^2}} } \right) $$
(21)

Finally, the sensor fusion tracing system has totally three measurement situations: (1) the vehicle is tracked by both the LIDAR and the camera. In this case, the single sensor tracking error is eliminated by sensor integration technique proposed in section “Moving Vehicle Detection System”; (2) the vehicle is out of camera detection zone, so it is tracked by the LIDAR alone; and (3) the vehicle is tracked by the camera but not detected by the LIDAR sensor due to factors such as distance or weak reflection. The sensor fusion model is given as:

$$\eqalign{ & p\left( {^l {\mathop{\bf z}\nolimits} _t ;^c{\mathop{\bf z}\nolimits} _t \left| {{\mathop{\bf x}\nolimits}_t^{(i)} } \right.} \right) \cr & = \left\{ \eqalign{ & \alpha p\left( {^l {\mathop{\bf z}\nolimits} _t \left| {{\mathop{\bf x}\nolimits} _t^{(i)} } \right.} \right) + \beta p\left( {^c {\mathop{\bf z}\nolimits} _t \left| {{\mathop{\bf x}\nolimits}_t^{(i)} } \right.} \right) \cr & {\rm target\ is\ tracked\ by\ both\ LIDAR\ and} \cr & {\rm camara} \cr & p\left( {^l {\mathop{\bf z}\nolimits} _t \left| {{\mathop{\bf x}\nolimits} _t^{(i)} } \right.} \right) \cr & {\rm target\ is\ tracked\ by\ LIDAR}\ {\mathop{\rm alone}\nolimits} \cr & p\left( {^l {\mathop{\bf z}\nolimits} _t \left| {{\mathop{\bf x}\nolimits} _t^{(i)} } \right.} \right) \cr & {\rm target\ is\ tracked\ by}\ {\mathop{\rm camara}\nolimits} \ {\mathop{\rm alone}\nolimits}} \right.} $$
(22)

where the weight \( 0 \ \leq \alpha \ \leq 1 \) and \( 0 \ \leq \beta \ \leq 1 \) are two coefficients obtained by data training, \( \alpha + \beta = 1 \), which allows to balance the LIDAR and computer vision information.

The weight \( w_t^{{(i)}} \) can be calculated using Eq. 14. The particle filter is summarized in Algorithm 2. Unlike the Kalman filter or EKF, particle filter can track vehicle state with multi-model or arbitrary distributions.

Algorithm 2: Particle Filter for Sensor Fusion Systems

Input: \( \left\{ {x_{{t - 1}}^{{(i)}},\,w_{{t - 1}}^{{(i)}}} \right\} \), \( i = 1, \,2,\, \cdots, \,{N_s} \): set of weighted particles at time \( t - 1 \)

      \( {\mathbf{z}}_t = \left( {^l{\mathbf{z}}_t;^c{\mathbf{z}}_t} \right) \): LIDAR and computer vision measurement at time \( t \)

Output: \( \left\{ {x_t^{{(i)}},\,w_t^{{(i)}}} \right\} \), \( i = 1, \,2,\, \cdots, \,{N_s} \): set of weighted particles at time \( t \)

Process:

 for \( i = 1 \) to \( {N_s} \) do

  Predict \( {\mathbf{x}}_t^{{(i)}} \) as \( {\mathbf{p}}_t^{{(i)}} = {\mathbf{p}}_{{t - 1}}^{{(i)}} + {\mathbf{v}}_{{t - 1}}^{{(i)}}T + n \)

   \( w_t^{{(i)}} = w_{{t - 1}}^{{(i)}}p\left( {{\mathbf{z}}_t\left| {{\mathbf{x}}_t^{{(i)}}} \right.} \right) \)

 end for

calculate \( \widehat{{{N_{{eff}}}}} \)

if \( \widehat{{{N_{{eff}}}}} < {N_T} \)

 for \( i = 1 \) to \( {N_s} \) do

  computer \( w_t^{{(i)}} \) using Eq. 14, in which \( p\left( {^l{\mathbf{z}}_t;^c{\mathbf{z}}_t\left| {{\mathbf{x}}_t^{{(i)}}} \right.} \right) \) is given in Eq. 20

  update the particle with \( \left\{ {x_t^{{(i)}},\;{{1} \left/ {{{N_s}}} \right.}} \right\} \)

 end for

else

\( Z = \sum\nolimits_{{i = 1}}^{{{N_s}}} {w_t^{{(i)}}} \)

 for \( i = 1 \) to \( {N_s} \) do

  update the particle with \( \left\{ {x_t^{{(i)}},\,{Z^{{ - 1}}}w_t^{{(i)}}} \right\} \)

 end for

end if

Experiment Results

The experiment results of vehicle detection are discussed in this section. To evaluate the performance of this system, a dataset of 377 images was used for training and testing. These images are taken from both the Caltech vehicle image dataset [37] and the video samples collected by probe vehicles. Another test dataset consists of 526 images with synchronized scanning data that are used for performance evaluation. The test dataset was recorded in a local parking lot on different days during different seasons.

Hit rate (HR), false alarm rate (FAR), and region detection rate (RDR) are used to evaluate the performance of this system. Here HR is the number of detected vehicles over total number of vehicles. RDA denotes the percentage of “real” vehicle detection rate. A “real” vehicle detection is that majority area of the vehicle is covered by a rectangle, and there is only one rectangle that covers this object. Therefore, a target that is hit may not be a region “really” detected; and a region detected object is always a hit. The higher the RDR is, the more accurate the detection result will be. Figure 16 illustrates several cases for hit, false alarm, and region detection. In this figure, TR represents the target region in the image, and DR is the detected region by the classifier.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 16
figure 2034

Three target detection cases. The first row is region detected, the second row is hit but not region detected, and the third row is false alarm

Table 2 gives the detection performance of Adaboost classifier (detection with camera only), the classic LIDAR–camera sensor fusion system, and the proposed LIDAR and computer vision-based detection and error correction approach. Here in both the classic sensor fusion technique and the proposed approach, LIDAR data are utilized for ROI generation. The difference lies in the fact that in the proposed approach LIDAR data help correct the classification result. Table 2 shows that this approach both improves the hit rate and reduces the false alarm rate in comparison with Adaboost classifier. Compared with the classic LIDAR–camera fusion system, this approach improves the region detect rate from 84.85% to 89.32%, since for each hit but not accurately covered object the LIDAR scanning data helps to recomputed position of the target. Most of the overlapping or partial target detection areas are merged during the LIDAR correction process.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Table 2 Detection result

Figure 17 illustrates some of the vehicle detection results. The left column presents LIDAR scan points, and the right column illustrates camera images with information from the sensor fusion system. In Figure 7a, all the vehicles are detected and are marked with a rectangle. Figure 7b shows that the classifier found two vehicles only, which are bounded in a red rectangle. The other two ROIs (shown in blue rectangles) are classified to have non-vehicle objects. In fact, one of them is a trash can. The other is a vehicle at a distance of 92.78 m. This vehicle is too far away and too small in the image for the classifier to recognize.

Vehicle Detection, Tightly Coupled LIDAR and Computer Vision Integration for. Figure 17
figure 2035

LIDAR scan points and the final vehicle detection results

During the test, the hit rate decreases when the distance between the probe vehicle and the target vehicles increases. The targets are detected frame by frame. Therefore, the target vehicles may not be recognized by the classifier in certain frames even if it was recognized in the last frame. Vehicle tracking technique helps solve this problem. By running a particle filter–based tracking algorithm , the target is initially detected in the initial frame or in several initial frames, after which it is tracked in the following frames. This approach both improves the detection accuracy and reduces the required amount of calculation. HR and RDR will be further improved by vehicle tracking.

Summary and Discussion

A novel vehicle detection system has been proposed based on tightly integrating LIDAR and computer vision sensors. Distances to the objects are first defined by the LIDAR sensors, and then the object is classified based on computer vision images. In addition, data from these two complementary sensors are combined for classifier correction and vehicle detection. The experimental results have indicated that, when compared with image-based and classic sensor fusion–based vehicle detection systems, this approach has a higher hit rate and a lower false alarm rate. It is quite useful for modeling and prediction of the traffic conditions over a variety of roadways. This system may be used in future autonomous navigation systems.

Conclusions and Future Work

This entry presents a multi-sensor equipped vehicle detection system that was developed to specifically obtain the state of surrounding vehicles. It involves the development of a tightly coupled LIDAR and computer vision system, calibration of a pair of multi-planar LIDAR sensors and the camera system, and the methodology of sensor fusion-based vehicle detection technique.

This section provides a brief summary of this entry, as well as the possible future work.

Summary

Automatic vehicle detection techniques are becoming an essential part of our daily lives. They open up many potential opportunities but they also come with challenges in terms of sensing capability and accuracy. In this entry, the problem of vehicle detection is addressed, and some novel approaches have been demonstrated to solve the problem in a traffic environment.

The goal of this research is to provide a solution to measure the state of surrounding vehicles. State of the vehicle includes position, orientation, speed, and acceleration. Sensor fusion techniques are utilized to provide a direct measurement of the state. A variety of sensors have been used in this entry, including LIDAR and computer vision. The goal is to quantitatively show that the integration of sensors provides a more accurate and effective estimation of the vehicle state. The proposed system has successfully met this goal.

The developed multi-planar LIDAR and computer vision sensor calibration approach, as to the author’s best knowledge, is the first calibration method for an “invisible-beam” multi-planar LIDAR and a camera. In comparison to the commonly used calibration methods that require a infrared camera to “see” the LIDAR beams or a special designed calibration shape, this approach is easy to implement with low cost. It has been theoretically and experimentally proven to be able to estimate the geometric relationships between the two sensors.

Based on this unique calibration method, a sensor fusion-based vehicle detection system is designed and implemented. It consists of three major components: (1) ROIs are generated by the LIDAR sensor; (2) vehicle classification using a computer vision-based Adaboost algorithm, and (3) vehicle position is verified using the output of the LIDAR sensor. A vehicle tracking model is also presented in this entry, which uses a joint probability model-based particle filter to predict state of the vehicle. The experiment result shows that the designed sensor fusion system achieves higher detection rate and lower positive as well as negative error rates, when compared with a single sensor-based detection method. The positions of detected vehicles have been represented in vehicle coordinates to generate a local traffic map.

Taken together, the tests in this entry demonstrate that a good vehicle detection performance can be achieved using a LIDAR and computer vision sensor-based moving platform. Such results are especially important for vehicle navigation systems, as well as traffic surveillance systems that are equipped with multiple sensors.

Future Directions

Although a sensor fusion system is developed in this entry for the on-board vehicle detection application, it is believed that the introduction of sensor fusion-based system in the automobile industry is still couple of years away. In the future driver assistance systems, sensor fusion techniques can be employed to support, even replace the driver. Moreover, falling costs of sensors, such as RADAR, GPS, inertial sensors (INS), and LIDAR, combined with increasing image processing capability provides a bright future for on-board intelligent transportation applications.

Disclaimer

Much of the material for this entry comes from the 2010 dissertation of Lili Huang, at the University of California-Riverside (see [44]). Portions of this entry have also appeared in [45] and [46].