Advertisement

Deep Learning in Object Detection

  • Yanwei PangEmail author
  • Jiale Cao
Chapter

Abstract

Object detection is an important research area in image processing and computer vision. The performance of object detection has significantly improved through applying deep learning technology. Among these methods, convolutional neural network (CNN)-based methods are most frequently used. CNN methods mainly include two classes: two-stage methods and one-stage methods. This chapter firstly introduces some typical CNN-based architectures in details. After that, pedestrian detection, as a classical subset of object detection, is further introduced. According to whether CNN is used or not, pedestrian detection can be divided into two types: handcrafted feature-based methods and CNN-based methods. Among these methods, NNNF (non-neighboring and neighboring features) inspired by pedestrian attributes (i.e., appearance constancy and shape symmetry) and MCF based on handcrafted channels and each layer of CNN are specifically illustrated. Finally, some challenges of object detection (i.e., scale variation, occlusion, and deformation) will be discussed.

1 Introduction

Object detection can be applied into many computer vision areas, such as video surveillance, robotics, and human interaction. However, due to the factors of complex background, illumination variation, scale variation, occlusion, and object deformation, object detection is very challenging and difficult. In the past few decades, researchers have done a lot of work to push the progress of object detection. Figure 1 shows mean average precisions (mAP) on PASCAL VOC2007 of some methods (i.e., [18, 23, 24, 25, 51, 52, 63]).
Fig. 1

The progress of object detection on PASCAL VOC2007

Depending on whether deep learning is used or not, object detection methods can be divided into two main classes: the handcrafted feature-based methods [6, 14, 63, 73] and the deep learning-based methods [23, 25, 36, 38, 52]. In the first decade of twenty-first century, the traditional handcrafted feature-based methods are main stream. During this period, many famous and successful image feature descriptors (e.g., SIFT [39], HOG [10], Haar [61], LBP [44], and DPM [19]) are proposed. Based on these feature descriptors and the classical classifiers (e.g., SVM and AdaBoost), these methods achieve great success at that time.

However, when it comes to 2010, the performance of objection detection tends to be stable. Though many methods are still proposed, the performance improvement is relatively limited. Meanwhile, deep learning begins to show superior performance on some computer vision areas (e.g., image classification [26, 33, 56, 57]). In 2012, with the big image data (i.e., ImageNet [13]), deep CNN network (called AlexNet [33]) achieves the best detection performance on ImageNet ILSVRC2012, which outperforms the second best method by 10.9% on ILSVRC-2012 competition.

With the great success of deep learning on image classification [13, 26, 56, 57], researchers start to explore how to improve object detection performance with deep learning. In the recent few years, object detection based on deep learning has also achieved a great progress [23, 25, 52]. The mAP of object detection on PASCAL VOC2007 [17] dramatically increases from 58% (based on RCNN with AlexNet [23]) to 86% (based on Faster RCNN with ResNet [26]). Currently, the state-of-the-art methods for deep object detection are based on deep convolutional neural networks (CNN) [23, 27, 36, 52].

In Sect. 2, some typical CNN architectures of object detection will be introduced. Pedestrian detection, as a special case of object detection, will be specifically discussed in Sect. 3. Finally, some representative challenges (i.e., occlusion, scale variation, and deformation) of object detection will be illustrated in Sect. 4.

2 The CNN Architectures of Object Detection

According to the pipeline of deep object detection, the methods can be divided into two main classes in Fig. 2: two-stage methods [11, 23, 24, 25, 52] and one-stage methods [36, 38, 49, 54]. Two-stage methods firstly generate some candidate object proposals and then classify these proposals into the specific categories. One-stage methods simultaneously extract and classify all the object proposals. Generally speaking, two-stage methods have a relatively slower detection speed and higher detection accuracy, while one-stage methods have a much faster detection speed and comparable detection accuracy. In the following part of this section, two-stage methods and one-stage methods are introduced, respectively.
Fig. 2

Object detection in deep learning can be mainly divided into two different classes: two-stage methods and one-stage methods

2.1 Two-Stage Methods for Deep Object Detection

Two-stage methods treat object detection as a multistage process. Given an input image, some proposals of possible objects are firstly extracted. After that, these proposals are further classified into the specific object categories by the trained classifier. The benefits of these methods can be summarized as follows: (1) It reduces a large number of proposals which are put into the following classifier. Thus, it can accelerate detection speed. (2) The step of proposal generation can be seen as a bootstrap technique. Based on the proposals of possible objects, the classifier can focus on the classification task with little influence of background (or easy negatives) in the training stage. Thus, it can improve detection accuracy. Among these two-stage methods, the series of RCNN, including RCNN [23], SPPnet [25], Fast RCNN [24], and Faster RCNN [52], are very representative.

With the great success of deep convolutional neural networks (CNN) on image classification [33, 56], Girshick et al. [23] initially attempted to apply deep CNN to object detection and proposed RCNN. Compared to the traditional highly tuned DPM [19], RCNN improves mean average precision (mAP) by 21% on PASCAL VOC2010. Figure 3 shows the architecture of RCNN. It can be mainly divided into three steps: (1) It firstly extracts the candidate object proposals, where the object proposals are category-independent. They can be extracted by the objectness methods, such as selective search [60], EdgeBox [75], and BING [9]. (2) For each object proposal of arbitrary scale, the image data is then warped into a fixed size (e.g., 227 × 227) and put into the deep CNN network (e.g., AlexNet [33]) to compute a 4096-d feature vector. (3) Finally, based on the feature vector extracted by CNN network, the SVM classifiers predict the specific category of each proposal.
Fig. 3

The architecture of RCNN. It consists of three steps: proposal generation, CNN feature computation, and proposal classification. (Ⓒ[2014] IEEE. Reprinted, with permission, from Ref. [23])

In the training stage, the object proposals should be firstly generated by selective search for training CNN network and SVM classifiers. The CNN network (e.g., AlexNet [33]) is firstly pre-trained on ImageNet [13] and then fine-tuned on specific object detection dataset (e.g., PASCAL VOC [17]). Because the number of object category on ImageNet [13] and PASCAL VOC [17] is different, the outputs of final fully connected layer in CNN network should be changed from 1000 to 21 when fine-tuning on PASCAL VOC. The number of 21 represents 20 object classes of PASCAL VOC and the background. When fine-tuning the CNN network, the proposal is labelled as the positive for the matched class if it has the maximum IoU overlap with a ground-truth bounding box and the overlap is at least 0.5. Otherwise, the proposal is labelled as the background class. Based on the CNN features extracted from the trained CNN network, the linear SVM classifiers for different classes are further trained, respectively. When training the SVM classifier per class, only the ground-truth bounding box is labelled as the positive. Otherwise, the proposal is labelled as the negative if it has the IoU overlap below 0.3 with all the ground-truth bounding boxes. Because the extracted CNN features are too large to load in memory, the bootstrap technique is used to mine the hard negatives in training SVM classifiers.

To improve location accuracy of the proposal, the linear regression model is further trained to predict a more accurate bounding box based on the pool5 features of trained CNN network (i.e., AlexNet [33]). Assuming that the original bounding box of the proposal (i.e., P ) is represented by (Px, Py, Pw, Ph), where Px and Py are the coordinates of the center of the proposal P, Pw and Ph are the width and the height of the proposal P. The bounding box of corresponding ground-truth (i.e., G) is represented by (Gx, Gy, Gw, Gh). Then, the regression target for a given (P, G) can be written as:
$$\displaystyle \begin{aligned} & t_x=(G_x-P_x)/P_w, \end{aligned} $$
(1)
$$\displaystyle \begin{aligned} & t_y=(G_y-P_y)/P_h, \end{aligned} $$
(2)
$$\displaystyle \begin{aligned} & t_w=\log(G_w/P_w), \end{aligned} $$
(3)
$$\displaystyle \begin{aligned} & t_h=\log(G_h/P_h). \end{aligned} $$
(4)
To predict the regression target (i.e., \(({\hat t_x},{\hat t_y},{\hat t_w},{\hat t_h})\)) for a new proposal (i.e., P), the pool5 features of proposal represented as ϕ5(P) are used. Thus, \({\hat t_*}(P) = {{{\mathbf {w}}_*}^T}{\phi _5}(P)\), where w is a learnable parameter and ∗ means one of x, y, w, h. Given the training sample pairs {(Pi, Gi)}, where i = 1, 2, …, N. w can be optimized by the regularized least squares objective as follows:
$$\displaystyle \begin{aligned} {{\mathbf{w}}_*} = \operatorname*{\arg\min}_{\hat{\mathbf{w}}_*}\sum_{i}^{N}({\hat t_*}-{\hat{\mathbf{w}}_*}^T \phi_5(P))^2+\lambda ||{\hat{\mathbf{w}}_*}||{}^2, \end{aligned} $$
(5)
where λ is a regularization factor, which is usually set as 1000. When training the regression model per class, the proposal that has an IoU overlap over 0.6 with a ground-truth bounding box is used. Otherwise, the proposal is ignored.
Based on the learned w (i.e., wx,wy,ww,wh), the predicted bounding box of proposal (P) can be calculated as follows:
$$\displaystyle \begin{aligned} &{{{\hat P}_x} = {P_w}{{\mathbf{w}}_x}^T{\phi _5}(P) + {P_x}}, \end{aligned} $$
(6)
$$\displaystyle \begin{aligned} &{{{\hat P}_y} = {P_h}{{\mathbf{w}}_y}^T{\phi _5}(P) + {P_y}}, \end{aligned} $$
(7)
$$\displaystyle \begin{aligned} &{{{\hat P}_w} = {P_w}\exp ({{\mathbf{w}}_w}^T{\phi _5}(P))}, \end{aligned} $$
(8)
$$\displaystyle \begin{aligned} &{{{\hat P}_h} = {P_h}\exp ({{\mathbf{w}}_h}^T{\phi _5}(P))}. \end{aligned} $$
(9)
The new predicted proposals have more accurate location accuracy.
Though RCNN dramatically improves the object detection performance, the object proposals should be warped into a fixed size and then put into the CNN network, respectively. Because the computation of CNN features of different proposals are not shared, RCNN is very time-consuming. To remove the fixed-size constraint and accelerate detection speed, He et al. [25] proposed SPPnet. Figure 4 compares RCNN and SPPnet. Instead of cropping or warping the image data of all the proposals before computing the CNN features, SPPnet firstly computes all the convolutional features of the whole image and then uses spatial pyramid pooling to extract the fixed-size features of each proposal. Figure 5 gives the illustration of spatial pyramid pooling layer (SPP). Based on the feature maps of last convolutional layer, SPP splits the feature maps of the proposal into 3 × 3 spatial bins, 2 × 2 spatial bins, and 1 × 1 spatial bins, respectively. In each spatial bin, the feature response value is calculated as the maximum of all the features which belong to the same spatial bin (i.e., max-pooling). After that, the outputs of 3 × 3 spatial bins, 2 × 2 spatial bins, and 1 × 1 spatial bins are concatenated as a 21c-d feature vector, where c = 256 is the number of feature maps of the last convolutional layer. After concatenation, two fully connected layers are connected to this 21c-d feature vector. For training the CNN network of SPPnet, two different strategies can be adopted: single-scale training and multi-scale training. Single-scale training uses a fixed-size input (i.e., 224 × 224) wrapped from the input images. Multi-scale training uses the images of multiple different sizes, where in each iteration, only the images of one scale are used for training the CNN network. The size of input image is represented by s × s, where s is uniformly sampled from 180 to 224. Because the multi-scale training can simulate the varying sizes of images, it can improve detection accuracy. For training SVM classifiers, the specific steps are the same as that of RCNN. In the test stage, the image of arbitrary scale can be put into SPPnet. Compared to RCNN, SPPnet has the following advantages: (1) All the candidate proposals share all the convolutional layers before the fully connected layers. Thus, it has faster detection speed than RCNN. (2) SPPnet makes use of multilevel spatial information of objects, which is more robust to object deformations. Meanwhile, multi-scale training can enlarge the training data. Thus, it has the higher detection accuracy than RCNN.
Fig. 4

Comparison of RCNN and SPPnet

Fig. 5

Spatial pyramid pooling layer. The feature maps of a given proposal are pooled into 3 × 3 spatial bins, 2 × 2 spatial bins, and 1 × 1 spatial bins, respectively. After that, they are concatenated into a fixed-size feature vector and fed into two fully connected layers

Though SPPnet accelerates detection speed by sharing computation of all the convolutional layers, the training of SPPnet is still a multistage process similar to RCNN. Namely, they need to, respectively, fine-tune the CNN network on object detection dataset, train multiple SVM classifiers, and learn the bounding box regressors. To fix the disadvantages of RCNN and SPPnet, Girshick [24] further proposed Fast RCNN, which integrates the training of CNN networks, object classification, and bounding regression into a unified framework. Figure 6 shows the architecture of Fast RCNN. Generally, Fast RCNN firstly calculates all the convolutional layers of the whole image. For each proposal, Fast RCNN uses a ROI pooling layer to extract the fixed-size feature maps from the feature maps of last convolutional layer, then feds the fixed-size feature maps to two fully connected layers, and finally generates two sibling branches with fully connected operation for object classification and box regression. For object classification, it has c + 1 outputs by softmax, where c means the number of object classes. For box regression, it has 4c outputs, where each four outputs correspond to the box offset per class. The ROI pooling layer warps the feature maps of object proposals into the fixed-size spatial bins (e.g., 7 × 7) and uses max-pooling operation to calculate the feature responses in each bin. Because Fast RCNN has two sibling outputs for object classification and box regression, the multitask training loss (i.e., L) is the joint of classification loss (i.e., Lcls) and regression loss (i.e., Lloc) for each ROI as follows:
$$\displaystyle \begin{aligned} L(p,t) = L_{cls}(p,c)+\lambda [c\ge 1] L_{reg}(t,v), \end{aligned} $$
(10)
where λ balances classification loss and regression loss and [c ≥ 1] is equal to 1 if the c ≥ 1 and 0 otherwise. Namely, the ROI belonging to background class does not contribute to the regression loss. The classification loss \(L_{cls}(p,c)=-\log p_c\) is the log loss for true class c. The regression loss Lreg is defined by the ground-truth regress target (i.e., (vx, vy, vw, vh)) and a predicted target (i.e., (ux, uy, uw, uh)) as follows:
$$\displaystyle \begin{aligned} L_{loc}(t,v) = \sum_{i\in {x,y,w,h}} smooth_{L_1} (t_i,v_i), \end{aligned} $$
(11)
where
$$\displaystyle \begin{aligned} smooth_{L_1}(x)= \begin{cases} 0.5x^2,&\text{if {$|x|<1$}}, \\ |x|-0.5,&\text{otherwise}. \end{cases} \end{aligned} $$
(12)
Compared to L2 loss used in RCNN and SPPnet, the L1 loss is more robust to outliers.
Fig. 6

The architecture of Fast RCNN. Compared to RCNN and SPPnet, it joins the classification and regression into a unified framework. (Ⓒ[2015] IEEE. Reprinted, with permission, from Ref. [24])

In the training stage, the CNN network is firstly initialized from the pre-trained ImageNet network and then trained with back-propagation in the end-to-end way. The mini-batches (i.e., R) are sampled from N images, where each image provides the RN proposals. The proposals from the same image share all the convolutional computation. Generally, N is set as 2 and R is set as 128. To achieve scale invariance, two different strategies are used: brute-force approach and multi-scale approach. In brute-force approach, the images are resized to a fixed size in the training and test stages. In the multi-scale approach, the images are randomly rescaled to a pyramid scale in each iteration of training stage. In the test stage, the multi-scale images are put into the trained network, respectively, and the detection results are combined together by NMS.

For proposal generation, RCNN [23], SPPnet [25], and Fast RCNN [24] are all based on selective search. Selective search [60] uses the handcrafted features and adopts the hierarchical grouping strategies to capture all possible object proposals. Generally, it runs at 2 s per image on the common CPU. The detection network of Fast RCNN can run at about 100 ms per image on the GPU. Thus, proposal generation of Fast RCNN is more time-consuming compared to the detection network of Fast RCNN. Though selective search can be also re-implemented on the GPU, proposal extraction is still isolated from detection network of Fast RCNN. Thus, region proposal extraction becomes the bottleneck of Fast RCNN on object detection. To solve this problem, Ren et al. [52] proposed Faster RCNN. It integrates proposal generation, proposal classification, and proposal regression into a unified network. Figure 7 shows the network architecture of Faster RCNN. It consists of two modules. One module is called region proposal network (i.e., RPN), which is used to extract candidate object proposals. Another module is Fast RCNN, which is used to classify these proposals into the specific categories and predict more accurate proposal locations. The two modules share the same base sub-network. On the one hand, RPN can generate the candidate object proposals by deep convolutional features; it can improve proposal location quality. On the other hand, Faster RCNN is an end-to-end framework with the multitask loss. Compared to Fast RCNN, Faster RCNN can achieve better detection performance with much fewer proposals.
Fig. 7

The architecture of Faster RCNN. Proposal generation (RPN) and proposal classification (Fast RCNN) are integrated into a unified framework

RPN slides a small network over the output layer of base network. The small network consists of one 3 × 3 convolutional layer and two sibling 1 × 1 convolutional layers for box regression and box classification. Box classification is class-agnostic. For each sliding window location, RPN predicts multiple proposals based on the anchors of different aspect ratios and scales. Assuming that the number of anchors is k, the box regression layer has 4k outputs for each sliding window, and the box classification layer has 2k outputs for each sliding window. Generally speaking, three different aspect ratios of {1 : 2, 1 : 1, 2 : 1} and three different scales of {0.5, 1, 2} are used. Thus, there are nine (i.e., 3 × 3) anchors (i.e., k = 9) at each sliding window. The multitask loss of RPN consists of two parts: classification loss Lcls and regression loss Lreg, which can be written as follows:
$$\displaystyle \begin{aligned} {L({p,v})} = \frac{1}{{{N_{cls}}}}\sum_i {{L_{cls}}(p_i,c_i) + \lambda } \frac{1}{{{N_{reg}}}}\sum_i {[{c_i}\ge 1]{L_{reg}}(t_i,v_i)} \end{aligned} $$
(13)
where Ncls (256) and Nreg (about 2400) are the terms to, respectively, normalize classification loss and location loss, λ balances classification loss and regression loss, and [ci ≥ 1] is 1 if ci ≥ 1 or 0 otherwise. The classification loss and regression loss are the same as that of Fast RCNN. Two following kinds of anchors are labelled as the positives: (1) the anchor with the highest IoU overlap with a ground-truth bounding box and (2) the anchor that has an IoU overlap over 0.7 with any ground-truth bounding boxes. The anchors are labelled as the negatives if the anchors are not labelled as the positives and they have the IoU overlap under 0.3 with all ground-truth bounding boxes.

Because RPN and Fast RCNN share the same base network, thus they cannot train independently. Faster RCNN provides three different ways to train RPN and Fast RCNN in the unified network: (1) Alternating training. RPN is firstly trained. Based on the proposals generated by RPN and the filter weights of RPN, Fast RCNN is then trained. Two above steps are iterated for two times. (2) Approximate joint training. RPN and Fast RCNN networks are seen as a unified network. The loss of the unified network is the joint of RPN loss and Fast RCNN loss. In each iteration of training stage, the proposals generated by RPN are treated as the fixed proposals when training Fast RCNN detector. Namely, the derivative of proposals coordinates are ignored. (3) Non-approximate joint training. The difference between approximate joint training and non-approximate joint training is that the derivative of proposal coordinates should be considered. Because the standard ROI pooling layers are not differentiable for proposal coordinate, the first two ways are usually used.

Generally, the above state-of-the-art object detection methods (i.e., RCNN [23], SPPnet [25], Fast RCNN [24], and Faster RCNN [52]) use the pre-trained CNN network on image classification dataset (i.e., ImageNet [13]). Dai et al. [11] argued that this design has the dilemma in some degree. Generally, deep CNN network for image classification usually favors translation invariance, while deep CNN network for object detection needs to be translation variance. To address the above dilemma between image classification and object detection, Dai et al. [11] proposed R-FCN. It encodes the object position information by the position-sensitive ROI pooling layer (PSROI) for the following Fast RCNN subnet. Figure 8 shows the architecture of R-FCN. Region proposal generation is the same as Faster RCNN. Based on the output layer of original base network, R-FCN generates the new k ∗ k position-sensitive convolutional banks. The convolutional banks correspond to the k × k spatial grids, respectively. In each convolutional bank, there are c + 1 convolutional layers (c means the number of object categories, and + 1 means the background category). Namely, k ∗ k ∗ (c + 1) convolutional feature maps are generated. For a given proposal, position-sensitive ROI pooling layer generates the k × k score maps from position-sensitive feature maps, where the response of (i, j)-th bin (i.e., r(i, j)) pools over (i, j)-th positive-sensitive score maps as follows:
$$\displaystyle \begin{aligned} r_{r^*}(i,j)=\sum_{(x,y)\in bin(i,j)} z_{i,j,c^*}(x+x_0,y+y_0)/n, \end{aligned} $$
(14)
where \(z_{i,j,c^*}\) is one of k ∗ k ∗ (c + 1) feature maps, (x0, y0) denotes top-left corner of a ROI proposal, n is the number of pixels in the bin(i, j), \(r_{c^*}(i,j)\) is the pooled response of bin(i, j) for c-th category. After that, average pooling or max-pooling is conducted to output the scores for object classification. For box regression, sibling 4 ∗ k ∗ k convolutional feature maps are also generated, and then position-sensitive ROI pooling operation is conducted to construct four feature maps with the size of k × k. Finally, 4-d vector of the box parameter is calculated by the average voting. For example, if k is set as 3, it means that the position information is encoded as {top − left, top − center, …, bottom − right}.
Fig. 8

The architecture of R-FCN. Position information is encoded into the network by position-sensitive ROI pooling (PSROI)

The loss of R-FCN is similar to that of Faster RCNN. Please note that the proposal location predications per category in Faster RCNN are based on different outputs of box regression layer, while all the proposal location predications in R-FCN share the same output of box regression layer. R-FCN is fully connected networks, which almost shares all the CNN computation of the whole image. Thus, it can achieve the competitive detection accuracy and faster detection speed compared to Faster RCNN.

Most methods of object detection only predict object locations by bounding box and do not provide the more accurate segmentation information. In recent few years, some researchers proposed instance segmentation, which usually contains object detection and segmentation. Mask RCNN is a famous method for instance segmentation and object detection. Figure 9 shows the architecture of Mask RCNN. Mask RCNN incorporates instance segmentation and object detection into a unified framework based on Faster RCNN architecture. Specifically, it adds an extra mask branch to predict the mask of object aside from the branch for object classification and box regression. The mask branch has c binary masks with the size of m × m. c means the number of object categories. The multitask loss of Mask RCNN on each sampled ROI is the joint of classification loss, regression loss, and mask loss. It can represented as L = Lcls + Lreg + Lmask. The losses of Lcls and Lreg are the same as that of Faster RCNN. For an RoI proposal associated with the ground-truth class c, Lmask is only defined on the c-th mask, and other mask outputs do not contribute to the loss. Based on this design, it allows the network to generate masks for every class without competition among classes. In test stage, the output mask of object is determined by the predicted category of classification branch. To extract a small feature map for each ROI, ROI pooling quantizes the floating number of ROI proposal location into the discrete values. The quantization of ROI pool causes the misalignment between the input and the output, which has a negative effect on instance segmentation. To solve this problem, ROIAlign is proposed. It uses bilinear interpolation to compute the feature values of four corner locations in each spatial bin and then aggregates the feature response of each bin by max-pooling. Based on multitask learning, Mask RCNN can achieve state-of-the-art performance on object detection and instance segmentation. It means that joining the instance segmentation task with object detection task can also help improve detection performance.
Fig. 9

The architecture of Mask RCNN. Apart from detection branch of Faster RCNN, the extra branch of mask segmentation is added. (Ⓒ[2017] IEEE. Reprinted, with permission, from Ref. [27])

Some other improvements of two-stage methods are also proposed. Yang et al. [67] proposed a “divide and conquer” solution-based Fast RCNN, which firstly rejects the background from the candidate proposals and then uses the c trained category-specific network to judge the proposal category. Based on the classification loss of proposals, Shrivastava et al. [55] proposed online hard example mining to select the hard samples. Gidaris et al. [22] proposed LocNet to improve location accuracy, which localizes the bounding box by the probabilities on each row and column of a given region. Bell et al. [1] proposed to use the contextual information and skip-layer pooling to improve detection performance. Kong et al. [31] proposed to use skip-layer connection for object detection. Zagoruyko et al. [70] proposed to use multiple branches to extract object context of multiple resolutions. Yu et al. [69] proposed dilated residual networks to enlarge the resolution of output layer without reducing the respective field.

2.2 One-Stage Methods for Deep Object Detection

Different from the multistage process of two-stage methods, one-stage methods aim to simultaneously predict object category and object location. Compared to two-stage methods, one-stage methods have much faster detection speed and comparable detection accuracy. Among the one-stage methods, OverFeat [54], YOLO [49], SSD [38], and RetinaNet [36] are the representative methods.

YOLO [49] divides the input image into the k × k grids. Each grid cell predicts B bounding boxes with objectness scores and c conditional class probabilities. The predictions of each bounding box are (x, y, w, h, s), where (x, y, w, h) gives the location of bounding box and s is the confidence objectness score of bounding box. Thus, the output layer has the size of n × n × (5B + c). Based on bounding box prediction and corresponding class prediction, YOLO can simultaneously give the object probability and the category probability of each cell. Figure 10 shows the architecture of YOLO. It consists of 24 convolutional layers and 2 fully connected layers. To accelerate detection speed, the alternating 1 × 1 convolutional layers are used in some middle layers. The input image size of YOLO is fixed (i.e., 448 × 448). On PASCAL VOC, B is set to 2, and c is 20. Thus, the output layer for PASCAL VOC has the size of 7 × 7 × 30. The loss of YOLO is the joint of classification loss, location loss, and detection loss.
Fig. 10

The architecture of YOLO. (Ⓒ[2016] IEEE. Reprinted, with permission, from Ref. [49])

In the training stage, the first 20 convolutional layers of YOLO are firstly pre-trained, and the whole network of YOLO is then fine-tuned on the dataset of object detection. YOLO uses sum-squared error for optimization. Because there are many grid cells that do not contain objects, it will affect the gradient from the cells that contain objects. Thus, the weight of box location predication loss is increased, while the weight of box predication loss is decreased. For each bounding box, it is assigned to the ground-truth with which it has the highest IoU overlap.

SSD [38] is a single-shot detector for generic object detection. The base network is based on VGG16 [56]. Following the base network, several convolutional layers are added to generate more convolutional layers (i.e., C6-C11) of different resolution. After that, it uses multiple convolutional layers of different resolution to predict objects of different scales. Specifically, for the output convolutional layer of a given resolution, it firstly uses a 3 × 3 convolutional filter to generate the new feature maps and then predicts object category scores and object locations on the new feature maps. Figure 11 shows the architecture of SSD. Assuming that the number of object classes is c and each feature map predicts k objects, it will result (c + 4) ∗ k ∗ m ∗ n output vector for the given m × n feature maps. The objective loss of SSD is a weighted sum of location loss and confidence loss similar to that of Fast RCNN.
Fig. 11

The architecture of SSD. The base network is VGG16. (Reprinted from Ref. [38], with permission of Springer)

For multi-scale object detection based on different convolutional layers, the anchor scale for each convolutional layer is computed as:
$$\displaystyle \begin{aligned} s_k=s_{min}+\frac{s_{max}-s_{min}}{K-1}(k-1), ~k\in [1,K], \end{aligned} $$
(15)
where smin and smax are 0.2 and 0.9 and K means the number of different convolutional layers used for prediction. Aspect ratios of anchors are set as {1, 2, 3, 1∕2, 1∕3}. For the aspect ratio of 1, the anchor with extra scale of \(\sqrt {s_k*s_{k+1}}\) is also added. Thus, there are six default boxes per feature map location. The default anchor is labelled as the positive for the matched class if it has the highest Jaccard overlap with a ground-truth or it has the Jaccard overlap over 0.5 with the ground-truth. Namely, SSD can predict high scores for multiple overlapping anchors. Generally, the negative and positive training samples have a significant imbalance. The bootstrap technique is used. According to the highest confidence loss, the anchors of negatives are sorted. Then, some top negatives are used so that the ratio of positives and negatives is about 1:3. Compared to YOLO, SSD makes full use of multi-scale information. Thus, it has a faster detection speed and higher detection accuracy.
Though one-stage methods (e.g., YOLO and SSD) have the faster detection speed than two-stage methods, most state-of-the-art methods for object detection are still two-stage. Lin et al. [36] investigated why one-stage methods have the inferior detection performance compared to the two-stage methods. It was found that the extreme positive and negative imbalance is the main reason that causes the inferior performance of one-stage methods. To solve the imbalance of positives and negatives, bootstrap technique is usually used to choose some hard negatives. However, it will ignore the information of easy negative. If all the negatives are used, the weights of easy negatives will be large which causes the worse detection performance. To solve this problem, RetinaNet is proposed. RetinaNet adopts the focal loss for object detection based on FPN architecture. Focal loss can be also seen as a bootstrap technique. In the training stage, it (i.e., FL(pt)) adds a factor to the cross entropy loss as follows:
$$\displaystyle \begin{aligned} FL(p_t)=-(1-p_t)^\gamma \log (p_t) \end{aligned} $$
(16)
where
$$\displaystyle \begin{aligned} p_t= \begin{cases} p,&\text{if {$c=1$}}, \\ 1-p,&\text{otherwise}, \end{cases} \end{aligned} $$
(17)
where p is the probability of the anchor which belongs to class c = 1 and γ > 0. Based on this operation, it can enlarge the weights of hard negative samples and reduce the weights of easy samples. Namely, it can pay more attention on the hard negatives and reduce the influence of easy negatives. Moreover, α-balanced variant of focal loss is further proposed as follows:
$$\displaystyle \begin{aligned} FL(p_t)=-\alpha(1-p_t)^\gamma \log (p_t), \end{aligned} $$
(18)
where α belongs to [0,1]. With the α-balanced variant of focal loss, it can further improve detection accuracy.
Figure 12 shows the architecture of RetinaNet. The base network is FPN, which constructs multiple feature maps. For box classification and box regression, four 3 × 3 convolutional layers are attached to the output layer of FPN, respectively. After that, 3 × 3 convolutional layers with K ∗ C filters are used for box classification, and 3 × 3 convolutional layers with 4 ∗ K filters are used for box regression. Namely, one position of each layer has K anchors, and there are C classes. The anchors are assigned to the matched class if it has the highest IoU overlap with a ground-truth, and the IoU overlap is over 0.5. The anchor is assigned to background if it has the highest IoU overlap below 0.4 with a ground-truth. The other anchors are ignored in the training stage. The anchors have areas of 32 × 32 to 512 × 512 on pyramid levels P3 to P7. The aspect ratios for each feature map are set as {1∕2, 1, 2}, respectively. Focal loss can make full use of all the information of negatives in the training stage, while the easy negatives have the relatively small influence. Compared to SSD, FPN uses the top-down structure to enhance the semantic levels of high-resolution and low-level semantic convolutional layers.
Fig. 12

The architecture of RetinaNet. The base network is FPN with ResNet. (Ⓒ[2017] IEEE. Reprinted, with permission, from Ref. [36])

Some other one-stage methods are also proposed. Najibi et al. [42] proposed to initially divide the input image into the multi-scale regular grids and iteratively update the location of these grids to be toward the objects. Based on SSD, Ren et al. [50] further proposed to use the recurrent rolling convolution to add deep context information. Fu et al. [20] proposed to use deconvolutional layer after SSD to incorporate the object context.

3 Pedestrian Detection

As the canonical and important case of object detection, pedestrian detection can be applied into many areas (autonomous driving, human-computer interaction, video surveillance, and robotics). In the past 10 years, pedestrian detection has also achieved great success [3]. Figure 13 shows the progress of pedestrian detection Caltech pedestrian dataset. According to whether using CNN, pedestrian detection can be mainly classified into two classes: handcrafted feature-based methods and CNN-based methods.
Fig. 13

The progress of pedestrian detection on Caltech pedestrian dataset

3.1 Handcrafted Feature-Based Methods for Pedestrian Detection

In 2004, Viola and Jones [61] proposed robust real-time face detection, which uses cascaded AdaBoost to learn strong classifier from the candidate Haar feature pool. In 2005, Dalal and Triggs [10] proposed histograms of gradient (HOG) for pedestrian detection, which dramatically improves the performance of pedestrian detection. Inspired by the above two ideas, Dollár et al. [14] integrated them into pedestrian detection, which is called integral channel features (i.e., ICF). The top row of Fig. 14 shows the architecture of ICF. It firstly converts the original input image into ten feature channels (i.e., HOG+LUV), then extracts the features of local pixel sum in each channel, and finally learns the strong classifier from the candidate feature pools by cascaded AdaBoost. HOG+LUV consists of six histograms of gradient, one gradient magnitude, and the LUV color channels. The features of local pixel sum are efficiently calculated by using integral images. To accelerate detection speed, Dollár et al. [16] further proposed aggregated channel features (i.e., ACF). It downsamples the original feature channels (i.e., HOG+LUV) four times and uses every single pixel value in each channel as the candidate feature. After that, some variants of ICF are also proposed. To avoid the randomness of candidate features in ICF, SquaresChnFtrs [2] deterministically generates candidate features by calculating the pixel sum features of all the squares inside each channel. To reduce the local correlation in each feature channel, LDCF [43] convolves the filters learned by PCA technique with feature channels (i.e., HOG+LUV) to generate new feature maps.
Fig. 14

Integral channel features (ICF) and filtered channel features (FCF). (Ⓒ[2015] IEEE. Reprinted, with permission, from Ref. [73])

By observing and summarizing ICF and the variants of ICF, Zhang et al. [73] generalized these methods into a unified framework. It is called filtered channel features (FCF). The bottom row of Fig. 14 shows the pipeline of filtered channel features (FCF). It firstly converts the input image to the HOG+LUV channels, then convolves the filter bank with HOG+LUV to generate the new feature channels where each pixel value in each channel is set as candidate feature, and finally learns the strong classifier by decision forest from the candidate feature pool. The filter bank can be different types, including SquaresChntrs filters, Checkerboards filters, Random filters, Informed filters, and so on. Thus, ICF [14], ACF [16], SquaresChntrs [2], and LDCF [43] can be seen as the special cases of FCF. It is found that based on Checkerboards filters, FCF can outperform ICF, ACF, SquaresChntrs, and LDCF on Caltech pedestrian dataset [15] and KITTI benchmark [21].

These above methods make great success on pedestrian detection. However, the design of handcrafted features does not consider pedestrian inherent attributes. Zhang et al. [72] treated the pedestrians as three parts (i.e., head, upper body, and legs) and designed the Haar-like features (i.e., InformedHaar) for pedestrian detection. However, it still does not make full use of pedestrian attributes. To make better use of pedestrian attributes for pedestrian detection, Cao et al. [6, 7] further proposed two non-neighborhood features inspired by appearance constancy and shape symmetry of pedestrians.

Generally, pedestrian can be seen as three parts: head, upper body, and legs. Usually, the appearance of these parts is constancy and contrast to the surrounding background. Based on appearance constancy, side-inner difference features (SIDF) are proposed. Figure 15 gives the illustration of side-inner difference features (SIDF). Patch B is randomly sampled between patch A and the symmetrical patch A. The height of patch B is the same as that of patch A, while the width of patch B can be different from patch B. The direction of patch A can be horizontal or vertical, while the sizes of patch A can be arbitrarily various in the maximum square of 8 × 8 cells, where each cell is 2 × 2 pixels. SIDF (i.e., f(A, B)) calculates the difference between the local patch (i.e., patch A) in background and the local patch (i.e., patch B) in pedestrians as:
$$\displaystyle \begin{aligned} f(A,B)=\frac{S_A}{N_A}-\frac{S_B}{N_B}, \end{aligned} $$
(19)
where SA and SB are, respectively, the pixel sums in patch A and patch B and NA and NB means, respectively, the pixel number in patch A and patch B.
Fig. 15

Illustration of side-inner difference features (SIDF). SIDF calculates the difference of patch A and patch B, where patch B is randomly sampled between patch A and the symmetrical patch A

Meanwhile, pedestrians usually appear stand-up. Thus, the shape of pedestrian is loosely symmetrical in the horizontal direction. Based on shape symmetry, symmetrical similarity features (SSF) are proposed. Figure 16 shows symmetrical similarity features (SSF). Patch A and Patch A are two symmetrical patches on pedestrians. The direction of patch A can be horizontal or vertical, and the size of patch A can be changed from 6 × 6 cells to 12 × 12 cells, where each cell is 2 × 2 pixels. SSF (i.e., f(A, A)) calculates the similarity between two symmetrical patches (i.e., patch A and patch A) of pedestrians as follows:
$$\displaystyle \begin{aligned} f(A,A^{\prime})=|f_A-f_{A^{\prime}}| = |\frac{S_A}{N_A}-\frac{S_{A^{\prime}}}{N_{A^{\prime}}}|, \end{aligned} $$
(20)
where fA and \(f_{A^{\prime }}\) represent the features of patches A and A. Because the shape symmetry of pedestrians is not very strict, max-pooling technique is further incorporated to improve the robustness of the features. Specifically, the feature value (i.e., fM(A)) of patch A is represented by the maximum feature of three sub-patches in patch A (i.e., patch A1, patch A2, and patch A3) as follows:
$$\displaystyle \begin{aligned} f_M(A)=\max_{i=1,2,3}{\frac{S_i}{N_i}}. \end{aligned} $$
(21)
The three sub-patches are randomly sampled in patch A or patch B. Thus, they can have different aspect ratios, positions, and sizes. Based on it, SSF (i.e., f(A, A)) can be rewritten by:
$$\displaystyle \begin{aligned} f(A,A^{\prime})=|f_M(A)-f_M(A^{\prime})|. \end{aligned} $$
(22)
Fig. 16

Illustration of symmetrical similarity features (SSF). SSF abstracts the similarity between patch A and the symmetrical patch A. Each patch is represented by three random sub-patches

Because SIDF and SSF are calculated by non-neighboring patches, they are called non-neighboring features (NNF). To achieve state-of-the-art detection performance, neighboring features (NF) are also designed for pedestrian detection. Figure 17 shows the neighboring features, which contain local mean features and neighboring difference features. In Fig. 17a, the sizes and aspect ratios of local mean features can be changed. In Fig. 17b, partition line is where two neighboring patches intersect. The direction of partition line can be horizontal or vertical. The location of partition line can be also various.
Fig. 17

Illustration of neighboring features. (a) is the local mean feature, and (b) is the neighboring feature

Based on Caltech2x training set, the detectors which select features from NF, NF+SIDF, NF+SSF, or NNNF are trained. Level-2 2048 decision forests are used. Table 1 compares NF, NF+SIDF, NF+SSF, and NNNF. It can be seen with non-neighboring features (SIDF or SSF); NF+SIDF or NF+SSF can outperform NF by 1.83% or 2.30%. When SIDF and SSF are both combined with NF, NNNF outperforms NF by 4.44%. Figure 18 shows the ratios of NF, SSF, and SIDF in NNNF. It can be seen that 69.97% features are NF, 11.34% features are SSF, and 18.69% features are SIDF. Namely, about 70% features are neighboring features, and about 30% features are non-neighboring features. It demonstrates the effectiveness of proposed non-neighboring features. Some representative features are also shown in Fig. 18.
Fig. 18

Among all the selected features in NNNF, about 30% are non-neighboring features, and 70% are neighboring features. Several representative non-neighboring features and neighboring features are also shown

Table 1

Comparison of log-average miss rates on Caltech test set

Method

MR

Δ MR

NF

27.50%

N/A

NF+ SIDF

25.67%

+ 1.83%

NF+ SSF

25.20%

+ 2.30%

NNNF

23.06%

+ 4.44%

To compare with the state-of-the-art methods, NNNF with 4096 level-4 decision forests are trained on Caltech10x training set. It is called NNNF-L4. Figure 19 shows ROC on Caltech test set. It can be seen that NNNF outperforms Checkerboards [73] by 2.27%. Table 2 shows the average precision (AP) on KITTI test set. It can be seen that NNNF outperforms Checkerboards by 1.36% on moderate test set. Namely, among the methods without using CNNN, NNNF achieves the state-of-the-art performance by combining NNF with neighboring features (NF) on Caltech pedestrian dataset and KITTI benchmark.
Table 2

Average precision (AP) of some methods without using CNN on KITTI test set

Method

Easy

Moderate

Hard

ACF [16]

44.49%

39.81%

37.21%

SquaresChnFtrs [2]

57.33%

44.42%

40.08%

SpatialPooling+ [46]

65.26%

54.49%

48.60%

Checkerboards [73]

67.75%

56.75%

51.12%

NNNF-L4

69.16%

58.01%

52.77%

Fig. 19

ROC on Caltech test set (reasonable) of some methods without using CNN

3.2 CNN-Based Methods for Pedestrian Detection

With the success of convolutional neural networks on many fields of computer vision, researchers also explored how to apply CNN on pedestrian detection for improving detection performance. As an initial attempt, Hosang et al. [28] studied the effectiveness of CNN for pedestrian detection. In [28], it firstly extracts the candidate pedestrian proposals by using the handcrafted feature-based method (i.e., SquaresChnFtrs [2]) and then uses a small network (i.e., CifarNet) or a large network (i.e., AlexNet [33]) to classify these proposals. The traditional pedestrian detection usually uses a fixed-size detection window (i.e., 128 × 64). Following it, the input size of CNN network (i.e., CifarNet and AlexNet) is changed to be 128 × 64, and the outputs of CNN network are changed to be 2 (i.e., pedestrian or non-pedestrian). In the training stages, the proposal with an IoU overlap over 0.5 with a ground-truth bounding box is labelled as the positive, and the proposal with the IoU below 0.3 with all the ground-truth bounding boxes is labelled as the negative. The ratio of positives and negatives in the mini-batch is set as 1:5. Experimental results on Caltech pedestrian dataset demonstrate that the ConvNets can achieve the state-of-the-art performance, which is useful for helping pedestrian detection.

After that, pedestrian detection based on CNN achieves great success. Sermanet et al. [53] proposed to merge the down-sampled 1-st convolutional layer with 2-st convolutional layer to add the global information for pedestrian detection. Instead of using handcrafted feature channel (HOG+LUV), Yang et al. [66] proposed convolutional channel features (CCF). CCF treats the feature maps of last convolutional layer as the feature channels and uses decision forests to learn the strong classifier. Because of the richer representation ability of CNN features, CCF achieves state-of-the-art performance on pedestrians detection. Zhang et al. [71] proposed to use RPN to extract candidate features and use decision forest to classify these proposals. Zhang et al. [74] proposed to improve the performance of Faster RCNN on pedestrian detection by some specific tricks (e.g., quantized RPN scales and the upsampled input image). To reduce the computation complexity, Cai et al. [4] proposed the complexity-aware cascaded detector (i.e., CompACT), which integrates the features of different complexities together. Complexity-aware cascaded detector aims to achieve a best trade-off between classification accuracy and computation complexity. In the training stage, the loss of CompACT is the joint of classification loss and computation complexity loss. Based on CompACT, the first few stages of strong classifier learn the more features of lower computation complexity, and the last few stages of strong classifier learn the more features of higher computation complexity. To further improve detection performance, the CNN features of very highest complexity are embedded into the last stage of CompACT, which is called CompACT-Deep. Because most detection proposals are rejected by the first few stages, only the CNN features of very few proposals need to be calculated. Thus, CompACT-Deep can improve detection performance without increasing much computation cost.

Some methods exploit to use the extra feature information to help pedestrian detection. Tian et al. [59] proposed to join pedestrian detection with semantic tasks, including scene attributes and pedestrian attributes, into the TA-CNN. The attributes are transferred from existing scene datasets. TA-CNN is learned by iterating two tasks, respectively. Mao et al. [41] proposed to apply some extra features to deep pedestrian detection. Some extra feature channels are used as the extra input channels for CNN detectors. The extra feature channels contain some low-level semantic feature channels (e.g., gradient and edge), some high-level sematic feature channels (e.g., segmentation and heatmap), depth channels, and temporal channels. It is found that segmentation and edge used as the extra input channels can significantly help improve pedestrian detection. Based on this observation, Mao et al. [41] further proposed HyperLearner. It consists of four different modules: base network, channel feature network (CFN), region proposal network, and Fast RCNN. Base network and feature, region proposal network, and Fast RCNN are the same as original Faster RCNN. Multiple convolutional layers of base network firstly go through two 3 × 3 convolutional layers and then upsample the same size of conv1. After that, the output layers are appended together to generate the aggregated feature maps. The aggregated feature maps are fed into CFN for channel feature prediction and concatenated with the output layer of base network. In the training stage, feature channel generation network is supervised by the semantic segmentation ground-truth. The loss of HyperLearner is the joint of Faster RCNN loss and segmentation loss. In the test stage, feature channel generation network outputs the predicted feature channel map (e.g., semantic segmentation map). Before Fast RCNN subnet, ROI pooling layer pools over the concatenation of output layer of base network and aggregated features maps of CFN. With the help of extra feature information, HyperLearner improves detection performance.

Generally, CNN-based methods for pedestrian detection have the superior detection performance. However, the computation complexity of these methods is very high. Thus, these methods will be very slow when they run on the common CPU. Meanwhile, in many situations the computing device only has the CPU. Thus, speeding up CNN-based methods on CPU is very important and necessary. Compared to CNN-based methods, the traditional handcrafted feature-based methods are relatively simple and have the faster detection speed on the CPU. To accelerate detection speed on the common CPU, Cao et al. [8] proposed to integrate the traditional channel features and each layer of CNN into a multiple feature channels. Figure 20 shows the architecture of MCF. Firstly, multiple feature channels are constructed by HOG+LUV and each channel of CNN (e.g., AlexNet [33] and VGG16 [56]). Based on the multiple feature channels, the candidate features are further extracted. Finally, multistage cascade AdaBoost is learned from candidate features of corresponding layers. On the one hand, based on the handcrafted feature channels and each layer of CNN, MCF can learn more abundant features to improve detection performance. On the other hand, MCF can quickly reject many negative detection windows and thus reduce the computation of the remaining CNN layers. As a result, it can accelerate detection speed.
Fig. 20

The architecture of MCF. It consists of three parts: multiple feature channels, feature extraction, and multistage cascade AdaBoost

Table 3 shows the specific parameters of multilayer feature channels based on HOG+LUV and VGG16. It consists of six layers (i.e., L1, L2, …, L6). L1 is the handcrafted feature channels (i.e., HOG+LUV). The size of HOG+LUV is 128 × 64. L2–L6 correspond to multiple convolutional layers of VGG16 (i.e., C1–C5). The sizes of L2–L6 are 64 × 32, 32 × 16, 16 × 8, 8 × 4, and 4 × 2, respectively. The numbers of channels in L1–L6 are 10, 64, 128, 256, 512, and 512, respectively. In fact, only part convolutional layers can be used to construct multilayer feature channels. For example, a five-layer feature channels can be generated by HOG+LUV and C2–C5 of VGG16. C1 of VGG16 is not used.
Table 3

Multilayer image channels. The first layer is HOG+LUV, and the remaining layers are the convolutional layers (i.e., C1 to C5) in VGG16

Layer

L1

L2

L3

L4

L5

L6

Name

HOG

VGG16

 

LUV

C1

C2

C3

C4

C5

Size

128 × 64

64 × 32

32 × 16

16 × 8

8 × 4

4 × 2

Num

10

64

128

256

512

512

Feature extraction methods used in multilayer feature channels are different. For L1 (i.e., HOG+LUV), many successful methods can been proposed (e.g., ACF [6, 16, 43, 72, 73]). ACF [16] and NNNF [6] are selected for feature extraction in L1. ACF [16] has a very fast detection speed, while NNNF [6, 7] has a very good detection speed performance and relatively fast detection speed. For L2–L6 (CNN channels), the number of channels is relatively large. If the relative complexity features are used, the computation cost will be large. Thus, only the single pixel value is used as the candidate feature in L2–L6 for the computation efficiency. Figure 21 shows the feature extraction in L1 and L2–L6.
Fig. 21

Feature extraction in L1 and L2–L6. (a) is NNNF for L1, (b) is the single pixel for L2–L6

Based on multilayer image channels and candidate features extracted in each layer, multistage cascade AdaBoost is used to learn the strong classifier. Rows 2–4 in Fig. 15 shows the explanations of multistage cascade AdaBoost. The weak classifiers in each stage are learned from the candidate features extracted from corresponding feature channels. The strong classifier (i.e., H(x)) of multistage cascade AdaBoost can be written as follows:
$$\displaystyle \begin{aligned} H(\mathbf{x})=\sum_{j=1}^{k_1}{\alpha_1^j h_1^j(\mathbf{x})}+\ldots+\sum_{j=1}^{k_i}{\alpha_i^j h_i^j(\mathbf{x})}+\ldots +\sum_{j=1}^{k_N}{\alpha_N^j h_N^j(\mathbf{x})} =\sum_{i=1}^{N}\sum_{j=1}^{k_i}{\alpha_i^j h_i^j(\mathbf{x})}, \end{aligned} $$
(23)
where x means the detection windows, \(h_i^j({\mathbf {x}})\) is the j-th weak classifier in stage i, \(\alpha _i^j\) is th weight of \(h_i^j({\mathbf {x}})\), and k1, k2, …, kN are the number of weak classifiers in each stage. How to set k1, k2, …, kN is an opening problem. Generally speaking, one empirical settings are used as follows:
$$\displaystyle \begin{aligned} \begin{aligned} & k_1=N_{All}/2, \\& k_2=k_3=\ldots=k_N=N_{All}/(2\times(N-1)), \end{aligned} \end{aligned} $$
(24)
where NAll is the total number of weak classifiers in the strong classifier, which is usually set as 2048 or 4096.
Figure 22 shows the test process of MCF. In the test stage, the channels of HOG+LUV are firstly computed. Detection widows are generated by scanning the whole input image. These detection windows are firstly classified by the classifier of S1. For detection windows accepted by S1, the channels of L2 are then computed, and these windows are classified by S2. The above process is repeated from L1 to LN. Finally, the detection windows accepted by all the stages of strong classifier (i.e., H(x)) are merged by NMS as the final pedestrian detection result. Generally, detection windows around pedestrians highly overlap; they need much computation cost. Thus, the highly overlap detection windows after the first stage are eliminated by NMS with the threshold 0.8. It can further accelerate detection speed with little performance loss.
Fig. 22

The test process of MCF

Table 4 shows miss rates (MR) of MCF-based HOG+LUV and different layers in VGG16 on Caltech test set. They are trained on Caltech10x training set with 4096 level-2 decision forests. means that the corresponding layer is used. For example, MCF-2 is constructed by HOG+LUV and C5 of VGG16. It can be seen that with more convolutional layers, MCF has the lower miss rate. MCF-6 has the best detection performance, which outperforms MCF-2 and MCF-5 by 4.21% and 1.47%, respectively. It demonstrates that the middle layers in CNN can enrich the feature abstraction. Based on each layer in CNN, MCF can extract more discriminative features, which can be used for pedestrian detection.
Table 4

Miss rates (MR) of MCF based on HOG+LUV and the different layers in CNN on Caltech test set. means that the corresponding layer is used. HOG+LUV is always used for the first layer. The layers in VGG16 are used for the remaining layers

 

HOG

VGG16

  

Name

LUV

C1

C2

C3

C4

C5

MR (%)

Δ MR (%)

MCF-2

    

18.52

N/A

MCF-3

   

17.14

1.38

MCF-4

  

15.40

3.12

MCF-5

 

14.78

3.74

MCF-6

14.31

4.21

Table 5 compares miss rates and detection times of MCF-2, MCF-6, and MCF-6-f. MCF-6-f eliminates the highly overlapped detection windows by NMS with the threshold of 0.8. The miss rate of MCF-6-f is 4.21% lower than that of MCF-2, while the detection speed of MCF-6 is 1.43 times faster than that of MCF-2. By eliminating the highly overlapped detection windows after stage 1, MCF-6-f can further accelerate detection speed with little detection performance loss (i.e., 4.21%). The miss rate of MCF-6-f is 3.63% low; the detection speed of MCF-6-f is 4.07 times faster than that of MCF-2. Because MCF can reject many detection windows by first few stages, it can accelerate detection speed.
Table 5

Miss rate (MR) and detection time of MCF-2, MCF-6, and MCF-6-f. MCF-2 is based on HOG+LUV and C5 in CNN. MCF-6 is based on HOG+LUV and C1–C5 in CNN. MCF-6-f is the fast version of MCF-6 by eliminating overlapped detection windows

 

HOG+LUV and VGG16

 

MCF-2

MCF-6

MCF-6-f

MR (%)

18.52

14.31

14.89

Time (s)

7.69

5.37

1.89

To compare with state-of-the-art performance, MCF is trained on Caltech10x training set by 4096 level-4 decision forests. Figure 23 compares MCF with some state-of-the-art methods (i.e., LatSvm [19], ACF [16], LDCF [43], Checkerboards [73], CCF+CF [66], DeepParts [58], and CompACT-Deep [4]) on Caltech test set. It can be seen that MCF achieves state-of-the-art performance, which outperforms DeepParts and CompACT-Deep by 1.49% and 1.35%, respectively.
Fig. 23

ROC on Caltech test set (reasonable). MCF is compared with some state-of-the-art methods

Moreover, MCF is further trained on KITTI training set by 4096 level-4 decision forests. Table 6 compares MCF with some state-of-the-art methods (i.e., ACF [16], SpatialPooling+ [46], Checkerboards [73], DeepParts [58], and CompACT-Deep [4]) on KITTI test set. MCF also outperforms DeepParts and CompACT-Deep. On the moderate test, MCF outperforms CompACT-Deep by 0.71%. On the hard test set, MCF outperforms CompACT-Deep by 1.57%.
Table 6

Average precision (AP) of some methods on KITTI

Method

Easy

Moderate

Hard

ACF [16]

44.49%

39.81%

37.21%

SpatialPooling+ [46]

65.26%

54.49%

48.60%

Checkerboards [73]

67.75%

56.75%

51.12%

DeepParts [58]

70.49%

58.67%

52.78%

CompACT-Deep [4]

70.69%

58.74%

52.71%

MCF

70.87%

59.45%

54.28%

4 Challenges of Object Detection

Though object detection has achieved great progress in the past decades, it still has many challenges when pushing the progress of object detection. In the following part, three common and typical challenges of object detection will be discussed, and some solutions are also introduced.

4.1 Scale Variation Problem

As the distance from objects to camera can be various, objects of various scales usually appear on the image. Thus, scale variation is an inevitable problem for object detection. The solutions to scale variation can be divided into two main classes: (1) image pyramid-based methods and (2) feature pyramid-based methods. Generally, image pyramid-based methods firstly resize the original image into multiple different scales and then use the same detector to detect the rescaled images, respectively. Feature pyramid-based methods firstly generate multiple feature maps of different resolution based on the input image and then use different feature maps to detect objects of different scales.

At first, deep object detection adopts image pyramid to detect objects of various scales. RCNN [23], SPPnet [25], Fast RCNN [24], and Faster RCNN [52] all adopt image pyramid for object detection. In the training stage, the CNN detector is trained based on the images of a given scale. In the test stage, image pyramids are used for multi-scale object detection. On the one hand, it usually causes the inconsistency between the training and test inference. On the other hand, each image of image pyramids is put into the CNN network, respectively. Thus, it is also very time-consuming.

In fact, the feature maps of different resolutions in CNN can be seen as the feature pyramid. If the feature maps of different resolutions are used to detect objects of different scales, it can avoid resizing the input image and accelerating detection speed. Thus, feature pyramid-based methods become popular. Researchers have done many attempts on feature pyramid-based methods.

Li et al. [34] proposed scale-aware Fast RCNN (SAF RCNN) for pedestrian detection. The base network is split into two sub-networks for large-scale pedestrian detection and small-scale pedestrian detection, respectively. Given a detection window, the final detection score is the weight sum of two sub-networks. If the detection window is relatively large, the large-scale network has the relatively large weight. If the detection window is relatively small, the small-scale network has the relatively large weight.

Yang et al. [68] proposed scale-dependent pooling (SDP) for multi-scale object detection to handle the scale variation problem. It is based on Fast RCNN architecture. The proposals are extracted by selective search method [60]. According to the heights of proposals, SDP pools the features of proposals from different convolutional layers according to the height of proposals. If the height of object proposal belongs to [0, 64], SDP pools the feature maps from the third convolutional blocks. If the height of object proposal belongs to [64, 128], SDP pools the feature maps from the fourth convolutional blocks. If the height of object proposal belongs to [128, +inf], SDP pools the feature maps from the fifth convolutional blocks. Due to the feature maps of ROI pooling layer are pooled from different convolutional layers, three different subnets for classifying and locating proposals are trained, respectively.

Generally, Faster RCNN needs to extract proposals by sliding RPN on a fixed convolutional layer (e.g., conv5_3 of VGG16). Because the respective field of a convolutional layer is relatively fixed, it cannot match the sizes of all objects very well. The respective field of former convolutional layer is relatively small, which matches the small-scale objects better, while the respective field of latter convolutional layer is relatively large, which matches the large-scale objects better. To solve this problem, Cai et al. [5] proposed multi-scale deep convolutional neural network (MS-CNN) to generate object proposals of different scales. Figure 24 shows the architecture of MS-CNN. It outputs proposals from multiple convolutional layers of different resolution. The anchor in each convolutional layer is labelled as the positive if it has an IoU overlap over 0.5 with a ground-truth bounding box. The anchor in each convolutional layer is labelled as the negative if it has the IoU overlap below 0.2 with all the ground-truth bounding boxes. Because of the imbalance between the negatives and positives, three different sampling strategies are explored: random, bootstrapping, and mixture. Random sampling strategy randomly selects some negatives from all the negatives. Bootstrapping sampling strategy selects some hardest negatives according to their objectness scores. Mixture sampling strategy selects the half negatives by random sampling strategy and the half negatives by bootstrapping strategy. It is found that the bootstrapping sampling strategy and mixture sampling strategy have similar detection performance, which outperforms random sampling strategy. The ratio of positives and negatives is 1:R.
Fig. 24

The architecture of MS-CNN. (Reprinted from Ref. [5], with permission of Springer)

Though the success of MS-CNN, the output layers of MS-CNN have different semantic levels. To solve the semantic inconstancy of different convolutional layers, Lin et al. [37] proposed feature pyramid network (FPN) for object detection. FPN incorporates top-down structure to improve the semantic levels of first few convolutional layers. Figure 25 shows the architecture of FPN. Specifically, top-down structure combines the low-level semantic convolutional layer with the upsampled high-level semantic convolutional layer by the element-wise addition. Based on FPN, the semantics of output layers for proposal generation are high-level and consistent. This final feature maps used for predicting proposals are called {P2, P3, P4, P5}, corresponding to the convolutional layers of {C2, C3, C4, C5}. The anchors have areas of {32 × 32, 64 × 64, 128 × 128, 256 × 256, 512 × 512} pixels on {P2, P3, P4, P5}, respectively. The aspect ratios of anchors are {1 : 2, 1 : 1, 2 : 1}. Thus, there are 15 anchors over the pyramid. The anchor is labelled as the positive if it has the highest IoU overlap with a ground-truth box or if it has the IoU overlap over 0.7 with any ground-truth boxes. The anchor is labelled as the negative if it has the IoU overlap below 0.3 for all the ground-truth boxes. With consistency feature maps for object detection, it can improve detection performance, especially the detection performance of small-scale object detection. The similar idea is also adopted by [32].
Fig. 25

The architecture of FPN. (Ⓒ[2017] IEEE. Reprinted, with permission, from Ref. [37])

Because small-scale objects are usually low-resolution and noisy, small-scale object detection is more challenging compared to large-scale object detection. Though multi-scale methods treat objects of different scales as different classes, the improvement is relatively limited. Thus, improving small-scale object detection is a key for multi-scale object detection. To solve this problem, Li et al. [35] proposed the perceptual generative adversarial network (Perceptual GAN). Based on the difference between feature representations of small-scale objects and feature representations of large-scale objects, perceptual GAN aims to compensate feature representations of small-scale objects.

Generally, large-scale objects have more abundant information compared to small-scale object. To improve small-scale pedestrian detection, Pang et al. [47] proposed JCS-Net. The main idea of JCS-Net is to use large-scale pedestrian to help small-scale pedestrian detection. Figure 26 shows the pipeline of JCS-Net. The training process of JCS-Net for small-scale pedestrian can be summarized as follows: (1) It firstly fine-tunes the network for large-scale pedestrian detection based on large-scale pedestrians (i.e., the top row of Fig. 26). (2) The super-resolution sub-network for small-scale pedestrian detection (i.e., the left of bottom row of Fig. 26) is pre-trained by the large-scale pedestrians and corresponding small-scale pedestrians. (3) The classification sub-network for small-scale pedestrian detection (i.e., the right of bottom row of Fig. 26) is initialized by the large-scale network. (4) The super-resolution sub-network and classification sub-network for small-scale pedestrian detection are jointly trained based on small-scale pedestrians and corresponding negatives.
Fig. 26

The architecture of JCS-Net

The loss of JCS-Net is the joint of two sub-networks. The loss of super-resolution sub-network can be calculated by mean squared error as:
$$\displaystyle \begin{aligned} L_{similarity} = \frac{1}{n}\sum_{i=1}^{n}||{\mathbf{y}}_i-F({\mathbf{x}}_i)||{}^2, \end{aligned} $$
(25)
where yi is large-scale pedestrian, xi is small-scale pedestrian, F(xi) is the reconstructed pedestrian by super-resolution sub-network, and n is the number of training samples. The loss of classification can be written as:
$$\displaystyle \begin{aligned} L_{cls} = \frac{1}{n}\sum_{i=1}^{n}-\log p_c({\mathbf{x}}_i), \end{aligned} $$
(26)
where c is the ground-truth label of xi and pc(xi) is the probability that xi belongs to class c. Based on Lsimilarity and Lcls, the joint loss of JCS-Net (i.e., LJCS) can be expressed as:
$$\displaystyle \begin{aligned} L_{JCS} = L_{cls}+\lambda L_{similarity}, \end{aligned} $$
(27)
where λ is used to balance two terms (i.e., Lcls and Lsimilarity) which is set to be 0.1 by cross-validation.
To detect multi-scale pedestrians, multi-scale MCF can be trained based on JCS-Net or original HOG+LUV and VGG16. Generally, MCF-V is trained based on original HOG+LUV and VGG16, which is used for large-scale pedestrian detection. MCF-J is trained based on JCS-Net, which is used for small-scale pedestrian detection. Figure 27 gives an illustration of multi-scale MCF. Pedestrians of different scales are divided into several different subsets (i.e., subset 1, subset 2, …, subset N) according to the height of pedestrians. The different subsets can contain some overlapped images. In Fig. 27, two detectors on subset 1 and subset 2 are trained based on MCF-J. The other detectors are trained based on MCF-V. In the test stage, the detection results of each detector are added together before NMS.
Fig. 27

The illustration of multi-scale MCF

The original pedestrians on Caltech training set are split into three different subsets, which are called “train-all,” “train-small,” and “train-large,” respectively. “Train-all” subset contains all the pedestrians, “train-small” subset contains the pedestrians under 100 pixels tall and the interpolated pedestrians over 100 pixels tall, and “train-large” subset contains the pedestrians over 80 pixels tall. The two subsets of Caltech test set (i.e., reasonable and small) are used for evaluation. The reasonable test set means that pedestrians are over 50 pixels tall under no or partial occlusion, and the small test set means that pedestrians are under 100 pixels tall and over 50 pixels tall. Namely, the small test set belongs to the reasonable test set.

Table 7 shows the miss rates of MCF-V and MCF-J trained on “train-small” training set. MCF-J outperforms MCF-V on the reasonable test set and the small test set, especially on the small test set. Two ablation experiments are also conducted: (1) The first one is that setting λ of (27) is 0, which aims to show the influence of depth of JCS-Net. It is called MCF-C. (2) The second one is that the super-resolution sub-network and classification sub-network are not jointly trained, which aims to show the importance of the joint multitask training. It is called MCF-S. Both MCF-C and MCF-S are superior to MCF-V and inferior to MCF-J. It means that though deeper depth and simple super-resolution can improve detection performance, the joint multitask training of JCS-Net is important, which can further improve detection performance.
Table 7

Miss rates (MR) of MCF-V and MCF-J are shown on Caltech test set. MCF-V is learned based on HOG+LUV and the fine-tuned VGG16. MCF-J is learned based on HOG+LUV and the proposed JCS-Net

Method

Training set

Reasonable

Small

MCF-V

“train-small”

13.20%

14.28%

MCF-J

“train-small”

11.07%

11.72%

ΔMR

2.13%

2.56%

Ablation experiments

MCF-C

“train-small”

12.23%

13.02%

MCF-S

“train-small”

12.65%

13.50%

Based on three training sets (i.e., “train-all,” “train-small,” and “train-large”), MS-V and MS-J are trained. They both contain three different detectors. The difference is that on “train-small” training set, MS-V uses MCF-V, and MS-J uses MCF-J. Table 8 shows the miss rates of MS-V and MS-J on Caltech test set. The miss rates of MS-J have 0.86% and 0.91% lower than that of MS-V on reasonable test set and small test set. It demonstrates the effectiveness of MS-J.
Table 8

Miss rates (MR) of MS-V and MS-J are shown on Caltech test set. MS-V means multi-scale MCF based on fine-tuned VGG16. MS-J means multi-scale MCF based on JCS-Net

Method

Detectors

Training set

Reasonable

Small

 

MCF-V

“train-small”

  

MS-V

MCF-V

“train-large”

9.67%

10.48%

 

MCF-V

“train-all”

  
 

MCF-J

“train-small”

  

MS-J

MCF-V

“train-large”

8.81%

9.57%

 

MCF-V

“train-all”

  

ΔMR

0.86%

0.91%

Finally, MS-J is compared to some state-of-the-art methods (i.e., Roerei [2], ACF [16], LDCF [43], TA-CNN [59], Checkerboards [73], DeepParts [58], CompACT-Deep [4], and MS-CNN [5]) on Caltech test set. Figure 28 shows the ROC. MS-J achieves state-of-the-art performance, which outperforms MS-CNN by 1.14%.
Fig. 28

ROC on Caltech test set. MS-J is compared to some state-of-the-art methods

4.2 Occlusion Problem

Object occlusion is very common. For example, in [15], Dollar et al. found that most pedestrians (about 70%) in street scenes are occluded in at least one frame. Thus, detecting occluded object is very necessary and important for computer vision application. In the past decade years, researchers have done many attempts to solve occlusion problem.

Wang et al. [62] found that if some parts of pedestrian are occluded, the block features of corresponding region uniformly respond to the block scores of linear classifier. Based on this phenomenon, they proposed to use the score of each block to judge whether the corresponding region is occluded or not. Based on the scores of each block, the occlusion likelihood images are segmented by the mean shift approach. If occlusion occurs, the part detector is applied on the unoccluded regions to output the final detection result.

To maximize detection performance on the occluded pedestrians, Mathias et al. [40] proposed to learn a set of occlusion-specific pedestrian detectors. Each pedestrian detector serves for the occlusion of certain type. In [40], occlusion can be divided into three different types: occlusions from bottom, occlusions from right, and occlusions from left. For each type, the degree of occlusion ranges from 0% to 50%. Eight left/right occlusion detectors and 16 bottom-up occlusion detectors are trained, respectively. One nave method to obtain these classifiers is to train the classifiers of all the occlusion levels, respectively. However, it is very time-consuming. To reduce the training time, Franken-classifiers are proposed. It starts to train the full-body biased classifier and remove weak classifiers to generate the first occlusion classifier. The additional weak classifiers of first occlusion classifier are further learned without bias. Similar to the first occlusion classifier, the second occlusion classifier are learned based on the first occlusion classifier. Based on Franken-classifiers, it only needs one tenth computation cost for training the set of occlusion-specific pedestrian detectors.

Inspired by the set of occlusion-specific pedestrian detectors, Tian et al. [58] extended it by constructing an extensive deep part pool and automatically choose important parts for occlusion handling by linear SVM. The extensive part pool contains various body parts. Pedestrians can be seen as a rigid object, which are divided into the 2m × m spatial grids. The part pools consist of all the rectangles inside the spatial grids, where the height and width of the rectangle are at least 2. If m = 3, the part pool has 45 part models. To alleviate the test computation cost of 45 part models, 6 part models with highest SVM scores are selected, which also yield approximate performance with faster detection speed.

Wang et al. [64] thought that data-driven strategy is very important for solving occlusion problem. If the training data has enough image data of all different occlusion situations, the training detector can have a better detection performance for occluded objects. However, the dataset cannot cover all the cases of occlusions generally, and the occlusions are relatively rare. Because the occlusions in the dataset follow a long-tail distribution, it is impossible to collect the dataset to cover all the occlusions. To solve this problem, Wang et al. proposed A-Fast-RCNN, which uses adversary network to generate hard examples by blocking some feature maps spatially. After ROI pooling layer, it adds an extra branch to generate the occlusion mask. The branch consists of two fully connected layers and mask prediction layer. The feature maps for the final classification and regression are the combination of mask and the feature map of ROI pooling layer. If the cell of mask is equal to 1, the corresponding responses of feature maps are set as 0 (i.e., dropout). In the training stage, it adopts the stage-wise steps: (1) The occlusion mask is fixed; Fast RCNN network is firstly trained. (2) Based on Fast RCNN, the adversary network generates the occlusion mask which makes the loss of Fast RCNN the largest. Through the two-stage training, the final Fast RCNN network can be robust to the occlusion. In the test stage, the mask branch is removed, and the rest process is the same as the original Fast RCNN.

4.3 Deformation Problem

Object deformation can be caused by non-grid deformation, intra-class shape variability, and so on. For example, people can jump or squat. Thus, a good object detection method should be robust to the deformation of object. Before CNN, researchers have done many attempts to handle deformation. For example, DPM [19] uses the mixtures of multi-scale deformable part models (i.e., one low-resolution root model and six high-resolution part models) to handle deformation. HSC [51] further incorporates histograms of sparse codes into deformable part model. To accelerate detection speed of DPM, CDPM [18] and FDPM [65] are further proposed. Park et al. [48] proposed to detect large-scale pedestrians by deformable part model and detect small-scale pedestrians by the rigid template. Regionlets [63] presents the region by a set of small sub-regions with different sizes and aspect ratios.

Though CNN-based methods are robust to object deformation in some degree, it is still not good enough. To further improve the robustness to object detection, researchers also incorporate some specific design into the CNN-based methods. Ouyang et al. [45] proposed the deformation constrained pooling layer (def-pooling) to model the deformation of object parts. Traditional pooling (e.g., max-pooling and average pooling) can be replaced by def-pooling to better represent the deformation properties of objects. Recently, Dai et al. [12] proposed two deformable modules (i.e., deformable convolution and deformable ROI pooling) to enhance the representation ability of geometric transformation. They add 2D offset to the regular grid sampling locations in the standard convolution. The offsets are learned in the training stage. Jeon and Kim [30] also proposed active convolution, where the shape of convolution is learned in the training stage. To improve invariance to large deformation and transformation, Jaderberg et al. [29] proposed the spatial transformer network. The transformation of scaling, rotation, and non-rigid deformations is performed on the feature map.

References

  1. 1.
    Bell, S., Zitnick, C. L., Bala, K., and Girshick, R.: Inside-Outside Net: Detecting objects in context with skip pooling and recurrent neural networks. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
  2. 2.
    Benenson, R., Mathias, M., Tuytelaars, T., and Gool, L. V.: Seeking the strongest rigid detector. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2013)Google Scholar
  3. 3.
    Benenson, R., Omran, M., Hosang, J. and Schiele, B.: Ten years of pedestrian detection, what have we learned? in Proc. Eur. Conf. Comput. Vis. (2014)Google Scholar
  4. 4.
    Cai, Z., Saberian, M., and Vasconcelos, N.: Learning complexity-aware cascades for deep pedestrian detection. in Proc. IEEE Int. Conf. Comput. Vis. (2015)Google Scholar
  5. 5.
    Cai, Z. Fan, Q., Feris, R. S., and Vasconcelos, N.: A unified multi-scale deep convolutional neural network for fast object detection. in Proc. Eur. Conf. Comput. Vis. (2016)Google Scholar
  6. 6.
    Cao, J., Pang, Y., and Li, X.: Pedestrian detection inspired by appearance constancy and shape symmetry. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
  7. 7.
    Cao, J., Pang, Y., and Li, X.: Pedestrian detection inspired by appearance constancy and shape symmetry. IEEE Trans. Image Processing 25(12), 5538–5551 (2016)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Cao, J., Pang, Y., and Li, X.: Learning multilayer features for pedestrian detection. IEEE Trans. Image Processing 26(7), 3310–3320 (2017)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Cheng, M. M., Zhang, Z., Lin, W.Y., and Torr, P.: BING: Binarized normed gradients for objectness estimation at 300fps. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2014)Google Scholar
  10. 10.
    Dalal, N. and Triggs, B.: Histograms of oriented gradients for human detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)CrossRefGoogle Scholar
  11. 11.
    Dai, J., Li, Y., He, K., and Sun, J.: R-FCN: Object detection via region-based fully convolutional networks. in Proc. Advances in Neural Information Processing Systems (2016)Google Scholar
  12. 12.
    Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., and Wei, Y.: Deformable convolutional networks. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
  13. 13.
    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2009)Google Scholar
  14. 14.
    Dollár, P., Tu, Z., Perona, P., and Belongie, S.: Integral channel features. in Proc. Brit. Mach. Vis. Conf. (2009)Google Scholar
  15. 15.
    Dollár, P., Wojek, C., Schiele, B., and Perona, P.: Pedestrian detection: An evaluation of the state of the art. IEEE Trans. Pattern Analysis and Machine Intelligence 34(4), 743–761 (2012)CrossRefGoogle Scholar
  16. 16.
    Dollár, P., Appel, R., Belongie, S., and Perona, P.: Fastest feature pyramids for object detection. IEEE Trans. Pattern Analysis and Machine Intelligence 36(8), 1532–1545 (2014)CrossRefGoogle Scholar
  17. 17.
    Everingham, M., Van Gool, L.,Williams, C. K. I.,Winn, J., and Zisserman, A.: The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)CrossRefGoogle Scholar
  18. 18.
    Felzenszwalb, P. F., Girshick, R., and McAllester, D.: Cascade object detection with deformable part models. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2010)Google Scholar
  19. 19.
    Felzenszwalb, P., Girshick, R., McAllester, D., and Ramanan, D.: Object detection with discriminatively trained part based models. IEEE Trans. Pattern Analysis and Machine Intelligence 32(9), 1627–1645 (2012)CrossRefGoogle Scholar
  20. 20.
    Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., and Berg, A. C.: Dssd: Deconvolutional single shot detector. CoRR abs/1701.06659 (2017)Google Scholar
  21. 21.
    Geiger, A., Lenz, P., and Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2012)Google Scholar
  22. 22.
    Gidaris, S. and Komodakis, N.: LocNet: Improving localization accuracy for object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
  23. 23.
    Girshick, R., Donahue, J., Darrell, T., and Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2014)Google Scholar
  24. 24.
    Girshick, R.: Fast RCNN. in Proc. Int. Conf. Comput. Vis. (2015)Google Scholar
  25. 25.
    He, K., Zhang, X., Ren, S., and Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Analysis and Machine Intelligence 37(9), 1904–1916 (2015)CrossRefGoogle Scholar
  26. 26.
    He, K., Zhang, X., Ren, S., and Sun, J.: Deep residual learning for image recognition. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
  27. 27.
    He, K., Gkioxari, G., Dollár, P., and Girshick, R.: Mask R-CNN. in Proc. Int. Conf. Comput. Vis. (2017)Google Scholar
  28. 28.
    Hosang, J., Omran, M., Benenson, R., and Schiele, B.: Taking a deeper look at pedestrians. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2015)Google Scholar
  29. 29.
    Jaderberg, M., Simonyan, K., Zisserman, A., and Kavukcuoglu, K.: Spatial transformer networks. in Proc. Advances in Neural Information Processing Systems (2015)Google Scholar
  30. 30.
    Jeon, Y. and Kim, J.: Active convolution: Learning the shape of convolution for image classification. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
  31. 31.
    Kong, T., Yao, A., Chen, Y., and Sun, F.: HyperNet: Towards accurate region proposal generation and joint object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
  32. 32.
    Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., and Chen, Y.: Ron: Reverse connection with objectness prior networks for object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
  33. 33.
    Krizhevsky, A., Sutskever, I., and Hinton, G. E.: ImageNet classification with deep convolutional neural networks. in Proc. Advances in Neural Information Processing Systems (2012)Google Scholar
  34. 34.
    Li, J., Liang, X., Shen, S., Xu, T., and Yan, S.: Scale-aware Fast R-CNN for pedestrian detection. CoRR abs/1510.08160 2015Google Scholar
  35. 35.
    Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., and Yan, S.: Perceptual generative adversarial networks for small object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
  36. 36.
    Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P.: Focal loss for dense object detection. in Proc. Int. Conf. Comput. Vis. (2017)Google Scholar
  37. 37.
    Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., and Belongie, S.: Feature pyramid networks for object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
  38. 38.
    Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., and Berg, A. C.: SSD: Single shot multibox detector. in Proc. Eur. Conf. Comput. Vis. (2016)Google Scholar
  39. 39.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)CrossRefGoogle Scholar
  40. 40.
    Mathias, M., Benenson, R., Timofte, R., and Van Gool, L.: Handling occlusions with franken-classifiers. in Proc. Int. Conf. Comput. Vis. (2013)Google Scholar
  41. 41.
    Mao, J., Xiao, T., Jiang, Y., and Cao, Z.: What can help pedestrian detection? in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
  42. 42.
    Najibi, M., Rastegari, M., and Davis, L. S.: G-CNN: an iterative grid based object detector. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
  43. 43.
    Nam, N., Dollár, P., and Han, J.: Local decorrelation for improved detection. in Proc. Advances in Neural Information Processing Systems (2014)Google Scholar
  44. 44.
    Ojala, T., Pietikainen, M., and Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 971–987 (2002)CrossRefGoogle Scholar
  45. 45.
    Ouyang, W., Wang, X., Zeng, X., Qiu, S. Luo, P., Tian, Y., Li, H., Yang, S., Wang, Z., Loy, C.-C., and Tang, X.: DeepID-Net: Deformable deep convolutional neural networks for object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2015)Google Scholar
  46. 46.
    Paisitkriangkrai, S., Shen, C., and van den Hengel, A.: Pedestrian detection with spatially pooled features and structured ensemble learning. IEEE Trans. Pattern Analysis and Machine Intelligence. 38(6), 1243–1257 (2016)CrossRefGoogle Scholar
  47. 47.
    Pang, Y., Cao, J., and Shao, L.: Small-scale pedestrian detection by joint classification and super-resolution into a unified network. Tech. report (2017)Google Scholar
  48. 48.
    Park, D., Ramanan. D., and Fowlkes, C.: Multiresolution models for object detection. in Proc. Eur. Conf. Comput. Vis. (2010)Google Scholar
  49. 49.
    Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.: You only look once: unified, real-time object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
  50. 50.
    Ren, J., Chen, X., Liu, J., Sun, W., Pang, J., Yan, Q., Tai, Y., and Xu, L:. Accurate single stage detector using recurrent rolling convolution. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
  51. 51.
    Ren, X., and Ramanan, D.: Histograms of sparse codes for object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
  52. 52.
    Ren, S., He, K., Girshick, R., and Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. in Proc. Advances in Neural Information Processing Systems (2015)Google Scholar
  53. 53.
    Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y.: Pedestrian detection with unsupervised multi-stage feature learning. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2013)Google Scholar
  54. 54.
    Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, (2014)Google Scholar
  55. 55.
    Shrivastava, A., Gupta, A., and Girshick, R.: Training region-based object detectors with online hard example mining. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
  56. 56.
    Simonyan, K., and Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar
  57. 57.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A.: Going deeper with convolutions. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2015)Google Scholar
  58. 58.
    Tian, Y., Luo, P., Wang, X., Tang, X.: Deep learning strong parts for pedestrian detection. in Proc. Int. Conf. Comput. Vis. (2015)Google Scholar
  59. 59.
    Tian, Y., Luo, P., Wang, X., and Tang, X.: Pedestrian detection aided by deep learning semantic tasks. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2015)Google Scholar
  60. 60.
    Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., and Smeulders, A. W. M.: Selective search for object recognition. Int. J. Comput. Vis. (2013)CrossRefGoogle Scholar
  61. 61.
    Viola, P. and Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)CrossRefGoogle Scholar
  62. 62.
    Wang, X., Han, T. X., and Yan, S.: An HOG-LBP human detector with partial occlusion handling. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2008)Google Scholar
  63. 63.
    Wang, X., Yang, M., Zhu, S., and Lin, Y.: Regionlets for generic object detection. in Proc. Int. Conf. Comput. Vis. (2013)Google Scholar
  64. 64.
    Wang, X., Shrivastava, A., and Gupta, A.: A-Fast-RCNN: Hard positive generation via adversary for object detection. in Proc. Int. Conf. Comput. Vis. (2017)Google Scholar
  65. 65.
    Yan, J., Lei, Z., Wen, L., and Li, S. Z.: The fastest deformable part model for object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2014)Google Scholar
  66. 66.
    Yang, B., Yan, J., Lei, Z., and Li, S. Z.: Convolutional channel features. in Proc. Int. Conf. Comput. Vis. (2015)Google Scholar
  67. 67.
    Yang, B., Yan, J., Lei, Z., Li, and S. Z.: CRAFT objects from images. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
  68. 68.
    Yang, F., Choi, W., and Lin, Y.: Exploit All the Layers: Fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
  69. 69.
    Yu, F., Koltun, V., and Funkhouser, T.: Dilated residual networks. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
  70. 70.
    Zagoruyko, S., Lerer, A., Lin, T.-Y., Pinheiro, P. O., Gross, S., Chintala, S., and Dollár, P.: A multipath network for object detection. in Proc. British Machine Vision Conference (2016)Google Scholar
  71. 71.
    Zhang, L., Lin, L., Liang, X., and He, K.: Is faster R-CNN doing well for pedestrian detection? in Proc. Eur. Conf. Comput. Vis. (2016)Google Scholar
  72. 72.
    Zhang, S., Bauckhage, C., and Cremers, A. B.: Informed haar-like features improve pedestrian detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2014)Google Scholar
  73. 73.
    Zhang, S., Benenson, R., and Schiele, B.: Filtered channel features for pedestrian detection, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2015)Google Scholar
  74. 74.
    Zhang, S., Benenson, R., Hosang, J. and Schiele, B.: CityPersons: A diverse dataset for pedestrian detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
  75. 75.
    Zitnick, C. L., and Dollár, P.: Edge boxes: locating object proposals from edges. in Proc. Eur. Conf. Comput. Vis. (2014)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.School of Electrical and Information EngineeringTianjin UniversityTianjinChina

Personalised recommendations