Deep Learning in Object Detection
- 1.1k Downloads
Object detection is an important research area in image processing and computer vision. The performance of object detection has significantly improved through applying deep learning technology. Among these methods, convolutional neural network (CNN)-based methods are most frequently used. CNN methods mainly include two classes: two-stage methods and one-stage methods. This chapter firstly introduces some typical CNN-based architectures in details. After that, pedestrian detection, as a classical subset of object detection, is further introduced. According to whether CNN is used or not, pedestrian detection can be divided into two types: handcrafted feature-based methods and CNN-based methods. Among these methods, NNNF (non-neighboring and neighboring features) inspired by pedestrian attributes (i.e., appearance constancy and shape symmetry) and MCF based on handcrafted channels and each layer of CNN are specifically illustrated. Finally, some challenges of object detection (i.e., scale variation, occlusion, and deformation) will be discussed.
Depending on whether deep learning is used or not, object detection methods can be divided into two main classes: the handcrafted feature-based methods [6, 14, 63, 73] and the deep learning-based methods [23, 25, 36, 38, 52]. In the first decade of twenty-first century, the traditional handcrafted feature-based methods are main stream. During this period, many famous and successful image feature descriptors (e.g., SIFT , HOG , Haar , LBP , and DPM ) are proposed. Based on these feature descriptors and the classical classifiers (e.g., SVM and AdaBoost), these methods achieve great success at that time.
However, when it comes to 2010, the performance of objection detection tends to be stable. Though many methods are still proposed, the performance improvement is relatively limited. Meanwhile, deep learning begins to show superior performance on some computer vision areas (e.g., image classification [26, 33, 56, 57]). In 2012, with the big image data (i.e., ImageNet ), deep CNN network (called AlexNet ) achieves the best detection performance on ImageNet ILSVRC2012, which outperforms the second best method by 10.9% on ILSVRC-2012 competition.
With the great success of deep learning on image classification [13, 26, 56, 57], researchers start to explore how to improve object detection performance with deep learning. In the recent few years, object detection based on deep learning has also achieved a great progress [23, 25, 52]. The mAP of object detection on PASCAL VOC2007  dramatically increases from 58% (based on RCNN with AlexNet ) to 86% (based on Faster RCNN with ResNet ). Currently, the state-of-the-art methods for deep object detection are based on deep convolutional neural networks (CNN) [23, 27, 36, 52].
In Sect. 2, some typical CNN architectures of object detection will be introduced. Pedestrian detection, as a special case of object detection, will be specifically discussed in Sect. 3. Finally, some representative challenges (i.e., occlusion, scale variation, and deformation) of object detection will be illustrated in Sect. 4.
2 The CNN Architectures of Object Detection
2.1 Two-Stage Methods for Deep Object Detection
Two-stage methods treat object detection as a multistage process. Given an input image, some proposals of possible objects are firstly extracted. After that, these proposals are further classified into the specific object categories by the trained classifier. The benefits of these methods can be summarized as follows: (1) It reduces a large number of proposals which are put into the following classifier. Thus, it can accelerate detection speed. (2) The step of proposal generation can be seen as a bootstrap technique. Based on the proposals of possible objects, the classifier can focus on the classification task with little influence of background (or easy negatives) in the training stage. Thus, it can improve detection accuracy. Among these two-stage methods, the series of RCNN, including RCNN , SPPnet , Fast RCNN , and Faster RCNN , are very representative.
In the training stage, the object proposals should be firstly generated by selective search for training CNN network and SVM classifiers. The CNN network (e.g., AlexNet ) is firstly pre-trained on ImageNet  and then fine-tuned on specific object detection dataset (e.g., PASCAL VOC ). Because the number of object category on ImageNet  and PASCAL VOC  is different, the outputs of final fully connected layer in CNN network should be changed from 1000 to 21 when fine-tuning on PASCAL VOC. The number of 21 represents 20 object classes of PASCAL VOC and the background. When fine-tuning the CNN network, the proposal is labelled as the positive for the matched class if it has the maximum IoU overlap with a ground-truth bounding box and the overlap is at least 0.5. Otherwise, the proposal is labelled as the background class. Based on the CNN features extracted from the trained CNN network, the linear SVM classifiers for different classes are further trained, respectively. When training the SVM classifier per class, only the ground-truth bounding box is labelled as the positive. Otherwise, the proposal is labelled as the negative if it has the IoU overlap below 0.3 with all the ground-truth bounding boxes. Because the extracted CNN features are too large to load in memory, the bootstrap technique is used to mine the hard negatives in training SVM classifiers.
In the training stage, the CNN network is firstly initialized from the pre-trained ImageNet network and then trained with back-propagation in the end-to-end way. The mini-batches (i.e., R) are sampled from N images, where each image provides the R∕N proposals. The proposals from the same image share all the convolutional computation. Generally, N is set as 2 and R is set as 128. To achieve scale invariance, two different strategies are used: brute-force approach and multi-scale approach. In brute-force approach, the images are resized to a fixed size in the training and test stages. In the multi-scale approach, the images are randomly rescaled to a pyramid scale in each iteration of training stage. In the test stage, the multi-scale images are put into the trained network, respectively, and the detection results are combined together by NMS.
Because RPN and Fast RCNN share the same base network, thus they cannot train independently. Faster RCNN provides three different ways to train RPN and Fast RCNN in the unified network: (1) Alternating training. RPN is firstly trained. Based on the proposals generated by RPN and the filter weights of RPN, Fast RCNN is then trained. Two above steps are iterated for two times. (2) Approximate joint training. RPN and Fast RCNN networks are seen as a unified network. The loss of the unified network is the joint of RPN loss and Fast RCNN loss. In each iteration of training stage, the proposals generated by RPN are treated as the fixed proposals when training Fast RCNN detector. Namely, the derivative of proposals coordinates are ignored. (3) Non-approximate joint training. The difference between approximate joint training and non-approximate joint training is that the derivative of proposal coordinates should be considered. Because the standard ROI pooling layers are not differentiable for proposal coordinate, the first two ways are usually used.
The loss of R-FCN is similar to that of Faster RCNN. Please note that the proposal location predications per category in Faster RCNN are based on different outputs of box regression layer, while all the proposal location predications in R-FCN share the same output of box regression layer. R-FCN is fully connected networks, which almost shares all the CNN computation of the whole image. Thus, it can achieve the competitive detection accuracy and faster detection speed compared to Faster RCNN.
Some other improvements of two-stage methods are also proposed. Yang et al.  proposed a “divide and conquer” solution-based Fast RCNN, which firstly rejects the background from the candidate proposals and then uses the c trained category-specific network to judge the proposal category. Based on the classification loss of proposals, Shrivastava et al.  proposed online hard example mining to select the hard samples. Gidaris et al.  proposed LocNet to improve location accuracy, which localizes the bounding box by the probabilities on each row and column of a given region. Bell et al.  proposed to use the contextual information and skip-layer pooling to improve detection performance. Kong et al.  proposed to use skip-layer connection for object detection. Zagoruyko et al.  proposed to use multiple branches to extract object context of multiple resolutions. Yu et al.  proposed dilated residual networks to enlarge the resolution of output layer without reducing the respective field.
2.2 One-Stage Methods for Deep Object Detection
Different from the multistage process of two-stage methods, one-stage methods aim to simultaneously predict object category and object location. Compared to two-stage methods, one-stage methods have much faster detection speed and comparable detection accuracy. Among the one-stage methods, OverFeat , YOLO , SSD , and RetinaNet  are the representative methods.
In the training stage, the first 20 convolutional layers of YOLO are firstly pre-trained, and the whole network of YOLO is then fine-tuned on the dataset of object detection. YOLO uses sum-squared error for optimization. Because there are many grid cells that do not contain objects, it will affect the gradient from the cells that contain objects. Thus, the weight of box location predication loss is increased, while the weight of box predication loss is decreased. For each bounding box, it is assigned to the ground-truth with which it has the highest IoU overlap.
Some other one-stage methods are also proposed. Najibi et al.  proposed to initially divide the input image into the multi-scale regular grids and iteratively update the location of these grids to be toward the objects. Based on SSD, Ren et al.  further proposed to use the recurrent rolling convolution to add deep context information. Fu et al.  proposed to use deconvolutional layer after SSD to incorporate the object context.
3 Pedestrian Detection
3.1 Handcrafted Feature-Based Methods for Pedestrian Detection
By observing and summarizing ICF and the variants of ICF, Zhang et al.  generalized these methods into a unified framework. It is called filtered channel features (FCF). The bottom row of Fig. 14 shows the pipeline of filtered channel features (FCF). It firstly converts the input image to the HOG+LUV channels, then convolves the filter bank with HOG+LUV to generate the new feature channels where each pixel value in each channel is set as candidate feature, and finally learns the strong classifier by decision forest from the candidate feature pool. The filter bank can be different types, including SquaresChntrs filters, Checkerboards filters, Random filters, Informed filters, and so on. Thus, ICF , ACF , SquaresChntrs , and LDCF  can be seen as the special cases of FCF. It is found that based on Checkerboards filters, FCF can outperform ICF, ACF, SquaresChntrs, and LDCF on Caltech pedestrian dataset  and KITTI benchmark .
These above methods make great success on pedestrian detection. However, the design of handcrafted features does not consider pedestrian inherent attributes. Zhang et al.  treated the pedestrians as three parts (i.e., head, upper body, and legs) and designed the Haar-like features (i.e., InformedHaar) for pedestrian detection. However, it still does not make full use of pedestrian attributes. To make better use of pedestrian attributes for pedestrian detection, Cao et al. [6, 7] further proposed two non-neighborhood features inspired by appearance constancy and shape symmetry of pedestrians.
Comparison of log-average miss rates on Caltech test set
3.2 CNN-Based Methods for Pedestrian Detection
With the success of convolutional neural networks on many fields of computer vision, researchers also explored how to apply CNN on pedestrian detection for improving detection performance. As an initial attempt, Hosang et al.  studied the effectiveness of CNN for pedestrian detection. In , it firstly extracts the candidate pedestrian proposals by using the handcrafted feature-based method (i.e., SquaresChnFtrs ) and then uses a small network (i.e., CifarNet) or a large network (i.e., AlexNet ) to classify these proposals. The traditional pedestrian detection usually uses a fixed-size detection window (i.e., 128 × 64). Following it, the input size of CNN network (i.e., CifarNet and AlexNet) is changed to be 128 × 64, and the outputs of CNN network are changed to be 2 (i.e., pedestrian or non-pedestrian). In the training stages, the proposal with an IoU overlap over 0.5 with a ground-truth bounding box is labelled as the positive, and the proposal with the IoU below 0.3 with all the ground-truth bounding boxes is labelled as the negative. The ratio of positives and negatives in the mini-batch is set as 1:5. Experimental results on Caltech pedestrian dataset demonstrate that the ConvNets can achieve the state-of-the-art performance, which is useful for helping pedestrian detection.
After that, pedestrian detection based on CNN achieves great success. Sermanet et al.  proposed to merge the down-sampled 1-st convolutional layer with 2-st convolutional layer to add the global information for pedestrian detection. Instead of using handcrafted feature channel (HOG+LUV), Yang et al.  proposed convolutional channel features (CCF). CCF treats the feature maps of last convolutional layer as the feature channels and uses decision forests to learn the strong classifier. Because of the richer representation ability of CNN features, CCF achieves state-of-the-art performance on pedestrians detection. Zhang et al.  proposed to use RPN to extract candidate features and use decision forest to classify these proposals. Zhang et al.  proposed to improve the performance of Faster RCNN on pedestrian detection by some specific tricks (e.g., quantized RPN scales and the upsampled input image). To reduce the computation complexity, Cai et al.  proposed the complexity-aware cascaded detector (i.e., CompACT), which integrates the features of different complexities together. Complexity-aware cascaded detector aims to achieve a best trade-off between classification accuracy and computation complexity. In the training stage, the loss of CompACT is the joint of classification loss and computation complexity loss. Based on CompACT, the first few stages of strong classifier learn the more features of lower computation complexity, and the last few stages of strong classifier learn the more features of higher computation complexity. To further improve detection performance, the CNN features of very highest complexity are embedded into the last stage of CompACT, which is called CompACT-Deep. Because most detection proposals are rejected by the first few stages, only the CNN features of very few proposals need to be calculated. Thus, CompACT-Deep can improve detection performance without increasing much computation cost.
Some methods exploit to use the extra feature information to help pedestrian detection. Tian et al.  proposed to join pedestrian detection with semantic tasks, including scene attributes and pedestrian attributes, into the TA-CNN. The attributes are transferred from existing scene datasets. TA-CNN is learned by iterating two tasks, respectively. Mao et al.  proposed to apply some extra features to deep pedestrian detection. Some extra feature channels are used as the extra input channels for CNN detectors. The extra feature channels contain some low-level semantic feature channels (e.g., gradient and edge), some high-level sematic feature channels (e.g., segmentation and heatmap), depth channels, and temporal channels. It is found that segmentation and edge used as the extra input channels can significantly help improve pedestrian detection. Based on this observation, Mao et al.  further proposed HyperLearner. It consists of four different modules: base network, channel feature network (CFN), region proposal network, and Fast RCNN. Base network and feature, region proposal network, and Fast RCNN are the same as original Faster RCNN. Multiple convolutional layers of base network firstly go through two 3 × 3 convolutional layers and then upsample the same size of conv1. After that, the output layers are appended together to generate the aggregated feature maps. The aggregated feature maps are fed into CFN for channel feature prediction and concatenated with the output layer of base network. In the training stage, feature channel generation network is supervised by the semantic segmentation ground-truth. The loss of HyperLearner is the joint of Faster RCNN loss and segmentation loss. In the test stage, feature channel generation network outputs the predicted feature channel map (e.g., semantic segmentation map). Before Fast RCNN subnet, ROI pooling layer pools over the concatenation of output layer of base network and aggregated features maps of CFN. With the help of extra feature information, HyperLearner improves detection performance.
Multilayer image channels. The first layer is HOG+LUV, and the remaining layers are the convolutional layers (i.e., C1 to C5) in VGG16
128 × 64
64 × 32
32 × 16
16 × 8
8 × 4
4 × 2
Miss rates (MR) of MCF based on HOG+LUV and the different layers in CNN on Caltech test set. √ means that the corresponding layer is used. HOG+LUV is always used for the first layer. The layers in VGG16 are used for the remaining layers
Δ MR (%)
Miss rate (MR) and detection time of MCF-2, MCF-6, and MCF-6-f. MCF-2 is based on HOG+LUV and C5 in CNN. MCF-6 is based on HOG+LUV and C1–C5 in CNN. MCF-6-f is the fast version of MCF-6 by eliminating overlapped detection windows
HOG+LUV and VGG16
4 Challenges of Object Detection
Though object detection has achieved great progress in the past decades, it still has many challenges when pushing the progress of object detection. In the following part, three common and typical challenges of object detection will be discussed, and some solutions are also introduced.
4.1 Scale Variation Problem
As the distance from objects to camera can be various, objects of various scales usually appear on the image. Thus, scale variation is an inevitable problem for object detection. The solutions to scale variation can be divided into two main classes: (1) image pyramid-based methods and (2) feature pyramid-based methods. Generally, image pyramid-based methods firstly resize the original image into multiple different scales and then use the same detector to detect the rescaled images, respectively. Feature pyramid-based methods firstly generate multiple feature maps of different resolution based on the input image and then use different feature maps to detect objects of different scales.
At first, deep object detection adopts image pyramid to detect objects of various scales. RCNN , SPPnet , Fast RCNN , and Faster RCNN  all adopt image pyramid for object detection. In the training stage, the CNN detector is trained based on the images of a given scale. In the test stage, image pyramids are used for multi-scale object detection. On the one hand, it usually causes the inconsistency between the training and test inference. On the other hand, each image of image pyramids is put into the CNN network, respectively. Thus, it is also very time-consuming.
In fact, the feature maps of different resolutions in CNN can be seen as the feature pyramid. If the feature maps of different resolutions are used to detect objects of different scales, it can avoid resizing the input image and accelerating detection speed. Thus, feature pyramid-based methods become popular. Researchers have done many attempts on feature pyramid-based methods.
Li et al.  proposed scale-aware Fast RCNN (SAF RCNN) for pedestrian detection. The base network is split into two sub-networks for large-scale pedestrian detection and small-scale pedestrian detection, respectively. Given a detection window, the final detection score is the weight sum of two sub-networks. If the detection window is relatively large, the large-scale network has the relatively large weight. If the detection window is relatively small, the small-scale network has the relatively large weight.
Yang et al.  proposed scale-dependent pooling (SDP) for multi-scale object detection to handle the scale variation problem. It is based on Fast RCNN architecture. The proposals are extracted by selective search method . According to the heights of proposals, SDP pools the features of proposals from different convolutional layers according to the height of proposals. If the height of object proposal belongs to [0, 64], SDP pools the feature maps from the third convolutional blocks. If the height of object proposal belongs to [64, 128], SDP pools the feature maps from the fourth convolutional blocks. If the height of object proposal belongs to [128, +inf], SDP pools the feature maps from the fifth convolutional blocks. Due to the feature maps of ROI pooling layer are pooled from different convolutional layers, three different subnets for classifying and locating proposals are trained, respectively.
Because small-scale objects are usually low-resolution and noisy, small-scale object detection is more challenging compared to large-scale object detection. Though multi-scale methods treat objects of different scales as different classes, the improvement is relatively limited. Thus, improving small-scale object detection is a key for multi-scale object detection. To solve this problem, Li et al.  proposed the perceptual generative adversarial network (Perceptual GAN). Based on the difference between feature representations of small-scale objects and feature representations of large-scale objects, perceptual GAN aims to compensate feature representations of small-scale objects.
The original pedestrians on Caltech training set are split into three different subsets, which are called “train-all,” “train-small,” and “train-large,” respectively. “Train-all” subset contains all the pedestrians, “train-small” subset contains the pedestrians under 100 pixels tall and the interpolated pedestrians over 100 pixels tall, and “train-large” subset contains the pedestrians over 80 pixels tall. The two subsets of Caltech test set (i.e., reasonable and small) are used for evaluation. The reasonable test set means that pedestrians are over 50 pixels tall under no or partial occlusion, and the small test set means that pedestrians are under 100 pixels tall and over 50 pixels tall. Namely, the small test set belongs to the reasonable test set.
Miss rates (MR) of MCF-V and MCF-J are shown on Caltech test set. MCF-V is learned based on HOG+LUV and the fine-tuned VGG16. MCF-J is learned based on HOG+LUV and the proposed JCS-Net
Miss rates (MR) of MS-V and MS-J are shown on Caltech test set. MS-V means multi-scale MCF based on fine-tuned VGG16. MS-J means multi-scale MCF based on JCS-Net
4.2 Occlusion Problem
Object occlusion is very common. For example, in , Dollar et al. found that most pedestrians (about 70%) in street scenes are occluded in at least one frame. Thus, detecting occluded object is very necessary and important for computer vision application. In the past decade years, researchers have done many attempts to solve occlusion problem.
Wang et al.  found that if some parts of pedestrian are occluded, the block features of corresponding region uniformly respond to the block scores of linear classifier. Based on this phenomenon, they proposed to use the score of each block to judge whether the corresponding region is occluded or not. Based on the scores of each block, the occlusion likelihood images are segmented by the mean shift approach. If occlusion occurs, the part detector is applied on the unoccluded regions to output the final detection result.
To maximize detection performance on the occluded pedestrians, Mathias et al.  proposed to learn a set of occlusion-specific pedestrian detectors. Each pedestrian detector serves for the occlusion of certain type. In , occlusion can be divided into three different types: occlusions from bottom, occlusions from right, and occlusions from left. For each type, the degree of occlusion ranges from 0% to 50%. Eight left/right occlusion detectors and 16 bottom-up occlusion detectors are trained, respectively. One nave method to obtain these classifiers is to train the classifiers of all the occlusion levels, respectively. However, it is very time-consuming. To reduce the training time, Franken-classifiers are proposed. It starts to train the full-body biased classifier and remove weak classifiers to generate the first occlusion classifier. The additional weak classifiers of first occlusion classifier are further learned without bias. Similar to the first occlusion classifier, the second occlusion classifier are learned based on the first occlusion classifier. Based on Franken-classifiers, it only needs one tenth computation cost for training the set of occlusion-specific pedestrian detectors.
Inspired by the set of occlusion-specific pedestrian detectors, Tian et al.  extended it by constructing an extensive deep part pool and automatically choose important parts for occlusion handling by linear SVM. The extensive part pool contains various body parts. Pedestrians can be seen as a rigid object, which are divided into the 2m × m spatial grids. The part pools consist of all the rectangles inside the spatial grids, where the height and width of the rectangle are at least 2. If m = 3, the part pool has 45 part models. To alleviate the test computation cost of 45 part models, 6 part models with highest SVM scores are selected, which also yield approximate performance with faster detection speed.
Wang et al.  thought that data-driven strategy is very important for solving occlusion problem. If the training data has enough image data of all different occlusion situations, the training detector can have a better detection performance for occluded objects. However, the dataset cannot cover all the cases of occlusions generally, and the occlusions are relatively rare. Because the occlusions in the dataset follow a long-tail distribution, it is impossible to collect the dataset to cover all the occlusions. To solve this problem, Wang et al. proposed A-Fast-RCNN, which uses adversary network to generate hard examples by blocking some feature maps spatially. After ROI pooling layer, it adds an extra branch to generate the occlusion mask. The branch consists of two fully connected layers and mask prediction layer. The feature maps for the final classification and regression are the combination of mask and the feature map of ROI pooling layer. If the cell of mask is equal to 1, the corresponding responses of feature maps are set as 0 (i.e., dropout). In the training stage, it adopts the stage-wise steps: (1) The occlusion mask is fixed; Fast RCNN network is firstly trained. (2) Based on Fast RCNN, the adversary network generates the occlusion mask which makes the loss of Fast RCNN the largest. Through the two-stage training, the final Fast RCNN network can be robust to the occlusion. In the test stage, the mask branch is removed, and the rest process is the same as the original Fast RCNN.
4.3 Deformation Problem
Object deformation can be caused by non-grid deformation, intra-class shape variability, and so on. For example, people can jump or squat. Thus, a good object detection method should be robust to the deformation of object. Before CNN, researchers have done many attempts to handle deformation. For example, DPM  uses the mixtures of multi-scale deformable part models (i.e., one low-resolution root model and six high-resolution part models) to handle deformation. HSC  further incorporates histograms of sparse codes into deformable part model. To accelerate detection speed of DPM, CDPM  and FDPM  are further proposed. Park et al.  proposed to detect large-scale pedestrians by deformable part model and detect small-scale pedestrians by the rigid template. Regionlets  presents the region by a set of small sub-regions with different sizes and aspect ratios.
Though CNN-based methods are robust to object deformation in some degree, it is still not good enough. To further improve the robustness to object detection, researchers also incorporate some specific design into the CNN-based methods. Ouyang et al.  proposed the deformation constrained pooling layer (def-pooling) to model the deformation of object parts. Traditional pooling (e.g., max-pooling and average pooling) can be replaced by def-pooling to better represent the deformation properties of objects. Recently, Dai et al.  proposed two deformable modules (i.e., deformable convolution and deformable ROI pooling) to enhance the representation ability of geometric transformation. They add 2D offset to the regular grid sampling locations in the standard convolution. The offsets are learned in the training stage. Jeon and Kim  also proposed active convolution, where the shape of convolution is learned in the training stage. To improve invariance to large deformation and transformation, Jaderberg et al.  proposed the spatial transformer network. The transformation of scaling, rotation, and non-rigid deformations is performed on the feature map.
- 1.Bell, S., Zitnick, C. L., Bala, K., and Girshick, R.: Inside-Outside Net: Detecting objects in context with skip pooling and recurrent neural networks. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
- 2.Benenson, R., Mathias, M., Tuytelaars, T., and Gool, L. V.: Seeking the strongest rigid detector. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2013)Google Scholar
- 3.Benenson, R., Omran, M., Hosang, J. and Schiele, B.: Ten years of pedestrian detection, what have we learned? in Proc. Eur. Conf. Comput. Vis. (2014)Google Scholar
- 4.Cai, Z., Saberian, M., and Vasconcelos, N.: Learning complexity-aware cascades for deep pedestrian detection. in Proc. IEEE Int. Conf. Comput. Vis. (2015)Google Scholar
- 5.Cai, Z. Fan, Q., Feris, R. S., and Vasconcelos, N.: A unified multi-scale deep convolutional neural network for fast object detection. in Proc. Eur. Conf. Comput. Vis. (2016)Google Scholar
- 6.Cao, J., Pang, Y., and Li, X.: Pedestrian detection inspired by appearance constancy and shape symmetry. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
- 9.Cheng, M. M., Zhang, Z., Lin, W.Y., and Torr, P.: BING: Binarized normed gradients for objectness estimation at 300fps. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2014)Google Scholar
- 11.Dai, J., Li, Y., He, K., and Sun, J.: R-FCN: Object detection via region-based fully convolutional networks. in Proc. Advances in Neural Information Processing Systems (2016)Google Scholar
- 12.Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., and Wei, Y.: Deformable convolutional networks. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
- 13.Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2009)Google Scholar
- 14.Dollár, P., Tu, Z., Perona, P., and Belongie, S.: Integral channel features. in Proc. Brit. Mach. Vis. Conf. (2009)Google Scholar
- 18.Felzenszwalb, P. F., Girshick, R., and McAllester, D.: Cascade object detection with deformable part models. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2010)Google Scholar
- 20.Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., and Berg, A. C.: Dssd: Deconvolutional single shot detector. CoRR abs/1701.06659 (2017)Google Scholar
- 21.Geiger, A., Lenz, P., and Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2012)Google Scholar
- 22.Gidaris, S. and Komodakis, N.: LocNet: Improving localization accuracy for object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
- 23.Girshick, R., Donahue, J., Darrell, T., and Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2014)Google Scholar
- 24.Girshick, R.: Fast RCNN. in Proc. Int. Conf. Comput. Vis. (2015)Google Scholar
- 26.He, K., Zhang, X., Ren, S., and Sun, J.: Deep residual learning for image recognition. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
- 27.He, K., Gkioxari, G., Dollár, P., and Girshick, R.: Mask R-CNN. in Proc. Int. Conf. Comput. Vis. (2017)Google Scholar
- 28.Hosang, J., Omran, M., Benenson, R., and Schiele, B.: Taking a deeper look at pedestrians. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2015)Google Scholar
- 29.Jaderberg, M., Simonyan, K., Zisserman, A., and Kavukcuoglu, K.: Spatial transformer networks. in Proc. Advances in Neural Information Processing Systems (2015)Google Scholar
- 30.Jeon, Y. and Kim, J.: Active convolution: Learning the shape of convolution for image classification. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
- 31.Kong, T., Yao, A., Chen, Y., and Sun, F.: HyperNet: Towards accurate region proposal generation and joint object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
- 32.Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., and Chen, Y.: Ron: Reverse connection with objectness prior networks for object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
- 33.Krizhevsky, A., Sutskever, I., and Hinton, G. E.: ImageNet classification with deep convolutional neural networks. in Proc. Advances in Neural Information Processing Systems (2012)Google Scholar
- 34.Li, J., Liang, X., Shen, S., Xu, T., and Yan, S.: Scale-aware Fast R-CNN for pedestrian detection. CoRR abs/1510.08160 2015Google Scholar
- 35.Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., and Yan, S.: Perceptual generative adversarial networks for small object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
- 36.Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P.: Focal loss for dense object detection. in Proc. Int. Conf. Comput. Vis. (2017)Google Scholar
- 37.Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., and Belongie, S.: Feature pyramid networks for object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
- 38.Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., and Berg, A. C.: SSD: Single shot multibox detector. in Proc. Eur. Conf. Comput. Vis. (2016)Google Scholar
- 40.Mathias, M., Benenson, R., Timofte, R., and Van Gool, L.: Handling occlusions with franken-classifiers. in Proc. Int. Conf. Comput. Vis. (2013)Google Scholar
- 41.Mao, J., Xiao, T., Jiang, Y., and Cao, Z.: What can help pedestrian detection? in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
- 42.Najibi, M., Rastegari, M., and Davis, L. S.: G-CNN: an iterative grid based object detector. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
- 43.Nam, N., Dollár, P., and Han, J.: Local decorrelation for improved detection. in Proc. Advances in Neural Information Processing Systems (2014)Google Scholar
- 45.Ouyang, W., Wang, X., Zeng, X., Qiu, S. Luo, P., Tian, Y., Li, H., Yang, S., Wang, Z., Loy, C.-C., and Tang, X.: DeepID-Net: Deformable deep convolutional neural networks for object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2015)Google Scholar
- 47.Pang, Y., Cao, J., and Shao, L.: Small-scale pedestrian detection by joint classification and super-resolution into a unified network. Tech. report (2017)Google Scholar
- 48.Park, D., Ramanan. D., and Fowlkes, C.: Multiresolution models for object detection. in Proc. Eur. Conf. Comput. Vis. (2010)Google Scholar
- 49.Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.: You only look once: unified, real-time object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
- 50.Ren, J., Chen, X., Liu, J., Sun, W., Pang, J., Yan, Q., Tai, Y., and Xu, L:. Accurate single stage detector using recurrent rolling convolution. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
- 51.Ren, X., and Ramanan, D.: Histograms of sparse codes for object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
- 52.Ren, S., He, K., Girshick, R., and Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. in Proc. Advances in Neural Information Processing Systems (2015)Google Scholar
- 53.Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y.: Pedestrian detection with unsupervised multi-stage feature learning. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2013)Google Scholar
- 54.Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, (2014)Google Scholar
- 55.Shrivastava, A., Gupta, A., and Girshick, R.: Training region-based object detectors with online hard example mining. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
- 56.Simonyan, K., and Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar
- 57.Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A.: Going deeper with convolutions. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2015)Google Scholar
- 58.Tian, Y., Luo, P., Wang, X., Tang, X.: Deep learning strong parts for pedestrian detection. in Proc. Int. Conf. Comput. Vis. (2015)Google Scholar
- 59.Tian, Y., Luo, P., Wang, X., and Tang, X.: Pedestrian detection aided by deep learning semantic tasks. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2015)Google Scholar
- 62.Wang, X., Han, T. X., and Yan, S.: An HOG-LBP human detector with partial occlusion handling. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2008)Google Scholar
- 63.Wang, X., Yang, M., Zhu, S., and Lin, Y.: Regionlets for generic object detection. in Proc. Int. Conf. Comput. Vis. (2013)Google Scholar
- 64.Wang, X., Shrivastava, A., and Gupta, A.: A-Fast-RCNN: Hard positive generation via adversary for object detection. in Proc. Int. Conf. Comput. Vis. (2017)Google Scholar
- 65.Yan, J., Lei, Z., Wen, L., and Li, S. Z.: The fastest deformable part model for object detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2014)Google Scholar
- 66.Yang, B., Yan, J., Lei, Z., and Li, S. Z.: Convolutional channel features. in Proc. Int. Conf. Comput. Vis. (2015)Google Scholar
- 67.Yang, B., Yan, J., Lei, Z., Li, and S. Z.: CRAFT objects from images. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
- 68.Yang, F., Choi, W., and Lin, Y.: Exploit All the Layers: Fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
- 69.Yu, F., Koltun, V., and Funkhouser, T.: Dilated residual networks. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2017)Google Scholar
- 70.Zagoruyko, S., Lerer, A., Lin, T.-Y., Pinheiro, P. O., Gross, S., Chintala, S., and Dollár, P.: A multipath network for object detection. in Proc. British Machine Vision Conference (2016)Google Scholar
- 71.Zhang, L., Lin, L., Liang, X., and He, K.: Is faster R-CNN doing well for pedestrian detection? in Proc. Eur. Conf. Comput. Vis. (2016)Google Scholar
- 72.Zhang, S., Bauckhage, C., and Cremers, A. B.: Informed haar-like features improve pedestrian detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2014)Google Scholar
- 73.Zhang, S., Benenson, R., and Schiele, B.: Filtered channel features for pedestrian detection, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2015)Google Scholar
- 74.Zhang, S., Benenson, R., Hosang, J. and Schiele, B.: CityPersons: A diverse dataset for pedestrian detection. in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2016)Google Scholar
- 75.Zitnick, C. L., and Dollár, P.: Edge boxes: locating object proposals from edges. in Proc. Eur. Conf. Comput. Vis. (2014)Google Scholar