1 Introduction

Pedestrian detection, as a key component of real-world applications such as automatic driving and intelligent surveillance, has attracted special attention beyond general object detection. Despite the prevalent success of deeply learned features in computer vision, current leading pedestrian detectors (e.g., [14]) are in general hybrid methods that combines traditional, hand-crafted features [5, 6] and deep convolutional features [7, 8]. For example, in [1] a stand-alone pedestrian detector [9] (that uses Squares Channel Features) is adopted as a highly selective proposer (<3 regions per image), followed by R-CNN [10] for classification. Hand-crafted features appear to be of critical importance for state-of-the-art pedestrian detection.

On the other hand, Faster R-CNN [11] is a particularly successful method for general object detection. It consists of two components: a fully convolutional Region Proposal Network (RPN) for proposing candidate regions, followed by a downstream Fast R-CNN [12] classifier. The Faster R-CNN system is thus a purely CNN-based method without using hand-crafted features (e.g., Selective Search [13] that is based on low-level features). Despite its leading accuracy on several multi-category benchmarks, Faster R-CNN has not presented competitive results on popular pedestrian detection datasets (e.g., the Caltech set [14]).

In this paper, we investigate the issues involving Faster R-CNN as a pedestrian detector. Interestingly, we find that an RPN specially tailored for pedestrian detection achieves competitive results as a stand-alone pedestrian detector. But surprisingly, the accuracy is degraded after feeding these proposals into the Fast R-CNN classifier. We argue that such unsatisfactory performance is attributed to two reasons as follows.

Fig. 1.
figure 1

Two challenges for Fast/Faster R-CNN in pedestrian detection. (a) Small objects that may fail RoI pooling on low-resolution feature maps. (b) Hard negative examples that receive no careful attention in Fast/Faster R-CNN.

First, the convolutional feature maps of the Fast R-CNN classifier are of low solution for detecting small objects. Typical scenarios of pedestrian detection, such as automatic driving and intellegent surveillance, generally present pedestrian instances of small sizes (e.g., \(28\times 70\) for Caltech [14]). On small objects (Fig. 1(a)), the Region-of-Interest (RoI) pooling layer [12, 15] performed on a low-resolution feature map (usually with a stride of 16 pixels) can lead to “plain” features caused by collapsing bins. These features are not discriminative on small regions, and thus degrade the downstream classifier. We note that this is in contrast to hand-crafted features that have finer resolutions. We address this problem by pooling features from shallower but higher-resolution layers, and by the hole algorithm (namely, “à trous” [16] or filter rarefaction [17]) that increases feature map size.

Second, in pedestrian detection the false predictions are dominantly caused by confusions of hard background instances (Fig. 1(b)). This is in contrast to general object detection where a main source of confusion is from multiple categories. To address hard negative examples, we adopt cascaded Boosted Forest (BF) [18, 19], which performs effective hard negative mining (bootstrapping) and sample re-weighting, to classify the RPN proposals. Unlike previous methods that use hand-crafted features to train the forest, in our method the BF reuses the deep convolutional features of RPN. This strategy not only reduces the computational cost of the classifier by sharing features, but also exploits the deeply learned features.

As such, we present a surprisingly simple but effective baseline for pedestrian detection based on RPN and BF. Our method overcomes two limitations of Faster R-CNN for pedestrian detection and gets rid of traditional hand-crafted features. We present compelling results on several benchmarks, including Caltech [14], INRIA [20], ETH [21], and KITTI [22]. Remarkably, our method has substantially better localization accuracy and shows a relative improvement of 40 % on the Caltech dataset under an Intersection-over-Union (IoU) threshold of 0.7 for evaluation. Meanwhile, our method has a test-time speed of 0.5 s per image, which is competitive with previous leading methods.

In addition, our paper reveals that traditional pedestrian detectors have been inherited in recent methods at least for two reasons. First, the higher resolution of hand-crafted features (such as [5, 6]) and their pyramids is good for detecting small objects. Second, effective bootstrapping is performed for mining hard negative examples. These key factors, however, when appropriately handled in a deep learning system, lead to excellent results.

2 Related Work

The Integrate Channel Features (ICF) detector [5], which extends the Viola-Jones framework [23], is among the most popular pedestrian detectors without using deep learning features. The ICF detector involves channel feature pyramids and boosted classifiers. The feature representations of ICF have been improved in several ways, including ACF [6], LDCF [24], SCF [9], and many others, but the boosting algorithm remains a key building block for pedestrian detection.

Driven by the success of (“slow”) R-CNN [10] for general object detection, a recent series of methods [2, 3, 9] adopt a two-stage pipeline for pedestrian detection. In [1], the SCF pedestrian detector [9] is used to propose regions, followed by an R-CNN for classification; TA-CNN [2] employs the ACF detector [6] to generate proposals, and trains an R-CNN-style network to jointly optimize pedestrian detection with semantic tasks; the DeepParts method [3] applies the LDCF detector [24] to generate proposals and learns a set of complementary parts by neural networks. We note that these proposers are stand-alone pedestrian detectors consisting of hand-crafted features and boosted classifiers.

Unlike the above R-CNN-based methods, the CompACT method [4] learns boosted classifiers on top of hybrid hand-crafted and deep convolutional features. Most closely related to our work, the CCF detector [25] is boosted classifiers on pyramids of deep convolutional features, but uses no region proposals. Our method has no pyramid, and is much faster and more accurate than [25].

3 Approach

Our approach consists of two components (illustrated in Fig. 2): an RPN that generates candidate boxes as well as convolutional feature maps, and a Boosted Forest that classifies these proposals using these convolutional features.

Fig. 2.
figure 2

Our pipeline. RPN is used to compute candidate bounding boxes, scores, and convolutional feature maps. The candidate boxes are fed into cascaded Boosted Forests (BF) for classification, using the features pooled from the convolutional feature maps computed by RPN.

3.1 Region Proposal Network for Pedestrian Detection

The RPN in Faster R-CNN [11] was developed as a class-agnostic detector (proposer) in the scenario of multi-category object detection. For single-category detection, RPN is naturally a detector for the only category concerned. We specially tailor the RPN for pedestrian detection, as introduced in the following.

We adopt anchors (reference boxes) [11] of a single aspect ratio of 0.41 (width to height). This is the average aspect ratio of pedestrians as indicated in [14]. This is unlike the original RPN [11] that has anchors of multiple aspect ratios. Anchors of inappropriate aspect ratios are associated with few examples, so are noisy and harmful for detection accuracy. In addition, we use anchors of 9 different scales, starting from 40 pixels height with a scaling stride of 1.3\(\times \). This spans a wider range of scales than [11]. The usage of multi-scale anchors waives the requirement of using feature pyramids to detect multi-scale objects.

Following [11], we adopt the VGG-16 net [8] pre-trained on the ImageNet dataset [26] as the backbone network. The RPN is built on top of the Conv5_3 layer, which is followed by an intermediate \(3\times 3\) convolutional layer and two sibling \(1\times 1\) convolutional layers for classification and bounding box regression (more details in [11]). In this way, RPN regresses boxes with a stride of 16 pixels (Conv5_3). The classification layer provides confidence scores of the predicted boxes, which can be used as the initial scores of the Boosted Forest cascade that follows.

It is noteworthy that although we will use the “à trous” [16] trick in the following section to increase resolution and reduce stride, we keep using the same RPN with a stride of 16 pixels. The à trous trick is only exploited when extracting features (as introduced next), but not for fine-tuning.

3.2 Feature Extraction

With the proposals generated by RPN, we adopt RoI pooling [12] to extract fixed-length features from regions. These features will be used to train BF as introduced in the next section. Unlike Faster R-CNN which requires to feed these features into the original fully-connected (fc) layers and thus limits their dimensions, the BF classifier imposes no constraint on the dimensions of features. For example, we can extract features from RoIs on Conv3_3 (of a stride = 4 pixels) and Conv4_3 (of a stride = 8 pixels). We pool the features into a fixed resolution of \(7\times 7\). These features from different layers are simply concatenated without normalization, thanks to the flexibility of the BF classifier; on the contrast, feature normalization needs to be carefully addressed [27] for deep classifiers when concatenating features.

Remarkably, as there is no constraint imposed to feature dimensions, it is flexible for us to use features of increased resolution. In particular, given the fine-tuned layers from RPN (stride = 4 on Conv3, 8 on Conv4, and 16 on Conv5), we can use the à trous trick [16] to compute convolutional feature maps of higher resolution. For example, we can set the stride of Pool3 as 1 and dilate all Conv4 filters by 2, which reduces the stride of Conv4 from 8 to 4. Unlike previous methods [16, 17] that fine-tune the dilated filters, in our method we only use them for feature extraction, without fine-tuning a new RPN.

Though we adopt the same RoI resolution (\(7\times 7\)) as Faster R-CNN [11], these RoIs are on higher-resolution feature maps (e.g., Conv3_3, Conv4_3, or Conv4_3 à trous) than Fast R-CNN (Conv5_3). If an RoI’s input resolution is smaller than output (i.e., \(7\times 7\)), the pooling bins collapse and the features become “flat” and not discriminative. This problem is alleviated in our method, as it is not constrained to use features of Conv5_3 in our downstream classifier.

3.3 Boosted Forest

The RPN has generated the region proposals, confidence scores, and features, all of which are used to train a cascaded Boosted Forest classifier. We adopt the RealBoost algorithm [18], and mainly follow the hyper-parameters in [4]. Formally, we bootstrap the training by 6 times, and the forest in each stage has \(\{64, 128, 256, 512, 1024, 1536\}\) trees. Initially, the training set consists of all positive examples (\(\sim \)50k on the Caltech set) and the same number of randomly sampled negative examples from the proposals. After each stage, additional hard negative examples (whose number is 10 % of the positives, \(\sim \)5k on Caltech) are mined and added into the training set. Finally, a forest of 2048 trees is trained after all bootstrapping stages. This final forest classifier is used for inference. Our implementation is based on [28].

We note that it is not necessary to handle the initial proposals equally, because our proposals have initial confidence scores computed by RPN. In other words, the RPN can be considered as the stage-0 classifier \(f_0\), and we set \(f_{0} = \frac{1}{2} \log \frac{s}{1 - s}\) following the RealBoost form where s is the score of a proposal region (\(f_0\) is a constant in standard boosting). The other stages are as in standard RealBoost.

3.4 Implementation Details

We adopt single-scale training and testing as in [11, 12, 15], without using feature pyramids. An image is resized such that its shorter edge has N pixels (N = 720 pixels on Caltech, 600 on INRIA, 810 on ETH, and 500 on KITTI). For RPN training, an anchor is considered as a positive example if it has an Intersection-over-Union (IoU) ratio greater than 0.5 with one ground truth box, and otherwise negative. We adopt the image-centric training scheme [11, 12], and each mini-batch consists of 1 image and 120 randomly sampled anchors for computing the loss. The ratio of positive and negative samples is 1:5 in a mini-batch. Other hyper-parameters of RPN are as in [11], and we adopt the publicly available code of [11] to fine-tune the RPN. We note that in [11] the cross-boundary anchors are ignored during fine-tuning, whereas in our implementation we preserve the cross-boundary negative anchors during fine-tuning, which empirically improves accuracy on these datasets.

With the fine-tuned RPN, we adopt non-maximum suppression (NMS) with a threshold of 0.7 to filter the proposal regions. Then the proposal regions are ranked by their scores. For BF training, we construct the training set by selecting the top-ranked 1000 proposals (and ground truths) of each image. The tree depth is set as 5 for the Caltech and KITTI set, and 2 for the INRIA and ETH set, which are empirically determined according to the different sizes of the data sets. At test time, we only use the top-ranked 100 proposals in an image, which are classified by the BF.

4 Experiments and Analysis

4.1 Datasets

We comprehensively evaluate on 4 benchmarks: Caltech [14], INRIA [20], ETH [21] and KITTI [22]. By default an IoU threshold of 0.5 is used for determining True Positives in these datasets.

On Caltech [14], the training data is augmented by 10 folds (42782 images) following [1]. 4024 images in the standard test set are used for evaluation on the original annotations under the “reasonable” setting (pedestrians that are at least 50 pixels tall and at least 65 % visible) [14]. The evaluation metric is log-average Miss Rate on False Positive Per Image (FPPI) in \([10^{-2}, 10^{0}]\) (denoted as MR\(_{-2}\) following [29], or in short MR). In addition, we also test our model on the new annotations provided by [29], which correct the errors in the original annotations. This set is denoted as “Caltech-New”. The evaluation metrics in Caltech-New are MR\(_{-2}\) and MR\(_{-4}\), corresponding to the log-average Miss Rate on FPPI ranges of \([10^{-2}, 10^{0}]\) and \([10^{-4}, 10^{0}]\), following [29].

The INRIA [20] and ETH [21] datasets are often used for verifying the generalization capability of the models. Following the settings in [30], our model is trained on the 614 positive and 1218 negative images in the INRIA training set. The models are evaluated on the 288 testing images in INRIA and 1804 images in ETH, evaluated by MR\(_{-2}\).

The KITTI dataset [22] consists of images with stereo data available. We perform training on the 7481 images of the left camera, and evaluate on the standard 7518 test images. KITTI evaluates the PASCAL-style mean Average Precision (mAP) under three difficulty levels: “Easy”, “Moderate”, and “Hard”Footnote 1.

4.2 Ablation Experiments

In this subsection, we conduct ablation experiments on the Caltech dataset.

Is RPN Good for Pedestrian Detection?

In Fig. 3 we investigate RPN in terms of proposal quality, evaluated by the recall rates under different IoU thresholds. We evaluate on average 1, 4, or 100 proposals per imageFootnote 2. Figure 3 shows that in general RPN performs better than three leading methods that are based on traditional features: SCF [9], LDCF [24] and Checkerboards [31]. With 100 proposals per image, our RPN achieves >95 % recall at an IoU of 0.7.

Fig. 3.
figure 3

Comparison of RPN and three existing methods in terms of proposal quality (recall vs. IoU) on the Caltech set, with on average 1, 4 or 100 proposals per image are evaluated.

Fig. 4.
figure 4

Comparisons on the Caltech set (legends indicate MR).

More importantly, RPN as a stand-alone pedestrian detector achieves an MR of 14.9 % (Table 1). This result is competitive and is better than all but two state-of-the-art competitors on the Caltech dataset (Fig. 4). We note that unlike RoI pooling that may suffer from small regions, RPN is essentially based on fixed-size sliding windows (in a fully convolutional fashion) and thus avoids collapsing bins. RPN predicts small objects by using small anchors.

Table 1. Comparisons of different classifiers and features on the Caltech set. All methods are based on VGG-16 (including R-CNN). The same set of RPN proposals are used for all entries.
Table 2. Comparisons of different features in our RPN + BF method on the Caltech set. All entries are based on VGG-16 and the same set of RPN proposals.

How Important is Feature Resolution?

We first report the accuracy of (“slow”) R-CNN [10]. For fair comparisons, we fine-tune R-CNN using the VGG-16 network, and the proposals are from the same RPN as above. This method has an MR of 13.1 % (Table 1), better than its proposals (stand-alone RPN, 14.9 %). R-CNN crops raw pixels from images and warps to a fixed size (\(224\times 224\)), so suffers less from small objects. This result suggests that if reliable features (e.g., from a fine resolution of \(224\times 224\)) can be extracted, the downstream classifier is able to improve the accuracy.

Surprisingly, training a Fast R-CNN classifier on the same set of RPN proposals actually degrades the results: the MR is considerably increased to 20.2 % (vs. RPN’s 14.9 %, Table 1). Even though R-CNN performs well on this task, Fast R-CNN presents a much worse result.

This problem is partially because of the low-resolution features. To show this, we train a Fast R-CNN (on the same set of RPN proposals as above) with the à trous trick adopted on Conv5, reducing the stride from 16 pixels to 8. The problem is alleviated (16.2 %, Table 1), demonstrating that higher resolution can be helpful. Yet, this result still lags far behind the stand-alone RPN or R-CNN (Table 1).

The effects of low-resolution features are also observed in our Boosted Forest classifiers. BF using Conv5_3 features has an MR of 18.2 % (Table 1), lower than the stand-alone RPN. Using the à trous trick on Conv5 when extracting features (Sec. 3.2), BF has a much better MR of 13.7 %.

But the BF classifier is more flexible and is able to take advantage of features of various resolutions. Table 2 shows the results of using different features in our method. Conv3_3 or Conv4_3 alone yields good results (12.4 % and 12.6 %), showing the effects of higher resolution features. Conv2_2 starts to show degradation (15.9 %), which can be explained by the weaker representation of the shallower layers. BF on the concatenation of Conv3_3 and Conv4_3 features reduces the MR to 11.5 %. The combination of features in this way is nearly cost-free. Moreover, unlike previous usage of skip connections [27], it is not necessary to normalize features in a decision forest classifier.

Finally, combining Conv3_3 with the à trous version of Conv4_3, we achieve the best result of 9.6 % MR. We note that this is at the cost of extra computation (Table 2), because it requires to re-compute the Conv4 features maps with the à trous trick. Nevertheless, the speed of our method is still competitive (Table 4).

Table 3. Comparisons of with/without bootstrapping on the Caltech set.
Table 4. Comparisons of running time on the Caltech set. The time of LDCF and CCF is reported in [25], and that of CompactACT-Deep is reported in [4].

How Important Is Bootstrapping?

To verify that the bootstrapping scheme in BF is of central importance (instead of the tree structure of the BF classifiers), we replace the last-stage BF classifier with a Fast R-CNN classifier. The results are in Table 3. Formally, after the 6 stages of bootstrapping, the bootstrapped training set is used to train a Fast R-CNN classifier (instead of the final BF with 2048 trees). We perform this comparison using RoI features on Conv5_3 (à trous). The bootstrapped Fast R-CNN has an MR of 14.3 %, which is closer to the BF counterpart of 13.7 %, and better than the non-bootstrapped Fast R-CNN’s 16.2 %. This comparison indicates that the major improvement of BF over Fast R-CNN is because of bootstrapping, whereas the shapes of classifiers (forest vs. MLP) are less important.

Fig. 5.
figure 5

Comparisons on the Caltech set using an IoU threshold of 0.7 to determine True Positives (legends indicate MR).

4.3 Comparisons with State-of-the-Art Methods

Caltech. Figures 4 and 6 show the results on Caltech. In the case of using original annotations (Fig. 4), our method has an MR of 9.6 %, which is over 2 points better than the closest competitor (11.7 % of CompactACT-Deep [4]). In the case of using the corrected annotations (Fig. 6), our method has an MR\(_{-2}\) of 7.3 % and MR\(_{-4}\) of 16.8 %, both being 2 points better than the previous best methods.

Fig. 6.
figure 6

Comparisons on the Caltech-New set (legends indicate MR\(_{-2}\) (MR\(_{-4}\))).

In addition, expect for CCF (MR 18.7 %) [25], ours (MR 9.6 %) is the only method that uses no hand-crafted features. Our results suggest that hand-crafted features are not essential for good accuracy on the Caltech dataset; rather, high-resolution features and bootstrapping are the key to good accuracy, both of which are missing in the original Fast R-CNN detector.

Figure 5 shows the results on Caltech where an IoU threshold of 0.7 is used to determine True Positives (instead of 0.5 by default). With this more challenging metric, most methods exhibit dramatic performance drops, e.g., the MR of CompactACT-Deep [4]/DeepParts [3] increase from 11.7 %/11.9 % to 38.1 %/40.7 %. Our method has an MR of 23.5 %, which is a relative improvement of \(\sim \) 40 % over the closest competitors. This comparison demonstrates that our method has a substantially better localization accuracy. It also indicates that there is much room to improve localization performance on this widely evaluated dataset.

Table 4 compares the running time on Caltech. Our method is as fast as CompACT-Deep [4], and is much faster than CCF [25] that adopts feature pyramids. Our method shares feature between RPN and BF, and achieves a good balance between speed and accuracy.

INRIA and ETH. Figures 7 and 8 show the results on the INRIA and ETH datasets. On the INRIA set, our method achieves an MR of 6.9 %, considerably better than the best available competitor’s 11.2 %. On the ETH set, our result (30.2 %) is better than the previous leading method (TA-CNN [2]) by 5 points.

KITTI. Table 5 shows the performance comparisons on KITTI. Our method has competitive accuracy and fast speed.

Fig. 7.
figure 7

Comparisons on the INRIA dataset (legends indicate MR).

Fig. 8.
figure 8

Comparisons on the ETH dataset (legends indicate MR).

Table 5. Comparisons on the KITTI dataset collected at the time of submission (Feb 2016). The timing records are collected from the KITTI leaderboard. \(^\dag \): region proposal running time ignored (estimated 2 s).

5 Conclusion and Discussion

In this paper, we present a very simple but effective baseline that uses RPN and BF for pedestrian detection. On top of the RPN proposals and features, the BF classifier is flexible for (i) combining features of arbitrary resolutions from any layers, without being limited by the classifier structure of the pre-trained network; and (ii) incorporating effective bootstrapping for mining hard negatives. These nice properties overcome two limitations of the Faster R-CNN system for pedestrian detection. Our method is a self-contained solution and does not resort to hybrid features.

Interestingly, we show that bootstrapping is a key component, even with the advance of deep neural networks. Using the same bootstrapping strategy and the same RoI features, both the tree-structured BF classifier and the region-wise MLP classifier (Fast R-CNN) are able to achieve similar results (Table 3). Concurrent with this work, an independently developed method called Online Hard Example Mining (OHEM) [32] is developed for training Fast R-CNN for general object detection. It is interesting to investigate this end-to-end, online mining fashion vs. the multi-stage, cascaded bootstrapping one.