Keywords

1 Introduction

Pedestrian detection is an important research topic in computer vision and has attracted a considerable attention over past few years [4, 7, 9, 11, 18, 32, 37, 39, 43, 45, 48]. It plays a key role in several applications such as autonomous driving, robotics and intelligent video surveillance. Despite the recent progress, pedestrian detection task still remains a challenging problem because of large variations, low resolution and occlusion issues.

Existing methods for pedestrian detection can mainly be grouped into two categories: hand-crafted features based [7, 9, 40, 44] and deep learning features based [4, 11, 18, 48]. In the first category, human shape based features such as Haar [39] and HOG [7] are extracted to train SVM [7] or boosting classifiers [9]. While these methods are sufficient for simple applications, these hand-crafted feature representations are not robust enough for detecting pedestrian in complex scenes. In the second category, deep convolutional neural network (CNN) learns high-level semantic features from raw pixels, which shows more discriminative capability to recognize pedestrian with complex poses from noisy background. Deep learning features have considerably improved pedestrian detection performance. While many CNN based methods have been proposed [4, 11, 18, 26, 48], there are still some shortcomings for methods in this category. On one hand, most methods employ heavy deep network and need refinement stage to boost the detection results. The inference time is scarified to ensure accuracy, making these methods unsuitable for real-time application. On the other hand, feature maps of coarse resolution and fixed receptive field are often used for prediction, which is inefficient for distinguishing targets of small size from background.

Fig. 1.
figure 1

Overview of our proposed framework. The model includes three key parts: convolutional backbone, pedestrian attention module and zoom-in-zoom-out module (ZIZOM). Given an image, the backbone generates multiple features representing pedestrians of different scales. The attention masks are encoded into backbone feature maps to highlight pedestrians and suppress background interference. ZIZOM incorporates local details and context information to further enhance the feature maps.

In this paper, we propose a graininess-aware deep feature learning (GDFL) based detector for pedestrian detection. We exploit fine-grained details into deep convolutional features for robust pedestrian detection. Specifically, we propose a scale-aware pedestrian attention module to guide the detector to focus on pedestrian regions. It generates pedestrian attentional masks which indicate the probability of human at each pixel location. With its fine-grained property, the attention module has high capability to recognize small size target and human body parts. By encoding these masks into the convolutional feature maps, they significantly eliminate background interference while highlight pedestrians. The resulting graininess-aware deep features have much more discriminative capability to distinguish pedestrians, especially the small-size and occluded ones from complex background. In addition, we introduce a zoom-in-zoom-out module to further alleviate the detection of targets at small size. It mimics our intuitive zoom in and zoom out processes, when we aim to locate an object in an image. The module incorporates local details and context information in a convolutional manner to enhance the graininess-aware deep features for small size target detection. Figure 1 illustrates the overview of our proposed framework. The proposed two modules can be easily integrated into a basic deep network, leading to an end-to-end trainable model. This results in a fast and robust single stage pedestrian detector, without any extra refinement steps. Extensive experimental results on four widely used pedestrian detection benchmarks demonstrate the effectiveness of the proposed method. Our GDFL approach achieves competitive performance on Caltech [10], INRIA [7], KITTI [14] and MOT17Det [29] datasets and executes about 4 times faster than competitive methods.

2 Related Work

Pedestrian Detection: With the prevalence of deep convolutional neural network, which has achieved impressive results in various domains, most recent pedestrian detection methods are CNN-based. Many methods were variations of Faster R-CNN [35] which has shown great accuracy in general object detection. RPN+BF [43] replaced the downstream classifier of Faster R-CNN with a boosted forest and used aggregated features with a hard mining strategy to boost the small size pedestrian detection performance. SA-FastRCNN [19] and MS-CNN [5] extended Fast and Faster R-CNN [15, 35] with a multi-scale network to deal with the scale variations problem, respectively. Instead of a single downstream classifier, F-DNN [11] employed multiple deep classifiers in parallel to post verify each region proposal using a soft-reject strategy. Different from these two stages methods, our proposed approach directly outputs detection results without post-processing [23, 34]. Apart the above full-body detectors, several human part based methods [12, 31, 32, 37, 47, 48] have been introduced to handle occlusion issues. These occlusion-specific methods learned a set of part-detector, where each one was responsive to detect a human part. The results from these part detections were then fused properly for locating partially occluded pedestrians. The occlusion-specific detectors were able to give a high confidence score based on the visible parts when the full-body detector was confused by the presence of background. Instead of part-level classification, we explore pixel-level masks which guide the detector to pay more attention to human body parts.

Segmentation in Detection: Since our pedestrian attention masks are generated in a segmentation manner [17, 25], we present here some methods that have also exploited semantic segmentation information. Tian et al. [38] optimized pedestrian detection with semantic tasks, including pedestrian attributes and scene attributes. Instead of simple binary detection, this method considered multiple classes according to the attributes to handle pedestrian variations and discarded hard negative samples with scene attributes. Mao et al. [27] have demonstrated that fusing semantic segmentation features with detection features improves the performance. Du et al. [11] exploited segmentation as a strong cue in their F-DNN+SS framework. The segmentation mask was used in a post-processing manner to suppress prediction bounding boxes without any pedestrian. Brazil et al. [4] extended Faster R-CNN [35] by replacing the downstream classifier with an independent deep CNN and added a segmentation loss to implicitly supervise the detection, which made the features be more semantically meaningful. Instead of exploiting segmentation mask for post-processing or implicit supervision, our attention mechanism directly encodes into feature maps and explicitly highlights pedestrians.

3 Approach

In this section, we present the proposed GDFL method for pedestrian detection in detail. Our framework is composed of three key parts: a convolutional backbone, a scale-aware pedestrian attention module and a zoom-in-zoom-out module. The convolutional backbone generates multiple feature maps for representing pedestrian at different scales. The scale-aware pedestrian attention module generates several attention masks which are encoded into these convolutional feature maps. This forms graininess-aware feature maps which have more capability to distinguish pedestrians and body parts from background. The zoom-in-zoom-out module incorporates extra local details and context information to further enhance the features. We then slide two sibling 3 \(\times \) 3 convolutional layers over the resulting feature maps to output a detection score and a shape offset relative to the default box at each location [23].

Fig. 2.
figure 2

Visualization of feature maps from different convolutional layers. Shallow layers have strong activation for small size targets but are unable to recognize large size instances. While deep layers tend to encode pedestrians of large size and ignore small ones. For clarity, only one channel of feature maps is shown here. Best viewed in color.

3.1 Multi-layer Pedestrian Representation

Pedestrians have a large variance of scales, which is a critical problem for an accurate detection due to the difference of features between small and large instances. We exploit the hierarchical architecture of the deep convolutional network to address this multi-scale issue. The network computes feature maps of different spatial resolutions with successive sub-sampling layers, which forms naturally a feature pyramid [22]. We use multiple feature maps to detect pedestrians at different scales. Specifically, we tailor the VGG16 network [36] for detection, by removing all classification layers and converting the fully connected layers into convolutional layers. Two extra convolutional layers are added on the end of the converted-VGG16 in order to cover large scale targets. The architecture of the network is presented on the top of Fig. 1. Given an input image, the network generates multiple convolutional feature layers with increasing sizes of receptive field. We select four intermediate convolutional layers {conv4_3, conv5_3, conv_fc7, conv6_2} as detection layers for multi-scale detection. As illustrated in Fig. 2, shallower convolutional layers with high resolution feature maps have strong activation for small size targets, while large-size pedestrians emerge at deeper layers. We regularly place a series of default boxes [23] with different scales on top of the detection layers according to their representation capability. The detection bounding boxes are predicted based on the offsets with respect to these default boxes, as well as the pedestrian probability in each of those boxes. The high resolution feature maps from layers conv4_3 and conv5_3 are associated with default boxes of small scales for detecting small target, while those from layers conv_fc7 and conv6_2 are designed for large pedestrian detection.

Fig. 3.
figure 3

Visualization of pedestrian attention masks generated from Caltech test images. From left to right are illustrated: images with the ground truth bounding boxes, pedestrian v.s. background mask, small-size pedestrian mask, and large-size pedestrian mask. The pedestrian/background mask corresponds to the sum of the last two masks and can be seen as a single scale pedestrian mask. Best viewed in color.

3.2 Pedestrian Attention Module

Despite the multi-layer representation, the feature maps from the backbone are still too coarse, e.g., stride 8 on conv4_3, to effectively locate small size pedestrians and recognize human body parts. In addition, even if each detection layer tends to represent pedestrian of particular size, it would also consider target of other scales, which is undesirable and may lead to box-in-box detection. We propose a scale-aware pedestrian attention module to make our detector pay more attention to pedestrians, especially small size ones, and guide feature maps to focus on target of specific scale via pixel-wise attentional maps. By encoding the fine-grained attention masks into the convolutional feature maps, the features representing pedestrian are enhanced, while the background interference is significantly reduced. The resulting graininess-aware features have more powerful capability to recognize human body parts and are able to infer occluded pedestrian based on the visible parts.

The attention module is built on the layers conv3_3 and conv4_3 of the backbone network. It generates multiple masks that indicate the probability of pedestrian of specific size at each pixel location. The architecture of the attention module is illustrated in Fig. 1. We construct a max-pooling layer and three atrous convolutional layers [20] on top of conv4_3 to get conv_mask layer which has high resolution and large receptive field. Each of conv3_3, conv4_3 and conv_mask layers is first reduced into \((S_c+1)\)-channel maps and spatially up-sampled into the image size. They are then concatenated and followed by a \(1 \times 1\) convolution and softmax layer to output attention maps. Where \(S_c\) corresponds to the number of scale-class. In default, we distinguish small and large pedestrians according to a height threshold of 120 pixels and set \(S_c = 2\). Figure 3 illustrates some examples of pedestrian masks, which effectively highlight pedestrian regions.

Fig. 4.
figure 4

Visualization of feature maps from detection layers of the backbone network (top), and visualization of feature maps with pedestrian attention (bottom). With our attention mechanism, the background interference is significantly attenuated and each detection layer is more focused on pedestrians of specific size. Best viewed in color.

Once the attention masks \(M \in {\mathcal {R}}^{W\times H \times 3}\) are generated, we encode them into the feature maps from the convolutional backbone to obtain our graininess-aware feature maps by resizing the spatial size and element-wise product:

$$\begin{aligned} {\tilde{F_i}}= & {} F_i \odot R(M_S,i), \quad i\in \{\text {conv}4,\text {conv}5\} \end{aligned}$$
(1)
$$\begin{aligned} {\tilde{F_j}}= & {} F_j \odot R(M_L,j), \quad j\in \{\text {conv}\_\text {fc}7,\text {conv}6\} \end{aligned}$$
(2)

where \(M_S \in {\mathcal {R}}^{W\times H \times 1}\) and \(M_L\in {\mathcal {R}}^{W\times H \times 1}\) correspond to attention masks highlighting small and large pedestrians, respectively. W and H are the size of input image. \(R(\cdot ,i)\) is the function that resizes the input into the size of \(i^{th}\) layer. \(\odot \) is the channel element-wise dot product operator. \(F_i\) represents the feature maps from backbone network while \({\tilde{F_i}}\) is the graininess-aware feature maps with pedestrian attention. The mask \(R(M_S,i)\) is encoded into the feature maps from layers conv4_3 and conv5_3, which are responsive for small pedestrian detection. While the mask \(R(M_L,i)\) is encoded into the feature maps from conv_fc7 and conv6_2, which are used for large pedestrian detection. The feature maps with and without attention masks are shown in Fig. 4, where pedestrian information is highlighted while background is smoothed with masks.

Fig. 5.
figure 5

Zoom-in-zoom-out module. (a) According to their receptive fields, the layer conv5_3 has more capability to get context information while the layer conv3_3 is able to get more local details. (b) Architecture of the module. Features from adjacent detection layers are re-sampled and encoded with the corresponding attention mask before to be fused with current detection features.

3.3 Zoom-In-Zoom-Out Module

When our human annotators try to find and recognize a small object in an image, we often zoom in and zoom out several times to correctly locate the target. The zoom-in process allows to get details information and improve the location precision. While the zoom-out process permits to import context information, which is a key factor when reasoning the probability of a target in the region, e.g., pedestrians tend to appear on the ground or next to cars than on sky. Inspired by these intuitive operations, we introduce a zoom-in-zoom-out module (ZIZOM) to further enhance the features. It explores rich context information and local details to facilitate detection.

We implement the zoom-in-zoom-out module in a convolutional manner by exploiting the feature maps of different receptive fields and resolutions. Feature maps with smaller receptive fields provide rich local details, while feature maps with larger receptive fields import context information. Figure 5(b) depicts the architecture of the zoom-in-zoom-out module. Specifically, given the graininess-aware feature maps \({\tilde{F_i}}\), we incorporate the features from directly adjacent layers \({F}_{i-1}\) and \({F}_{i+1}\) to mimic zoom-in and zoom-out processes. Each adjacent layer is followed by an \(1 \times 1\) kernel convolution to select features and an up- and down-sampling operation to harmonize the spatial size of feature maps. The sampling operations consist of max-pooling and bi-linear interpolation without learning parameters for simplicity. The attention mask of the current layer, \(Mask_i\), is encoded into these sampled feature maps, making them focus on targets of the corresponding size. We then fuse these feature maps along their channel axis and generate the feature maps for final prediction with an \(1 \times 1\) convolutional layer for dimension reduction as well as features recombination. Since the feature maps from different layers have different scales, we use L2-normalization [24] to rescale their norm to 10 and learn the scale during the back propagation.

Figure 5(a) analyzes the effects of the ZIZOM in terms of receptive field with some convolutional layers. The features from conv5_3 enhance the context information with the presence of a car and another pedestrian. Since the receptive field of conv3_3 matches with size of target, its features are able to import more local details about the pedestrian. The concatenation of these two adjacent features with conv4_3 results in more powerful feature maps as illustrated in Fig. 5(b).

3.4 Objective Function

All the three components form a unified framework which is trained end-to-end. We formulate the following multi-task loss function L to supervise our model:

$$\begin{aligned} L= & {} L_{\text {conf}} + \lambda _l L_{\text {loc}} + \lambda _m L_{\text {mask}} \end{aligned}$$
(3)

where \(L_{\text {conf}}\) is the confidence loss, \(L_{\text {loc}}\) corresponds to the localization loss and \(L_{\text {mask}}\) is the loss function of pedestrian attention masks. \(\lambda _l\) and \(\lambda _m\) are two parameters to balance the importance of different tasks. In our experiments we empirically set \(\lambda _l\) to 2 and \(\lambda _m\) to 1.

The confidence score branch is supervised by a Softmax loss over two classes (pedestrian vs. background). The box regression loss \(L_{\text {loc}}\) targets at minimizing the Smooth L1 loss [15], between the predicted bounding-box regression offsets and the ground truth box regression targets. We develop a weighted Softmax loss to supervise our pedestrian attention module. There are two main motivations for this weighting policy: (1) Most regions are background, but only few pixels correspond to pedestrians. This imbalance makes the training inefficient; (2) The large size instance occupies naturally larger area compared to the small ones. This size inequality pushes the classifier to ignore small pedestrians. To address the above imbalances, we introduce a instance-sensitive weight \(\omega _i = \alpha + \beta \frac{1}{h_i}\) and define the attention mask loss \(L_{\text {mask}}\) as a weighted Softmax loss:

$$\begin{aligned} L_{\text {mask}}= & {} -\frac{1}{N_s}\sum _{i=1}^{N_s} \sum _{l_s=0}^{S_c} {\mathbbm {1}}\{y_i=l_s\} \omega _i^{ {\mathbbm {1}}\{l_s\ne 0 \}} \log (c_i^{l_s}) \end{aligned}$$
(4)

where \(N_s\) is the number of pixels in mask, \(S_c\) is the number of scale-class, and \(h_i\) is the height of the target representing by the \(i^{th}\) pixel. \({\mathbbm {1}}\{\cdot \}\) is the indicator function. \(y_i\) is the ground truth label, \(l_s=0\) corresponds to the background label and \(c_i^{l_s}\) is the predicted score of \(i^{th}\) pixel for \(l_s\) class. The constants \(\alpha \) and \(\beta \) are set to 3 and 10 by cross validation.

4 Experiments and Analysis

4.1 Datasets and Evaluation Protocols

We comprehensively evaluated our proposed method on 3 benchmarks: Caltech [10], INRIA [7] and KITTI [14]. Here we give a brief description of these benchmarks.

The Caltech dataset [10] consists of \(\sim \)10 h of urban driving video with 350K labeled bounding boxes. It results in 42,782 training images and 4,024 test images. The log-average miss rate is used to evaluate the detection performance and is calculated by averaging miss rate on false positive per-image (FPPI) points sampled within the range of \([10^{-2}, 10^{0}]\). As the purpose of our approach is to alleviate occlusion and small-size issues, we evaluated our GDFL on three subsets: Heavy Occlusion, Medium and Reasonable. In the Heavy Occlusion subset, pedestrians are taller than 50 pixels and 36 to 80% occluded. In the Medium subset, people are between 30 and 80 pixels tall, with partial occlusion. The Reasonable subset consists of pedestrians taller than 50 pixels with partial occlusion.

The INRIA dataset [7] includes 614 positive and 1,218 negative training images. There are 288 test images available for evaluating pedestrian detection methods. The evaluation metric is the log-average miss rate on FPPI. Due to limited available annotations, we only considered the Reasonable subset for comparison with state-of-the-art methods.

The KITTI dataset [14] consists of 7,481 training images and 7,518 test images, comprising about 80K annotations of cars, pedestrians and cyclists. KITTI evaluates the PASCAL-style mean Average Precision (mAP) with three metrics: easy, moderate and hard. The difficulties are defined based on minimum pedestrian height, occlusion and truncation level.

The MOT17Det dataset [29] consists of 14 video sequences in unconstrained environments, which results in 11,235 images. The dataset is split into two parts for training and testing, which are composed of 7 video sequences respectively. The Average Precision (AP) is used for evaluating different methods.

4.2 Implementation Details

Weakly Supervised Training for Attention Module: To train the pedestrian attention module, we only use the bounding box annotations in order to be independent of any pixel-wise annotation. To achieve this, we explore a weakly supervised strategy by creating artificial foreground segmentation using bounding box information. In practice, we consider pixels within the bounding box as foreground while the rest are labeled as background. We assign the pixels that belong to multiple bounding boxes to the one that has the smallest area. As illustrated in Fig. 3, despite the weak supervised training, our generated pedestrian masks carry significant semantic segmentation information.

Training: Our network is trained end-to-end using stochastic gradient descent algorithm (SGD). We partially initialize our model with the pre-trained model in [23], and all new additional layers are randomly initialized with the “xavier” method [16]. We adopt the data augmentation strategies as in [23] to make our model more robust to scale and illumination variations. Besides, during the training phase, negative samples largely over-dominate positive samples, and most are easy samples. For more stable training, instead of using all negative samples, we sort them by the highest loss values and keep the top ones so that the ratio between the negatives and positives is at most 3:1.

Inference: We use the initial size of input image to avoid loss of information and save inference time: \(480\times 640\) for Caltech and INRIA, and \(384\times 1280\) for KITTI. In inference stage, a large number of bounding boxes are generated by our detector. We perform non-maximum suppression (NMS) with a Intersection over Union (IoU) threshold of 0.45 to filter redundant detection. We use a single GeForce GTX 1080 Ti GPU for computation and our detector executes about 20 frames per second with inputs of size \(480\times 640\) pixels.

Table 1. Comparison with the state-of-the-art methods on the Caltech heavy occlusion subset in terms of speed and miss rate.

4.3 Results and Analysis

Comparison with State-of-the-Art Methods: We evaluated our proposed GDFL method on four challenging pedestrian detection benchmarks, Caltech [10], INRIA [7], KITTI [14] and MOT17Det [29].

Caltech: We trained our model on the Caltech training set and evaluated on the Caltech testing set. Table 1 lists the comparison with state-of-the-art methods on Caltech heavy occlusion subset in terms of execution time and miss rate. Figure 6 illustrates the ROC plot of miss rate against FPPI for the available top performing methods reported on Caltech medium and reasonable subsets [1, 4,5,6, 8, 11, 19, 37, 43]. In heavy occlusion case, our GDFL achieves \(43.18\%\) miss rate, which is significantly better than the existing occlusion-specific detectors. This performance suggests that our detector, guided by fine-grained information, has better capability to identify human body parts and thus to locate occluded pedestrians. In Caltech medium subset, our method has a miss rate of \(32.50\%\) which is slightly better than the previous best method [11]. In more reasonable scenarios, our approach achieves comparable performance with the method that achieves best results on Caltech reasonable subset [4].

Since our goal is to propose a fast and accurate pedestrian detector, we have also examined the efficiency of our method. Table 1 compares the running time on Caltech dataset. Our GDFL method is much faster than F-DNN+SS [11] and is about \(10 \times \) faster than the previous best method on Caltech heavy occlusion subset, JL-TopS [48]. While SDS-RCNN [4] performs slightly better than our method on Caltech reasonable subset (\(7.36\%\) vs. \(7.84\%\)), it needs \(4\times \) more inference times than our approach. The comparison shows that our pedestrian detector achieves a favorable trade-off between speed and accuracy.

Fig. 6.
figure 6

Comparison with state-of-the-art methods on the Caltech dataset.

Fig. 7.
figure 7

Comparison with state-of-the-art methods on the INRIA dataset using the reasonable setting.

INRIA: We trained our model with 614 positive images by excluding the negative images and evaluated on the test set. Figure 7 illustrates the results of our approach and the methods that perform best on the INRIA set [2, 3, 21, 28, 30, 33, 44]. Our detector yields the state-of-the-art performance with \(5.04\%\) miss rate, outperforming the competitive methods by more than 1%. It proves that our method can achieve great results even if the training set is limited.

Table 2. Comparison with published pedestrian detection methods on the KITTI dataset. The mAP (%) and running time are collected from the KITTI leaderboard.
Table 3. Comparison with published state-of-the-art methods on MOT17Det benchmark. The symbol \(^*\) means that external data are used for training.

KITTI: We trained our model on the KITTI training set and evaluated on the designated test set. We compared our proposed GDFL approach with the current pedestrian detection methods on KITTI [4,5,6, 18, 37, 43, 46]. The results are listed in Table 2. Our detector achieves competitive performance with MS-CNN [5] yet executes about \(3\times \) faster with the original input size. Apart its scale-specific property, MS-CNN [5] has explored input and feature up-sampling strategies which are crucial for improving the small objects detection performance. Following this process, we up-sampled the inputs by 1.5 times and we observed a significant improvement on the hard subset but with more execution time. Note that in the KITTI evaluation protocol, cyclists are regarded as false detections while people-sitting are ignored. With this setting, our pedestrian attention mechanism is less helpful since it tends to highlight all human-shape targets including person riding a bicycle. This explains the reason our model does not perform as well as on KITTI than that on Caltech or INRIA.

MOT17Det: We trained and evaluated our detector on the designated training and testing sets, respectively and compared with existing methods. Table 3 tabulates the detection results of our method and the state-of-the-art approaches. Our proposed detector achieves competitive 0.81 average precision without using external datasets for training. This performance demonstrates the generalization capability of our model.

Table 4. Ablation experiments evaluated on the Caltech test set. Analysis show the effects of various components and design choices on detection performance.

Ablation Experiments: To better understand our model, we conducted ablation experiments using the Caltech dataset. We considered our convolutional backbone as baseline and successively added different key components to examine their contributions on performance. Table 4 summarizes our comprehensive ablation experiments.

Multi-layer Detection: We first analyzed the advantages of using multiple detection layers. To this end, instead of multi-layer representation, we only used conv_fc7 layer to predict pedestrians of all scales. The experimental results of these two architectures demonstrate the superiority of multi-layer detection with a notable gain of 7% on Caltech Reasonable subset.

Attention Mechanism: We analyzed the effects of our attention mechanism, in particular the difference between single scale attention mask and multiple scale-aware attention masks. To control this, we compared two models with these two attention designs. From Table 4, we can see that both models improve the results, but the model with scale-aware attention has clearly better results. The confusions, such as box-in-box detection, are suppressed with our scale-aware attention masks. We observe an impressive improvement on the Caltech heavy occlusion subset, which demonstrates that the fine-grained masks better capture body parts. Some examples of occlusion cases are depicted in Fig. 8. We can see that the features without attention are unable to recognize human parts and tend to ignore occluded pedestrians. When we encode the pedestrian masks into these feature maps, human body parts are considerably highlighted. The detector becomes able to deduce the occluded parts by considering visible parts, which makes plausible the detection of occluded targets.

Instance-Sensitive Weight in Softmax Loss: During the training stage, our attention module was supervised by a weighted Softmax loss and we examined how the instance-sensitive weight contributed to the performance. We compared two models trained with and without the weight term. As listed in the \(5{^\text {th}}\) column of Table 4, the performance drops on all three subsets of Caltech with the conventional Softmax loss. In particular, the miss rate increases from 44.68% to 47.69% in heavy occlusion case. The results point out that the instance-sensitive weight term is a key component for accurate attention masks generation.

ZIZOM: We further built the zoom-in-zoom-out module on our model with attention masks. Table 4 shows that with the ZIZOM on top of the graininess-aware features \({\tilde{F}}_{conv4\_3}\), the performance is ameliorated by 1% on all subsets of Caltech. However, when we further constructed a ZIZOM on \({\tilde{F}}_{conv5\_3}\), the results were nearly the same. Since the feature maps \({\tilde{F}}_{conv5\_3}\) represent pedestrians with about 100 pixels tall, these results confirm our intuition that context information and local details are important for small targets but are less helpful for large ones. To better control the effectiveness of this module, we disabled the attention mechanism and considered a convolutional backbone with the ZIZOM on \(F_{conv4\_3}\) model. The comparison with the baseline shows a gain of 4% on the Caltech heavy occlusion subset. The results prove the effectiveness of the proposed zoom-in-zoom-out module.

Fig. 8.
figure 8

Hard detection samples where box-based detector is often fooled due to noisy representation. The first row illustrates the images with pedestrians located by green bounding boxes. The second and third rows show the feature maps without attention masks and the graininess-aware feature maps, respectively. Best viewed in color.

5 Conclusion

In this paper, we have proposed a framework which incorporates pixel-wise information into deep convolutional feature maps for pedestrian detection. We have introduced scale-aware pedestrian attention masks and a zoom-in-zoom-out module to improve the capability of the feature maps to identify small and occluded pedestrians. Experimental results on three widely used pedestrian benchmarks have validated the advantages on detection robustness and efficiency of the proposed method.