Keywords

1 Introduction

Object detection technology has been widely used in intelligent transportation, road detection and military target detection. With the advent of deep learning and large-scale visual identification datasets, object detection has developed rapidly, among which the two-step object detection framework based on R-CNN [1,2,3,4] and one-step object detection framework based on regression [5,6,7] are the most representative.

Object detection framework based on R-CNN mainly consists of image convolution, region proposal, classification and regression of the region. In 2014, Girshick et al. [1] proposed the object detection framework based R-CNN by combining region proposal with CNN, which opened a new era for object detection with deep learning. After that, around the detection accuracy and speed, many improved versions of the R-CNN model have been proposed, such as, the Fast R-CNN [2] incorporating the ROI-Pool layer and multi-task loss function, the Faster R-CNN [3] incorporating Region Proposal Network (RPN), and the Mask R-CNN [4] cooperating instance segmentation for multitask collaboration. Compared with the traditional method of target detection, the R-CNN methods avoid the subjectivity of the manual feature extraction, and realize the end-to-end object feature extraction and classification.

In view of the two-step object detection be very slow, the one-step object detection method avoids the process of region proposal, and performs the object detection by using regression method directly to output category and bounding box of each regions in the image. The YOLO [5] divides the image into grids, and performs regression computation on those grids to gain category and bounding box of the objects, which boosts the detection speed to 45 fps. The SSD [6] introduces the anchor representation of Faster R-CNN into YOLO to general multi-scale regions at each location in the image, which not only improves greatly the accuracy of detector but also makes the detection speed be up to 58 fps. DSSD [7] fuses the deep convolution layers and the shallow convolution layers by using encode-decode network, which can leverage high-level semantic and low-level image feature, so boosts the detection performance on small object and dense object. Above methods do not require region proposal, so compared to the methods based on R-CNN, their detection speed are faster but the accuracy be lower.

The shortage of above two methods is that the detection accuracy of small objects is poor. Therefore, in this paper, an object detection model based on parallel connection of Deep and Shallow CNN (DS-CNN) is designed via innovative use of skip-layers connection, region proposal and anchors. This model not only keeps the detection accuracy for big object, but also improves significantly the accuracy for small object, and with very little reduction in term of detection speed.

Fig. 1.
figure 1

The framework of DS-CNN. Given an input image, after dealing with two layers of convolutional layers, we use shallow network and deep network to process the feature map into the same size and combine them in the concat layer. The RPN generates 500 region proposals, and then the feature map is processed by the dimensional reducer and fully-connected layer. Finally, we use softmax for classification and multi-task loss function for regression.

2 DS-CNN

The framework of our detection model is shown in Fig. 1, and it consists of four parts: the first part is the feature extraction network, including deep CNN and shallow CNN; the second part is the region proposal network, which is used to generate object candidates; the third part is dimensional reducer, which is used to reduce the dimension of feature of object candidate; the fourth part is fully-connected (FC) layer, classification and regression network.

2.1 The Design of Deep and Shallow CNN

In general, the deeper the level of convolution is, the more obvious the semantic characteristic is, and the easier it is to classify the object, but the more information lost. For large-scale objects, this loss is not enough to affect their classification and identification, but it is not the same for small objects. Taking Fast R-CNN [2] as an example, the feature map of last convolutional layer conv5-3 has been reduced by 16 times. For a 500 \(\times \) 300 image, the size of the small object is about 32 \(\times \) 32, and it becomes 2 \(\times \) 2 after conv5-3. Although the upsampling can expand the image to 7 \(\times \) 7, the loss of information is irreversible. This is the reason why the series method of R-CNN model has a relatively poor detection accuracy on small objects.

For this purpose, we designed deep CNN and shallow CNN based on the VGG16 network. The deep network is used to capture the high-level semantics of large objects, while the shallow network is used to hold the low-level image features of small objects. In the deep network, the parameters of conv1-1 to pool4 are the same as those of VGG16, but the conv5-1 to conv5-3 layers are all modified using dilated convolution with a pad of 2, a kernel size of 3 \(\times \) 3, a stride of 1, and a dilation of 2. Dilated convolution [8] is a common method in the image segmentation, it can expand the receptive field without changing the size of the feature map, and thus contains more global information. The principle is shown in Fig. 2, where (a) is the normal feature map, and (b) is a dilated convolutional map with a dilatation factor of 2. For the 7 \(\times \) 7 feature area, the actual convolution kernel size is 3 \(\times \) 3 and the hole is 1, that is, the weights of other points except 9 black points are 0. Although there is no change in kernel size compared to (a), the receptive field of this convolution has increased to 7 \(\times \) 7, which allows each convolution output to contain more global information.

Fig. 2.
figure 2

Principle of dilated convolution.

In the shallow network, it is no longer necessary to capture high-level semantic features of the image, but rather to obtain the low-level image features, so we don’t need a very deep network, that is, we don’t need to use a large number of convolutional layers. In order to achieve better results for the parallel structure, we use the skip-layers connection method to share the parameters of conv1-1 and conv1-2. Starting from conv2-1, only 4 convolutional layers are used, each of these layers has 24 filters with a kernel size of 5 \(\times \) 5. In order to make the final deep network and shallow network have the same spatial resolution, we design an average pooling layer with a kernel size of 4 \(\times \) 4 and a stride of 2 after each convolutional layer in the shallow network. Using average pooling instead of maximum pooling in this model ensures that no excessive image information is lost.

After extracting image features, we need to combine the feature maps of the deep network and the shallow network, and to integrate them into a unified space. In this paper, we employ the concat layer to do it, and the dimension of joint features is 536-d.

2.2 Object Candidate Generation

The number and quality of object candidate region proposals affects directly the speed and accuracy of the object detection. RPN [3] directly generates candidate regions on the convolutional map by using the “anchor”. Although RPN is still in the way of window sliding in essence, the detection speed of the whole network is greatly improved because of its regional recommendation, classification and regression sharing the feature of convolution map, so we refer the RPN in the proposed DS-CNN.

RPN scans and convolves the feature maps by using a 3 \(\times \) 3 sliding window. At the center of 3 \(\times \) 3 sliding window, we give 4 scales(64, 128, 256, 512) and 3 aspect ratios(1:1, 1:2, 2:1), which can generate 12 different region proposal boxes, i.e., 12 kind of anchor. Thus, for the 14 \(\times \) 14 feature map, there are about 2300 (14 \(\times \) 14 \(\times \) 12) region proposal boxes. After that, all region proposal boxes are sent to the fully connection classification layer and regression layer to classify and refine the region. The classification layer contains two elements for calculating the probability of the target or non-target. The regression layer contains four coordinate elements (xywh) for determining the target position. In order to obtain valid region proposals, we adjusted some of the parameters and using the non-maximum suppression method to preserve the candidate region whose overlap rate with truth region is greater than or equal to 0.5 as a positive sample, and less than 0.3 as a negative sample. Finally, the first 500 positive samples with highest overlap rate are selected as the final region proposals for object detection.

In RPN, the input image can scale up to 1000 \(\times \) 600, but the maximum scale in 12 kinds of anchors is 1024 \(\times \) 512, resulting to the 1024 be beyond 1000, so the parts beyond the border will be cut out. As a result, the maximum size of the anchor is 1000 \(\times \) 512, and this size is large enough to cover the big object in image. Similarly, 256 and 128 scales can be used to deal with medium-size objects. Because each anchor is single-label detection, large object with obvious feature will cover small object with obscure feature. However, by using the minimum scale 64, we can avoid the small objects to be covered by large objects when detecting small objects, so improve the detection accuracy of small objects.

2.3 Dimensional Reducer

The FC layer can integrate the extracted image features, and plays the role of classification in CNN. Because the FC layer is easy to cause parameter redundancy, many classical methods choose to use other types of layers instead of the FC layer. For example, fully convolutional network uses a convolutional layer instead of FC layer, and ResNet [9] and GoogLeNet [10] all use the global average pooling instead of FC layer. Because our model draws on the classification and regression layer of Fast R-CNN, it cannot completely remove the FC layer. Therefore, in our model, a dimensional reducer is designed to replace one FC layer of VGG16 to reduce parameter redundancy. The dimensional reducer consists of a ROI-Pool layer and a 1 \(\times \) 1 convolutional layer. The ROI-Pool layer is able to output a fixed size feature map after the RPN, which is used to compress feature maps in this paper. The convolutional layer with a kernel size of 1 \(\times \) 1 and a step size of 1 is behind the ROI-Pool layer, which can not only make the structure more compact, but also reduce the dimension of the feature map. We use dimensional reducer to fix the size of feature maps to 7 \(\times \) 7, and to reduce dimension of features from 536 to 512. The compressed features are then input the FC layer. The experimental results show that our structure of dimensional reducer+FC6+Loss is faster than the one of FC6+FC7+Loss in VGG16, and the detection accuracy is also slightly improved.

Similar to the series method of R-CNN, in FC layer, we uses SoftmaxWithLoss for classification and SmoothL1Loss for regression when training, and uses Softmax for classification when testing.

2.4 Joint Training

Like some advanced models, the DS-CNN can also accept end-to-end training and testing. However, by comparison, we find that the alternate training method can obtain better mAP than end-to-end training method on our model. The main steps of the alternate optimization training are as follows: Firstly, we initialize the feature extraction network with the pre-trained model of ImageNet [11], and gain candidate regions by training the RPN alone on PASCAL VOC. Secondly, we reinitialize the feature extraction network with the pre-trained model of ImageNet, and add the candidate regions generated in the first step. Meanwhile, a separate detection network is trained on the PASCAL VOC dataset using DS-CNN so as to obtain the parameters of convolutional layer through the loss values of the fully-connected layer and the candidate regions of the RPN. Thirdly, we retrain DS-CNN, and use the model obtainded in the second step to initialize and fix the parameters of the convolutional layer so that the convolutional layer does not participate the back propagation, and using the RPN model trained in the first step to initialize and fix the parameters of the RPN in the DS-CNN so that the RPN isn’t involved in the back propagation. The total purpose of this step is to connect the feature extraction network with the RPN. Finally, we use the parameters of both convolutional layer and RPN obtained in the third step to reinitialize and fix the DS-CNN model so that both the convolutional layer and the RPN isn’t involved in the back propagation. The purpose of training in this step is to fine-tune the fully-connected layer and get the most optimized results.

3 Experimental Evaluation

PASCAL VOC [12] and MS COCO [17] are two used widely datasets in the object detection field, and are used to evaluate our DS-CNN. The mAP is used as the main evaluation criterion, and the convergence and detect speed of model are used as two auxiliary evaluation criteria. We also compare our model with state-of-the-art models, and they all use VGG16. All experimental results are obtained by running these models on a PC with Intel Core i7-7700K 4.20 GHz CPU, GeForce GTX 1080Ti GPU, and 16 GB RAM.

3.1 Experiments on PASCAL VOC

The PASCAL VOC 2007 dataset includes 20 object categories, about 5k training images and 5k testing images, and the PASCAL VOC 2012 dataset is similar to PASCAL VOC 2007, but the volume of data has doubled. Small objects of PASCAL VOC dataset are mostly indoor, including bottle, chair, dining table, potted plant, sofa, and tv.

In the first experiment, we use alternate training method to train our DS-CNN on the training dataset of PASCAL VOC 2007, and test the model on the testing dataset of PASCAL VOC 2007. Experimental results are shown in Table 1, where the bold fonts, such as bottle, chair, are small objects. From the table, we can observe that the mAP of DS-CNN is 72.1%, which is higher than other models. For small objects, the detection accuracy of our model is significantly improved, where the bottle and plant is the most obvious. Although the accuracy of tv is lower than OHEM+FRCN [14], it is also 3.2% better than Faster R-CNN. However, the detection results on larger objects seem to be unstable, but we notice that most of them can maintain a high accuracy. In order to express object detection results more intuitively, Fig. 3 shows some examples of results on the PASCAL VOC 2007 dataset.

Table 1. The average detection precision (%) of all models on PASCAL VOC 2007, where the training dataset is from PASCAL VOC 2007, and the best score is highlighted in color.
Fig. 3.
figure 3

Some elected examples of object detection results on the PASCAL VOC 2007

In order to eliminate the interference caused by the insufficiency of the dataset, we designed the second experiment. Similarly to the first experiment, we still used the testing dataset of PASCAL VOC 2007 for testing, but the training dataset were from PASCAL VOC 2007+2012, by which the volume of training dataset was expanded to three times of the first experiment. The experimental results are shown in Table 2. It is easy to see that the mAP of DS-CNN is 75.8%. Similar to our method, SSD500 also parallel connects the convolution features from different layers. However, its features all are from high-level instead of low-level layers, so the features of small objects cannot be effectively extracted and trained. On the contrary, in our model, the shallow network and the average pooling layer are used to preserve the information of small objects, and the scale 64 is used to enhance the detection of small objects in the RPN, so the DS-CNN performs better than SSD on the detection of small objects. However, SSD enhances the combination of different convolutional layers, uses data augments, and abandons the fully-connected layers and candidate region generating, so the overall performance of object detection is better than DS-CNN. In structure, DS-CNN is similar to Faster R-CNN, and also draws on RPN of Faster R-CNN, so there is a high comparability between them. The accuracy of DS-CNN is higher than that of Faster R-CNN on all objects except boat, where the detection accuracy on small objects is increased significantly, which demonstrates the effectiveness of DS-CNN.

Table 2. The average detection precision (%) of all models on testing dataset of PASCAL VOC 2007, where the training dataset is from PASCAL VOC 2007+2012, and the best score is highlighted in color.
Table 3. The average detection precision (%) of all models on testing dataset of PASCAL VOC 2012, where the training dataset is from PASCAL VOC 2007+2012, and the best score is highlighted in color.

In order to illustrate that our model can also achieve good results in different datasets, we design the third experiment. In this experiment, the training dataset consists of training dataset of PASCAL VOC 2007+2012 and testing dataset of PASCAL VOC 2007, and testing dataset is from testing dataset of PASCAL VOC 2012. We also compare DS-CNN with FRCN+YOLO [5] and HyperNet [16], and the experimental results are illustrated in Table 3. It is easy to see that our model not only keeps the high detection accuracy for big object, but also improves significantly the detection accuracy for small object, and with very little reduction in detection speed.

3.2 Experiments on MS COCO

The MS COCO dataset is more complex than PASCAL VOC, and contains 80 object categories, about 80k images on the training set and 40k images on the validation set. Especially, the dataset has many small objects, so is very suitable for evaluating DS-CNN. We use the end-to-end training method, and set the basic learning rate be 0.001 and the learning strategy be ‘step’. The total iteration step is 490k, and the learning rate is reduced to 0.0001 after 350k iterations. We calculate the mAP@IoU\(\in \)[0.5:0.05:0.95] (COCO’s standard metric) and mAP@0.50 (PASCAL VOC’s metric). Experimental results are shown in Table 4. It can be seen that our model has 23.1% mAP on the COCO metric and 43.6% mAP on the VOC metric. It is also interesting to notice that our model performs well on the detection of small and medium objects, and its mAP reaches 6.3% and 25.4% respectively. However, its performance on the large objects seem to be mediocre.

Table 4. MS COCO 2015 test-dev detection average precision (%). All methods use VGG16, and area infers to the size of object.

3.3 Combine from Which Layers?

When using the skip-layers connection method, we need to consider which layers be combined can get the best detection result. For example, the combination of conv3+conv4 +conv5 is the best in ION [18], while the combination of conv1+conv3+conv5 is the best in HyperNet [16]. We give different combinations of cov1, cov2, cov3 and cov5, and use the end-to-end method to train and test each combination on the PASCAL VOC 2007 dataset. The experimental results are shown in Table 5. We found it is no true that the more the number of layers is, the higher the accuracy is, and the best combination is conv1+conv5.

Table 5. The detection accuracy of different combinations of multiple layers.

3.4 The Evaluation of Speed

Detection speed and convergence speed are two important indexes for evaluating the performance of an object detection model. We compare the DS-CNN with Faster R-CNN on PASCAL VOC 2007. For fair comparison, we also set the number of final candidate regions in Faster R-CNN to 500, and run two models on our PC. We collected each detection time of the model, and averaged all detection times. The detection speed of DS-CNN was about 12 fps, while Faster R-CNN was about 14 fps. In fact, this is an expected result, because DS-CNN consumes more time than Faster R-CNN in feature extraction. However, the difference between 12 and 14 is very little, so the speed has also met our standard: the detection speed has little reduction.

Fig. 4.
figure 4

The mAP at different iterations.

We also used the end-to-end training and testing method to evaluate the convergence speeds of two models on PASCAL VOC 2007 dataset, and recorded the mAPs of intermediate models with different iterations before generating the final models. The comparison result is shown in Fig. 4. It is easy to see that the two models all had converged when iterating 70k times, and the mAP of DS-CNN is 71.4% while Faster R-CNN is 69.5%. DS-CNN has a faster convergence speed than Faster R-CNN, because it is about 6% higher than Faster R-CNN when iterating 2k times, and it starts to converge after 50k iterations.

4 Conclusions

We designed a new object detection model based on R-CNN. Firstly, we used dilated convolution to design deep neural networks and shallow neural networks, and used skip-layers connection method to connect the two networks. Secondly, we used the RPN to generate object candidates. Thirdly, we designed a dimensional reducer to reduce the dimension of feature maps. Finally, the model output the results of classification and regression. The experimental results illustrated that our model not only keeps the detection accuracy for big object, but also improves significantly the detection accuracy for small object, and with very little reduction in detection speed. However, many more advanced structures cannot be applied to our model due to the limitations of the VGG16 and Fast R-CNN frameworks. In the future, we will research more advanced image feature extraction methods to further improve the accuracy and speed of object detection.