Keywords

1 Introduction

In recent years, thanks to the advances of deep convolutional neural networks, a large number of computer vision tasks have enjoyed significant progress, including segmentation [7, 8], image classification [1,2,3], object detection [4,5,6]. Among them, object detection is one of the fundamental problems that has been widely studied. Currently, there are two mainstream frameworks to solve the problem of object detection: the one-stage frameworks such as SSD [9] and YOLO [10], which directly transform the problem of object border positioning into a regression problem without extracting proposals; and the two-stage frameworks such as Fast R-CNN [5] and Faster R-CNN [6] which generate proposals by RPN layers [6] and then apply classification and regression to each proposal.

Fig. 1.
figure 1

The framework of our method. We feed an image into VGG16 pre-trained on ImageNet dataset to get a feature map named conv5_3. Then the residual joint attention module recalibrates conv5_3 feature map. Next, the feature map named attention5_3 is passed to an RPN layer followed by the graph structure inference part which involves two contextual information into the inference of node state. Eventually, the final state of each node is used to predict the category and refine the location of the corresponding RoI.

Most object detectors include three main parts, CNN feature extraction, proposal classification, and duplicate detection removal. For these three parts, improving the quality of the features of ConvNet backbones is a straightforward idea through which a lot of algorithms have made major breakthroughs [12,13,14,15]. Most of them use effective methods to increase the receptive field or semantic information of the feature maps extracted from ConvNet backbones. However, all of them do not consider utilizing the spatial and channel information of the feature maps when improving the detection accuracy.

Motivated by the success of the attention modules in image classification field [16]. We consider the combination of the spatial attention and the channel attention. As indicated in SENet [16], the channel-wise features can be adaptively recalibrated by effectively modeling the interdependencies between the channels of the feature map extracted from a ConvNet backbone. Similar to SENet [16], we also model the interdependencies between the spatial features. Spontaneously, the joint and multiplicative result of the spatial attention map and the channel attention map is applied to recalibrate the original features. Intuitively, We conjecture this adds the complementary and compatible information between the spatial attention and the channel attention to the proposed network which enhances the useful features and suppresses the less informative ones. In addition, we combine the residual learning [3] with the joint attention to form a residual joint attention module. All of these lead to boost model’s discriminative power. The proposed object detection network is shown in Fig. 1.

In this paper, the proposed module incorporates into an advanced object detector [11] with graph structure inference only increasing a small number of parameters. In principle, The residual joint attention module is universal and not restricted to object detection.

2 Related Work

With the rise of deep convolutional neural networks, the two-stage detectors have rapidly dominated object detection over the past few years [4,5,6]. These advanced object detectors prevailingly follow the pioneering work R-CNN [4]. R-CNN first generates object proposals by Selective Search [21] and then operates classification and bounding box regression on every proposal. But the biggest problem of R-CNN is repetitive convolutional operation consuming too much time. To speed up, Fast R-CNN [5] introduces a novel RoI pooling layer to extract features for each proposal from the shared ConvNet feature map of the whole image. Whereas proposal generators are still not trained together with Fast R-CNN. To solve this problem, Faster R-CNN [6] develops RPN which can generate precise proposals and be trained together with detection subnetwork. Different from the two-stage detectors, the one-stage detectors remove proposal generators and directly operate classification and regression on a series of pre-computed anchors for real-time detection. Anyway, these state-of-the-art methods only consider the appearance features of the objects without considering the connections between the context and the objects in an image. Consequently, it is natural to utilize contextual information to improve object detection.

Many papers have proposed that scene information or relations between objects help object detection [17,18,19]. However, After the rise of deep learning, There haven’t been significant breakthroughs in using contextual information to explore object detection until the emergence of [11, 20]. In SIN [11], two kinds of contextual information are introduced: one is scene-level context, the other is instance-level relationships. These two complementary contextual information are combined through GRU [22] to help detection. Hu et al. [20] proposes an object relation module for object detection. By modeling the interdependencies between object appearance features and object geometry features, the object relation module can be used for instance recognition.

Most object detectors include CNN feature extraction, proposal classification, and duplicate detection removal. Actually, Using contextual information is working in the proposal classification part. Another way to improve object detection is promoting the quality of the features of ConvNet backbones. At present, many works are focusing on increasing the receptive field and semantic information of the features extracted from ConvNet backbones [12,13,14,15]. To involve multi-scale features, FPN [12] utilizes the hierarchical feature maps from different depths of CNN. DES [13] augments the low-level feature maps of VGG16 with strong semantic information which is trained by week bounding-box level segmentation ground-truth. In order to make the feature maps own higher resolution and larger receptive field at the same time, DetNet [14] designs a new backbone. RFB [15] adds dilated convolution layers on the basis of SSD [9] to effectively increase the receptive field of the feature maps.

Attention can be seen as a way of allocating limited computational resources to the most useful parts of an image. Therefore, attention can be used to improve the quality of the features of ConvNet backbones by selectively emphasizing the informative features and suppressing noises. However, as far as we know, there is only one work [13] that applies attention mechanism to ConvNet backbones in object detection.

3 Method

In this section, we present the details of the proposed network. Firstly, we describe the graph structure inference part, next elaborate the residual joint attention module.

3.1 Graph Structure Inference

Contextual information plays an important role in accurate object detection. Therefore advanced detectors not only consider object visual appearance, but also take advantage of two kinds of structured contextual information: scene-level information and object relationship information. SIN [11] is one of them which considers object detection as the problem of graph structure inference. Given an image, the objects will be treated as graph nodes while the relationships between the objects will be regarded as graph edges jointly under the supervision of the scene context formed by the whole image. More specifically, an object will receive information passed from other objects and scene which is closely related to it. By this way, the object state is finally confirmed by both its appearance features and the contextual information. For encoding different information into objects, SIN chooses Gated Recurrent Units (GRU) [22] as the tool of graph structure inference. The graph structure inference part is shown in Fig. 1. The specific operation steps are described as follows.

Initially, an image is passed through pre-trained VGG16 and the residual joint attention module. The features map named attention5_3 is extracted and then sent to the graph structure inference part. After RPN, a fixed number of RoIs (Region of Interest) are obtained. To get the descriptors about the graph nodes of 4096 dimension, operation of RoI pooling followed by an FC layer is performed on per-RoIs. The conv5_3 feature map is extracted as the scene by the same layer as the graph nodes. As for the descriptors of the graph edges of 4096 dimension, object-object relationships are modeled by both the spatial features and the visual features of the objects. Eventually, GRU whose input and initial state are respectively the 4096-dimension scene or the edge vectors and the 4096-dimension object vectors iteratively updates two steps to determine the node final state.

Fig. 2.
figure 2

The structure of residual joint attention module. The three items in the block of a convolution layer are filter shape, filter number, and stride.

3.2 Residual Joint Attention Module

It can be seen from Fig. 2 that our residual joint attention module is the union of the spatial attention, the channel attention, and the residual learning. The spatial attention aims at choosing spatially important features (not related to channels), while the channel attention is dedicated to seeking vital channels for our task. The ultimate goal of them is promoting the quality of the features of ConvNet backbones by performing feature recalibration. Intuitively, if they can be compatible and complementary to each other in functionality, the combination of the spatial attention and the channel attention should apply attention mechanism to every pixel of a specific feature map leading to better performance than any single attention. At the same time, the residual learning is proposed to keep good attributes of the original features. To the end, we package the spatial attention, the channel attention, and the residual learning into a module which can conveniently be embedded everywhere in a CNN with only a small number of additional parameters.

Mathematically, let \(X\in {{\varvec{R}}}^{h\times w\times c}\) be the input to a residual joint attention module where h, w, c respectively denotes dimension in height, width, channel of the input feature map. Then X goes through three branches: the channel attention branch which produces a weight map \(C\in {{\varvec{R}}}^{1\times 1\times c}\), the spatial attention branch which produces a weight map \(S\in {{\varvec{R}}}^{h\times w\times 1}\), and the residual learning branch. Eventually, we select a natural way to combine two weight maps together to get the final weight map \(A\in {{\varvec{R}}}^{h\times w\times c}\):

$$\begin{aligned} A=C\times S \end{aligned}$$
(1)

Next, we describe the designs of three branches in details.

Channel Attention Branch. Our channel attention branch is derived from SENet [16] aiming at promoting the quality of features of its convolutional neural networks by effectively modeling the interactions between the channels from a specific layer of a network. There are two main steps in it. First is squeeze operation (0 parameters) which produces a channel descriptor by squeezing global spatial information named GAP. Squeeze stage will produce \( Z\in {{\varvec{R}}}^{1\times 1\times c} \):

$$\begin{aligned} Z_i=\frac{1}{h\times w}\sum _{h,w}X_{hwi} \end{aligned}$$
(2)

Next is excitation operation (\( \frac{2c^2}{r} \) parameters) which aims to capture the interactions between the channels. We use two convolutions to generate \(C\in {{\varvec{R}}}^{1\times 1\times c}\):

$$\begin{aligned} C=\mathrm{RELU}(W_2\times \mathrm{RELU}(W_1Z)) \end{aligned}$$
(3)

where \( W_1\in {{\varvec{R}}}^{1\times 1\times \frac{c}{r}} \), \( W_2\in {{\varvec{R}}}^{1\times 1\times c} \) (to simplify the notation, bias terms are omitted). In our work, we keep \(r=4\) to reduce parameters.

Spatial Attention Branch. To explicitly modeling the interactions between the spatial features of a convolution layer, we imitate SENet [16] to build a spatial attention branch. First step is to eliminate channel information by a global cross-channel averaging pooling operation (0 parameters) producing \( M\in {{\varvec{R}}}^{h\times w\times 1} \). The operation is defined as follows:

$$\begin{aligned} M_{i,j}=\frac{1}{c}\sum _{c}X_{ijc} \end{aligned}$$
(4)

A convolution operation with the kernel size of \(3\times 3 \) (9 parameters) is applied to M next. This filter purposes to model the interactions between the spatial features. Lastly, we use a harmonic convolution operation of a \( 1\times 1\) filter (1 parameter) to obtain \( S\in {{\varvec{R}}}^{h\times w\times 1} \).

After getting S and C two attention maps, we adopt tensor multiplication to combine them. But the union of the spatial attention and the channel attention is not inherent. So it is necessary to further use a \( 1\times 1\times c\) convolution (\( c^2 \) parameters) to make the combination more harmonious. A sigmoid activation function maps the combination into the range between 0.5 and 1.

Residual Learning Branch. In the experiment, we notice that the dot production of the original features and the joint attention map who ranges from 0.5 to 1 will degrade the values of the original features. In fact, the values of the useless features decrease more significantly than the useful features. But to ease the situation, we apply residual learning to the joint attention mechanism. According to ResNet [3], if the attention module can be built as identical mapping, the performance should be no worse without attention. The new output is expressed as:

$$\begin{aligned} X^{'}=X+A\times X \end{aligned}$$
(5)

To harmoniously fusing the original features with the weighted features, we deploy a convolutional operation with a \( 1\times 1\times c\) filter (\( c^2 \) parameters) for the original features before fusing.

Table 1. Overall performance on VOC2007 test.

4 Experiments

In our experiments, we evaluate our model on VOC dataset [23]. At the same time, several ablation studies are conducted on our various branches to verify the effectiveness of our method. All experiments are evaluated by using VOC metric with IOU = 0.5.

Table 2. Ablation studies on VOC2007 test.

4.1 Experimental Settings

During training and testing, the proposal number is set to 128 because too many proposals lead to out of memory when inferencing graph structure. Specifically, we follow the popular split which takes the combination of VOC2007 trainval and VOC2012 trainval as the train data, and takes VOC2007 test as the test data. The training steps are set to 130000. In the previous 80,000 iterations, we use a learning rate of 0.0005 while the learning rate is reduced by 10 times for the next 50000 iterations. We use momentum gradient descent with momentum 0.9 and batch size of 1 to train the parameters of our network.

4.2 Overall Performance

The results are shown in Table 1. To illustrate the superiority of our method, the ConvNet backbone for all methods in the Table 1 is VGG16. Comparing the results of the baseline, our mAP is higher than SIN [11], which proves that the residual joint attention module really helps our detector to achieve better detection accuracy. Interestingly, on some specific classes, it is found that our model performs very well including aero, bird, bus, chair, plant and so on. Our method is also better than ION [24] which is a network with explicitly modeling of contextual information using RNN, and Shrivastava et al. [25] which exploits segmentation information in the framework of Faster R-CNN. We show some detection examples in Fig. 3. The top column is the results of the original SIN, and the bottom column is the results of our network. From these examples, it can see that our method is good at detecting objects in complex situations like a dog only with a head, a blurry ship, obscured cows and obscured sheep. These also directly indicate that our residual joint attention module makes the original inapparent features more powerful and differentiated through feature recalibration.

Fig. 3.
figure 3

Examples of detection results. Top: SIN. Bottom: ours.

4.3 Ablation Studies

In order to verify the effectiveness of each branch in our proposed method, we conduct several ablation studies which still use the same dataset settings as above. Table 2 shows the results of different branches, where Baseline stands for SIN, SA stands for the spatial attention branch, CA stands for the channel attention branch. Comparing with the baseline, there is a slight increase by adding any kind of attention mechanism to the baseline. This shows that the feature recalibration through attention mechanisms is effective. What’s more, The combination of the spatial attention and the channel attention improves more than any single attention which proves our preliminary conjecture that the spatial attention and the channel attention are complementary and compatible. Similarly, the residual learning continues to optimize our model that demonstrates the validity of the residual learning.

5 Conclusion

In this paper, we proposed a residual joint attention module embedded in an advanced network with graph structure inference. The graph structure inference part is used for the detection subnetwork of the detector and the residual joint attention module composed of the spatial attention, the channel attention and the residual learning follows VGG16. Due to the complementarity and compatibility of the spatial attention and the channel attention, the joint attention mechanism more significantly improves the representational power of a network by performing feature recalibration than any single attention. Moreover, the residual learning keeps good attributes of the original features. Quantitative evaluations show that our residual joint attention module boosts model’s discriminative power. We hope that this paper can provide reference for researchers to use attention.