Keywords

1 Introduction

With the rapid development of spaceborne and airborne imaging technology, the high-resolution remote sensing imagery (RSI) can be more and more accessible to make the spatial structure, texture and other information of geographic objects abundant. Thus, automatic building localization can potentially achieve higher accuracy, which is helpful to many remote sensing applications, such as land planning, environment management and disaster assessment.

Therefore, developing automatic methods of building localization is a significant task. Over the past decades, many approaches have been proposed for automatic building localization. For example, in the early days, low-level handcrafted features were applied for feature extraction to locate buildings. Kim et al. [1] extracted the edge segments and detected possible building structures based on graph search strategy. Jung et al. [2] proposed a Hough transform-based method to extract the rectangular building roofs.

Moreover, in order to obtain building contours, image segmentation can also be utilized to partition RSI into many regions and classify each pixel into a fixed set of categories [3], distinguishing buildings from their surrounding background. For example, Kampffmeyer et al. [4] combined different deep architectures including patch-base and pixel-to-pixel approaches, to achieve good accuracy for small object segmentation in urban remote sensing. Wu et al. [5] proposed a multi-constraint fully convolutional network to improve the performance of the U-Net model in building segmentation from aerial imagery. Troya-Galvis et al. [6] presented two different extensions of a collaborative framework called CoSC which outperform hybrid pixel-object oriented approach as well as a deep learning approach. Insufficiently, such methods can generate roughly building segmentation boundary, however, they are always irregular and can not differentiate building instances.

Fig. 1.
figure 1

Example of building localization results from TQR-Net in Google Earth image of Calgary, Alberta, Canada (\({51.05}^{\circ }N\), \({114.07}^{\circ }W\)).

Fig. 2.
figure 2

The architecture of the proposed multi-stage TQR-Net is as follows: (a) Feature extraction stage generates a rich and multi-scale feature pyramid. (b) Region proposal network outputs a set of object proposals with objectness scores \(s_{i}\) (e.g., i = 0, 1, 2 denotes three aspect ratios). (c) Bounding box branch regresses rectangular bounding boxes of each pyramid level. (d) TQR box branch predicts quadrangle bounding boxes and obtains building contours.

In recent five years, the CNN-based object detectors [7,8,9,10] have made a great improvement for detecting remotely sensed targets [11,12,13,14,15,16,17]. Consequently, the CNN-based building detectors have also made a breakthrough. For example, Zhang et al. [18] proposed a CNN-based detector using multi-scale saliency-based sliding window and improved non-maximum suppression (NMS) to detect suburban buildings. Li et al. [19] presented a cascaded CNN architecture utilizing Hough transform to guide CNN to extract mid-level features of the building. Chen et al. [20] proposed a two-stage CNN-based detector for multi-sized building localization, in which a multi-sized fusion region proposal network (RPN) and a novel dynamic weighting algorithm were used to generate and classify multi-sized region proposals, respectively. Although such object detection-based methods can classify individual buildings, they denote detection via rectangular bounding boxes and can not generate building contour. To tackle this problem, some instance segmentation-based methods [21,22,23] can be adopted to detect buildings in RSI, but the generated contours are still irregular in the instance segmentation-based approaches.

As aforementioned, generally, there are two kinds of bounding boxes to locate building targets. One is rectangular, which cannot generate the contours of buildings. The other is polygonal, based on instance segmentation detectors (e.g., Mask R-CNN [10]), which can locate buildings via predicting their segmentation and polygonal contours. However, such polygonal contours are always inaccurate due to their uncertain nodes and irregular shapes.

In this paper, aiming to make a trade-off between these two kinds of bounding boxes, we propose to use quadrangular bounding boxes, which are generated by a tighter quadrangle-based convolutional neural network (TQR-Net) directly. Considering that most buildings are quadrilateral, we adopt quadrangular bounding boxes with four nodes, which can not only avoid irregular shapes but also keep certain structural restrictions.

Without bells and whistles, the experiment results prove that the proposed TQR-Net can improve the feature extraction domain of corner and contour in building targets with higher precision of building localization. Here, we give an example of localization results acquired by TQR-Net in Google Earth urban area image of Calgary is shown in Fig. 1.

2 Proposed Approach

As shown in Fig. 2, our method is based on a multi-stage region-based object detection framework. In this section, we will elaborate the proposed network in the subsections.

2.1 Multi-stage Region-Based TQR-Net

There are four main stages in TQR-Net, i.e., feature extraction, region proposal network, bounding box branch, and tighter quadrangle box branch, and we will detail each stage as follows.

Feature Extraction. A feature extraction network can extract features from the input image. Here we utilize ResNeXt-101 [24] for feature extraction, and such multi-scale feature maps are extracted on five levels, which can be defined as \(\{C_{1}, C_{2}, C_{3}, C_{4}, C_{5}\}\). At each level, convolutional layers generate feature maps of the same size. In order to detect buildings in different scales, we use Feature Pyramid Network (FPN) [25] in the convolutional backbone which utilizes top-down lateral connections to build an in-network feature pyramid. The FPN can take \(\{C_{2}, C_{3}, C_{4}, C_{5}\}\) as input and generate the final set of feature maps defined as follows:

$$\begin{aligned} P_{*} = \{P_{2}, P_{3}, P_{4}, P_{5}, P_{6}\}. \end{aligned}$$
(1)

Region Proposal Network. A region proposal network (RPN) can generate region of interests (RoIs) on feature maps \(P_{*}\) by the anchors which are pre-defined in five scales and three aspect ratios. In RPN, classification and bounding box regression are performed by a \(3\times 3\) convolutional layer, followed by two sibling \(1\times 1\) convolutions, subsequently.

Bounding Box Branch. After RPN, feature maps of size \(7\times 7\) from RoIs are extracted by using RoIAlign [10] on {\(P_{2}\), \(P_{3}\), \(P_{4}\), \(P_{5}\)}, and they are fed into bounding box branch which performs classification and rectangular bounding box regression, respectively.

Tighter Quadrangle Box Branch. In the proposed network, a tighter quadrangle (TQR) box branch is applied to generate building contours using quadrangular bounding boxes. Similar to the sequential protocol of coordinates proposed in [26], via ordering the coordinates, we can define the quadrangular bounding box with four nodes uniquely. By default, the four nodes are arranged clockwise, and the node closest to the grid origin is set to be the first. In particular, if there are two nodes at the same distance with the grid origin, we set the node which owns smaller value x as the first one. After determining the order of the nodes, inspired by the coordinates of rectangle bounding box as follows:

$$\begin{aligned} r_{*} = (x_{}, y_{}, w_{}, h_{}), \end{aligned}$$
(2)

the 8-coordinate TQR box can be represented as follows:

$$\begin{aligned} t_{*} = (x, y, w_{1}, h_{1}, w_{2}, h_{2}, w_{3}, h_{3}, w_{4}, h_{4}). \end{aligned}$$
(3)

Here, variables x, y denote the center coordinates of the TQR box’s minimum bounding rectangle, and \(w_{n}, h_{n}\) represent the \(n_{}\)-th (\(n = 1, 2, 3, 4\)) relative position to the center coordinates.

As aforementioned, in order to generate the TQR box, {\(P_{2}\), \(P_{3}\), \(P_{4}\), \(P_{5}\)} are fed into TQR box branch, which uses RoiAlign to extract \(7\times 7\) feature maps from boxes \((x_{b},y_{b},w_{b},h_{b})\) output by bounding box branch. Then, three fully-connected layers are utilized to collapse the small feature maps into two 10-d vectors \(\{t_{0}, t_{1}\}\), where \(t_{0}\) corresponding to the background class is ignored in the loss computation, and \(t_{1}\) represents the predicted TQR box. For TQR box regression, we adopt the parameterizations of the 10-coordinate as follows:

$$\begin{aligned} {\begin{matrix} &{}d^{ }_{x} = (x^{ } - x^{ }_{b})/w^{ }_{b},\ d^{ }_{w_{n}} = w^{ }_{n}/w_{b}, \\ &{}d^{ }_{y} = (y^{ } - y^{ }_{b})/h^{ }_{b},\ d^{ }_{h_{n}} = h^{ }_{n}/h_{b}, \\ &{}d^{*}_{x} = (x^{*} - x^{ }_{b})/w^{ }_{b},\ d^{*}_{w_{n}} = w^{*}_{n}/w_{b}, \\ &{}d^{*}_{y} = (y^{*} - y^{ }_{b})/h^{ }_{b},\ d^{*}_{h_{n}} = h^{*}_{n}/h_{b}, \end{matrix}} \end{aligned}$$
(4)

where \(x^{*}, y^{*}, w^{*}_{n}, h^{*}_{n}\ (n = 1, 2, 3, 4)\) stand for the ground-truth TQR box.

Fig. 3.
figure 3

Joint loss curves of TQR-Net with ResNeXt-101 in three typical areas.

Fig. 4.
figure 4

Precision-recall comparisons of bounding box between TQR-Net and other baseline methods with different backbones on Qinghai Province dataset in three different kinds of areas. (\(\mathrm{IoU}=0.5\)). Key: R = ResNet-101-FPN; X = ResNeXt-101-FPN; M = Mask Branch.

2.2 Loss Function

For end-to-end training, we utilize a joint loss to optimize our network. Here, the joint loss is combined of \(L_{rpn}\), \(L_{bbox}\) and \(L_{tqr}\), for region proposal network, bounding box branch and TQR box branch, respectively. Formally, we compute the joint loss function L for each mini-batch as follows:

$$\begin{aligned} \begin{aligned} L_{} = \sum ^{\varTheta }_{\theta }L^{(\theta )}_{rpn}+\sum ^{\varTheta }_{\theta }L^{(\theta )}_{bbox}+\sum ^{\varTheta }_{\theta }L^{(\theta )}_{tqr}+\varphi \parallel \mathbf {w}\parallel ^2, \end{aligned} \end{aligned}$$
(5)

where \(\varphi \) is a hyper-parameter, \(\mathbf {w}\) is a vector of network weights and, the definition of RPN loss \(L^{(\theta )}_{rpn}\) and bounding box branch loss \(L^{(\theta )}_{bbox}\) can refer to [9, 10], for the \(\theta \)-th image in a mini-batch (e.g., batch size \(\varTheta = 3\) in our experiments). Moreover, the TQR box branch loss \(L^{}_{tqr}\) for one image is defined as follows:

$$\begin{aligned} \begin{aligned}&L^{}_{tqr}(\{d^{}_{i}\},\{d^{*}_{i}\}) = \lambda \frac{1}{N_{tqr}}\sum ^{}_{i}smooth_{L_{1}}(d^{}_{i} - d^{*}_{i}), \\&smooth_{L_{1}}(x) = {\left\{ \begin{array}{ll} 0.5x^{2} &{}\text{ if } |x|<1\\ |x|-0.5 &{}\text{ otherwise } \end{array}\right. }\,. \end{aligned} \end{aligned}$$
(6)

Here, i and \(N_{tqr}\) are the index and number of the TQR boxes, and \(d^{}_{i}\) and \(d^{*}_{i}\) represent the 10 parameterized coordinates of the predicted and ground-truth TQR boxes, respectively. For the regression loss, we use \(smooth_{L_{1}}\) which is the robust loss function defined in [8].

Fig. 5.
figure 5

Building localization results in Qinghai Province, China. First two rows (urban): Tianjun Dist. in Haixi Mongolian T.A.P (\({37.30}^{\circ }N\), \({99.02}^{\circ }E\)) and Xinghai Dist. in Hainan T.A.P (\({35.58}^{\circ }N\), \({99.99}^{\circ }E\)). Key: T.A.P = Tibetan Autonomous Prefecture; E.A = Ethnic Autonomous Dist. Second two rows (suburban): Tu E.A.D in Haidong City (\({36.82}^{\circ }N\), \({101.99}^{\circ }E\)) and Tongde Dist. in Hainan T.A.P (\({35.26}^{\circ }N, {100.55}^{\circ }E\)). Last two rows (rural): Gonghe Dist. in Hainan T.A.P (\({36.40}^{\circ }N, {100.97}^{\circ }E\)) and Datong Hui and Tu E.A.D in Xining City (\({37.03}^{\circ }N, {101.50}^{\circ }E\)).

In this paper, we set the weight decay \(\varphi = 0.0001, N_{tqr} = 1000\), and the loss weight \(\lambda = 10\). The joint loss curves of TQR-Net with ResNeXt-101 in three typical kinds of areas are shown in Fig. 3.

Table 1. Comparisons of bounding box \(\mathrm{{AP}}^\mathrm{{bb}}_{}\)(%) and \(\mathrm{{AR}}^\mathrm{{bb}}_{}\)(%) among the baseline methods and the proposed method on Qinghai Province dataset in three different kinds of areas. Key: M.R. = Mask R-CNN [10]; R = ResNet-101-FPN; X = ResNeXt-101-FPN; M = Mask Branch.

3 Experiments and Discussion

3.1 Dataset

In order to evaluate our method, we collect a large building dataset from Google Earth, in which all buildings are manually labeled by minimum bounding rectangles. The RGB images in this dataset are from rural, suburban and urban areas in Qinghai Province, China. Statistically, there are 48222 labeled buildings (7628, 16533 and 24061 in rural, suburban and urban areas) in 1660 images (296, 631 and 733 in rural, suburban and urban areas). For each area, images are randomly split into \(50\%\) for training and \(50\%\) for testing.

3.2 Implementation and Results

All models are implemented with PyTorch on 3 NVIDIA GeForce GTX 1080 Ti of 11 GB on board memory. We evaluate ResNet-101 [27] and ResNeXt-101 [24] pre-trained on ImageNet [28] as backbone. As for the parameters in the new layers, we adopt the weight initialization strategy introduced in [29]. In order to train our network, we use stochastic gradient descent (SGD) with a fixed learning rate of 0.002, and the momentum is set to 0.9.

The proposed TQR-Net is compared with Mask R-CNN [10] in three typical areas. We also compare the TQR box branch with the mask branch. Table 1 shows the comparison results of COCO-style bounding box average precision (\(\mathrm{{AP}}^\mathrm{{bb}}_{}\)) and average recall (\(\mathrm{{AR}}^\mathrm{{bb}}_{}\)), following the definitions in [30].

In Table 1, we can see that TQR-Net outperforms the baseline methods in both \(\mathrm{{AP}}^\mathrm{{bb}}_{}\) and \(\mathrm{{AR}}^\mathrm{{bb}}_{}\) indicators in all three areas. For example, compared to Mask R-CNN with the mask branch, TQR-Net improves \(3.7\%\) in \(\mathrm{{AP}}^\mathrm{{bb}}_{}\) and \(5.5\%\) in \(\mathrm{{AR}}^\mathrm{{bb}}_{}\) while using ResNeXt-101 as backbone in rural area. Moreover, we show precision-recall curves comparisons of our method and other competitors with different backbones in three different kinds of areas, respectively, in Fig. 4 (for convenience, we draw precision-recall curves according to PASCAL VOC format here). Some localization results generated by TQR-Net with ResNeXt-101 as backbone can be seen in Fig. 5. Thus, our method preserves more geometric information with maintaining certain structural restrictions, which can aid building localization.

4 Conclusion

In this paper, a multi-stage CNN-based method called TQR-Net has been proposed to locate buildings with quadrangle bounding boxes, which can be trained end-to-end by a joint loss function. We make a trade-off between rectangular and polygonal bounding boxes to acquire high-quality building contours in our method. Different from traditional object detection-based and instance segmentation-based methods, TQR-Net can directly generate TQR boxes with more flexibility of freedom than bounding boxes, while avoiding irregular shapes, extra time and resource overheads, associated with predicting masks. Experiments on a large Google Earth dataset of three typical kinds of areas demonstrate its effectiveness for building instance localization task.