Keywords

1 Introduction

Unmanned Aerial Vehicles (UAV) is widely used in civilian applications. Comparing to standard airborne aerial, UAV system is more flexible, efficient, especially for small area coverage [1], which is especially suitable for time critical events where rapidly acquiring current and accurate spatial information is critically important. In such a case, a large number of UAV images should be processed, e.g. building an image mosaic, in a short time, with moderate accuracy, a near-orthophoto accuracy. However, the traditional photogrammetry and automatic aerial triangulation (AAT) cannot efficiently create such mosaic due to the high variation of UAV images.

One of the approaches is to decompose an image into local regions of interest, or so-called ‘features’ to alleviate the time complexity. To this end, the past decades have seen considerable advances in feature descriptors and matching strategy. Representative approaches include SIFT [2, 3] and its variants, which are distinctive and invariant to various image transformations. However, since SIFT is computationally expensive, alternative methods such as SURF [4], PCA-SIFT [5] have been proposed to speed up. These methods have similar matching rates with SIFT while much faster performance.

Unfortunately, these gradient based methods still hard to meet the requirement of real-time image mosaic, especially on mobile devices with low computing power and memory capacity. As a result, algorithms with fixed-point operations and low memory load are preferred, such as Binary Robust Independent Elementary Feature (BRIEF) [7], Oriented Fast and Rotated BRIEF (ORB) [8], Binary Robust Invariant Scalable Keypoints (BRISK) [9], and Fast Retina Keypoint (FREAK) [10]. Such hand-crafted and heavily-engineered features are difficult to generalize to new domains. Other works [14, 20] learn short binary codes by minimizing the distances between positive training feature descriptors, while maximizing the negative pairs. Binarization is usually performed by multiplying the descriptors by a projection matrix, subtracting a threshold vector, and retaining only the sign of the result. Although low memory load, these new descriptors tend to perform worse than the floating point descriptors.

Aiming to bridge this performance gap between floating descriptors and binary descriptors without increasing the computation cost, a Compressed Binary Discriminative Feature (CBDF) is proposed for fast image registration. First, a Gaussian smoothing is applied to the image patch around the keypoint, and then local image gradients are computed. Second, the image patch is split up into smaller sub-regions, similar to GLOH [6], and a vector of the gradient statistic information is calculated for each sub-region, our binary descriptor is computed by a comparison of these vectors. Since low bit-rate feature descriptor means fast to match and low memory footprint, our goal is to produce low bit-rate descriptors which maintain the highest possible fidelity. We optimize our descriptor by a supervised learning method to find the dimensions in the descriptor which are informative to the descriptor. The advantage of our feature descriptor is that the gradient information is contained in our binary keypoint descriptor, which makes our binary descriptor much more discriminative than the simple pixel intensity comparison descriptors, in the meantime, a learning process is performed to realize dimension reduction, and this process makes our descriptor more compact.

The rest of this paper is organized as follows: The related work is presented in Sect. 2. The method to construct our CBDF descriptor is described in details in Sect. 3. In Sect. 4, we compare the performances of our CBDF descriptor with the state-of-the-art methods. Finally, we conclude our work in Sect. 5.

2 Related Work

Due to the fast development of UAV system, low altitude remote sensing is becoming more and more attractive for commercial and military applications. Such technique, if applicable, can be widely used in earthquake relief work, forest fire surveillance and flood disaster.

In such a scenario, one important step is to mosaic the UAV image in real-time. Here, one key technique is image registration. Different from the state-of-the-art image registration algorithms like SIFT matching [11], where the registration accuracy is of fundamental importance, under the UAV circumstance the efficiency is more important. To speed up, several approaches are proposed in the literature like PCA-SIFT [5], SURF [4], CHOG [21], DAISY [23], etc. For instance, PCA-SIFT reduces the description vector from 128 to 36 dimension using principal component analysis. The matching time is reduced, but the time to build the descriptor is increased leading to a small gain in speed and a loss of distinctiveness. The SURF descriptor sums responses of Haar wavelets, which is fast by using integral image. SURF addresses the issue of speed. However, since the descriptor is a 64-vector of floating point, its representation still requires 256 bytes, which becomes crucial when millions of descriptors must be stored. Chandrasekhar et al. [21] applies tree-coding method for lossy compression of probability distributions to SIFT-like descriptors to obtain compressed histogram of gradients (CHOG). Brown et al. [22] use a training method to optimize the filtering and normalization steps that produce a SIFT-like vector. However, the dimensionality of the feature vector is still too high for large-scale applications, such as image retrieval or 3D reconstruction.

Much research has been done recently focusing on designing binary descriptor to reduce both the matching time and storage cost [12, 13]. For example, Calonder et al. [7] show that it is possible to shortcut the dimensionality reduction step by directly building a short binary descriptor in which each bits are independent, called BRIEF. The descriptor vector is obtained by comparing the intensity of 512 pairs of pixels or even 256 pairs after applying a Gaussian smoothing. The smoothing step is to reduce the noise sensitivity. The positions of the pixels are pre-selected randomly according to a Gaussian distribution or Uniform distribution around the patch center. However, this descriptor is not invariant to scale and rotation changes. Rublee et al. [8] propose the Oriented Fast and Rotated BRIEF (ORB) descriptor. This binary descriptor is invariant to rotation, which is robust to noise but not invariant to scale change while relying on a greedy optimization. Leutenegger et al. [9] propose a binary descriptor called BRISK, which is invariant to scale and rotation. Its key design lies in the application of a novel scale-space. To build the descriptor bit-stream, a limited number of points in a polar sampling pattern are used. Each point contributes many pairs. The pairs are divided into short-distance and long-distance subsets. The long-distance subset is used to calculate the direction of the keypoint, while the short-distance subset is used to build the binary descriptor. Alahi et al. [10] propose a keypoint descriptor termed Fast Retina Keypoint (FREAK), which is inspired by the human retina topology. A cascade of binary strings are computed by comparing image intensities efficiently over a retinal sampling pattern. Strecha et al. [20] map the SIFT descriptor vectors into the Hamming space by a LDA method. Although these binary feature descriptors are fast to compute and match, they tend to be less robust than their floating point equivalents.

3 The Method

Most of the local image descriptors are extracted based on the keypoints returned by interest point detectors, such as Harris [17], DoG [2], MSER [18], Hessian-Affine [19] or FAST [15]. Our proposed CBDF descriptor can be combined with any of these local feature detectors. As finding a rotation-invariant and efficient detector is important to image registration, especially in the aerial images. So here we take the FAST as our keypoint detector. Unfortunately, FAST does not consider the orientation of the keypoint, so we proposed an enhanced-version of FAST, called orientation FAST. After oriented keypoints extraction, we build our CBDF descriptor by dividing the image patch surrounding the keypoints into subregions, and then we static the gradient information for each subregion, a four dimension vector is acquired. CBDF descriptor is constructed by comparing and threshholding these vectors. At last a dimension reduction scheme is applied on the descriptor to make our descriptor more compact.

The computation of CBDF includes the following three steps (as illustrated in Fig. 1): (1) Oriented FAST feature point extraction; (2) Oriented binary discriminative feature descriptors; (3) Descriptor compression.

Fig. 1.
figure 1

Overview of the computation of CBDF descriptor

3.1 Oriented FAST Keypoint Extraction

FAST keypoint detector is extensively used in computer vision applications for its efficiency and rotation invariance. There is only one parameter in FAST, which is the intensity threshold between the center pixel and those in a circular ring around the center. Typically 9–16 mask is usually used, it require at least 9 consecutive pixels in the 16-pixel circle which are sufficiently brighter or darker than the central pixel. In such a way, FAST detector has large responses along edges. We defined s as the maximum threshold considering an image point as a corner. The currently testing point needs to fulfill the maximum condition with respect to its 8 neighboring FAST scores. We do not employ a scale pyramid of the image, as for consecutive UAV frames the image scales are almost all the same. Without scale pyramid transformation of input image, we can also save much time on feature point detection and matching. To measure the corner orientation, we assume that the intensity of corner is offset from its center, as intensity centroid C named in [16], this vector is used to calculate an orientation. The patch moments are defined as:

$$ m_{pq} = \sum\limits_{x,y} {x^{p} y^{q} I(x,y)} $$
(1)

Then these moments are used to compute the centroid of a patch as defined in Eq. (2).

$$ {\mathbf{C}} = \left( {\frac{{m_{10} }}{{m_{00} }},\frac{{m_{01} }}{{m_{00} }}} \right) $$
(2)

The orientation of the patch is:

$$ \theta = \hbox{atan}2(m_{01} ,m_{10} ) $$
(3)

where atan2 is the quadrant-aware version of arctan. To guarantee the rotation invariance, we choose the pixels with positions within a circular area of radius r to calculate the moments.

3.2 Oriented Binary Discriminative Feature Descriptor

Most recent local feature descriptors are based on the statistics of the gradients of pixel intensity in a patch. e.g. SIFT, SURF, GLOH This is because gradients is highly distinctive yet as invariant as possible to remaining variations, such as change in illumination or 3D viewpoint [3]. We follow this trend to build our binary descriptor, yet in a much simplified way. After normalizing the rotation, the proposed CBDF descriptor applies the sampling pattern rotated by \( \theta \) around the detected keypoints with patch size 32 × 32. Then, the intensities of the rotated patch is calculated by nearest neighbor interpolation.

After rotation, gradients of each pixels are computed by a discrete derivative masks. There are several derivative masks can be used to calculate the gradients, such as 1-D point derivatives uncentred [−1, 1], centred [−1, 0, 1], as well as 2 × 2 diagonal ones \( \left[ {\begin{array}{*{20}c} 0 & 1 \\ { - 1} & 0 \\ \end{array} } \right] \), \( \left[ {\begin{array}{*{20}c} { - 1} & 0 \\ 0 & 1 \\ \end{array} } \right] \), 3 × 3 Sobel mask and Prewitt mask, we tested these masks in our experiments to chose the best one. And then, gradients are smoothed by Gaussian smoothing. The size of the smoothing template is also tested including \( \sigma = 0 \) (none). From the experiment results, we find that Sobel mask at \( \sigma = 4 \) with a size of 5 × 5 gaussian kernel window works best, and the 1-D point derivative uncentred mask performs almost the same with Sobel, since it’s much time saving than Sobel mask, we choose it for our final descriptor. Experiment results are given in Sect. 4. The image patch is then split up regularly into 16 smaller sub-regions in two styles as shown in Fig. 3, and a accumulated gradient magnitude vector v is calculated for each sub-region:

$$ {\mathbf{v}} = (\sum {d_{x} } ,\sum {d_{y} ,\sum {\left| {d_{x} } \right|,\sum {\left| {d_{y} } \right|} } } ) $$
(4)

After this step, each sub region has a four-parameter vector, from which our bit-vector descriptor x is assembled by a comparison of these parameters between each vector, such that each bit b corresponds to:

$$ \forall i = 1, \ldots ,{\text{S}},j = 1, \ldots ,{\text{S}},k = 1, \ldots ,4,i \ne j\quad {\text{b}} = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {{\mathbf{v}}_{ik} > {\mathbf{v}}_{jk} } \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} } \right. $$
(5)

where S is the number of the sub regions, k denotes which parameter of the vector to be compared.

For the sake of generating a low bit-rate binary descriptor, we do not compare all sub-regions. Because if we do so, the descriptor length will be \( C_{16}^{15} \times 4 = 480 \), it is a little too long. Instead, we compare the vector of sub regions linked by the arrow in different and sparse styles as shown in Fig. 2. The radius of the three circles in GV, GVI, GVII, GVIII are set to be 2, 10, 15 pixels. The performance of the different test strategies are given in Sect. 4, experiment results show that even be a 224 bits descriptor, our method out performs several longer descriptors, e.g. BRIEF, ORB, BRISK.

Fig. 2.
figure 2

The binary test strategies, note that we compare the vector parameters of the sub regions linked by the arrows

3.3 Descriptor Compression

Fewer dimensions mean low memory footprint and fast to match, although our descriptor is much shorter than several state-of-art binary descriptors, we apply a dimension reduction process to our descriptor. However, it is important that we do not adversely affect the performance of the descriptor.

Our keypoint descriptor x is represented as n-dimensional binary vector in hamming space \( {\mathbf{H}}^{n} \). We attempt to find a \( m \times n \) matrix \( {\mathbf{P}} \) which takes its value in {0, 1} to map our descriptor to an m-dimensional hamming space \( {\mathbf{H}}^{m} \). Our goal in finding such a matrix is in two-fold. First, \( {\mathbf{H}}^{m} \) should be a more efficient representation. This implies that m must be smaller than n. Secondly, through this mapping, the performance should not degrade too much. To better take advantage of training data, we present a supervised optimization scheme that is inspired by [20, 24]. In [24], they use AdaBoost to compute the projection matrix, but there is no guarantee the solution it finds is optimal. We compute a projection matrix that is designed to minimize the in-class covariance of the descriptors and maximize the covariance across classes. In essence, we perform Linear Discriminant Analysis (LDA) on the descriptor.

Here, we limit our attention to dimension reduction of the form:

$$ {\mathbf{y}} = {\mathbf{Px}} $$
(6)

y is constructed to minimize the expectation of the Hamming distance on the set of positive pairs, while maximizing it on the set of negative pairs. This can be expressed as minimization of the loss function:

$$ L = {\text{E}}\{ d_{{{\mathbf{H}}^{{\mathbf{m}}} }} ({\mathbf{y,y}}^{{\prime }} )|P\} - {\text{E}}\{ d_{{{\mathbf{H}}^{{\mathbf{m}}} }} ({\mathbf{y,y}}^{{\prime }} )|N\} $$
(7)

Equation (7) is equivalent to the minimization of:

$$ L = {\text{E}}\{ \left\| {{\mathbf{y}} - {\mathbf{y}}^{{\prime }} } \right\|^{2}\,|\,P\} - {\text{E}}\{ \left\| {{\mathbf{y}} - {\mathbf{y}}^{{\prime }} } \right\|^{2}\,|\,N\} $$
(8)

The direct minimization of Eq. (8) is difficult since the solution of the resulting non-convex problem in \( m \times n \) variables is challenging. It can be found that:

$$ {\text{E}}\{ \left\| {{\mathbf{Px}} - {\mathbf{Px}}^{{\prime }} } \right\|^{2} |P\} = {\text{tr}}({\mathbf{P}}\Sigma _{P} {\mathbf{P}}^{{\mathbf{T}}} ) $$
(9)

where \( \Sigma _{P} = {\text{E}}\{ ({\mathbf{x}} - {\mathbf{x}}^{{\prime }} )({\mathbf{x}} - {\mathbf{x}}^{{\prime }} )^{T} |P\} \) is the covariance matrix of the positive descriptor vector differences, Eq. (9) turns to be:

$$ L = {\text{tr}}\{ {\mathbf{P}}\Sigma _{P} {\mathbf{P}}^{T} \} - {\text{tr}}\{ {\mathbf{P}}\Sigma _{N} {\mathbf{P}}^{{\rm T}} \} $$
(10)

Pre-multiplying x by \( \Sigma _{N}^{ - 1} \) turns the second term of Eq. (10) into a constant, leaving:

$$ L \propto {\text{tr}}\{ {\mathbf{P}}\Sigma _{P}\Sigma _{N}^{ - 1} {\mathbf{P}}\} = {\text{tr}}\{ {\mathbf{P}}\Sigma _{R} {\mathbf{P}}^{\text{T}} \} $$
(11)

where \( \Sigma _{R} =\Sigma _{P}\Sigma _{N}^{ - 1} \) is the ratio of the positive and negative covariance matrices. Since \( \Sigma _{R} \) is a symmetric positive semi-definite matrix, it admits the eigendecomposition \( \Sigma _{R} = {\mathbf{USU}}^{{\mathbf{T}}} \), where S is a non-negative diagonal matrix. An orthogonal \( m \times n \) matrix P minimizing \( {\text{tr}}\{ {\mathbf{P}}\Sigma _{R} {\mathbf{P}}^{\text{T}} \} \) is a projection onto the space spanned by the m smallest eigenvectors of \( \Sigma _{R} \), this yields:

$$ {\mathbf{P}} = {\tilde{\mathbf{U}}}_{m} $$
(12)

where \( {\tilde{\mathbf{U}}}_{m} \) is the \( m \times n \) matrix with the corresponding eigenvectors. Note that we aim to find a \( m \times n \) projection matrix P which takes its value in {0, 1}, the result in Eq. (12) does not conform this. The index of the m smallest elements of the principal diagonal elements of \( \Sigma _{R} \) is denoted as S. We approximate P by setting the elements of \( {\mathbf{P}}(ind,{\mathbf{S}}_{ind} ) = 1,ind = 1, \ldots ,m \) and others to be 0. m is set to be 196 and 128, the original 224 bits U-CBDF descriptor is compressed to 196 bits and 128 bits, denote as CBDF196, CBDF128.

4 Experimental Results

In this section, we first describe our evaluation framework, and then present a set of initial experiments. These experiments validate our approach and allow us to select the appropriate parameters for the descriptor. Finally, we compare our method to other descriptors including BRIEF, ORB, BRISK, SIFT and SURF. Finally, we apply our proposed CBDF descriptor in a real UAV image registration application.

4.1 Performance Evaluation Protocol

We evaluate the performance of our method using two datasets. The first dataset is proposed by Mikolajczyk and Schmid [6, 11]. This dataset contains several sub datasets. Each of the sub datasets contains a sequence of six images exhibiting an increasing amount of transformation. This dataset is used to detect the appropriate parameters for our descriptor. We use precision rate as a quality criterion, we show the Nb best matches and count the number of correct matches n c . The precision rate is calculated by r = n c /Nb. We set Nb = 300 in our experiment, we tune the threshold of each method to get 300 best matches. However, it’s usually hard to get exactly 300 matches, we get an approximate number, and the deviation is constrained no more than 2.

The second dataset contains two sub datasets: Notre Dame and Liberty [14]. Each of them contains over 400 k scale- and rotation-normalized 64 × 64 patches. These patches are sampled around interest points which detected using Difference of Gaussian, and the correspondences between patches are found using a multi-view stereo algorithm. The resulting datasets exhibit substantial perspective distortion and light changing conditions. The ground truth available for each of these datasets describes 100k, 200k and 500k pairs of patches. We train matrix P with these datasets. The performance of our CBDF descriptor is compared against U-CBDF and the state-of-the-art descriptors. The test set contains 100,000 pairs in which 50 % match pairs, and 50 % non-match pairs.

4.2 Initial Experiments

There are several parameters that influence the performance of our descriptor as been mentioned in Sect. 3: the smoothing scales \( \sigma \), the size of the smoothing template, the mask to compute the gradient, and the test strategy to generate our descriptor. We use the Wall dataset proposed in [6, 11] to test these parameters. It contains five image pairs, with the first image being the same in all pairs and the second image shot from a monotonically growing baseline, which makes matching increasingly more difficult. Figure 4(a) shows the first image of the Wall sequence. All the initial experiments are tested on the U-CBDF descriptor. When we test the influence of one of the parameters, other parameters are set to be the correct value which we finally use. Figure 3(a) shows the results obtained for different values of \( \sigma \). For most of the values of \( \sigma \), the performance are optimal for \( \sigma = 2 \), so we keep \( \sigma = 2 \) in the remaining experiments. Figure 3(b) shows the precision rates for different smoothing templates. The 5 × 5 mask outperforms other masks, so we keep 5 × 5 Gaussian smoothing template for our final descriptor. Figure 3(c) shows the influence of different gradient masks, we find that the Sobel mask performs slightly better that the 1-D point derivative uncentred mask, since 1-D point derivative uncentred mask is much time saving than Sobel mask, we choose it for our final descriptor. The influence of different test strategies is shown in Fig. 3(d). We also calculate the precision rate of BRIEF512 which has a length of 512 bits. Clearly, the symmetrical and regular GV, GVI, GVII, GVII strategies enjoy a big advantage over the other four in most cases. GVIII performs the best and it has a length of 224 bits. For this reason, in all further experiments presented in this paper, it is the one we will use. We also find that GVII and GVIII strategies perform better than BRIEF256 in all cases, in which GVII has a length of only 128 bits.

Fig. 3.
figure 3

Precision rate comparison for different parameters and BRIEF256

Using the above-mentioned parameters for our U-CBDF descriptor, we train the matrix P with both the Notre Dame and Liberty datasets. P is used to compress U-CBDF descriptor.

4.3 Descriptor Comparison

In this section, we use the Notre Dame and Liberty datasets as our training and test datasets. Figure 4(b) shows some image patches from the Liberty dataset. We compare our binary descriptors both uncompressed and compressed to the very recent BRIEF, ORB, BRISK binary descriptors, results obtained with SIFT and SURF are also presented. All the experiments are performed on a desktop computer with an Intel core2 2.80 Hz CPU. For SIFT, BRIEF, ORB, and BRISK, we use the publicly available library OpenCV2.4.3. For SURF, we use the implementation available from their authors. During testing, we compute the distances of all match/non-match descriptors, and sweep a threshold on the descriptor distance to generate a ROC curve. We also report 85 % error rate in Table 2, 85 % error rate is the percent of incorrect matches obtained when 85 % of the true matches are found.

Fig. 4.
figure 4

(a) The first image of Wall sequence of the Mikolajczyk and Schmid dataset (b) some image patches from the Liberty dataset.

Figure 5 provides the ROC curves for U-CBDF, CBDF and the state-of-the-art methods on different training and test datasets. Both Fig. 5(a) and (b) show that although CBDF192 is 32 bits shorter than U-CBDF224, its performance does not degrade too much. CBDF192 performs better than its binary competitors at all error rates. CBDF192 remains competitive to SURF, even though it has a much shorter representation. SIFT performs the best of all tested descriptors, though its complexity is prohibitive for real-time application. BRISK performs the worst at high false positive rate although it is much longer.

Fig. 5.
figure 5

Comparison of our CBDF descriptor to the state-of-the-art binary and floating-point descriptors

The first row of Table 1 clearly shows that CBDF192 provides up to 28 % improvement over BRISK and up to 11 % improvement over BRIEF and ORB in terms of 85 % error rate, CBDF128 provides up to 24 % improvement over BRISK and up to 7 % improvement over BRIEF and ORB. While CBDF128 requiring only 16 bytes instead of 64 bytes for BRISK and 32 bytes for BRIEF. It also shows that CBDF128 remains competitive to the much longer and much more computationally expensive floating-point SURF. The second row of Table 2 shows the similar results with the first row.

Table 1. 85 % error rate for our CBDF descriptor and BRIEF, BRISK, SURF, SIFT. We give the results for U-CBDF128 and U-CBDF224 which are corresponding to the GVII and GVIII in Fig. 1 and CBDF128, CBDF192 which are compressions of U-CBDF224.
Table 2. Timings for different methods

4.4 Timings

The timings of our method and its competitors are extensively tested with the boat image sequence by Mikolajczyk and Schmid dataset which are shown in Table 2. We use the first image and the second image of this sequence (size: 850 × 680). We also use FAST corner detector for BRIEF just like ORB, because there is no special keypoint extractor for BRIEF, so the timings of detection are almost the same. Their differences are in feature descriptor. By tuning the threshold of each method, we extract 1000 keypoints on each image. The matching of each method is based on a brute-force descriptor distance computation. We ran each method for 100 times and calculate the average time cost.

The timings show an advantage of CBDF192. Its descriptor computation time is typically two times faster than the one of SURF, and three times faster than the one of SIFT. The matching timings per point is faster than the one of SIFT, SURF, BRIEF, ORB and BRISK.

4.5 UAV Image Registration

To demonstrate the performance of our proposed CBDF feature descriptor on real UAV image registration, we test the performance of our descriptor on 100 pairs of real UAV images. Figure 6 shows one pair of these images. The image size is 533 × 400 pixels. There is a large rotation between this pair of images. Both CBDF192, BRIEF, ORB, BRISK, SURF and SIFT are tested on these images. The registration results are shown in Fig. 7. The timings are listed in Table 3. One shall note that all our method is almost 10 times faster than SURF and 4 times faster than SIFT.

Fig. 6.
figure 6

UAV images to be registration

Fig. 7.
figure 7

UAV image registration results

Table 3. Timings for different methods with UAV images

5 Conclusions

In this paper, we have defined a new oriented binary discriminative feature for UAV image registration. With only 192 bits or even 128 bits per descriptor, CBDF outperforms its binary state-of-the-art competitors in terms of accuracy while significantly reducing the memory footprint, and comparing to SIFT and SURF, the method offers faster alternative at comparable matching performance. Experiments with real UAV images justify that this framework can fulfill the near real-time image registration application requirement.