Keywords

1 Introduction

Computing local descriptors on interest regions serves as the subroutine of various computer vision applications such as panorama stitching [12], wide baseline matching [18], image retrieval [22], and Structure-from-Motion (SfM) [26, 37, 40, 41]. A powerful descriptor is expected to be invariant to both photometric and geometric changes, such as illumination, blur, rotation, scale and perspective changes. Due to the reliability, efficiency and portability, hand-crafted descriptors such as SIFT [14] have been influentially dominating this field for more than a decade. Until recently, great efforts have been made on developing learned descriptors based on Convolutional Neural Networks (CNNs), which have achieved surprising results on patch-based benchmarks such as HPatches dataset [3]. However, on image-based datasets such as ETH local features benchmark [25], learned descriptors are found to underperform advanced variants of hand-crafted descriptors. The contradictory findings raise the concern of integrating those purportedly better descriptors in real applications, and show significant room of improvement for developing more powerful descriptors that generalize to a wider range of scenarios.

One possible cause of above contradictions, as demonstrated in [25], is the lack of generalization ability as a consequence of data insufficiency. Although previous research [4, 27, 28] discusses several effective sampling methods that produce seemingly large amount of training data, the generalization ability is still bounded to limited data sources, e.g., the widely-used Brown dataset [6] with only 3 image sets. Hence, it is not surprising that resulting descriptors tend to overfit to particular scenarios. To overcome it, research such as [29, 38] applies extra regularization for compact feature learning. Meanwhile, LIFT [33] and [19] seek to enhance data diversity and generate training data from reconstructions of Internet tourism data. However, the existing limitation has not yet been fully mitigated, while intermediate geometric information is overlooked in the learning process despite the robust geometric property that local patch preserves, e.g., the well approximation of local deformations [20].

Besides, we lack guidelines for integrating learned descriptors in practical pipelines such as SfM. In particular, the ratio criterion, as suggested in [14] and justified in [10], has received almost no individual attention or was considered inapplicable for learned descriptors [25], whereas it delivers excellent matching efficiency and accuracy improvements, and serves as the necessity for pipelines such as SfM to reject false matches and seed feasible initialization. A general method to apply ratio criterion for learned descriptors is in need in practice.

In this paper, we tackle above issues by presenting a novel learning framework that takes advantage of geometry constraints from multi-view reconstructed data. In particular, we address the importance of data sampling for descriptor learning and summarize our contributions threefold. (i) We propose a novel batch construction method that simulates the pair-wise matching and effectively samples useful data for learning process. (ii) Collaboratively, we propose a loss formulation to reduce overfitting and improve the performance with geometry constraints. (iii) We provide guidelines about ratio criterion, compactness and scalability towards practical portability of learned descriptors.

We evaluate the proposed descriptor, referred to as GeoDesc, on traditional [9] and recent two large-scale datasets [3, 25]. Superior performance is shown over the state-of-the-art hand-crafted and learned descriptors. We mitigate previous limitations by showing consistent improvements on both patch-based and image-based datasets, and further demonstrate its success on challenging 3D reconstructions.

2 Related Works

Networks Design. Due to weak semantics and efficiency requirements, existing descriptor learning often relies on shallow and thin networks, e.g., three-layer networks in DDesc [27] with 128-dimensional output features. Moreover, although widely-used in high-level computer vision tasks, max pooling is found to be unsuitable for descriptor learning, which is then replaced by L2 pooling in DDesc [27] or even removed in L2-Net [29]. To further incorporate scale information, DeepCompare [35] and L2-Net [29] use a two-stream central-surround structure which delivers consistent improvements at extra computational cost. To improve the rotational invariance, an orientation estimator is proposed in [34]. Besides of feature learning, previous efforts are also made on joint metric learning as in [7, 8, 35], whereas comparison in Euclidean space is more preferable by recent works [4, 5, 27, 29, 33] in order to guarantee its efficiency.

Loss Formulation. Various of loss formulations have been explored for effective descriptor learning. Initially, networks with a learned metric use softmax loss [8, 35] and cast the descriptor learning to a binary classification problem (similar/dissimilar). With weakly annotated data, [15] formulates the loss on keypoint bags. More generally, pair-wise loss [27, 33] and triplet loss [4, 5, 7] are used by networks without a learned metric. Both loss formulations encourage matching patches to be close whereas non-matching patches to be far-away in some measure space. In particular, triplet loss delivers better results [4, 7] as it suffers less overfitting [13]. For effective training, recent L2-Net [29] and HardNet [17] use the structured loss for data sampling which drastically improves the performance. To further boost the performance, extra regularizations are introduced for learning compact representation in [29, 38].

Evaluation Protocol. Previous works often evaluate on datasets such as [9, 16, 31]. However, those datasets either are small, or lack diversity to generalize well to various applications of descriptors. As a result, the evaluation results are commonly inconsistent or even contradictory to each other as pointed out in [3], which limits the application of learned descriptors. Two novel benchmarks, HPatches [3] and ETH local descriptor benchmark [25] have been recently introduced with clearly defined protocols and better generalization properties. However, inconsistency still exists in the two benchmarks, where HPatches [3] benchmark demonstrates the significant outperformance from learned descriptors over the handcrafted, whereas the ETH local descriptor benchmark [25] finds that the advanced variants of the traditional descriptor are at least on par with the learning-based. The inconclusive results indicate that there is still significant room for improvement to learn more powerful feature descriptors.

3 Method

3.1 Network Architecture

We borrow the network in L2-Net [29], where the feature tower is constructed by eschewing pooling layers and using strided convolutional layers for in-network downsampling. Each convolutional layer except the last one is followed by a batch normalization (BN) layer whose weighting and bias parameters are fixed to 1 and 0. The L2-normalization layer after the last convolution produces the final 128-dimensional feature vector.

3.2 Training Data Generation

Acquiring high quality training data is important in learning tasks. In this section, we discuss a practical pipeline that automatically produces well-annotated data suitable for descriptor learning.

2D Correspondence Generation. Similar to LIFT [33], we rely on successful 3D reconstructions to generate ground truth 2D correspondences in an automatic manner. First, sparse reconstructions are obtained from standard SfM pipeline [24]. Then, 2D correspondences are generated by projecting 3D point clouds. In general, SfM is used to filter out most mismatches among images.

Although verified by SfM, the generated correspondences are still outlier-contaminated from image noise and wrongly registered cameras. It happens particularly often on Internet tourism datasets such as [23, 30] (illustrated in Fig. 1(a)), and usually not likely to be filtered by simply limiting reprojection error. To improve data quality, we take one step further than LIFT by computing the visibility check based on 3D Delaunay triangulation [11] which is widely-used for outlier filtering in dense stereo. Empirically, \(30\%\) of 3D points will be discarded after the filtering while only points with high precision are kept for ground truth generation. Figure 1(b) gives an example to illustrate its effect.

Fig. 1.
figure 1

(a) Outlier matches after SfM verification (by COLMAP [24]) on Gendarmenmarkt dataset [30]. The reprojection error (next to the image) cannot be used to identify false matches. (b) Reconstructed sparse point cloud (top), where points in red (bottom) indicate being filtered by Delaunay triangulation and only reliable points in green are kept. The number of points decreases from 75k to 53k after the filtering. (Color figure online)

Matching Patch Generation. Next, the interest region of a 2D projection is cropped similar to LIFT, which is formulated by an similarity transformation

(1)

where \((x^s_i , y_i^s ), (x^t_i , y_i^t)\) are input and output regular sampling grids, and \((x, y, \sigma , \theta )\) are keypoint parameters (xy coordinates, scale and orientation) from SIFT detector. The constant k is set to 12 as in LIFT, resulting in \(12\sigma \times 12\sigma \) patches.

Due to the robust estimation of scale (\(\sigma \)) and orientation (\(\theta \)) parameters of SIFT even in extreme cases [39], the resulting patches are mostly free of scale and rotation differences, thus suitable for training. In later experiments of image matching or SfM, we rely on the same cropping method to achieve scale and rotation invariance for learned descriptors.

3.3 Geometric Similarity Estimation

Geometries at a 3D point are robust and provide rich information. Inspired by the MVS (Multi-View Stereo) accuracy measurement in [36], we define two types of geometric similarity: patch similarity and image similarity, which will facilitate later data sampling and loss formulation.

Patch Similarity. We define patch similarity \(S_{patch}\) to measure the difficulty to have a patch pair matched with respect to perspective changes. Formally, given a patch pair, we relate it to its corresponding 3D track P which is seen by cameras centering at \(C_i\) and \(C_j\). Next, we compute the vertex normal \(P_n\) at P from the surface model. The geometric relationship is illustrated in Fig. 2(a). Finally, we formulate \(S_{patch}\) as

$$\begin{aligned} S_{patch} = s_1 s_2 = g(\angle C_iPC_j, \sigma _1)g(\angle C_iPP_n - \angle C_jPP_n, \sigma _2), \end{aligned}$$
(2)

where \(s_1\) measures the intersection angle between two viewing rays from the 3D track (\(\angle C_iPC_j\)), while \(s_2\) measures the difference of incidence angles between a viewing ray and the vertex normal from the 3D track (\(\angle C_iPP_n, \angle C_jPP_n\)). The angle metric is defined as \(g(\alpha , \sigma ) = \exp (-\frac{\alpha ^2}{2\sigma ^2})\). As an interpretation, \(s_1\) and \(s_2\) measure the perspective change regarding a 3D point and local 3D surface, respectively. The effect of \(S_{patch}\) is illustrated in Fig. 2(b).

The accuracy of \(s_1\) and \(s_2\) depends on sparse and mesh reconstructions, respectively, and is generally sufficient for its use as shown in [36]. The similarity does not consider scale and rotation changes as already resolved from Eq. 1. Empirically, we choose \(\sigma _1 = 15\) and \(\sigma _2 = 20\) (in degree).

Image Similarity. Based on the patch similarity, we define the image similarity \(S_{image}\) as the average patch similarity of the correspondences between an image pair. The image similarity measures the difficulty to match an image pair and can be interpreted as a measurement of perspective change. Examples are given in Fig. 2(c). The image similarity will be beneficial for data sampling in Sect. 3.4.

Fig. 2.
figure 2

(a) The patch similarity relies on the geometric relationship between cameras, tracks and surface normal. (b) The effect of patch similarity, which measures the difficulty to have a patch pair matched with respect to the perspective change. (c) The effect of image similarity, which measures the perspective change between an image pairs. (d) Batch data constructed by L2-Net [29] and HardNet [17] (top), whose in-batch patch pairs are often distinctive to each other and thus contribute nothing to the loss in the late learning (e.g., the margin-based loss). However, the batch data from the proposed batch construction method (bottom) consists of similar patch pairs due to the spatially close keypoints or repetitive patterns, which are considered harder to distinguish and thus raise greater challenges for learning

3.4 Batch Construction

For descriptor learning, most existing frameworks take patch pairs (matching/non-matching) or patch triplets (reference, matching and non-matching) as input. As in previous studies, the convergence rate is highly dependent on being able to see useful data [21]. Here, “useful” data often refers to patch pairs/triplets that produce meaningful loss for learning. However, the effective sampling of such data is generally challenging due to the intractably large number of patch pair/triplet combination in the database. Hence, on one hand, sampling strategies such as hard negative mining [27] and anchor swap [4] are proposed, while on the other hand, effective batch construction is used in [7, 17, 29] to compare the reference patch against all the in-batch samples in the loss computation.

Inspired by previous works, we propose a novel batch construction method that effectively samples “useful” data by relying on geometry constraints from SfM, including the image matching results and image similarity \(S_{image}\), to simulate the pair-wise image matching and sample data. Formally, given one image pair, we extract a match set \(X=\{(x_1, x^+_1), (x_2, x^+_2), ..., (x_{N_1}, x_{N_1}^+)\}\), where \(N_1\) is the set size and \((x_i, x^+_i)\) is a matching patch pair surviving the SfM verification. A training batch is then constructed on \(N_2\) match sets. Hence, the learning objective becomes to improve the matching quality for each match set. In Sect. 3.5, we will discuss the loss computation on each match set and batch data.

Compared with L2-Net [29] and HardNet [17] whose training batches are random sampled from the whole database, the proposed method produces harder samples and thus raises greater challenges for learning. As an example shown in Fig. 2(d), the training batch constructed by the proposed method consists of many similar patterns, due to the spatially close keypoints or repetitive textures. In general, such training batch has two major advantages for descriptor learning:

  • It reflects the in-practice complexity. In real applications, image patches are often extracted between image pairs for matching. The proposed method simulates this scenario so that training and testing become more consistent.

  • It generates hard samples for training. As observed in [4, 17, 21, 27], hard samples are critical to fast convergence and performance improvement for descriptor learning. The proposed method effectively generates batch data that is sufficiently hard, while not being overfitting as constructed on real matching results instead of model inference results [27].

To further boost the training efficiency, we exclude image pairs that are too similar to contribute to the learning. Those pairs are effectively identified by the image similarity defined in Sect. 3.3. In practice, we discard image pairs whose \(S_{image}\) are larger than 0.85 (e.g., the toppest pair in Fig. 2(c)), which results in a \(30\%\) shrink of training samples.

3.5 Loss Formulation

We formulate the loss with two terms: structured loss and geometric loss.

Structured Loss. The structured loss used in L2-Net [29] and HardNet [17] is essentially suitable to consume the batch samples constructed in Sect. 3.4. In particular, the formulation in HardNet based on the “hardest-in-batch” strategy and a distance margin shows to be more effective than the log-likelihood formulation in L2-Net. However, we observe successive overfitting when applying the HardNet loss to our batch data, which we ascribe to the too strong constraint of “hardest-in-batch” strategy. In this strategy, the loss is computed on the data sample that produces the largest loss, and a margin with a large value (1.0 in HardNet) is set to push the non-matching pairs away from matching pairs. In our batch data, we already effectively sample the “hard” data which is often visually similar, thus forcing a large margin is inapplicable and stalls the learning. One simple solution is to decrease the margin value, whereas the performance drops significantly in our experiments.

To avoid above limitation and better take advantage of our batch data, we propose the loss formulation as follows. First, we compute the structured loss for one match set. Given normalized features \(\mathbf {F}_1, \mathbf {F}_2 \in \mathbb {R}^{N_1\times 128}\) computed on match set X for all \((x_i, x_i^+)\), the cosine similarity matrix is derived by \(\mathbf {S} = \mathbf {F}_1 \mathbf {F}_2^T\). Next, we compute \(\mathbf {L} = \mathbf {S} - \alpha \mathbf {diag}(\mathbf {S})\) and formulate the loss as

$$\begin{aligned} E_1 = \frac{1}{N_1(N_1-1)}\sum _{i, j}(\max (0, l_{i, j} - l_{i, i}) + \max (0, l_{i, j} - l_{j, j})), \end{aligned}$$
(3)

where \(l_{i, j}\) is the element in \(\mathbf {L}\), and \(\alpha \in (0,1)\) is the distance ratio mimicking the behavior of ratio test [14] and pushing away non-matching pairs from matching pairs. Finally, we take the average of the loss on each match set to derive the final loss for one training batch.

The proposed formulation is distinctive from HardNet in three aspects. First, we compute the cosine similarity instead of Euclidean distance for computational efficiency. Second, we apply a distance ratio margin instead of a fixed distance margin as an adaptive margin to reduce overfitting. Finally, we compute the mean value of each loss element instead of the maximum (“hardest-in-batch”) in order to cooperate the proposed batch construction.

Geometric Loss. Although \(E_1\) ensures matching patch pairs to be distant from the non-matching, it does not explicitly encourage matching pairs to be close in its measure space. One simple solution is to apply a typical pair-wise loss in [27], whereas taking a risk of positive collapse and overfitting as observed in [13]. To overcome it, we adaptively set up the margin regarding the patch similarity defined in Sect. 3.3, serving as a soft constraint for maximizing the positive similarity. We refer to this term as geometric loss and formulate it as

$$\begin{aligned} E_2 = \sum _{i}{\max (0, \beta - s_{i, i})}, \beta = {\left\{ \begin{array}{ll} 0.7 &{} s_{patch} \ge 0.5 \\ 0.5 &{} 0.2 \le s_{patch} < 0.5 \\ 0.2 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(4)

where \(\beta \) is the adaptive margin, \(s_{i, i}\) is the element in S, namely, the cosine similarity of patch pair \((x_i, x_i^+)\), while \(s_{patch}\) is the patch similarity for \((x_i, x_i^+)\). We use \(E_1 + \lambda E_2\) as the final loss, and empirically set \(\alpha \) and \(\lambda \) to 0.4 and 0.2.

3.6 Training

We use image sets [30] as in LIFT [33], the SfM data in [23], and further collect several image sets to form the training database. Based on COLMAP [24], we run 3D reconstructions to establish necessary geometry constraints. Image sets that are overlapping with the benchmark data are manually excluded. We train the networks from scratch using Adam with a base learning rate of 0.001 and weight decay of 0.0001. The learning rate decays by 0.9 every 10, 000 steps. Data augmentation includes randomly flipping, 90 degrees rotation and brightness and contrast adjustment. The match set size \(N_1\) and batch size \(N_2\) are 64 and 12, respectively. Input patches are standardized to have zero mean and unit norm.

4 Experiments

We evaluate the proposed descriptor on three datasets: the patch-based HPatches [3] benchmark, the image-based Heinly benchmark [9] and ETH local features benchmark [25]. We further demonstrate on challenging SfM examples.

4.1 HPatches Benchmark

HPatches benchmark [3] defines three complementary tasks: patch verification, patch matching, and patch retrieval. Different levels of geometrical perturbations are imposed to form EASY, HARD and TOUGH patch groups. In the task of verification, two subtasks are defined based on whether negative pairs are sampled from images within the same (SAMESEQ) or different sequences (DIFFSEQ). In the task of matching, two subtasks are defined based on whether the principle variance is viewpoint (VIEW) or illumination (ILLUM). Following [3], we use mean average precision (mAP) to measure the performance for all three tasks on HPatches split ‘full’.

We select five descriptors to compare: SIFT as the baseline, RSIFT [2] and DDesc [27] as the best-performing hand-crafted and learned descriptors concluded in [3]. Moreoever, we experiment with recent learned descriptors L2-Net [29] and HardNet [17]. The proposed descriptor is referred to as GeoDesc.

Fig. 3.
figure 3

Left to right: verification, matching and retrieval results on HPatches dataset, split ‘full’. Results on different patch groups are colorized, while DIFFSEQ/SAMESEQ and ILLUM/VIEW denote the subtasks of verification and matching, respectively

As shown in Fig. 3, GeoDesc surpasses all the other descriptors on all three tasks by a large margin. In particular, the performance on TOUGH patch group is significantly improved, which indicates the superior invariance to large image changes of GeoDesc. Interestingly, comparing GeoDesc with HardNet, we observe some performance drop on EASY groups especially for illumination changes, which can be ascribed to the data bias for SfM data. Though applying randomness such as illumination during the training, we cannot fully mitigate this limitation which asks more diverse real data in descriptor learning.

In addition, we evaluate different configurations of GeoDesc on HPatches as shown in Table 1 to demonstrate the effect of each part of our method.

  • Config. 1: the HardNet framework as the baseline.

  • Config. 2: trained with the SfM data in Sect. 3.2. Compared with Config. 1, it is shown that crowd-sourced training data essentially improves the generalization ability. Meanwhile, on the other hand, Config. 2 can be regarded as an extension of LIFT [33] with more advanced loss formulation.

  • Config. 3: equipped with the proposed batch construction in Sect. 3.4. As discussed in Sect. 3.5, the “hardest-in-batch” strategy in HardNet is inapplicable to hard batch data and thus leads to performance drop compared with Config. 2. In practice, we need to adjust the margin value from 1.0 in HardNet to 0.6, otherwise the training will not even converge. Though trainable, the smaller margin value harms the final performance.

  • Config. 4: equipped with the modified structured loss in Sect. 3.5. Notable performance improvements are achieved over Config. 2 due to the collaborative use of proposed methods, showing the effectiveness of simulating pair-wise matching and sampling hard data. Besides, replacing the distance margin with distance ratio can improve the training efficiency, as shown in Fig. 4.

  • Config. 5: equipped with the geometric loss in Sect. 3.5. Further improvements are obtained over Config. 4 as \(E_2\) constrains the solution space and enhances the training efficiency.

To sum up, the “hardest-in-batch” strategy is beneficial when no other sampling is applied and most in-batch samples do not contribute to the loss. However, with harder batch data effectively constructed, it is advantageous to replace the “hardest-in-batch” and further boost the descriptor performance.

Table 1. Evaluation of different configurations of GeoDesc on HPatches. Modules are enabled if marked with “Y” otherwise with “-”. SfM Data denotes the training with our SfM data, Batch Construct. denotes the equipment of proposed batch construction, while \(E_1\) and \(E_2\) denote the use of proposed structured loss and geometric loss, respectively. The last configuration (Config. 5) is our best model with GeoDesc
Fig. 4.
figure 4

Effect of taking distance ratio in loss computation. The metric is the validation accuracy of patch triplets with a margin of 0.5 by cosine similarity.

4.2 Heinly Benchmark

Different from HPatches which experiments on image patches, the benchmark by Heinly et al. [9] evaluates pair-wise image matching regarding different types of photometric or geometric changes, targeting to provide practical insights for strengths and weaknesses of descriptors. We use two standard metrics as in [9] to quantify the matching quality. First, the Matching Score = #Inlier Matches/#Features. Second, the Recall = #Inlier Matches/#True Matches. Four descriptor are selected to compare: SIFT, the baseline hand-crafted descriptor; DSP-SIFT, the best hand-crafted descriptor even superior to the previous learning-based as evaluated in [25]; L2-Net and HardNet, the recent advanced learned descriptors. For fairness comparison, no ratio test and only cross check (mutual test) is applied for all descriptors.

Table 2. Evaluation results on pair-wise image matching on benchmark by Heinly et al. [9] with respect to different types of image changes

Evaluation results are shown in Table 2. Compared with DSP-SIFT, GeoDesc performs comparably regarding image quality changes (compression, blur), while notably better for illumination and geometric changes (rotation, scale, viewpoint). On the other hand, GeoDesc delivers significant improvements on L2-Net and HardNet and particularly narrows the gap in terms of photometric changes, which makes GeoDesc applicable to different scenarios in real applications.

4.3 ETH Local Features Benchmark

The ETH local features benchmark [25] focuses on image-based 3D reconstruction tasks. We compare GeoDesc with SIFT, DSP-SIFT and L2-Net, and follow the same protocols in [25] to quantify the SfM quality, including the number of registered images (# Registered), reconstructed sparse points (# Sparse Points), image observations (# Observations), mean track length (Track Length) and mean reprojection error (Reproj. Error). For fairness comparison, we apply no distance ratio test for descriptor matching and extract features at the same keypoints as in [25].

As observed in Table 3, first, GeoDesc performs best on # Registered, which is generally considered as the most important SfM metric that directly quantifies the reconstruction completeness. Second, GeoDesc achieves best results on # Sparse Points and # Observations, which indicates the superior matching quality in the early step of SfM. However, GeoDesc fails to get best statistics about Track Length and Reproj. Error as GeoDesc computes the two metrics on significantly larger # Sparse Points. In terms of datasets whose scale is small and have similar track number (Fountain, Herzjesu), GeoDesc gives the longest Track Length.

To sum up, GeoDesc surpasses both the previous best-performing DSP-SIFT and recent advanced L2-Net by a notable margin. In addition, it is noted that L2-Net also shows consistent improvements over DSP-SIFT, which demonstrates the power of taking structured loss for learned descriptors.

Table 3. Evaluation results on ETH local features benchmark [25] for SfM tasks

4.4 Challenging 3D Reconstructions

To further demonstrate the effect of the proposed descriptor in a context of 3D reconstruction, we showcase selective image sets whose reconstructions fail or are in low quality with a typical SIFT-based 3D reconstruction pipeline but get significantly improved by integrating GeoDesc.

Fig. 5.
figure 5

Testing cases of challenging image sets, where a traditional SIFT-based reconstruction pipeline fails to apply but GeoDesc delivers significant improvement.

From examples shown in Fig. 5, it is clear to see the benefit of deploying GeoDesc in a reconstruction pipeline. First, by robust matching resistant to photometric and geometric changes, a complete sparse reconstruction registered with more cameras can be obtained. Second, due to more accurate camera pose estimation, the final fined mesh reconstruction is then derived.

5 Practical Guidelines

In this section, we discuss several practical guidelines to complement the performance evaluation and provide insights towards real applications. Following experiments are conducted with 231 extra high-resolution image pairs, whose keypoints are downsampled to \(\sim \)10k per image. We use a single NVIDIA GTX 1080 GPU with TensorFlow [1], and forward each batch with 256 patches.

5.1 Ratio Criterion

The ratio criterion [14] compares the distance between the first and the second nearest neighbor, and establishes a match if the former is smaller than the latter to some ratio. For SfM tasks, the ratio criterion improves overall matching quality, RANSAC efficiency, and seeds robust initialization. Despite those benefits, the ratio criterion has received little attention, or even been considered inapplicable to learned descriptors in previous studies [25]. Here, we propose a general method to determine the ratio that well cooperates with existing SfM pipelines.

The general idea is simple: the new ratio should function similarly as SIFT’s, as most SfM pipelines are parameterized for SIFT. To quantify the effect of the ratio criterion, we use the metric Precision = #Inlier Matches/#Putative matches, and determine the ratio that achieves similar Precision as SIFT’s. As an example in Fig. 6, we compute the Precision of SIFT and GeoDesc on our experimental dataset, and find the best ratio for GeoDesc is 0.89 at which it gives similar Precision (0.70) as SIFT (0.69). This ratio is applied to experiments in Sect. 4.4 and shows robust results and compatibility in the practical SfM pipeline.

5.2 Compactness Study

A compact feature representation generally indicates better performance with respective to discriminativeness and scalability. To quantify the compactness, we reply on the intermediate result in Principal Component Analysis (PCA). First, we compute the explained variance \(v_i\) which is stored in increasing order for each feature dimension indexed by i. Then we estimate the compact dimensionality (denoted as Compact-Dim) by finding the minimal k that satisfies \(\sum ^k_i{v_i}/\sum ^D_i{v_i} \ge t\), where t is a given threshold and D is the original feature dimensionality. In this experiment, we set \(t = 0.9\), so that the Compact-Dim can be interpreted as the minimal dimensionality required to convey more than \(90\%\) information of the original feature. Obviously, larger Compact-Dim indicates less redundancy, namely greater compactness.

As a result, the Compact-Dim estimated on 4 millions feature vectors for SIFT, DSP-SIFT, L2-Net and GeoDesc is 56, 63, 75 and 100, respectively. The ranking of Compact-Dim effectively responds to previous performance evaluations, where descriptors with larger Compact-Dim yield better results.

5.3 Scalability Study

Computational Cost. As evaluated in [3, 25], the efficiency of learned descriptors is on par with traditional descriptors such as CPU-based SIFT. Here, we further compare with GPU-based SIFT [32] to provide insights about practicability. We evaluate the running time in three steps. First, keypoint detection and canonical orientation estimation by SIFT-GPU. Next, patches cropping by Eq. 1. Finally, feature inference of image patches. The computational cost and memory demand are shown in Table 4, indicating that with GPU support, not surprisingly, SIFT (0.20s) is still faster than the learned descriptor (0.31s), with a narrow gap due to the parallel implementation. For applications heavily relying on matching quality (e.g., 3D reconstruction), the proposed descriptor achieves a good trade-off to replace SIFT.

Quantization. To conserve disk space, I/O and memory, we linearly map feature vectors of GeoDesc from \([-1, 1]\) to [0, 255] and round each element to unsigned-char value. The quantization does not affect the performance as evaluated on HPatches benchmark.

Fig. 6.
figure 6

Determine the ratio criterion of GeoDesc so that it has the same Precision as SIFT (at 0.89)

Table 4. Computational cost and memory demand of feature extraction of GeoDesc in three steps: SIFT-GPU extraction, patch cropping and feature inference. The total time cost is evaluated with three steps implemented in a parallel fashion

6 Conclusions

In contrast to prior work, we have addressed the advantages of integrating geometry constraints for descriptor learning, which benefits the learning process in terms of ground truth data generation, data sampling and loss computation. Also, we have discussed several guidelines, in particular, the ratio criterion, towards practical portability. Finally, we have demonstrated the superior performance and generalization ability of the proposed descriptor, GeoDesc, on three benchmark datasets in different scenarios, We have further shown the significant improvement of GeoDesc on challenging reconstructions, and the good trade-off between efficiency and accuracy to deploy GeoDesc in real applications.