Keywords

1 Introduction

Registration of point clouds integrates 3D data from different sources into a common coordinate system, serving as an essential component of many high-level applications like 3D modeling [1, 2], SLAM [3] and robotic perception [4]. Critical to a registration task is the determination of correspondences between spatially localized 3D points within each cloud. To tackle the correspondence problem, on the one hand, a bunch of 3D local descriptors [5,6,7,8,9,10,11,12] have been developed to facilitate the description of 3D keypoints. On the other hand, matching strategies [13,14,15] have also been progressing towards higher accuracy and robustness.

The exploration of 3D geometric descriptors has long been the focus of interest in point cloud registration. It involves the hand-crafted geometric descriptors [5,6,7,8,9] as well as the learned ones [10,11,12, 16]. Both kinds of methods mainly rely on 3D local geometries. Meanwhile, with the significant progress of CNN-based 2D patch descriptors [17,18,19,20,21], more importance is attached to leveraging the 2D projections for the description of underlying 3D structures [22,23,24,25]. Particularly for the point cloud data generally co-registered with camera images [26,27,28,29], the fusion of multiple image views, which has reported success on various tasks [30,31,32,33], is expected to further improve the discriminative power of 3D local descriptors. With this motivation, we propose a multi-view descriptor, named MVDesc, for the description of 3D keypoints based on the synergy of the multi-view fusion techniques and patch descriptor learning. Rather than a replacement, the MVDesc is well complementary to existing geometric descriptors [5,6,7,8,9,10,11,12].

Given the local descriptors, the matching problem is another vital issue in point cloud registration. A set of outlier-free point matches is desired by most registration algorithms [15, 34,35,36,37,38]. Currently, the matching strategies, e.g., the nearest neighbor search, mutual best [15] and ratio test [13], basically estimate correspondences according to the similarities of local descriptors alone. Without considering the global geometric consistency, these methods are prone to spurious matches between locally-similar 3D structures. Efforts are also spent on jointly solving the outlier suppression via line process and the optimization of global registration [15]. But the spatial organizations of 3D point matches are still overlooked when identifying outliers. To address this, we develop a robust matching approach by explicitly considering the spatial consistency of point matches in 3D space. We seek to filter outlier matches based on a graphical model describing their spatial properties and provide an efficient solution via belief propagation.

The main contributions of this work can be summarized twofold. (1) We are the first to leverage the fusion of multiple image views for the description of 3D keypoints when tackling point cloud registration. (2) The proposed effective and efficient outlier filter, which is based on a graphical model and solved by belief propagation, remarkably enhances the robustness of 3D point matching.

2 Related Works

3D Local Descriptor. The representation of a 3D local structure used to rely on traditional geometric descriptors such as Spin Images [5], PFH [6], FPFH [7], SHOT [8], USC [9] and et al., which are mainly produced based on the histograms over local geometric attributes. Recent studies seek to learn descriptors from different representations of local geometries, like volumetric representations of 3D patches [10], point sets [12] and depth maps [16]. The CGF [11] still leverages the traditional spherical histograms to capture the local geometry but learns to map the high-dimensional histograms to a low-dimensional space for compactness.

Rather than only using geometric properties, some existing works refer to extracting descriptors from RGB images that are commonly co-registered with point clouds as in scanning datasets [26,27,28] and 3D reconstruction datasets [29, 39]. Registration frameworks like [22,23,24,25] use SIFT descriptors [13] as the representations of 3D keypoints based on their projections in single-view RGB images. Besides, the other state-of-the-art 2D descriptors like DeepDesc [17], L2-Net [18] and et al. [19,20,21] can easily migrate here for the description of 3D local structures.

Multi-view Fusion. The multi-view fusion technique is used to integrate information from multiple views into a single representation. It has been widely proved by the literature that the technique effectively boosts the performance of instance-level detection [30], recognition [31, 32] and classification [33] compared with a single view. Su et al. [30] first propose a probabilistic representation of a 3D-object class model for the scenario where an object is positioned at the center of a dense viewing sphere. A more general strategy of multi-view fusion is view pooling [31,32,33, 40], which aggregates the feature maps of multiple views via element-wise maximum operation.

Matching. The goal of matching for registration is to find correspondences across 3D point sets given keypoint descriptors. Almost all the registration algorithms [15, 34,35,36,37,38] demand accurate point correspondences as input. Nearest-neighbor search, mutual best filtering [15] and ratio test [13] are effective ways of searching for potential matches based on local similarities for general matching tasks. However, as mentioned above, these strategies are prone to mismatches without considering the geometric consistency. To absorb geometric information, [41] and [42] discover matches in geometric agreement using game-theoretic scheme. Ma et al. [43] propose to reject outliers by enforcing consistency in local neighborhood. Zhou et al. [15] use a RANSAC-style tuple test to eliminate matches with inconsistent scales. Besides, the line process model [44] is applied in registration domain to account for the presence of outliers implicitly [15].

3 Multi-view Local Descriptor (MVDesc)

In this section, we propose to learn multi-view descriptors (MVDesc) for 3D keypoints which combine multi-view fusion techniques [30,31,32,33] with patch descriptor learning [17,18,19,20,21]. Specifically, we first propose a new view-fusion architecture to integrate feature maps across views into a single representation. Second, we build the MVDesc network for learning by putting the fusion architecture above multiple feature networks [19]. Each feature network is used to extract feature maps from the local patch of each view.

Fig. 1.
figure 1

An overview of proposed Fuseption-ResNet (FRN). (a) Architecture of FRN that fuses feature maps of multiple views. Backed by the view pooling branch as a shortcut connection, (b) the Fuseption branch takes charge of learning the residual mapping. (c) The parameters of the convolutional layers are listed

3.1 Multi-view Fusion

Currently, view pooling is the dominant fusion technique used to merge feature maps from different views [31,32,33, 40]. However, as reported by the literature [32, 45, 46], the pooling operation is somewhat risky in terms of feature aggregation due to its effect of smoothing out the subtle local patterns. Inspired by ResNet [47], we propose an architecture termed Fuseption-ResNet which uses the view pooling as a shortcut connection and adds a sibling branch termed Fuseption in charge of learning the underlying residual mapping.

Fuseption. As shown in Fig. 1(b), the Fuseption is an Inception-style [48, 49] architecture capped above multi-view feature maps. First, following the structure of inception modules [48, 49], three lightweight cross-spatial filters with different kernel sizes, \(3\,\times \,3\), \(1\,\times \,3\) and \(3\,\times \,1\), are adopted to extract different types of features. Second, the \(1\,\times \,1\) convolution Conv6, employed above concatenated feature maps, is responsible for the merging of correlation statistics across channels and the dimension reduction as suggested by [48, 50].

Fuseption-ResNet (FRN). Inspired by the effectiveness of skip connections in ResNet [47], we take view pooling as a shortcut in addition to Fuseption as shown in Fig. 1(a). As opposed to the view pooling branch which is in charge of extracting the strongest responses across views [31], the Fuseption branch is responsible for learning the underlying residual mapping. Both engaged branches reinforce each other in term of accuracy and convergence rate. On the one hand, the residual branch, Fuseption, guarantees no worse accuracy compared to just using view pooling. This is because if view pooling is optimal, the residual can be easily pulled to zeros during training. On the other hand, the shortcut branch, view pooling, greatly accelerates the convergence of learning MVDesc as illustrated in Fig. 2(a). Intuitively, since the view pooling branch has extracted the essential strongest responses across views, it is easier for the Fuseption branch to just learn the residual mapping.

3.2 Learning MVDesc

Fig. 2.
figure 2

(a) Test error of our MVDesc network with or without the view pooling branch. Using view pooling as a shortcut connection contributes to much faster convergence of learning. (b) Test error and forward time per iteration of MVDesc network with respect to the number of fused views. Three views are chosen as a good trade-off between accuracy and efficiency. (c) The sample multi-view patches produced by the collected SfM database for training

Network. The network for learning MVDesc is built by putting the proposed FRN above multiple parallel feature networks. We use the feature network from MatchNet [19] as the basis, in which the bottleneck layer and the metric network are removed. The feature networks of multiple views share the same parameters of corresponding convolutional layers. The channel number of Conv6 is set to be the same as that of feature maps output by a feature network. The ReLU activation [51] follows each convolutional layer except the last Conv7. A layer of L2 normalization is appended after Conv7 whose channel number can be set flexibly to adjust the dimension of descriptors. The parameters of the full network are detailed in the supplemental material.

Loss. The two-tower Siamese architecture [17, 52] is adopted here for training. The formulation of the double-margin contrastive loss is used [53], i.e.,

$$\begin{aligned} L(\mathbf {x}_a, \mathbf {x}_b) =y \max (||\mathbf {d}_a - \mathbf {d}_b||_2 - \tau _1, 0) + (1-y) \max (\tau _2 - ||\mathbf {d}_a - \mathbf {d}_b||_2, 0), \end{aligned}$$
(1)

where \(y=1\) for positive pairs and 0 otherwise. \(\mathbf {d}_a\) and \(\mathbf {d}_b\) are L2-normalized MVDesc descriptors of the two sets of multi-view patches \(\mathbf {x}_a\) and \(\mathbf {x}_b\), output by the two towers. We set the margins \(\tau _1 = 0.3, \tau _2 = 0.6\) in experiments.

View Number. Unlike [31,32,33] using 12 or 20 views for objects, we adopt only 3 views in our MVDesc network for the description of local keypoints, which is a good tradeoff between accuracy and efficiency as shown in Fig. 2(b).

Data Preparation. Current available patch datasets generally lack sufficient multi-view patches for training. For example, one of the largest training sets Brown [54, 55] only possesses less than 25k 3D points with at least 6 views. Therefore, we prepare the training data similar to [20] based on the self-collected Structure-from-Motion (SfM) database. The database consists of 31 outdoor scenes of urban and rural areas captured by UAV and well reconstructed by a standard 3D reconstruction pipeline [1, 2, 56,57,58,59,60]. Each scene contains averagely about 900 images and 250k tracks with at least 6 projections. The multi-view patches of size 64\(\,\times \,\)64 are cropped from the projections of each track according to SIFT scales and orientations [20], as displayed in Fig. 2(c). A positive training pair is formed by two independent sets of triple-view patches from the same track, while a negative pair from different tracks. A total number of 10 million pairs with equal ratio of positives and negatives are evenly sampled from all the 31 scenes. We turn the patches into grayscale, subtract the intensities by 128 and divide them by 160 [19].

Training. We train the network from scratch using SGD with a momentum of 0.9, a weight decay of 0.0005 and a batch size of 256. The learning rate drops by 30% after every epoch with a base of 0.0001 using exponential decay. The training is generally done within 10 epochs.

4 Robust Matching Using Belief Propagation (RMBP)

In this section, we are devoted to enhancing the accuracy and robustness of 3D point matching. Firstly, a graphical model is defined to describe the spatial organizations of point matches. Secondly, each match pair is verified by the inference from the graphical model via belief propagation. Notably, the proposed method is complementary to the existing matching algorithms [13,14,15, 61].

Fig. 3.
figure 3

(a) \(c_{1\sim 4}\) are four pairs of point matches. (b) The graph is defined to model the neighboring relationship between \(c_{1\sim 4}\). The solid/dashed lines link between compatible/incompatible neighbors. (c) 4,721 pairs of putative point matches. (d) 109 true match pairs refined by our RMBP

4.1 Matching Model

It can be readily observed that inlier point correspondences generally hold spatial proximity. We illustrate it in Fig. 3(a) where \(c_1=(\mathbf {p}_1, \mathbf {q}_1)\), \(c_2=(\mathbf {p}_2, \mathbf {q}_2)\) and \(c_4=(\mathbf {p}_4, \mathbf {q}_4)\) are three pairs of inlier correspondences. For any two pairs of inlier matches, their points in each point cloud are either spatially close to each other like \(\langle \mathbf {p}_1, \mathbf {p}_2 \rangle \) and \(\langle \mathbf {q}_1, \mathbf {q}_2 \rangle \) or far away from each other like \(\langle \mathbf {p}_2, \mathbf {p}_4 \rangle \) and \(\langle \mathbf {q}_2, \mathbf {q}_4 \rangle \) at the same time. On the contrary, outlier correspondences tend to show spatial disorders. This observation implies the probabilistic dependence between neighboring point correspondences which can be modeled by a probabilistic graph.

Formally, we first define the neighborhood of point correspondences as follows. Two pairs of point correspondences \(c_i = (\mathbf {p}_i, \mathbf {q}_i)\) and \(c_j = (\mathbf {p}_j, \mathbf {q}_j)\) are considered as neighbors if either \(\mathbf {p}_i\) and \(\mathbf {p}_j\), or \(\mathbf {q}_i\) and \(\mathbf {q}_j\), are mutually k-nearest neighbors, i.e.,

$$\begin{aligned} \max ({\text {rank}}(\mathbf {p}_i, \mathbf {p}_j), {\text {rank}}(\mathbf {p}_j, \mathbf {p}_i))&< k, \end{aligned}$$
(2)
$$\begin{aligned} \text {or} \,\,\,\, \max ({\text {rank}}(\mathbf {q}_i, \mathbf {q}_j), {\text {rank}}(\mathbf {q}_j, \mathbf {q}_i))&< k, \end{aligned}$$
(3)

where \({\text {rank}}(\mathbf {x}, \mathbf {y})\) denotes the rank of distance of point \(\mathbf {y}\) with respect to point \(\mathbf {x}\). Then, the neighboring relationship between \(c_i\) and \(c_j\) can be further divided into two categories: if Condition 2 and 3 are satisfied simultaneously, \(c_i\) and \(c_j\) are called compatible neighbors. They are very likely to co-exist as inlier matches. But if only one of Condition 2 or 3 is satisfied by one point pair but another pair of points in the other point cloud locate far apart from each other, e.g.,

$$\begin{aligned} \min ({\text {rank}}(\mathbf {p}_i, \mathbf {p}_j), {\text {rank}}(\mathbf {p}_j, \mathbf {p}_i))&> l, \end{aligned}$$
(4)
$$\begin{aligned} \text {or} \,\,\,\, \min ({\text {rank}}(\mathbf {q}_i, \mathbf {q}_j), {\text {rank}}(\mathbf {q}_j, \mathbf {q}_i))&> l, \end{aligned}$$
(5)

\(c_i\) and \(c_j\) are called incompatible neighbors, as it is impossible for two match pairs breaching spatial proximity to be inliers simultaneously. The threshold parameter k in Condition 2 and 3 is set to a relatively small value, while the parameter l in Condition 4 and 5 is set to be larger than k by a considerable margin. These settings are intended to ensure sufficiently strict conditions on identifying compatible or incompatible neighbors for robustness.

Based on the spatial property of point matches stated above, an underlying graphical model is built to model the pairwise interactions between neighboring match pairs, as shown in Fig. 3(a) and (b). The nodes in graphical model are first defined as a set of binary variables \(\mathcal {X} = \{x_i \}\) each associated with a pair of point correspondence. \(x_i \in \{0, 1 \}\) indicates the latent state of being an outlier or inlier, respectively. Then the undirected edges between nodes are formed based on the compatible and incompatible neighboring relationship defined above. With the purpose of rejecting outliers, the objective here is to compute the marginal of being an inlier for each point correspondence by performing inference on the defined model.

4.2 Inference by Belief Propagation

The task of computing marginals on nodes of a cyclic network is known to be NP-hard [62]. As a disciplined inference algorithm, loopy belief propagation (LBP) provides approximate yet compelling inference on arbitrary networks [63].

In the case of our graphical network with binary variables and pairwise interactions, the probabilistic distributions of all node variables are first initialized as \([0.5, 0.5]^T\) with no prior imposed. Then the iterative message update step of a standard LBP algorithm at iteration t can be written as

$$\begin{aligned} \mathbf {m}_{ij}^{t+1} = \frac{1}{Z} \mathbf {F}_{ij} \mathbf {m}_i \prod _{k \in \partial i \backslash j} \mathbf {m}_{ki}^t. \end{aligned}$$
(6)

Here, \(\partial i\) denotes the set of neighbors of node \(x_i\) and Z is the L1 norm of the incoming message for normalization. The message \(\mathbf {m}_{ij}\) passed from node \(x_i\) to \(x_j\) is a two-dimensional vector, which represents the belief of \(x_j\)’s probability distribution inferred by \(x_i\). So is the constant message \(\mathbf {m}_i\) passed from the observation of node \(x_i\), which indicates the likelihood distribution of \(x_i\) given its observation measurements like descriptor similarity. The first and second components of the messages are the probabilities of being an outlier and an inlier, respectively. The product of messages is component-wise. The \(2\,\times \,2\) matrix \(\mathbf {F}_{ij}\) is the compatibility matrix of node \(x_i\) and \(x_j\). Based on the neighboring relationship analyzed above, the compatibility matrix is supposed to favor the possibility that both nodes are inliers if they are compatible neighbors and the reverse if they are incompatible neighbors. In order to explicitly specify the bias, the compatibility matrices take two forms in the two different cases, respectively:

$$\begin{aligned} \mathbf {F}^+= \begin{bmatrix} 1&1 \\ 1&\lambda \end{bmatrix} \quad \text {or}\quad \mathbf {F}^-= \begin{bmatrix} \lambda&\lambda \\ \lambda&1 \end{bmatrix} . \end{aligned}$$
(7)

The parameter \(\lambda \) takes a biased value greater than 1. To guarantee the convergence of LBP, Simon’s condition [64] is enforced and the value of \(\lambda \) is thus constrained by

$$\begin{aligned} \max _{x_i \in \mathcal {X}} |\partial i| \cdot \log \lambda < 2, \end{aligned}$$
(8)

in which way \(\lambda \) is set adaptively according to the boundary condition. The proof of the convergence’s condition is detailed in the supplemental material. After convergence, the marginal distribution of node \(x_i\) is derived by

$$\begin{aligned} \mathbf {b}_i = \frac{1}{Z} \mathbf {m}_i \prod _{k \in \partial i} \mathbf {m}_{ki}, \end{aligned}$$
(9)

which unifies implication from individual observations and beliefs from structured local neighborhood. After the inference, point matches with low marginals (e.g. \(<0.5\)) are discarded as outliers. It greatly contributes to the matching accuracy as shown in Fig. 3(d) where 109 true match pairs are refined from 4,721 noisy putative match pairs.

Complexity Analysis. The complexity of LBP is known to be linear to the number of edges in the graph [65]. And the Condition 2 and 3 bound the degree of each node to be less than 2k, so that the upper bound of RMBP’s complexity is linear with the number of nodes.

5 Experiments

In this section, we first individually evaluate the proposed MVDesc and RMBP in Sects. 5.1 and 5.2 respectively. Then the two approaches are validated on the tasks of geometric registration in Sect. 5.3.

All the experiments, including the training and testing of neural networks, are done on a machine with a 8-core Intel i7-4770K, a 32GB memory and a NVIDIA GTX980 graphics card. In the experiments below, when we say putative matching, we mean finding the correspondences of points whose descriptors are mutually closest to each other in Euclidean space between two point sets. The matching is implemented based on OpenBLAS [66]. The traditional geometric descriptors [6,7,8,9] are produced based on PCL [67].

5.1 Evaluation of MVDesc

The target here is to compare the description ability of the proposed MVDesc against the state-of-the-art patch descriptors [13, 17, 18] and geometric descriptors [6,7,8,9,10,11].

Table 1. The mAPs of descriptors in the three tasks of HPatches benchmark [68]. Our MVDesc holds the top place in all the three tasks

5.1.1 Comparisons with Patch Descriptors

Setup. We choose HPatches [68], one of the largest local patch benchmarks, for evaluation. It consists of 59 cases and 96,315 6-view patch sets. First, we partition each patch set into two subsets by splitting the 6-views into halves. Then, the 3-view patches are taken as input to generate descriptors. We set up the three benchmark tasks in [68] by reorganizing the 3-view patches and use the mean average precision (mAP) as measurement. For the patch verification task, we collect all the 96,315 positive pairs of 3-view patches and 100,000 random negatives. For the image matching task, we apply putative matching across the two half sets of each case after mixing 6,000 unrelated distractors into every half. For the patch retrieval task, we use the full 96,315 6-view patch sets each of which corresponds to a 3-view patch set as a query and the other 3-view set in the database. Besides, we mix 100,000 3-view patch sets from an independent image set into the database for distraction.

We make comparisons with the baseline SIFT [13] and the state-of-the-art DeepDesc [17] and L2-Net [18], for which we randomly choose a single view from the 3-view patches. To verify the advantage of our FRN over the widely-used view pooling [31,32,33] in multi-view fusion, we remove the Fuseption branch from our MVDesc network and train with the same data and configuration. All the descriptors have the dimensionality of 128.

Results. The statistics in Table 1 show that our MVDesc achieves the highest mAPs in all the three tasks. First, it demonstrates the advantage of our FRN over view pooling [31,32,33, 40, 69] in terms of multi-view fusion. Second, the improvement of MVDesc over DeepDesc [17] and L2-Net [18] suggests the benefits of leveraging more image views than a single one. Additionally, we illustrate in Fig. 4(a) the trade-off between the mAP of the image matching task and the dimension of our MVDesc. The mAP rises but gradually saturates with the increase of dimension.

Fig. 4.
figure 4

(a) The trade-off of mAP versus dimension of our MVDesc on the HPatches benchmark [68]; (b) The change of matching precisions w.r.t. the varying threshold of point distances on the TUM dataset [28]. The 128- and 32-dimensional MVDesc rank first and second at any threshold

5.1.2 Comparisons with Geometric Descriptors

Setup. Here, we perform evaluations on matching tasks of the RGB-D dataset TUM [28]. Following [11], we collect up to 3,000 pairs of overlapping point cloud fragments from 10 scenes of TUM. Each fragment is recovered from independent RGB-D sequences of 50 frames. We detect keypoints from the fragments by SIFT3D [70] and then generate geometric descriptors, including PFH [6], FPFH [7], SHOT [8], USC [9], 3DMatch [10] and CGF [11]. Our MVDesc is derived from the projected patches of keypoints in three randomly-selected camera views. For easier comparison, two dimensions, 32 and 128, of MVDesc (MVDesc-32 and MVDesc-128) are adopted. Putative matching is applied to all the fragment pairs to obtain correspondences. Following [11], we measure the performance of descriptors by the fraction of correspondences whose ground-truth distances lie below the given threshold.

Table 2. Precisions and recalls of matching on the TUM dataset [28] when the threshold of points’ distances equals to 0.1 meters. The average time taken to encode 1,000 descriptors is also compared. Our MVDesc hits the best in terms of precision, recall and efficiency

Results. The precision of point matches w.r.t. the threshold of point distances is depicted in Fig. 4(b). The MVDesc-128 and MVDesc-32 rank first and second in precision at any threshold, outperforming the state-of-the-art works by a considerable margin. We report in Table 2 the precisions and recalls when setting the threshold to 0.1 meters and the average time of producing 1,000 descriptors. Producing geometric descriptors in general is slower than MVDesc due to the cost of computing local histograms, although the computation has been accelerated by multi-thread parallelism.

5.2 Evaluation of RMBP

Setup. To evaluate the performance of outlier rejection, we compare RMBP with RANSAC and two state-of-the-art works - Sparse Matching Game (SMG) [42] and Locality Preserving Matching (LPM) [43]. All the parameters of methods have been tuned to give the best results. We match 100 pairs of same-scale point clouds from 20 diverse indoor and outdoor scenes of TUM [28], ScanNet [65] and EPFL [39] datasets. We keep a constant number of inlier correspondences and continuously add outlier correspondences for distraction. The evaluation uses the metrics: the mean precisions and recalls of outlier rejection and inlier selection. Formally, we write \(OP=\frac{\#true\,\, rejections}{\#rejections}\), \(OR=\frac{\#true\,\, rejections}{\#outliers}\), \(IP=\frac{\#kept\,\, inliers}{\#kept\,\, matches}\), \(IR=\frac{\#kept\,\, inliers}{\#inliers}\). We also estimate the transformations from kept matches using RANSAC and collect the median point distance after registration.

Fig. 5.
figure 5

The mean precisions and recalls of outlier rejection (OP, OR) and inlier selcection (IP, IR), as well as median point distance after registration, w.r.t. the inlier ratio. RMBP performs well in all metrics and at all inlier ratios while RANSAC, SMG [42] and LPM [43] fail to give valid registrations when the inlier ratio drops below \(\frac{1}{8}\), \(\frac{1}{8}\) and \(\frac{1}{64}\), respectively

Results. The measurements with respect to the inlier ratio are shown in Fig. 5. First, RMBP is the only method achieving high performance in all metrics and at all inlier ratios. Second, RANSAC, SMG and LPM fail to give valid registrations when the inlier ratio drops below \(\frac{1}{8}\), \(\frac{1}{8}\) and \(\frac{1}{64}\), respectively. They obtain high OP and OR but low IP or IR when the inlier ratio is smaller than \(\frac{1}{16}\), because they tend to reject almost all the matches.

5.3 Geometric Registration

In this section, we verify the practical usage of the proposed MVDesc and RMBP by the tasks of geometric registration. We operate on point cloud data obtained from two different sources: the point clouds scanned by RGB-D sensors and those reconstructed by multi-view stereo (MVS) algorithms [71].

5.3.1 Registration of Scanning Data

Setup. We use the task of loop closure as in [10, 11] based on the dataset ScanNet [72], where we check whether two overlapping sub-maps of an indoor scene can be effectively detected and registered. Similar to [10, 11], we build up independent fragments of 50 sequential RGB-D frames from 6 different indoor scenes of ScanNet [72]. For each scene, we collect more than 500 fragment pairs with labeled overlap obtained from the ground truth for registration.

The commonly-used registration algorithm, putative matching plus RANSAC, is adopted in combination with various descriptors [6,7,8,9,10,11]. The proposed RMBP serves as an optional step before RANSAC. We use the same metric as [10, 11], i.e., the precision and recall of registration of fragment pairs. Following [10], a registration is viewed as true positive if the estimated Euclidean transformation yields more than \(30\%\) overlap between registered fragments and transformation error less then 0.2 m. We see a registration as positive if there exist more than 20 pairs of point correspondences after RANSAC.

Fig. 6.
figure 6

Challenging cases of loop closures from the ScanNet dataset [72]. The images in the top row indicate very limited overlap shared by the fragment pairs. Our MVDesc-32+RMBP succeeds in the registration of these cases while the top-performing geometric descriptor CGF-32 [11] fails no matter whether RMBP is employed

Table 3. The quantitative comparisons of 3D descriptors on registration of the ScanNet dataset [72]. The superscript * means the proposed RMBP is applied. The RMBP generally lifts the precisions and recalls of registration for almost all the descriptors. Our MVDesc well surpasses the state-of-the-art works in recall with comparable precision and run-time of registration

Results. The precisions, recalls and the average time of registration per pair are reported in Table 3. Our MVDesc-32 and MVDesc-128 both surpass the counterparts by a significant margin in recall while with comparable precision and efficiency. Our versatile RMBP well improves the precisions for 6 out of 8 descriptors and lifts the recalls for 7 out of 8 descriptors. The sample registration results of overlap-deficient fragments are visualized in Fig. 6.

Indoor Reconstruction. The practical usage of MVDesc is additionally evaluated by indoor reconstruction of the ScanNet dataset [72]. We first build up reliable local fragments through RGB-D odometry following [73, 74] and then globally register the fragments based on [15]. The RMBP is applied for outlier filtering. The FPFH [7] used in [15] is replaced by SIFT [13], CGF-32 [11] or MVDesc-32 to establish correspondences. We also test the collaboration of CGF-32 and MVDesc-32 by combining their correspondences. Our MVDesc-32 contributes to visually compelling reconstruction as shown in Fig. 7(a). And we find that MVDesc-32 functions as a solid complement to CGF-32 as shown in Fig. 7(b), especially for the scenarios with rich textures.

Fig. 7.
figure 7

(a) A complete reconstruction of an apartment from ScanNet [72] using our MVDesc-32. (b) The reconstructions using SIFT [13], CGF-32 [11] and MVDesc-32. The collaboration of MVDesc-32 and CGF-32 yields the best reconstruction as shown in the last cell of (b)

Registration of Multi-view Stereo (MVS) Data.

Setup. Aside from the scanning data, we run registration on the four scenes of the MVS benchmark EPFL [39]. First, we split the cameras of each scene into two clusters in space highlighting the difference between camera views, as shown in Fig. 8. Then, the ground-truth camera poses of each cluster are utilized to independently reconstruct the dense point clouds by the MVS algorithm [71]. Next, we detect keypoints by SIFT3D [70] and generate descriptors [6,7,8,9,10,11] for each point cloud. The triple-view patches required by MVDesc-32 are obtained by projecting keypoints into 3 visible image views randomly with occlusion test by ray racing [75]. After, the correspondences between the two point clouds of each scene are obtained by putative matching and then RMBP filtering. Finally, we register the two point clouds of each scene based on FGR [15] using estimated correspondences.

Results. Our MVDesc-32 and RMBP help to achieve valid registrations for all the four scenes, whilst none of the geometric descriptors including CGF [11], 3DMatch [10], FPFH [7], PFH [6], SHOT [8] and USC [9] do, as shown in Fig. 8. It is found that the failure is mainly caused by the geometrically symmetric patterns of the four scenes. We show the correspondences between CGF-32 features [11] in Fig. 9 as an example. The putative matching has resulted in a large number of ambiguous correspondences between keypoints located at the symmetric positions. And in essence, our RMBP is incapable of resolving the ambiguity in such cases though, because the correspondences in a symmetric structure still adhere to the geometric consistency. Ultimately, the ambiguous matches lead to the horizontally-flipped registration results as shown in Fig. 8. At least in the EPFL benchmark [39], the proposed MVDesc descriptor shows superior ability of description to the geometric ones.

Fig. 8.
figure 8

Registrations of models of EPFL benchmark [39]. Given the same keypoints, our MVDesc-32+RMBP accomplishes correct registrations while CGF-32 [11]+RMBP fails due to the symmetric ambiguity of the geometry

Fig. 9.
figure 9

Spurious correspondences of CGF features [11] before and after RMBP filtering. The two point clouds of [39] (colored in yellow and red) have been overlaid together by the ground truth transformation for visualization. Although our RMBP has eliminated most of the unorganized false matches, it is incapable of rejecting those ambiguous outliers arising from the ambiguity of the symmetric geometry

6 Conclusion

In this paper, we address the correspondence problem for the registration of point clouds. First, a multi-view descriptor, named MVDesc, has been proposed for the encoding of 3D keypoints, which strengthens the representation by applying the fusion of image views [31,32,33] to patch descriptor learning [17,18,19,20,21]. Second, a robust matching method, abbreviated as RMBP, has been developed to resolve the rejection of outlier matches by means of efficient inference through belief propagation [63] on the defined graphical matching model. Both approaches have been validated to be conductive to forming point correspondences of better quality for registration, as demonstrated by the intensive comparative evaluations and registration experiments [6,7,8,9,10,11, 15, 17, 18, 61].