Abstract
Convolutional neural networks (CNNs) have been successfully applied to solve the problem of correspondence estimation between semantically related images. Due to non-availability of large training datasets, existing methods resort to self-supervised or unsupervised training paradigm. In this paper we propose a semi-supervised learning framework that imposes cyclic consistency constraint on unlabeled image pairs. Together with the supervised loss the proposed model achieves state-of-the-art on a benchmark semantic matching dataset.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The task of estimating correspondences across different images is one of the challenging problems in computer vision. Some popular lines of work include optical flow estimation [1], tracking [2] or stereo fusion [3] that all require estimating pixel-wise correspondences between an image pair. However, such problems deal with images of the same scene or object without much change in appearance or geometry of the scene. Such variations are challenges usually observed in semantic matching, where the objective is to estimate correspondences between semantically similar but different instances of the same object or scene.
The current approaches, like other fields in computer vision, can be broadly categorized into hand-crafted [4,5,6] and deep learning based methods. Hand-crafted methods employ features such as HOG,SURF or SIFT descriptors [7,8,9] in association with a geometric regularizer to establish spatially consistent correspondences. The deep learning based methods can be divided into following: (i)direct methods: that directly learn the correspondence as a matching function [10, 11] and (ii)indirect methods: that first learn an embedding where representations from similar image patch are mapped close to each other in Euclidean space [12, 13] followed by correspondence estimation using nearest neighbor search. However, this involves computationally heavy pairwise matching between putative image patches (or regions) from each image.
One of the problems with training deep learning models is the requirement of large amount of labeled data [14]. Using deep models to solve the semantic matching problem also faces a similar issue, where popular datasets like Proposal Flow [15] consists of about only 1400 image pairs with sparse ground-truth key-point correspondences, with 700 image pairs used for training. Generating datasets with ground-truth transformations for semantic matching is a challenging task. Sampling images and transformations from a 3D model is feasible for a single object but is not straightforward for multiple objects with intra-class variations. Recent deep learning approaches have addressed this issue using the self-supervised [10, 16] and unsupervised paradigm [11]. On the other hand obtaining image pairs with sparse ground-truth point correspondences is relatively simple for small-sized datasets (e.g. Proposal Flow). However, it is even simpler to obtain larger datasets with only image level correspondence. In this paper we explore the semi-supervised semantic correspondence learning framework where only a subset of the training image pairs are labeled with ground-truth correspondences and the rest are unlabeled (or weakly labeled) i.e. only image level correspondence information is available. In particular, we make the following key contributions: (i) we show that extending [10] to the supervised setting brings significant increase in semantic matching performance, (ii) we propose a novel loss function based on geometric re-projection error via cycle consistency that better complements the above supervised loss to make use of weakly-labeled data.
2 Related Work
Semantic Matching. Much of the earlier decade until recently has seen hand-crafted features and descriptors like SIFT [7], DAISY [8] being used in matching cross-instances of semantically related objects in images. SIFTFlow [17] uses dense SIFT descriptors in an optimization pipeline that minimizes the matching energy. More recently, Ham et al [15] introduced proposal flow that generates dense correspondence by finding putative matches between object proposals. This work along with Taniai et al [18] propose the use of HOG descriptors. With the success of deep learning, CNN representations were used instead of hand-crafted descriptors to establish correspondences. However, [15] shows that the performance still lags behind hand-designed descriptors. This performance gap is attributed to lack of fine-tuning the CNN representations for the target task of semantic matching.
Deep Learning for Dense Correspondence. The success of learning deep features in related problems like optical flow [1], stereo fusion has motivated similar application for semantic matching. Choy et al [19] propose a universal correspondence network for learning fine-grained high resolution feature representation using metric learning. The representations are then used to establish correspondences after geometric verification. Similarly, [12] uses correspondences between region proposals that pass a geometric verification check to fine tune the representations of the network. Kim et al [20] introduce a CNN descriptor termed fully convolutional self-similarity which are then combined with the proposal flow based geometric consistency check. The proposed CNN based approaches are at the same level or better than hand-engineered features, but, include costly pairwise matching between candidate regions. On the other hand, [10, 11, 21] learn the correspondence and feature representation in an end-to-end framework.
Unsupervised Correspondence Learning. It is common knowledge that neural networks are data hungry models. Transfer learning alleviates the problem to certain extent, but the main challenge lies in the effective use of large amount of unlabeled data. Zhou et al [22] propose a training procedure for their network that can learn to predict relative camera motion and depth without supervision using large amount of videos. Similarly, [23] propose to learn homography transformation between an image pair without using ground-truth transformation information. [24] introduced a semi-supervised paradigm based on GAN that learns optical flow from both labeled synthetic datasets and unlabeled real videos. The key recipe in all these algorithms is the idea of photometric consistency. However, in the field of semantic matching, due to large appearance variations, this color constancy constraint does not hold.
Rocco et al [10] propose learning a set of transformations in an iterative manner using synthetically generated ground-truth transformations. As a follow up [11] an unsupervised transformation learning method is proposed that uses feature similarity score at pixel positions consistent with the predicted transformation.
Cycle Consistency Loss. Cycle consistency has been used to learn correspondence in a variety of settings [25, 26] where images are defined as nodes and the pairwise flow fields define the edges. The main idea is to minimize the net distance between the key-points in source image and its estimated position obtained by traversing the cycle using respective flow fields. Zhou et al [21] extends the idea to the framework of CNNs by leveraging 3D models to create cyclic graphs between the rendered synthetic views and pairs of images. The network is made to predict transformations for image-image and image-synthetic pairs. Using the 4-cycle constraint, the synthetic-synthetic transformation is estimated and compared with the ground-truth to generate gradients. However, the method necessitates the availability of 3D models and sampling appropriate synthetic views.
3 Proposed Method
3.1 Background
In this section we give the reader a brief background of the correspondence estimation via geometric transformation. In [10] a CNN based architecture is proposed, which given two images \(I_A\), \(I_B\), estimates the parameters \(\theta _{AB}\) of the geometric transformation \(\mathbb {T}_{AB}\) between them. The network consists of three sequential stages as explained next.
Feature extraction layer that extracts feature representation for each image \(f_A\), \(f_B\), where \(f \in \mathbb {R}^{w \times h \times d}\) is a tensor. This can be interpreted as d dimensional representations at \(w \times h\) locations. The tensor representations are then L2 normalized.
Correlation layer computes the correlation between the normalized features resulting in a tensor \(c_{AB} \in \mathbb {R}^{w \times h \times (w \times h) }\).
Regression layer is the final stage of the geometry estimation network consisting of a series of convolutional layers and a fully connected layer that finally outputs the parameters of the geometric transformation. Two types of geometric transformations are considered, affine and thine-plate spline (tps). The transformation estimates are computed in an iterative manner. In the first iteration the network outputs affine parameters. Then, \(I_A\) is warped using the estimated transformation and feed-forwarded through the second geometry estimation network which outputs the tps parameters. The only difference between the networks in both iteration is in the final fully connected layer which outputs 6 estimates for affine transformation and 18 for thin-plate spline. The final transformation is a composition of the estimated transformations.
Loss function that the network parameters are optimized on is a novel grid loss unlike traditional approaches which directly minimizes the L2 error between the estimated and ground truth transformation parameters. A fixed grid of points \(G = \{g_i\}\), where \(g \in \mathbb {R}^2\) and \(N = |G|\), is defined on \(I_B\) is transformed using the estimated and ground truth transformations, \(\hat{\theta }\) and \(\theta \), to obtain \(\mathbb {T}_{\hat{\theta }}\) and \(\mathbb {T}_{\theta }\) respectively. The self-supervised grid loss is then computed as the L2 error between the transformed grid locations :
3.2 Semi-supervised Learning
We now proceed to show how the above self-supervised geometric transformation network can be trained in a semi-supervised manner. In semi-supervised learning, we assume to have a dataset of image pairs, \(D_l\) labeled with the information of corresponding keypoints. In addition, we have a unlabeled (or weakly labeled) dataset, \(D_{ul}\) of image pairs with only image level correspondence.
Supervised Learning. In [22], the authors propose a training procedure to learn relative camera pose and monocular depth estimation by providing supervision only at the meta-task of view synthesis. Thereby, without explicit supervision the network learns to solve intermediate tasks of relative camera pose and depth estimation. The only requirement is that the intermediate tasks should be differentiable w.r.t meta-task to allow back-propagation. Equation 1 has a similar formulation, where the network is forced to learn accurate geometric transformations to better solve the meta-task of minimizing the grid loss. Thereby in a supervised setting with ground-truth pixel correspondences \(P_A\) and \(P_B\) between \(I_A\) and \(I_B\), where \(P = \{p_j\}, p \in \mathbb {R}^{2 \times M}\) represents the M corresponding pixel locations in each image, Eq. 1 can be re-written as:
Although there are multiple geometric transformations that can fit to a given set of sparse correspondences, the intuition here is that the network will learn to generalize as it observes a diverse set of image pairs and pixel correspondences(e.g. image pairs arising from different object categories have a different distribution of keypoints in correspondence).
Unsupervised Learning. In order to learn geometric transformation from unlabeled image pairs, we propose a self-consistent grid loss. We consider an unlabeled image pair as a directed 2-cycle graph, where the edge represents forward flow/transformation. Cycle consistency [25] states that pixels or points defined in one image, when transferred through the composition of transformations along the edges, should have a zero net displacement as shown in Fig. 1. To accommodate this constraint, we compute both the forward and backward geometric transformations. Thereafter instead of computing the error in the space of the transformed grid positions, we compute the self-consistent grid loss that measures the loss in the original grid locations. Equation 1 can now be written for the unsupervised case as:
Converging to the true solution using the proposed loss is not trivial as an identity transformation completely satisfies the constraints of the proposed loss function. However, in the semi-supervised setting, this will not be the case as identity transformation will produce high loss for the labeled image pairs. The semi-supervised objective has the following formulation:
where \(\beta \) balances the supervised and unsupervised loss functions and is obtained using validation data.
4 Experimental Results
In this section we present the experimental settings to test the proposed method.
4.1 Datasets
In line with previous work [10,11,12, 15], we train the transformation estimation model using the PF-PASCAL dataset [15]. The dataset consists of 1400 image pairs with corresponding key-point annotations. Training,validation and test sets are obtained using the split proposed in [11] resulting in about 700, 300 and 300 image pairs respectively. The image pairs are also classified into 20 object categories.
Labeled Data. We increase the size of the training set to about 2500 using random flipping of image pairs and represent it by \(D_l\). This results in repetitions of image pairs. It is also ensured all test or validation image pairs are removed from \(D_l\).
Unlabeled Data. The total number of all possible image pairs that can be generated from PF-PASCAL dataset is around 33000. However, there is a class imbalance in terms of number of images per object category. This implies the number of possible pairs is also quadratically disproportionate across categories. In order to avoid this class imbalance, we upper bound the number of pairs per category to 100. To this set of image pairs we further add the labeled set \(D_l\), but, remove the correspondence information. The combined set forms our unlabeled set \(D_{ul}\) with 7400 image pairs.
Evaluation Criteria. We evaluate the proposed approach using the probability of correctly matched key-points (PCK) metric. This metric counts the number of key-points in the source image whose projection on the target image based on the correspondence prediction lies within a given threshold. As recommended in practice, the key-point coordinates are normalized in the range [0,1] using the respective image width and height. A distance threshold of 0.1 is used to count the correctly transferred keypoints.
4.2 Baselines
We compare our proposed semi-supervised method with the recent state-of-the-art methods: SCNet [12] and its variants,CNNGeo [10] and CNNGeo2 [11]. CNNGeo trained using the loss functions defined in Eqs. 2, 3 are termed CNNGeoS and CNNGeoU respectively. The combination CNNGeoS + CNNGeoU (Eq. 4) is the proposed method. In order to evaluate the performance of our semi-supervised model, we create a baseline model CNNGeoS + CNNGeo2. This baseline model, referred as CNNGeoS2, is also trained using Eq. 4 where the unsupervised loss is now driven by CNNGeo2 instead of the proposed CNNGeoU. We guide the reader to Table 1 for a more comprehensive understanding.
4.3 Implementation Details
Network Architecture. Recent methods [11, 27, 28] have shown that architectures from the ResNet family [29] are well suited to the task of estimating transformations. We proceed with the ResNet-101 architecture truncated at the conv4-23 layer. This forms the feature extraction layer (c.f. Sect. 3.1). The correlation and regression layer has the same architecture as in [10, 11]. The network is pre-trained end to end using [10].
Training Details. The network is based on PyTorch [30] framework and is trained and evaluated on PF-PASCAL dataset using the split detailed in Sect. 4.1. All training images are resized to 240 \(\times \) 240 resolution. Back-propagation is done using Adam [31] optimizer with a batch size of 16. These settings are shared by all the geometric transformation methods listed in Table 1. The learning rate is set to \(5.10^{-8}\) for CNNGeo, CNNGeo2 and CNNGeoS2. Particularly, for CNNGeoS2 increasing the learning rate led to drastic drop in keypoint transfer accuracy. CNNGeoS and our proposed model are trained with a higher learning rate of \(5.10^{-6}\) as it produced better results on the validation set. \(\beta \) is set to 1 for the proposed semi-supervised method and the baseline CNNGeoS2.
4.4 Results
We evaluated the baselines and existing methods on the PF-PASCAL test set and present our results in Table 2. Overall, the proposed semi-supervised approach outperforms the existing methods and the baseline geometric transformation models. The comparison with SCNet is not direct as we use ResNet-101 architecture which learns powerful representation than VGG-16 used by SCNet. However, the proposed approach and the baseline models (CNNGeo* in Table 2) were trained using a similar training setup and hence the comparison is fair and direct. Supervised model CNNGeoS clearly performs better than the self-supervised CNNGeo and unsupervised model CNNGeo2 as expected. This implies the model is able to learn valid geometric transformations using only sparse correspondences.
We now compare the baseline model CNNGeoS2 and our proposed method, which were trained in a semi-supervised framework. The proposed model sets the state-of-the-art in semantic matching across multiple object categories. This also shows that the proposed unsupervised loss (Eq. 3) is complementary to the supervised loss function. Also, both the supervised and unsupervised loss in our model operate in the space of normalized pixel space unlike CNNGeoS2 where the unsupervised loss operates directly on feature representations.
Figures 2, 3 and 4 shows some qualitative results where the source image is warped using the estimated transformations from the baselines (CNNGeoS and CNNGeoS2) and the proposed method respectively.
5 Conclusion
We presented a semi-supervised learning paradigm to address the problem of semantic matching. In particular, we demonstrated that cycle consistency can be integrated with supervised methods to learn correspondence from unlabeled data. Results show that our proposed approach outperforms the state-of-the-art semantic matching methods.
References
Fischer, P., et al.: FlowNet: Learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852 (2015)
Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_54
Hosni, A., Rhemann, C., Bleyer, M., Rother, C., Gelautz, M.: Fast cost-volume filtering for visual correspondence and beyond. IEEE Trans. Pattern Anal. Mach. Intell. 35(2), 504–511 (2013)
Bristow, H., Valmadre, J., Lucey, S.: Dense semantic correspondence where every pixel is a classifier. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4024–4031 (2015)
Hur, J., Lim, H., Park, C., Ahn, S.C.: Generalized deformable spatial pyramid: geometry-preserving dense correspondence estimation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Kim, J., Liu, C., Sha, F., Grauman, K.: Deformable spatial pyramid matching for fast dense correspondences. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2307–2314. IEEE Computer Society (2013)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Tola, E., Lepetit, V., Fua, P.: DAISY: an efficient dense descriptor applied to wide-baseline stereo. IEEE Trans. Pattern Anal. Mach. Intell. 32(5), 815–830 (2010)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 886–893. IEEE (2005)
Rocco, I., Arandjelović, R., Sivic, J.: Convolutional neural network architecture for geometric matching. In: CVPR 2017-IEEE Conference on Computer Vision and Pattern Recognition (2017)
Rocco, I., Arandjelović, R., Sivic, J.: End-to-end weakly-supervised semantic alignment. arXiv preprint arXiv:1712.06861 (2017)
Han, K., et al.: SCNet: learning semantic correspondence. In: International Conference on Computer Vision (2017)
Thewlis, J., Zheng, S., Torr, P.H., Vedaldi, A.: Fully-trainable deep matching. arXiv preprint arXiv:1609.03532 (2016)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR 2009 (2009)
Ham, B., Cho, M., Schmid, C., Ponce, J.: Proposal flow. In: CVPR 2016-IEEE Conference on Computer Vision and Pattern Recognition, pp. 3475–3484. IEEE (2016)
Novotný, D., Albanie, S., Larlus, D., Vedaldi, A.: Self-supervised learning of geometrically stable features through probabilistic introspection. CoRR (2018)
Liu, C., Yuen, J., Torralba, A., Sivic, J., Freeman, W.T.: SIFT flow: dense correspondence across different scenes. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5304, pp. 28–42. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88690-7_3
Taniai, T., Sinha, S.N., Sato, Y.: Joint recovery of dense correspondence and cosegmentation in two images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4246–4255 (2016)
Choy, C.B., Gwak, J., Savarese, S., Chandraker, M.: Universal correspondence network. In: Advances in Neural Information Processing Systems, pp. 2414–2422 (2016)
Kim, S., Min, D., Ham, B., Jeon, S., Lin, S., Sohn, K.: FCSS: fully convolutional self-similarity for dense semantic correspondence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., Efros, A.A.: Learning dense correspondence via 3D-guided cycle consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 117–126 (2016)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6612–6619. IEEE (2017)
Nguyen, T., Chen, S.W., Skandan, S., Taylor, C.J., Kumar, V.: Unsupervised deep homography: a fast and robust homography estimation model. IEEE Robot. Autom. Lett. (2018)
Lai, W.S., Huang, J.B., Yang, M.H.: Semi-supervised learning for optical flow with generative adversarial networks. In: Advances in Neural Information Processing Systems, pp. 353–363 (2017)
Zhou, T., Jae Lee, Y., Yu, S.X., Efros, A.A.: FlowWeb: joint image set alignment by weaving consistent, pixel-wise correspondences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1191–1200 (2015)
Zhou, X., Zhu, M., Daniilidis, K.: Multi-image matching via fast alternating minimization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4032–4040 (2015)
Laskar, Z., Melekhov, I., Kalia, S., Kannala, J.: Camera relocalization by computing pairwise relative poses using convolutional neural network
Melekhov, I., Ylioinas, J., Kannala, J., Rahtu, E.: Image-based localization using hourglass networks
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Paszke, A., et al.: Automatic differentiation in PyTorch. In: NIPS-W (2017)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Laskar, Z., Kannala, J. (2019). Semi-supervised Semantic Matching. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11131. Springer, Cham. https://doi.org/10.1007/978-3-030-11015-4_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-11015-4_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11014-7
Online ISBN: 978-3-030-11015-4
eBook Packages: Computer ScienceComputer Science (R0)