Abstract
The concept of geo-localization broadly refers to the process of determining an entity’s geographical location, typically in the form of Global Positioning System (GPS) coordinates. The entity of interest may be an image, a sequence of images, a video, a satellite image, or even objects visible within the image. Recently, massive datasets of GPS-tagged media have become available due to smartphones and the internet, and deep learning has risen to prominence and enhanced the performance capabilities of machine learning models. These developments have enabled the rise of image and object geo-localization, which has impacted a wide range of applications such as augmented reality, robotics, self-driving vehicles, road maintenance, and 3D reconstruction. This paper provides a comprehensive survey of visual geo-localization, which may involve either determining the location at which an image has been captured (image geo-localization) or geolocating objects within an image (object geo-localization). We will provide an in-depth study of visual geo-localization including a summary of popular algorithms, a description of proposed datasets, and an analysis of performance results to illustrate the current state of the field.
Similar content being viewed by others
Notes
References
Agarwal, S., Furukawa, Y., Snavely, N., Simon, I., Curless, B., Seitz, S. M., & Szeliski, R. (2011). Building rome in a day. Communications of the ACM, 54(10), 105–112. https://doi.org/10.1145/2001269.2001293
Almutairy, F., Alshaabi, T., Nelson, J., & Wshah, S. (2021). Arts: Automotive repository of traffic signs for the united states. Institute of Electrical and Electronics Engineers (IEEE) Transactions on Intelligent Transportation Systems, 22(1), 457–465. https://doi.org/10.1109/TITS.2019.2958486
Anguelov, D., Dulong, C., Filip, D., Frueh, C., Lafon, S., Lyon, R., & Weaver, J. (2010). Google street view: Capturing the world at street level. Institute of Electrical and Electronics Engineers (IEEE) Computer, 43(6), 32–38.
Ankerst, M., Breunig, M. M., Kriegel, H.-P., & Sander, J. (1999). Optics: Ordering points to identify the clustering structure. Proceedings of the 1999 ACM Sigmod International Conference on Management of Data (p. 49–60). Association for Computing Machinery. https://doi.org/10.1145/304182.304187
Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., & Sivic, J. (2018). Netvlad: CNN architecture for weakly supervised place recognition. Institute of Electrical and Electronics Engineers (IEEE) Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(6), 1437–1451. https://doi.org/10.1109/TPAMI.2017.2711011
Baatz, G., Saurer, O., Köser, K., & Pollefeys, M. (2012). Large scale visual geo-localization of images in mountainous terrain. In Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y. & Schmid, C. (Eds.) Computer Vision—ECCV 2012 (pp. 517–530). Springer.
Baatz, G., Saurer, O., Köser, K., & Pollefeys, M. (2012). Leveraging topographic maps for image to terrain alignment (p. 487-492). https://doi.org/10.1109/3DIMPVT.2012.33
Bansal, M., & Daniilidis, K. (2014). Geometric urban geo-localization. In Institute of electrical and electronics engineers (ieee) conference on computer vision and pattern recognition (CVPR) (p. 3978–3985). https://doi.org/10.1109/CVPR.2014.508
Benbihi, A., Arravechia, S., Geist, M., & Pradalier, C. (2020). Image-based place recognition on bucolic environment across seasons from semantic edge description (pp. 3032–3038). https://doi.org/10.1109/ICRA40945.2020.9197529
Brejcha, J., & Cadik, M. (2017). Geopose3k: Mountain landscape dataset for camera pose estimation in outdoor environments. Image and Vision Computing, 66, 1. https://doi.org/10.1016/j.imavis.2017.05.009
Brejcha, J., & Čadík, M. (2017). State-of-the-art in visual geo-localization. Pattern Analysis and Applications, 20(3), 613–637.
Brejcha, J., Lukác, M., Chen, Z., DiVerdi, S., & Cadík, M. (2018). Immersive trip reports. In Proceedings of the 31st Annual ACM symposium on user interface software and technology (pp. 389–401). Association for Computing Machinery. https://doi.org/10.1145/3242587.3242653
Brejcha, J., Lukáč, M., Hold-Geoffroy, Y., Wang, O., & Cadik, M. (2020). Landscapear: Large scale outdoor augmented reality by matching photographs with terrain models using learned descriptors (pp. 295–312). https://doi.org/10.1007/978-3-030-58526-6_18
Brock, A., Donahue, J., & Simonyan, K. (2019). Large scale GAN training for high fidelity natural image synthesis. International conference on learning representations (ICLR).
Bromley, J., Bentz, J. W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., & Shah, R. (1993). Signature verification using a “siamese’’ time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI), 7(04), 669–688.
Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., & Beijbom, O. (2020). nuscenes: A multimodal dataset for autonomous driving. Institute of Electrical and Electronics Engineers (IEEE)/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 11618–11628).
Cai, S., Guo, Y., Khan, S., Hu, J., & Wen, G. (2019). Ground-to-aerial image geolocalization with a hard exemplar reweighting triplet loss. In Proceedings of the institute of electrical and electronics engineers (IEEE)/cvf international conference on computer vision (ICCV).
Castaldo, F., Zamir, A., Angst, R., Palmieri, F., & Savarese, S. (2015). Semantic crossview matching. In Proceedings of the institute of electrical and electronics engineers (IEEE) international conference on computer vision (ICCV) workshops.
Chaabane, M., Gueguen, L., Trabelsi, A., Beveridge, R., & O’Hara, S. (2021). End-to-end learning improves static object geo-localization from video. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV) (pp. 2063–2072).
Chen, D.M., Baatz, G., Köser, K., Tsai, S.S., Vedantham, R., Pylvänäinen, T., & Grzeszczuk, R. (2011). City-scale landmark identification on mobile devices. Computer vision and pattern recognition (CVPR) (pp. 737–744). https://doi.org/10.1109/CVPR.2011.5995610
Chen, W., Liu, Y., Wang, W., Bakker, E., Georgiou, T., Fieguth, P., & Lew, M. (2021). Deep image retrieval: A survey.
Chen, Y., Qian, G., Gunda, K., Gupta, H., & Shafique, K. (2015). Camera geolocation from mountain images. In 18th International Conference on Information Fusion (Fusion) (pp. 1587–1596).
Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. In Institute of electrical and electronics engineers (IEEE) computer society conference on computer vision and pattern recognition (CVPR) (Vol. 1, pp. 539–546).
Clark, B., Kerrigan, A., Kulkarni, P., Cepeda, V., & Shah, M. (2023). Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes. https://doi.org/10.48550/arXiv.2303.04249
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (CVPR).
Costea, D., & Leordeanu, M. (2016). Aerial image geolocalization from recognition and matching of roads and intersections. Richard, E. R. H., Wilson, C., & Smith, W. A. P. (Eds.) Proceedings of the british machine vision conference (bmvc) (pp. 118.1–118.12). BMVA Press. https://doi.org/10.5244/C.30.118
Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. Advances in Neural Information Processing Systems (NeurIPS), 26, 2292–2300.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. Institute of electrical and electronics engineers (IEEE) computer society conference on computer vision and pattern recognition (CVPR) (Vol. 1, pp. 886–893).
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. Institute of electrical and electronics engineers (ieee) conference on computer vision and pattern recognition (cvpr) (pp. 248–255).
Dünser, A., Billinghurst, M., Wen, J., Lehtinen, V., & Nurminen, A. (2012). Exploring the use of handheld AR for outdoor navigation. Computers & Graphics, 36(8), 1084–1095.
Fu, C., Xiang, C., Wang, C., & Cai, D. (2019). Fast approximate nearest neighbor search with the navigating spreading-out graph. Proceedings of the VLDB Endowment, 12(5), 461–474. https://doi.org/10.14778/3303753.3303754
Gao, X., Shen, S., Hu, Z., & Wang, Z. (2019). Ground and aerial meta-data integration for localization and reconstruction: A review. Pattern Recognition Letters, 127, 202–214. https://doi.org/10.1016/j.patrec.2018.07.036
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR).
Girshick, R. (2015). Fast r-cnn. Proceedings of the institute of electrical and electronics engineers (IEEE) international conference on computer vision (iccv) (pp. 1440–1448).
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., & Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems (NeurIPS), 27, 1.
Gu, Y., Wang, Y., & Li, Y. (2019). A survey on deep learning-driven remote sensing image scene understanding: Scene classification, scene retrieval and scene-guided object detection. https://doi.org/10.3390/app9102110
Haas, L., Alberti, S., & Skreta, M. (2023). Pigeon: Predicting image geolocations.
Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. Institute of electrical and electronics engineers (IEEE) computer society conference on computer vision and pattern recognition (CVPR) (Vol. 2, pp. 1735–1742).
Hakeem, A., Vezzani, R., Shah, M., & Cucchiara, R. (2006). Estimating geospatial trajectory of a moving camera. In 18th International conference on pattern recognition (ICPR) (Vol. 2, pp. 82–87). https://doi.org/10.1109/ICPR.2006.499
Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision (2nd ed.). New York: Cambridge University Press.
Hartley, R. I., & Sturm, P. (1997). Triangulation. Computer Vision and Image Understanding, 68(2), 146–157. https://doi.org/10.1006/cviu.1997.0547
Hays, J., & Efros, A. (2015). Large-scale image geolocalization. Multimodal Location Estimation of Videos and Images, 1, 41–62. https://doi.org/10.1007/978-3-319-09861-6_3
Hays, J., & Efros, A. A. (2008). im2gps: Estimating geographic information from a single image. In Proceedings of the institute of electrical and electronics engineers (ieee) conference on computer vision and pattern recognition (cvpr).
Hu, S., Feng, M., Nguyen, R. M., & Lee, G. H. (2018). CVM-net: Cross-view matching network for image-based ground-to-aerial geo-localization. Proceedings of the institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (cvpr) (pp. 7258–7267).
Isola, P., Zhu, J.-Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. Proceedings of the institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (cvpr).
Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. Institute of electrical and electronics engineers (IEEE) computer society conference on computer vision and pattern recognition (cvpr) (pp. 3304–3311). https://doi.org/10.1109/CVPR.2010.5540039
Kalogerakis, E., Vesselova, O., Hays, J., Efros, A. A., & Hertzmann, A. (2009). Image sequence geolocation with human travel priors. In Institute of Electrical and Electronics Engineers (IEEE) 12th International Conference on Computer Vision (ICCV) (pp. 253–260).
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. Proceedings of the institute of electrical and electronics engineers (ieee)/cvf conference on computer vision and pattern recognition (cvpr).
Kendall, A., & Cipolla, R. (2016). Modelling uncertainty in deep learning for camera relocalization. Institute of electrical and electronics engineers (IEEE) international conference on robotics and automation (ICRA) (pp. 4762–4769). https://doi.org/10.1109/ICRA.2016.7487679
Kim, D.-K., & Walter, M. R. (2017). Satellite image-based localization via learned embeddings. In Institute of electrical and electronics engineers (IEEE) international conference on robotics and automation (ICRA) (pp. 2073–2080). https://doi.org/10.1109/ICRA.2017.7989239
Kim, H. J., Dunn, E., & Frahm, J.-M. (2015). Predicting good features for image geo-localization using per-bundle vlad. Institute of electrical and electronics engineers (IEEE) international conference on computer vision (ICCV) (pp. 1170–1178). https://doi.org/10.1109/ICCV.2015.139
Kim, H. J., Dunn, E., & Frahm, J.-M. (2017). Learned contextual feature reweighting for image geolocalization. In Institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (CVPR) (pp. 3251–3260). https://doi.org/10.1109/CVPR.2017.346
Kim, J., Lee, J. K., & Lee, K. M. (2016). Accurate image super-resolution using very deep convolutional networks. 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1646–1654). https://doi.org/10.1109/CVPR.2016.182
Knight, P. A. (2008). The Sinkhorn–Knopp algorithm: Convergence and applications. SIAM Journal on Matrix Analysis and Applications, 30(1), 261–275.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NeurIPS), 25, 1097–1105.
Krylov, V. A., Kenny, E., & Dahyot, R. (2018). Automatic discovery and geotagging of objects from street view imagery. Remote Sensing, 10(5), 1. https://doi.org/10.3390/rs10050661
Lam, D., Kuzma, R., McGee, K., Dooley, S., Laielli, M., Klaric, M. K., & McCord, B. (2018). xview: Objects in context in overhead imagery. ArXiv arXiv:1802.07856.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Institute of electrical and electronics engineers (IEEE) computer society conference on computer vision and pattern recognition (CVPR) (Vol. 2, pp. 2169–2178). https://doi.org/10.1109/CVPR.2006.68
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. https://doi.org/10.1109/5.726791
Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., & Shi, W. (2017). Photo-realistic single image superresolution using a generative adversarial network. Proceedings of the institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (cvpr).
Lin, T.-Y., Belongie, S., & Hays, J. (2013). Crossview image geolocalization. In Proceedings of the institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (CVPR).
Lin, T.-Y., Cui, Y., Belongie, S., & Hays, J. (2015). Learning deep representations for ground-toaerial geolocalization. In Proceedings of the institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (CVPR).
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2020). Focal loss for dense object detection. Institute of Electrical and Electronics Engineers (IEEE) Transactions on Pattern Analysis and Machine Intelligence (PAMI), 42(2), 318–327. https://doi.org/10.1109/TPAMI.2018.2858826
Liu, L., & Li, H. (2019). Lending orientation to neural networks for cross-view geolocalization. In Proceedings of the institute of electrical and electronics engineers (IEEE)/cvf conference on computer vision and pattern recognition (CVPR).
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV), 60(2), 91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
Lu, X., Li, Z., Cui, Z., Oswald, M.R., Pollefeys, M., & Qin, R. (2020). Geometry-aware satellite-to-ground image synthesis for urban areas. Proceedings of the institute of electrical and electronics engineers (IEEE)/cvf conference on computer vision and pattern recognition (CVPR).
Martinson, E., Furlong, B., & Gillies, A. (2021). Training rare object detection in satellite imagery with synthetic gan images. In 2021 institute of electrical and electronics engineers (IEEE)/cvf conference on computer vision and pattern recognition workshops (cvprw) (pp. 2763–2770). https://doi.org/10.1109/CVPRW53098.2021.00311
Masone, C., & Caputo, B. (2021). A survey on deep visual place recognition. IEEE Access, 9, 19516–19547. https://doi.org/10.1109/ACCESS.2021.3054937
Matas, J., Chum, O., Urban, M., & Pajdla, T. (2004). Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10), 761–767. https://doi.org/10.1016/j.imavis.2004.02.006
McManus, C., Churchill, W., Maddern, W., Stewart, A. D., & Newman, P. (2014). Shady dealings: Robust, long-term visual localisation using illumination invariance. Institute of electrical and electronics engineers (IEEE) international conference on robotics and automation (ICRA) (pp. 901–906). https://doi.org/10.1109/ICRA.2014.6906961
Mertan, A., Duff, D. J., & Unal, G. (2021). Single image depth estimation: An overview. ArXiv arXiv:2104.06456.
Middelberg, S., Sattler, T., Untzelmann, O., & Kobbelt, L. (2014). Scalable 6-dof localization on mobile devices. In Fleet, D., Pajdla, T., Schiele, B., & T. Tuytelaars (Eds.) European conference on computer vision (eccv) (pp. 268–283). Springer.
Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
Muller-Budack, E., Pustu-Iren, K., & Ewerth, R. (2018). Geolocation estimation of photos using a hierarchical model and scene classification. Proceedings of the European conference on computer vision (ECCV).
Narzt, W., Pomberger, G., Ferscha, A., Kolb, D., Müller, R., Wieghardt, J., & Lindinger, C. (2006). Augmented reality navigation systems. Universal Access in the Information Society (UAIS), 4(3), 177–187.
Nassar, A. S., D’Aronco, S., Lefèvre, S., Wegner, J. D. (2020). Geograph: graph-based multi-view object detection with geometric cues end-toend. Vedaldi, A., Bischof, H., Brox, T., & Frahm, J.-M. (Eds.) European conference on computer vision (eccv) (pp. 488–504). Springer.
Nassar, A. S., Lefevre, S., Wegner, & J. D. (2019). Simultaneous multi-view instance detection with learned geometric soft-constraints. Proceedings of the IEEE/CVF international conference on computer vision (ICCV).
Neuhold, G., Ollmann, T., Bulò, S. R., & Kontschieder, P. (2017). The mapillary vistas dataset for semantic understanding of street scenes. In Institute of electrical and electronics engineers (IEEE) international conference on computer vision (ICCV) (pp. 5000–5009). https://doi.org/10.1109/ICCV.2017.534
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision (IJCV), 42(3), 145–175.
Pavan, M., & Pelillo, M. (2003). A new graphtheoretic approach to clustering and segmentation. Institute of electrical and electronics engineers (IEEE) computer society conference on computer vision and pattern recognition (CVPR) (Vol. 1, pp. I-I). https://doi.org/10.1109/CVPR.2003.1211348
Pavan, M., & Pelillo, M. (2007). Dominant sets and pairwise clustering. Institute of Electrical and Electronics Engineers (IEEE) Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 29(1), 167–172. https://doi.org/10.1109/TPAMI.2007.250608
Pearson, K. (1901). Liii. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559–572. https://doi.org/10.1080/14786440109462720
Piasco, N., Sidibé, D., Demonceaux, C., & Gouet- Brunet, V. (2018). A survey on visual-based localization: On the benefit of heterogeneous data. Pattern Recognition, 74, 90–109. https://doi.org/10.1016/j.patcog.2017.09.013
Pramanick, S., Nowara, E.M., Gleason, J., Castillo, C.D., & Chellappa, R. (2022). Where in the world is this image? Transformer-based geo-localization in the wild. Avidan, S., Brostow, G., Cissé, M., Farinella, G. M. & Hassner, T. (Eds.) Computer vision—ECCV 2022 (pp. 196–215). Springer.
Pumarola, A., Agudo, A., Martinez, A. M., Sanfeliu, A., & Moreno-Noguer, F. (2018). Ganimation: Anatomically-aware facial animation from a single image. Proceedings of the european conference on computer vision (eccv) (pp. 818–833).
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In Meila, M., & Zhang, T. (Eds.) Proceedings of the 38th international conference on machine learning, ICML 2021, 18–24 July 2021, virtual event (Vol. 139, pp. 8748–8763). PMLR. http://proceedings.mlr.press/v139/radford21a.html
Regmi, K., & Borji, A. (2018). Cross-view image synthesis using conditional gans. Proceedings of the institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (CVPR).
Regmi, K., & Shah, M. (2019). Bridging the domain gap for ground-to-aerial image matching. In Proceedings of the institute of electrical and electronics engineers (IEEE)/CVF international conference on computer vision (ICCV).
Ren, X., Bo, L., & Fox, D. (2012). Rgb-(d) scene labeling: Features and algorithms. Institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (CVPR) (pp. 2759–2766).
Rodrigues, R., & Tani, M. (2021). Are these from the same place? seeing the unseen in crossview image geo-localization. In Proceedings of the institute of electrical and electronics engineers (IEEE)/CVF winter conference on applications of computer vision (WACV) (pp. 3753–3761).
Roshan Zamir, A., Ardeshir, S., & Shah, M. (2014). Gps-tag refinement using random walks with an adaptive damping factor. In Proceedings of the institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (CVPR).
Santana, L.V., Brandao, A.S., & Sarcinelli-Filho, M. (2015). Outdoor waypoint navigation with the ar. drone quadrotor. International conference on unmanned aircraft systems (ICUAS) (pp. 303–311).
Saputra, M. R. U., Markham, A., & Trigoni, N. (2018). Visual slam and structure from motion in dynamic environments. ACM Computing Surveys (CSUR), 51, 1–36.
Saurer, O., Baatz, G., Köser, K., Ladický, L., & Pollefeys, M. (2015). Image based geolocalization in the Alps. International Journal of Computer Vision, 116, 1. https://doi.org/10.1007/s11263-015-0830-0
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. Proceedings of the institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (CVPR).
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Gradcam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the institute of electrical and electronics engineers (IEEE) international conference on computer vision (ICCV).
Seo, P. H., Weyand, T., Sim, J., & Han, B. (2018). Cplanet: Enhancing image geolocalization by combinatorial partitioning of maps. In Ferrari, V., Hebert, M., Sminchisescu, C., & Weiss, Y. (Eds.) European conference on computer vision (ECCV) (pp. 544–560). Springer.
Shechtman, E., & Irani, M. (2007). Matching local self-similarities across images and videos. Institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (CVPR) (pp. 1–8). https://doi.org/10.1109/CVPR.2007.383198
Shermeyer, J., & Etten, A. V. (2019). The effects of super-resolution on object detection performance in satellite imagery. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, 1432–1441.
Shi, Y., Campbell, D., Yu, X., & Li, H. (2021). Geometry-guided street-view panorama synthesis from satellite imagery. arXiv preprint arXiv:2103.01623.
Shi, Y., Liu, L., Yu, X., & Li, H. (2019). Spatial-aware feature aggregation for image based cross-view geo-localization. Advances in Neural Information Processing Systems (NeurIPS), 32, 10090–10100.
Shi, Y., Yu, X., Campbell, D., & Li, H. (2020, June). Where am i looking at? Joint location and orientation estimation by cross-view matching. In Proceedings of the institute of electrical and electronics engineers (IEEE)/CVF conference on computer vision and pattern recognition (CVPR).
Shi, Y., Yu, X., Liu, L., Zhang, T., & Li, H. (2020). Optimal feature transport for cross-view image geo-localization. Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, 34(07), 11990–11997. https://doi.org/10.1609/aaai.v34i07.6875
Shi, Y., Yu, X., Wang, S., & Li, H. (2022). Cvlnet: Cross-view semantic correspondence learning for video-based camera localization. arXiv preprint arXiv:2208.03660.
Shrivastava, A., Malisiewicz, T., Gupta, A., & Efros, A. A. (2011). Data-driven visual similarity for cross-domain image matching. In Proceedings of the 2011 Siggraph Asia Conference. Association for Computing Machinery (ACM). https://doi.org/10.1145/2024156.2024188
Sinkhorn, R., & Knopp, P. (1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2), 343–348.
Suenderhauf, N., Shirazi, S., Jacobson, A., Dayoub, F., Pepperell, E., Upcroft, B., & Milford, M. (2015). Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free. Hsu, D. (Ed.) Robotics: Science and systems xi (pp. 1–10). Robotics: Science and Systems Conference.
Tang, H., Liu, H., Xu, D., Torr, P. H., & Sebe, N. (2021). Attentiongan: Unpaired image-to-image translation using attention-guided generative adversarial networks. Institute of Electrical and Electronics Engineers (IEEE) Transactions on Neural Networks and Learning Systems (TNNLS).
Tang, H., Xu, D., Sebe, N.,Wang, Y., Corso, J. J., & Yan, Y. (2019). Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. Proceedings of the institute of electrical and electronics engineers (IEEE)/CVF conference on computer vision and pattern recognition (CVPR).
Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., & Li, L.-J. (2016). Yfcc100m: The new data in multimedia research. Commun. ACM, 59(2), 64–73. https://doi.org/10.1145/2812802
Tian, Y., Chen, C., & Shah, M. (2017). Cross-view image matching for geo-localization in urban environments. Proceedings of the institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (CVPR).
Toker, A., Zhou, Q., Maximov, M., & Leal-Taixe, L. (2021). Coming down to earth: Satelliteto- street view synthesis for geo-localization. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (cvpr) (pp. 6488–6497).
Tomešek, J., Čadík, M., & Brejcha, J. (2022). Crosslocate: Cross-modal large-scale visual geolocalization in natural environments using rendered modalities. In 2022 IEEE/CVF winter conference on applications of computer vision (WACV) (pp. 2193–2202). https://doi.org/10.1109/WACV51458.2022.00225
Torii, A., Arandjelović, R., Sivic, J., Okutomi, M., & Pajdla, T. (2015). 24/7 place recognition by view synthesis. In Institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (CVPR) (pp. 1808–1817). https://doi.org/10.1109/CVPR.2015.7298790
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jegou, H. (2021). Training data efficient image transformers distillation through attention. International Conference on Machine Learning, 139, 10347–10357.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. In Guyon, I. et al. (Eds.) Advances in neural information processing systems (Vol. 30). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Verde, S., Resek, T., Milani, S., & Rocha, A. (2020). Ground-to-aerial viewpoint localization via landmark graphs matching. Institute of Electrical and Electronics Engineers (IEEE) Signal Processing Letters, 27, 1490–1494. https://doi.org/10.1109/LSP.2020.3017380
Vishal, K., Jawahar, C. V., & Chari, V. (2015). Accurate localization by fusing images and GPS signals. Proceedings of the institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (CVPR) workshops.
Vo, N., & Hays, J. (2016). Localizing and orienting street views using overhead imagery. Leibe, B., Matas, J., Sebe, N., & Welling, M. (Eds.) European conference on computer vision (ECCV) (pp. 494–509). Springer.
Vo, N., Jacobs, N., & Hays, J. (2017). Revisiting im2gps in the deep learning era. In Proceedings of the institute of electrical and electronics engineers (IEEE) international conference on computer vision (ICCV).
Vyas, S., Chen, C., & Shah, M. (2022). Gama: Cross-view video geo-localization. Avidan, S., Brostow, G., Cissé, M., Farinella, G. M., & Hassner, T (Eds.) Computer vision—ECCV 2022 (pp. 440–456). Springer.
Wang, T., Zheng, Z., Yan, C., Zhang, J., Sun, Y., Zheng, B., & Yang, Y. (2021). Each part matters: Local patterns facilitate cross-view geo-localization. Institute of Electrical and Electronics Engineers (IEEE) Transactions on Circuits and Systems for Video Technology (TCSVT), 1-1. https://doi.org/10.1109/TCSVT.2021.3061265
Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., & Change Loy, C. (2018). Esrgan: Enhanced super-resolution generative adversarial networks. Proceedings of the European conference on computer vision (ECCV) workshops
Weyand, T., Kostrikov, I., & Philbin, J. (2016). Planet—photo geolocation with convolutional neural networks. In Leibe, B., Matas, J., Sebe, N., & Welling, W. (Eds.) European conference on computer vision (eccv) (pp. 37–55). Springer.
Wilson, D., Alshaabi, T., Oort, C. M. V., Zhang, X., Nelson, J., & Wshah, S. (2021). Object tracking and geo-localization from street images. CoRR arXiv:2107.06257.
Woo, S., Park, J., Lee, J.-Y., & Kweon, I.S. (2018). Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV).
Workman, S., Souvenir, R., & Jacobs, N. (2015). Wide-area image geolocalization with aerial reference imagery. In Proceedings of the institute of electrical and electronics engineers (IEEE) international conference on computer vision (ICCV).
Xia, G.-S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., & Zhang, L. (2018). Dota: A large-scale dataset for object detection in aerial images. Proceedings of the institute of electrical and electronics engineers (IEEE) conference on computer vision and pattern recognition (CVPR) (pp. 3974–3983).
Xia, H., Zhao, H., & Ding, Z. (2021). Adaptive adversarial network for source-free domain adaptation. Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 9010–9019).
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. Institute of electrical and electronics engineers (IEEE) computer society conference on computer vision and pattern recognition (CVPR) (p. 3485–3492). https://doi.org/10.1109/CVPR.2010.5539970
Yi, Z., Zhang, H., Tan, P., & Gong, M. (2017). Dualgan: Unsupervised dual learning for image-toimage translation. In Proceedings of the institute of electrical and electronics engineers (IEEE) international conference on computer vision (ICCV).
You, K., Long, M., Cao, Z., Wang, J., & Jordan, M. I. (2019). Universal domain adaptation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (cvpr).
Zamir, A. R., & Shah, M. (2010). Accurate image localization based on google maps street view. In Daniilidis, K., Maragos, P., & Paragios, N. (Eds.) European conference on computer vision (eccv) (pp. 255–268). Springer.
Zamir, A. R., & Shah, M. (2014). Image geolocalization based on multiple nearest neighbor feature matching using generalized graphs. Institute of Electrical and Electronics Engineers (IEEE) Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(8), 1546–1558. https://doi.org/10.1109/TPAMI.2014.2299799
Zhai, M., Bessinger, Z., Workman, S., & Jacobs, N. (2017). Predicting ground-level scene layout from aerial imagery. In Proceedings of the ieee conference on computer vision and pattern recognition (cvpr).
Zhang, H., Berg, A., Maire, M., & Malik, J. (2006). Svm-knn: Discriminative nearest neighbor classification for visual category recognition. Institute of electrical and electronics engineers (IEEE) computer society conference on computer vision and pattern recognition (CVPR) (Vol. 2, pp. 2126–2136). https://doi.org/10.1109/CVPR.2006.301
Zhang, X., Li, X., Sultani, W., Zhou, Y., & Wshah, S. (2023). Cross-view geo-localization via learning disentangled geometric layout correspondence. In Proceedings of the AAAI Conference on Artificial Intelligence, 37(3), 3480–3488. https://doi.org/10.1609/aaai.v37i3.25457
Zhang, X., Sultani, W., & Wshah, S. (2023). Cross-view image sequence geo-localization. Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV) (pp. 2914–2923).
Zheng, L., Yang, Y., & Tian, Q. (2016). Sift meets CNN: A decade survey of instance retrieval. In IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2017.2709749
Zheng, Z., Wei, Y., & Yang, Y. (2020). University- 1652: A multi-view multi-source benchmark for drone-based geo-localization. In Proceedings of the 28th acm international conference on multimedia (p. 1395–1403). Association for Computing Machinery. https://doi.org/10.1145/3394171.3413896
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., & Weinberger, K. Q. (Eds.) Advances in neural information processing systems (neurips) (Vol. 27). Curran Associates, Inc.
Zhou, B., Liu, L., Oliva, A., & Torralba, A. (2014). Recognizing city identity via attribute analysis of geo-tagged images. In Fleet, D., Pajdla, T., Schiele, B., & Tuytelaars, T. (Eds.) European conference on computer vision (eccv) (pp. 519–534). Springer.
Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the institute of electrical and electronics engineers (IEEE) international conference on computer vision (ICCV).
Zhu, J.-Y., Zhang, R., Pathak, D., Darrell, T., Efros, A. A., Wang, O., & Shechtman, E. (2017). Toward multimodal image-to-image translation. In Guyon, I. et al. (Eds.) Advances in neural information processing systems (Vol. 30, pp. 465–476). Curran Associates, Inc.
Zhu, S., Shah, M., & Chen, C. (2022). Transgeo: Transformer is all you need for cross view image geo-localization. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1162–1171).
Zhu, S., Yang, T., & Chen, C. (2021a). Revisiting street-to-aerial view image geo-localization and orientation estimation. In Proceedings of the institute of electrical and electronics engineers (IEEE)/cvf winter conference on applications of computer vision (wacv) (pp. 756–765).
Zhu, S., Yang, T., & Chen, C. (2021b). Vigor: Cross-view image geo-localization beyond oneto- one retrieval. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 3640–3649).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Communicated by Ondra Chum.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We have obtained the copyright for all figures used in this paper by purchasing all the applicable rights from their publishers.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wilson, D., Zhang, X., Sultani, W. et al. Image and Object Geo-Localization. Int J Comput Vis 132, 1350–1392 (2024). https://doi.org/10.1007/s11263-023-01942-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-023-01942-3