Abstract
Visual localization is a critical technology for visual SLAM systems, which determines the relative position and motion trajectory by tracking feature points. In recent years, deep learning has been widely applied to the field of visual localization. The method based on deep learning is capable of surpassing the limitations of traditional manual feature extraction methods and achieving high-precision visual localization in complex scenes, thus realizing the goal of lifelong SLAM. The MLP model has characteristics such as flexibility and adaptability. The Mixer-WMLP achieves token information exchange between spatial positions by evenly dividing the feature map into non-overlapping windows, which makes the Mixer-WMLP approach a global receptive field. Compared to CNNs and Transformers, Mixer MLPs have higher computational efficiency and robustness. In this paper, we utilize the Mixer MLP structure to design a deep learning-based visual odometry system called MAIM-VO. Even in complex scenes with low texture areas, high-quality matching can be achieved. After obtaining the matching point pairs, the camera pose is solved in an optimized way by minimizing the reprojection error of the feature points. Multiple datasets and experiments in real-world environments have demonstrated that MIAM-VO exhibits higher robustness and relative localization accuracy compared to currently popular visual SLAM systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Bay, H., Ess, A., Tuytelaars, T., et al.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)
Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative to SIFT or SURF. In: 2011 International Conference on Computer Vision, pp. 2564–2571 (2011)
DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: self-supervised interest point detection and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 224–236 (2018)
Tang, J., Ericson, L., Folkesson, J., et al.: GCNv2: efficient correspondence prediction for real-time SLAM. IEEE Robot. Autom. Lett. 4(4), 3505–3512 (2019)
Jiang, W., Trulls, E., Hosang, J., et al.: Cotr: correspondence transformer for matching across images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6207–6217 (2021)
Sun, J., Shen, Z., Wang, Y., et al.: LoFTR: detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8922–8931 (2021)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Tolstikhin, I.O., et al.: MLP-mixer: an all-MLP architecture for vision. Neural Inf. Process. Syst. 34, 24261–24272 (2021)
Shen, Z., Kong, B., Dong, X.: MAIM: a mixer MLP architecture for image matching. Vis. Comput. (2023)
Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_54
Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 611–625 (2017)
Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O.: MonoSLAM: real-time single camera SLAM. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1052–1067 (2007). https://doi.org/10.1109/TPAMI.2007.1049
Kameda, Y.: Parallel tracking and mapping for small AR workspaces (PTAM) Augmented Reality. J. Instit. Image Inf. Telev. Eng. 66(1), 45–51 (2012). https://doi.org/10.3169/itej.66.45
Campos, C., Elvira, R., RodrÃguez, J.J.G., Montiel, J.M., Tardós, J.D.: ORB-SLAM3: an accurate open-source library for visual, visual–inertial, and multimap SLAM. IEEE Trans. Robot. 37(6), 1874–1890 (2021)
Qin, T., Li, P., Shen, S.: VINS-Mono: a robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 34(4), 1004–1020 (2018)
DeTone, D., Malisiewicz, T., Rabinovich, A.: Toward geometric deep SLAM. arXiv:1707.07410 (2017)
Sarlin, P.-E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: learning feature matching with graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4938–4947 (2020)
Rocco, I., Cimpoi, M., Arandjelović, R., et al.: Neighbourhood consensus networks. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Yu, C., Liu, Z., Liu, X.J., et al.: DS-SLAM: a semantic visual SLAM towards dynamic environments. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1168–1174. IEEE (2018)
Gao, Y., Zhao, L.: TRVO: a robust visual odometry with deep features. In: The 7th International Workshop on Advanced Computational Intelligence and Intelligent Informatics (IWACIII) (2021)
Li, Z., Snavely, N.: MegaDepth: learning single-view depth prediction from internet photos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2041–2050 (2018)
Maddern, W.P., Pascoe, G., Linegar, C., Newman, P.: 1 year, 1000 km: the Oxford RobotCar dataset. Int. J. Robot. Res. 36, 15–23 (2017)
Schubert, D., Goll, T., Demmel, N., et al.: The TUM VI Benchmark for Evaluating Visual-Inertial Odometry. IEEE (2018)
Acknowledgments
This work was supported by the Institute of Robotics and Intelligent Manufacturing Innovation, Chinese Academy of Sciences (Grant number: C2021002).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Shen, Z., Kong, B. (2023). MAIM-VO: A Robust Visual Odometry with Mixed MLP for Weak Textured Environment. In: Yongtian, W., Lifang, W. (eds) Image and Graphics Technologies and Applications. IGTA 2023. Communications in Computer and Information Science, vol 1910. Springer, Singapore. https://doi.org/10.1007/978-981-99-7549-5_6
Download citation
DOI: https://doi.org/10.1007/978-981-99-7549-5_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7548-8
Online ISBN: 978-981-99-7549-5
eBook Packages: Computer ScienceComputer Science (R0)