Skip to main content
Log in

UniMod1K: Towards a More Universal Large-Scale Dataset and Benchmark for Multi-modal Learning

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The emergence of large-scale high-quality datasets has stimulated the rapid development of deep learning in recent years. However, most computer vision tasks focus on the visual modality only, resulting in a huge imbalance in the number of annotated data for other modalities. While several multi-modal datasets have been made available, the majority of them are confined to only two modalities, serving a single specific computer vision task. To redress the data deficiency for multi-modal learning and applications, a new dataset named UniMod1K is presented in this work. UniMod1K involves three data modalities: vision, depth, and language. For the vision and depth modalities, the UniMod1K dataset contains 1050 RGB-D sequences, comprising a total of some 2.5 million frames. Regarding the language modality, the proposed dataset includes 1050 sentences describing the target object in each video. To demonstrate the advantages of training on a larger multi-modal dataset, such as UniMod1K, and to stimulate research enabled by the dataset, we address several multi-modal tasks, namely multi-modal object tracking and monocular depth estimation. To establish a performance baseline, we propose novel baseline methods for RGB-D object tracking, vision-language tracking and vision-depth-language tracking. Additionally, we conduct comprehensive experiments for each of these tasks. The results highlight the potential of the UniMod1K dataset to improve the performance of multi-modal approaches. The dataset and codes can be accessed at https://github.com/xuefeng-zhu5/UniMod1K.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

The datasets generated and/or analysed as part of the reported study are available from the corresponding author on request.

References

  • Baltrušaitis, T., Ahuja, C., & Morency, L. P. (2018). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443.

    Article  PubMed  Google Scholar 

  • Bhat, G., Danelljan, M., Gool, L. V. & Timofte, R. (2019). Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6182–6191).

  • Bhat, S. F., Birkl, R., Wofk, D., Wonka, P. & Müller, M. (2023). Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 pp 01–20.

  • Camplani, M., Hannuna, S. L., Mirmehdi, M., Damen, D., Paiement, A., Tao, L., & Burghardt, T. (2015). Real-time RGB-D tracking with depth scaling kernelised correlation filters and occlusion handling. BMVC, 3, 01–12.

    Google Scholar 

  • Chang, A., Dai, A., Funkhouser, T., Halber, M., Niebner, M., Savva, M., Song, S., Zeng, A., & Zhang, Y. (2017). Matterport3d: Learning from RGB-D data in indoor environments. In 2017 International conference on 3D vision (3DV), IEEE Computer Society (pp. 667–676).

  • Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X. & Lu, H. (2021). Transformer tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8126–8135).

  • Danelljan, M., Bhat, G., Shahbaz Khan, F. & Felsberg, M. (2017). Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6638–6646).

  • Ding, P., & Song, Y. (2015). Robust object tracking using color and depth images with a depth based occlusion handling and recovery. In 2015 12th International conference on fuzzy systems and knowledge discovery (FSKD). IEEE (pp. 930–935).

  • Eigen, D., & Fergus, R. (2015). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision (pp. 2650–2658).

  • Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. Advances in Neural Information Processing Systems, 27, 01–09.

    Google Scholar 

  • Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C. & Ling, H. (2019). Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5374–5383).

  • Feng, D., Haase-Schütz, C., Rosenbaum, L., Hertlein, H., Glaeser, C., Timm, F., Wiesbeck, W., & Dietmayer, K. (2020). Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems, 22(3), 1341–1360.

    Article  Google Scholar 

  • Feng, Q., Ablavsky, V., Bai, Q., Li, G. & Sclaroff, S. (2020b). Real-time visual object tracking with natural language description. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 700–709).

  • Feng, Q., Ablavsky, V., Bai, Q. & Sclaroff, S. (2021). Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5851–5860).

  • Fu, H., Gong, M., Wang, C., Batmanghelich, K. & Tao, D. (2018). Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2002–2011).

  • Geiger, A., Lenz, P. & Urtasun, R. (2012). Are we ready for autonomous driving? The kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition. IEEE (pp. 3354–3361).

  • Guo, M., Zhang, Z., Fan, H., & Jing, L. (2022). Divert more attention to vision-language tracking. Advances in Neural Information Processing Systems, 35, 4446–4460.

    Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16000–16009).

  • Mx, J., Deng, C., Js, S., Yy, W., Yj, J., & Sun, X. (2019). Hierarchical multi-modal fusion FCN with attention model for RGB-D tracking. Information Fusion, 50, 1–8.

    Article  Google Scholar 

  • Jung, I., Son, J., Baek, M. & Han, B. (2018). Real-time mdnet. In Proceedings of the European conference on computer vision. Springer (pp. 83–98).

  • Kart, U., Kamarainen, J. K. & Matas, J. (2018). How to make an RGBD tracker? In Proceedings of the european conference on computer vision (ECCV) Workshops (pp. 01–15).

  • Kart, U., Lukezic, A., Kristan, M., Kamarainen, J. K., & Matas, J. (2019). Object tracking by reconstruction with view-specific discriminative correlation filters. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1339–1348).

  • Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (pp. 4171–4186).

  • Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Cehovin Zajc, L., Vojir, T., Bhat, G., Lukezic, A., & Eldesokey, A., et al. (2018). The sixth visual object tracking vot2018 challenge results. In Proceedings of the European conference on computer vision workshops. Springer (pp. 01–52).

  • Kristan, M., Matas, J., Leonardis, A., Felsberg, M., Pflugfelder, R., Kamarainen, J. K., Cehovin Zajc, L., Drbohlav, O., Lukezic, A., & Berg, A., et al. (2019). The seventh visual object tracking vot2019 challenge results. In Proceedings of the IEEE/CVF international conference on computer vision workshops (pp. 01–36).

  • Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Kämäräinen, J. K., Danelljan, M., Zajc, LČ., Lukežič, A., & Drbohlav, O., et al. (2020). The eighth visual object tracking vot2020 challenge results. In European conference on computer vision. Springer (pp. 547–601).

  • Kristan, M., Matas, J., Leonardis, A., Felsberg, M., Pflugfelder, R., Kämäräinen, J. K., Chang, H. J., Danelljan, M., Cehovin, L., & Lukežič, A., et al. (2021). The ninth visual object tracking vot2021 challenge results. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2711–2738).

  • Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV). IEEE (pp. 239–248).

  • Levin, A., Lischinski, D., & Weiss, Y. (2004). Colorization using optimization. ACM SIGGRAPH 2004 Papers (pp. 689–694). NY, USA: ACM New York.

  • Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J. & Yan, J. (2019). Siamrpn++: Evolution of siamese visual tracking with very deep networks. In IEEE conference on computer vision and pattern recognition (pp. 4282–4291).

  • Li, Z., Tao, R., Gavves, E., Snoek, C. G., & Smeulders, A. W. (2017). Tracking by natural language specification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6495–6503).

  • Liu, C., Kumar, S., Gu, S., Timofte, R., & Van Gool, L. (2023). Va-depthnet: A variational approach to single image depth prediction. arXiv preprint arXiv:2302.06556 pp 01–21.

  • Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietikäinen, M. (2020). Deep learning for generic object detection: A survey. International Journal of Computer Vision, 128, 261–318.

    Article  Google Scholar 

  • Liu, Y., Jing, X. Y., Nie, J., Gao, H., Liu, J., & Jiang, G. P. (2018). Context-aware three-dimensional mean-shift with occlusion handling for robust object tracking in RGB-D videos. IEEE Transactions on Multimedia, 21(3), 664–677.

    Article  Google Scholar 

  • Lukežič, A., Zajc, LČ., Vojíř, T., Matas, J., & Kristan, M. (2018). Now you see me: Evaluating performance in long-term visual tracking. arXiv preprint arXiv:1804.07056 pp 01–16.

  • Lukezic, A., Kart, U., Kapyla, J., Durmush, A., Kamarainen, J. K., Matas, J., & Kristan, M. (2019). Cdtb: A color and depth visual object tracking dataset and benchmark. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10013–10022).

  • Marvasti-Zadeh, S. M., Cheng, L., Ghanei-Yakhdan, H., & Kasaei, S. (2021). Deep learning for visual tracking: A comprehensive survey. IEEE Transactions on Intelligent Transportation Systems, 23(5), 3943–3968.

    Article  Google Scholar 

  • Masana, M., Liu, X., Twardowski, B., Menta, M., Bagdanov, A. D., & van de Weijer, J. (2022). Class-incremental learning: Survey and performance evaluation on image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 1–20.

  • Mayer, C., Danelljan, M., Paudel, D. P., & Van Gool, L. (2021). Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13444–13454).

  • Meshgi, K., Si, M., Oba, S., Skibbe, H., Yz, L., & Ishii, S. (2016). An occlusion-aware particle filter tracker to handle complex and persistent occlusions. Computer Vision and Image Understanding, 150, 81–94.

    Article  Google Scholar 

  • Minaee, S., Boykov, Y. Y., Porikli, F., Plaza, A. J., Kehtarnavaz, N., & Terzopoulos, D. (2021). Image segmentation using deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7), 3523–3542.

    Google Scholar 

  • Ming, Y., Meng, X., Fan, C., & Yu, H. (2021). Deep learning for monocular depth estimation: A review. Neurocomputing, 438, 14–33.

    Article  Google Scholar 

  • Müller, H., & Unay, D. (2017). Retrieval from and understanding of large-scale multi-modal medical datasets: A review. IEEE Transactions on Multimedia, 19(9), 2093–2104.

    Article  Google Scholar 

  • Palmero, C., Clapés, A., Bahnsen, C., Møgelmose, A., Moeslund, T. B., & Escalera, S. (2016). Multi-modal RGB-depth-thermal human body segmentation. International Journal of Computer Vision, 118, 217–239.

    Article  MathSciNet  Google Scholar 

  • Qian, Y., Yan, S., Lukežič, A., Kristan, M., Kämäräinen, J. K., & Matas, J. (2021). Dal: A deep depth-aware long-term tracker. In 2020 25th International conference on pattern recognition (ICPR). IEEE (pp. 7825–7832).

  • Ramamonjisoa, M., Firman, M., Watson, J., Lepetit, V. & Turmukhambetov, D. (2021). Single image depth prediction with wavelet decomposition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11089–11098).

  • Ranftl, R., Bochkovskiy, A., & Koltun, V. (2021). Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 12179–12188).

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115, 211–252.

    Article  MathSciNet  Google Scholar 

  • Saxena, A., Sun, M., & Ng, A. Y. (2008). Make3d: Learning 3d scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 824–840.

    Article  Google Scholar 

  • Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. ECCV, 5(7576), 746–760.

    Google Scholar 

  • Song, M., Lim, S., & Kim, W. (2021). Monocular depth estimation using Laplacian pyramid-based depth residuals. IEEE Transactions on Circuits and Systems for Video Technology, 31(11), 4381–4393.

    Article  Google Scholar 

  • Song, S., & Xiao, J. (2013). Tracking revisited using rgbd camera: Unified benchmark and baselines. In Proceedings of the IEEE international conference on computer vision (pp. 233–240).

  • Summaira, J., Li, X., Shoib, A. M., Li, S., & Abdul, J. (2021). Recent advances and trends in multimodal deep learning: A review. arXiv preprint arXiv:2105.11087 pp 01–35.

  • Sun, P., Zhang, W., Li, S., Guo, Y., Song, C., & Li, X. (2022). Learnable depth-sensitive attention for deep rgb-d saliency detection with multi-modal fusion architecture search. International Journal of Computer Vision, 130(11), 2822–2841.

    Article  Google Scholar 

  • Valmadre, J., Bertinetto, L., Henriques, J. F., Tao, R., Vedaldi, A., Smeulders, A. W., Torr, P. H., & Gavves, E. (2018). Long-term tracking in the wild: A benchmark. In Proceedings of the European conference on computer vision (ECCV) (pp. 670–685).

  • Wang, X., Li, C., Yang, R., Zhang, T., Tang, J., & Luo, B. (2018). Describe and attend to track: Learning natural language guided structural representation and visual attention for object tracking. arXiv preprint arXiv:1811.10014 pp 01–12.

  • Wang, X., Shu, X., Zhang, Z., Jiang, B., Wang, Y., Tian, Y., & Wu, F. (2021). Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13763–13773).

  • Wu, C. Y., Wang, J., Hall, M., Neumann, U., & Su, S. (2022). Toward practical monocular indoor depth estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3814–3824).

  • Wu, Y., Lim, J., & Yang, M. H. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1834–1848.

    Article  PubMed  Google Scholar 

  • Xia, W., Zhang, Y., Yang, Y., Xue, J. H., Zhou, B., & Yang, M. H. (2022). Gan inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3121–3138.

    Google Scholar 

  • Xiao, J., Stolkin, R., Gao, Y., & Leonardis, A. (2017). Robust fusion of color and depth data for rgb-d target tracking using adaptive range-invariant depth models and spatio-temporal consistency constraints. IEEE Transactions on Cybernetics, 48(8), 2485–2499.

    PubMed  Google Scholar 

  • Xie, Z., Geng, Z., Hu, J., Zhang, Z., Hu, H., & Cao, Y. (2023). Revealing the dark secrets of masked image modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14475–14485).

  • Xu, T., Zhu, X. F., & Wu, X. J. (2023). Learning spatio-temporal discriminative model for affine subspace based visual object tracking. Visual Intelligence, 1(1), 4.

    Article  Google Scholar 

  • Yan, B., Peng, H., Fu, J., Wang, D., & Lu, H. (2021a). Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10448–10457).

  • Yan, S., Yang, J., Käpylä, J., Zheng, F., Leonardis, A., & Kämäräinen, J. K. (2021b). Depthtrack: Unveiling the power of rgbd tracking. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10725–10733).

  • Yang, J., Li, Z., Yan, S., Zheng, F., Leonardis, A., Kämäräinen, J. K., & Shao, L. (2022). Rgbd object tracking: An in-depth review. arXiv preprint arXiv:2203.14134 pp 01–35.

  • Yang, Z., Kumar, T., Chen, T., Su, J., & Luo, J. (2020). Grounding-tracking-integration. IEEE Transactions on Circuits and Systems for Video Technology, 31(9), 3433–3443.

    Article  Google Scholar 

  • Ye, B., Chang, H., Ma, B., Shan, S. & Chen, X. (2022). Joint feature learning and relation modeling for tracking: A one-stream framework. In European conference on computer vision. Springer (pp. 341–357).

  • Yuan, W., Gu, X., Dai, Z., Zhu, S., & Tan, P. (2022). New crfs: Neural window fully-connected crfs for monocular depth estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 01–10).

  • Zhang, P., Wang, D., &, Lu, H. (2020). Multi-modal visual tracking: Review and experimental comparison. arXiv preprint arXiv:2012.04176 pp 01–40.

  • Zhang, P., Zhao, J., Wang, D., Lu, H., & Ruan, X. (2022). Visible-thermal uav tracking: A large-scale benchmark and new baseline. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8886–8895).

  • Zhao, H., Wang, X., Wang, D., Lu, H., & Ruan, X. (2023). Transformer vision-language tracking via proxy token guided cross-modal fusion. Pattern Recognition Letters, 168, 10–16.

    Article  ADS  Google Scholar 

  • Zhao, P., Liu, Q., Wang, W., & Guo, Q. (2021). Tsdm: Tracking by siamrpn++ with a depth-refiner and a mask-generator. In 2020 25th International conference on pattern recognition (ICPR). IEEE (pp. 670–676).

  • Zhao, W., Rao, Y., Liu, Z., Liu, B., Zhou, J., & Lu, J. (2023b). Unleashing text-to-image diffusion models for visual perception. In Proceedings of the IEEE international conference on computer vision (pp. 01–10).

  • Zhou, L., Zhou, Z., Mao, K., & He, Z. (2023). Joint visual grounding and tracking with natural language specification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 23151–23160).

  • Zhu, J., Lai, S., Chen, X., Wang, D., & Lu, H. (2023a). Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9516–9526).

  • Zhu, X. F., Xu, T., Tang, Z., Wu, Z., Liu, H., Yang, X., Wu, X. J., & Kittler, J. (2023). Rgbd1k: A large-scale dataset and benchmark for rgb-d object tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 3870–3878.

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China under Grants 2023YFF1105102 and 2023YFF1105105, by the National Natural Science Foundation of China (62020106012, 62332008, 62336004, 62106089, U1836218), the 111 Project of Ministry of Education of China (B12018), and in part by EPSRC/dstl/MURI project EP/R018456/1, and EPSRC grants MVSE (EP/V002856/1) and JADE2 (EP/T022205/1).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao-Jun Wu.

Additional information

Communicated by Massimiliano Mancini.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, XF., Xu, T., Liu, Z. et al. UniMod1K: Towards a More Universal Large-Scale Dataset and Benchmark for Multi-modal Learning. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-01999-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11263-024-01999-8

Keywords

Navigation