Cross-domain structural model for video event annotation via web images

Wang, Han; Liu, Xiabi; Wu, Xinxiao; Jia, Yunde

doi:10.1007/s11042-014-2175-z

Cross-domain structural model for video event annotation via web images

Published: 30 July 2014

Volume 74, pages 10439–10456, (2015)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Han Wang¹,
Xiabi Liu¹,
Xinxiao Wu¹ &
…
Yunde Jia¹

225 Accesses
Explore all metrics

Abstract

Annotating events in uncontrolled videos is a challenging task. Most of the previous work focuses on obtaining concepts from numerous labeled videos. But it is extremely time consuming and labor expensive to collect a large amount of required labeled videos for modeling events under various circumstances. In this paper, we try to learn models for video event annotation by leveraging abundant Web images which contains a rich source of information with many events taken under various conditions and roughly annotated as well. Our method is based on a new discriminative structural model called Cross-Domain Structural Model (CDSM) to transfer knowledge from Web images (source domain) to consumer videos (target domain), by jointly modeling the interaction between videos and images. Specifically, under this framework we build a common feature subspace to deal with the feature distribution mismatching between the video domain and the image domain. Further, we propose to use weak semantic attributes to describe events, which can be obtained with no or little labor. Experimental results on challenging video datasets demonstrate the effectiveness of our transfer learning method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Ranjay Krishna, Yuke Zhu, … Li Fei-Fei

Learning to Prompt for Vision-Language Models

Article 31 July 2022

Kaiyang Zhou, Jingkang Yang, … Ziwei Liu

Learning with Noisy Correspondence

Article 13 April 2024

Zhenyu Huang, Peng Hu, … Xi Peng

References

Bel N, Koster C, Villegas M (2003) Cross-lingual text categorization. 7th European Conference on Research and Advanced Technology for Digital Libraries, Springer LNCS 2769:126–139
Berg T, Berg A, Shih J (2010) Automatic attribute discovery and characterization from noisy web data. ECCV 2010 1:663–676
Borth D, Ulges A, Breuel TM (2012) Dynamic vocabularies for web-based concept detection by trend discovery. In: Proceedings of the 20th ACM international conference on Multimedia, ACM, pp 977–980
Bruzzone L, Marconcini M (2010) Domain adaptation problems: a dasvm classification technique and a circular validation strategy. Pattern Anal Mach Intell, IEEE Trans on 32(5):770–787
Article Google Scholar
Cai J, Zha Z-J, Zhou W, Tian Q (2012) Attribute-assisted re-ranking for web image retrieval. In: Proceedings of the 20th ACM international conference on Multimedia, ACM, pp 873–876
Cao L, Liu Z, Huang T (2010) Cross-dataset action detection. In: CVPR, IEEE, pp 1998–2005
Do T-M-T and Arti`eres T (2009) Large margin training for hidden markov models with partially observed states. In: Proceedings of the 26th Annual International Conference on Machine Learning. ACM, pp 265–272
Duan L, Xu D, Tsang I, Luo J (2010) Visual event recognition in videos by learning from web data. In: CVPR, IEEE, pp 1959–1966
Duan L, Xu D, Chang S-F (2012) Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. In: Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference on, IEEE, pp 1959–1966
Ferrari V, Zisserman A (2007) Learning visual attributes. Advances in Neural Information Processing Systems pp 433–440
Hardoon D, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Article MATH Google Scholar
Hwang SJ and Grauman K (2010) Accounting for the Relative Importance of Objects in Image Retrieval. In: Proceedings of the British Machine Vision Conference (BMVC), Aberystwyth, UK
Ikizler-Cinbis N, Cinbis R, Sclaroff S (2009) Learning actions from the web. In: CVPR, IEEE, pp 995–1002
Jiang Y, Ye G, Chang S, Ellis D, Loui A (2011) Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In: Proceedings of the 1st ACM International Conference on Multimedia Retrieval, ACM, p 29
Lowe D (2004) Distinctive image features from scale-invariant keypoints. IJCV 60(2):91–110
Article Google Scholar
Pan S, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Article Google Scholar
Parikh D, Grauman K (2011) Interactively building a discriminative vocabulary of nameable attributes. In: Computer Vision and Pattern Recognition (CVPR), pp 1681–1688
Parikh D, Grauman K (2011) Relative attributes. In: ICCV, pp 1681–1688
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet G, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the international conference on Multimedia, ACM, pp 251–260
Siddiquie B, Feris R, Davis L (2011) Image ranking and retrieval based on multiattribute queries, in: Computer Vision and Pattern Recognition (CVPR), pp 801–808
Torresani L, Szummer M, Fitzgibbon A (2010) Efficient object category recognition using classemes, Computer Vision–ECCV 2010 776–789
Vaquero D, Feris R, Tran D, Brown L, Hampapur A, Turk M (2009) Attribute based people search in surveillance environments. In: Applications of Computer Vision (WACV), 2009 Workshop on, IEEE, pp 1–8
Wang H, Wu X, Jia Y (2012) Annotating videos from the web images, in: International Conference on Pattern Recognition, IEEE, pp 2801–2804
Wu X, Jia Y (2012) View-invariant action recognition using latent kernelized structural svm. In: ECCV, pp 995–1002
Xu X-S, Jiang Y, Xue X, Zhou Z-H (2012) Semi-supervised multi-instance multi-label learning for video annotation task. In: Proceedings of the 20th ACM international conference on Multimedia, ACM, pp 737–740

Download references

Acknowledgments

This work was partially supported by National Natural Science Foundation of China (Grant no. 60973059, 81171407) and Program for New Century Excellent Talents in University of China (Grant no. NCET-10-0044).

Author information

Authors and Affiliations

Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing, 100081, China
Han Wang, Xiabi Liu, Xinxiao Wu & Yunde Jia

Authors

Han Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiabi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xinxiao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yunde Jia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiabi Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Liu, X., Wu, X. et al. Cross-domain structural model for video event annotation via web images. Multimed Tools Appl 74, 10439–10456 (2015). https://doi.org/10.1007/s11042-014-2175-z

Download citation

Received: 26 November 2013
Revised: 03 June 2014
Accepted: 30 June 2014
Published: 30 July 2014
Issue Date: December 2015
DOI: https://doi.org/10.1007/s11042-014-2175-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-domain structural model for video event annotation via web images

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

Learning with Noisy Correspondence

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cross-domain structural model for video event annotation via web images

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

Learning with Noisy Correspondence

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation