Skip to main content
Log in

Cross-domain structural model for video event annotation via web images

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Annotating events in uncontrolled videos is a challenging task. Most of the previous work focuses on obtaining concepts from numerous labeled videos. But it is extremely time consuming and labor expensive to collect a large amount of required labeled videos for modeling events under various circumstances. In this paper, we try to learn models for video event annotation by leveraging abundant Web images which contains a rich source of information with many events taken under various conditions and roughly annotated as well. Our method is based on a new discriminative structural model called Cross-Domain Structural Model (CDSM) to transfer knowledge from Web images (source domain) to consumer videos (target domain), by jointly modeling the interaction between videos and images. Specifically, under this framework we build a common feature subspace to deal with the feature distribution mismatching between the video domain and the image domain. Further, we propose to use weak semantic attributes to describe events, which can be obtained with no or little labor. Experimental results on challenging video datasets demonstrate the effectiveness of our transfer learning method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Bel N, Koster C, Villegas M (2003) Cross-lingual text categorization. 7th European Conference on Research and Advanced Technology for Digital Libraries, Springer LNCS 2769:126–139

  2. Berg T, Berg A, Shih J (2010) Automatic attribute discovery and characterization from noisy web data. ECCV 2010 1:663–676

  3. Borth D, Ulges A, Breuel TM (2012) Dynamic vocabularies for web-based concept detection by trend discovery. In: Proceedings of the 20th ACM international conference on Multimedia, ACM, pp 977–980

  4. Bruzzone L, Marconcini M (2010) Domain adaptation problems: a dasvm classification technique and a circular validation strategy. Pattern Anal Mach Intell, IEEE Trans on 32(5):770–787

    Article  Google Scholar 

  5. Cai J, Zha Z-J, Zhou W, Tian Q (2012) Attribute-assisted re-ranking for web image retrieval. In: Proceedings of the 20th ACM international conference on Multimedia, ACM, pp 873–876

  6. Cao L, Liu Z, Huang T (2010) Cross-dataset action detection. In: CVPR, IEEE, pp 1998–2005

  7. Do T-M-T and Arti`eres T (2009) Large margin training for hidden markov models with partially observed states. In: Proceedings of the 26th Annual International Conference on Machine Learning. ACM, pp 265–272

  8. Duan L, Xu D, Tsang I, Luo J (2010) Visual event recognition in videos by learning from web data. In: CVPR, IEEE, pp 1959–1966

  9. Duan L, Xu D, Chang S-F (2012) Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. In: Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference on, IEEE, pp 1959–1966

  10. Ferrari V, Zisserman A (2007) Learning visual attributes. Advances in Neural Information Processing Systems pp 433–440

  11. Hardoon D, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  MATH  Google Scholar 

  12. Hwang SJ and Grauman K (2010) Accounting for the Relative Importance of Objects in Image Retrieval. In: Proceedings of the British Machine Vision Conference (BMVC), Aberystwyth, UK

  13. Ikizler-Cinbis N, Cinbis R, Sclaroff S (2009) Learning actions from the web. In: CVPR, IEEE, pp 995–1002

  14. Jiang Y, Ye G, Chang S, Ellis D, Loui A (2011) Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In: Proceedings of the 1st ACM International Conference on Multimedia Retrieval, ACM, p 29

  15. Lowe D (2004) Distinctive image features from scale-invariant keypoints. IJCV 60(2):91–110

    Article  Google Scholar 

  16. Pan S, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359

    Article  Google Scholar 

  17. Parikh D, Grauman K (2011) Interactively building a discriminative vocabulary of nameable attributes. In: Computer Vision and Pattern Recognition (CVPR), pp 1681–1688

  18. Parikh D, Grauman K (2011) Relative attributes. In: ICCV, pp 1681–1688

  19. Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet G, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the international conference on Multimedia, ACM, pp 251–260

  20. Siddiquie B, Feris R, Davis L (2011) Image ranking and retrieval based on multiattribute queries, in: Computer Vision and Pattern Recognition (CVPR), pp 801–808

  21. Torresani L, Szummer M, Fitzgibbon A (2010) Efficient object category recognition using classemes, Computer Vision–ECCV 2010 776–789

  22. Vaquero D, Feris R, Tran D, Brown L, Hampapur A, Turk M (2009) Attribute based people search in surveillance environments. In: Applications of Computer Vision (WACV), 2009 Workshop on, IEEE, pp 1–8

  23. Wang H, Wu X, Jia Y (2012) Annotating videos from the web images, in: International Conference on Pattern Recognition, IEEE, pp 2801–2804

  24. Wu X, Jia Y (2012) View-invariant action recognition using latent kernelized structural svm. In: ECCV, pp 995–1002

  25. Xu X-S, Jiang Y, Xue X, Zhou Z-H (2012) Semi-supervised multi-instance multi-label learning for video annotation task. In: Proceedings of the 20th ACM international conference on Multimedia, ACM, pp 737–740

Download references

Acknowledgments

This work was partially supported by National Natural Science Foundation of China (Grant no. 60973059, 81171407) and Program for New Century Excellent Talents in University of China (Grant no. NCET-10-0044).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiabi Liu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Liu, X., Wu, X. et al. Cross-domain structural model for video event annotation via web images. Multimed Tools Appl 74, 10439–10456 (2015). https://doi.org/10.1007/s11042-014-2175-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-2175-z

Keywords

Navigation