Abstract
Annotating events in uncontrolled videos is a challenging task. Most of the previous work focuses on obtaining concepts from numerous labeled videos. But it is extremely time consuming and labor expensive to collect a large amount of required labeled videos for modeling events under various circumstances. In this paper, we try to learn models for video event annotation by leveraging abundant Web images which contains a rich source of information with many events taken under various conditions and roughly annotated as well. Our method is based on a new discriminative structural model called Cross-Domain Structural Model (CDSM) to transfer knowledge from Web images (source domain) to consumer videos (target domain), by jointly modeling the interaction between videos and images. Specifically, under this framework we build a common feature subspace to deal with the feature distribution mismatching between the video domain and the image domain. Further, we propose to use weak semantic attributes to describe events, which can be obtained with no or little labor. Experimental results on challenging video datasets demonstrate the effectiveness of our transfer learning method.
Similar content being viewed by others
References
Bel N, Koster C, Villegas M (2003) Cross-lingual text categorization. 7th European Conference on Research and Advanced Technology for Digital Libraries, Springer LNCS 2769:126–139
Berg T, Berg A, Shih J (2010) Automatic attribute discovery and characterization from noisy web data. ECCV 2010 1:663–676
Borth D, Ulges A, Breuel TM (2012) Dynamic vocabularies for web-based concept detection by trend discovery. In: Proceedings of the 20th ACM international conference on Multimedia, ACM, pp 977–980
Bruzzone L, Marconcini M (2010) Domain adaptation problems: a dasvm classification technique and a circular validation strategy. Pattern Anal Mach Intell, IEEE Trans on 32(5):770–787
Cai J, Zha Z-J, Zhou W, Tian Q (2012) Attribute-assisted re-ranking for web image retrieval. In: Proceedings of the 20th ACM international conference on Multimedia, ACM, pp 873–876
Cao L, Liu Z, Huang T (2010) Cross-dataset action detection. In: CVPR, IEEE, pp 1998–2005
Do T-M-T and Arti`eres T (2009) Large margin training for hidden markov models with partially observed states. In: Proceedings of the 26th Annual International Conference on Machine Learning. ACM, pp 265–272
Duan L, Xu D, Tsang I, Luo J (2010) Visual event recognition in videos by learning from web data. In: CVPR, IEEE, pp 1959–1966
Duan L, Xu D, Chang S-F (2012) Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. In: Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference on, IEEE, pp 1959–1966
Ferrari V, Zisserman A (2007) Learning visual attributes. Advances in Neural Information Processing Systems pp 433–440
Hardoon D, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Hwang SJ and Grauman K (2010) Accounting for the Relative Importance of Objects in Image Retrieval. In: Proceedings of the British Machine Vision Conference (BMVC), Aberystwyth, UK
Ikizler-Cinbis N, Cinbis R, Sclaroff S (2009) Learning actions from the web. In: CVPR, IEEE, pp 995–1002
Jiang Y, Ye G, Chang S, Ellis D, Loui A (2011) Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In: Proceedings of the 1st ACM International Conference on Multimedia Retrieval, ACM, p 29
Lowe D (2004) Distinctive image features from scale-invariant keypoints. IJCV 60(2):91–110
Pan S, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Parikh D, Grauman K (2011) Interactively building a discriminative vocabulary of nameable attributes. In: Computer Vision and Pattern Recognition (CVPR), pp 1681–1688
Parikh D, Grauman K (2011) Relative attributes. In: ICCV, pp 1681–1688
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet G, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the international conference on Multimedia, ACM, pp 251–260
Siddiquie B, Feris R, Davis L (2011) Image ranking and retrieval based on multiattribute queries, in: Computer Vision and Pattern Recognition (CVPR), pp 801–808
Torresani L, Szummer M, Fitzgibbon A (2010) Efficient object category recognition using classemes, Computer Vision–ECCV 2010 776–789
Vaquero D, Feris R, Tran D, Brown L, Hampapur A, Turk M (2009) Attribute based people search in surveillance environments. In: Applications of Computer Vision (WACV), 2009 Workshop on, IEEE, pp 1–8
Wang H, Wu X, Jia Y (2012) Annotating videos from the web images, in: International Conference on Pattern Recognition, IEEE, pp 2801–2804
Wu X, Jia Y (2012) View-invariant action recognition using latent kernelized structural svm. In: ECCV, pp 995–1002
Xu X-S, Jiang Y, Xue X, Zhou Z-H (2012) Semi-supervised multi-instance multi-label learning for video annotation task. In: Proceedings of the 20th ACM international conference on Multimedia, ACM, pp 737–740
Acknowledgments
This work was partially supported by National Natural Science Foundation of China (Grant no. 60973059, 81171407) and Program for New Century Excellent Talents in University of China (Grant no. NCET-10-0044).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, H., Liu, X., Wu, X. et al. Cross-domain structural model for video event annotation via web images. Multimed Tools Appl 74, 10439–10456 (2015). https://doi.org/10.1007/s11042-014-2175-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-2175-z