Multimedia Tools and Applications

, Volume 74, Issue 2, pp 523–542 | Cite as

Evaluation of semi-supervised learning method on action recognition

  • Haoquan ShenEmail author
  • Yan Yan
  • Shicheng Xu
  • Nicolas Ballas
  • Wenzhi Chen


Action recognition is one of the most difficult problems in computer vision and multimedia areas, since both spatial information and spatiotemporal semantic meaning should be taken into consideration. Moreover, the noisy and weakly annotated information make this task even harder. Nowadays, instead of the traditional features and classifiers, a lot of new attempts have made the task of action recognition promising. Noticing that there is no work on comparison of different combination of pooling and semi-supervised learning method under the same experiment setting, it would be interesting to apply different combination of pooling and semi-supervised learning method on both the synthetic and realistic action recognition datasets to see which combination or method performs better. In summary, we can obtain the following conclusions based on our experiments. Firstly, Second Order Pooling (Carreira et al. 2012) is worse than the traditional Bag of Words (Schmid and Mohr 1997; Dance et al. 2004) regarding to the overall performance in some dataset, but is a good way to speed up the coding stage of video classification with little sacrifice of performance. Secondly, Semi-supervised Hierarchical Regression Algorithm (MLHR) and Manifold Regularized Least Square Regression (MRLS) (Belkin et al. J Mach Learn Res 12:2399–2434, 2006) is better than some of the supervised learning methods (χ 2-SVM, SVM-2K (Farquhar et al. 2006)) in the real world action recognition problems which shares little available annotated information. Thirdly, for KTH, UCF50 and HMDB dataset, late fusion doesn’t necessarily improve the performance. In comparison, MLHR, SVM-2K and Multi-kernel Learning is a more natural way to deal with multi-feature problems.


Semi-supervised learning Multi-feature fusion Second order pooling Video concept annotation Action recognition 


  1. 1.
    Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labled and unlabeled examples. J Mach Learn Res 12:2399–2434MathSciNetGoogle Scholar
  2. 2.
    Carreira J, Caseiro R, Batista J, Sminchisescu C (2012) Semantic segmentation with second-order pooling. In: ECCVGoogle Scholar
  3. 3.
    Chen M, Hauptmann A (2009) Mosift, recognizing human actions in surveillance videosGoogle Scholar
  4. 4.
    Dance C, Willamowski J, Fan L, Bray C, Csurka G (2004) Visual categorization with bags of keypoints. In: ECCV SLCV workshopGoogle Scholar
  5. 5.
    Farquhar JDR, Meng H, Szedmak S, Hardoon DR, Shawe-taylor J (2006) Two view learning: svm-2k, theory and practice. In: Advances in neural information processing systems. MIT PressGoogle Scholar
  6. 6.
    Han Y, Xu Z, Ma Z, Huang Z (2013) Image classification with manifold learning for out-of-sample data. Signal Process 93(8):2169–2177CrossRefGoogle Scholar
  7. 7.
    Han Y, Yang Y, Ma Z, Shen H, Sebe N, Zhou X (2014) Image attibute adaptation. IEEE Trans Multimed (IEEE T-MM). doi: 10.1109/TMM.2014.2306092
  8. 8.
    Han Y, Zhang J, Xu Z, Yu S (2013) Discriminative multi-task feature selection. In: AAAIGoogle Scholar
  9. 9.
    Hotelling H (1936) Relations between two sets of variates. Biometrika 28(3):321–377CrossRefzbMATHMathSciNetGoogle Scholar
  10. 10.
    Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: ICCVGoogle Scholar
  11. 11.
    Lan Z, Bao L, Yu S, Liu W, Hauptmann A (2012) Double fusion for multimedia event detection. In: ACM MMGoogle Scholar
  12. 12.
    Lew M, Sebe N, Djeraba C, Jain R (2006) Content-based multimedia information retrieval: state-of-the-art and challenges. ACM Trans Multimed Comput Commun Appl 2(1):1–19CrossRefGoogle Scholar
  13. 13.
    Ma Z, Nie F, Yang Y, Uijlings J, Sebe N, Hauptmann AG (2012) Discriminating joint feature analysis for multimedia data understanding. IEEE Trans Multimed (TMM) 14(6):1662–1672CrossRefGoogle Scholar
  14. 14.
    Ma Z, Yang Y, Cai Y, Sebe N, Hauptmann A (2012) Transfer knowledge adaptation for ad hoc multimedia event detection with few examplars. In: ACM MMGoogle Scholar
  15. 15.
    Reddy K, Shah M (2012) Recognizing 50 human action categories of web videos. In: MVAPGoogle Scholar
  16. 16.
    Schmid C, Mohr R (1997) Local grayvalue invariants for image retrieval. In: TPAMIGoogle Scholar
  17. 17.
    Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: ICPRGoogle Scholar
  18. 18.
    Snoek C, Worring M, Smeulders A (2005) Early versus late fusion in semantic video analysis. In: ACM MMGoogle Scholar
  19. 19.
    Sonnenburg S, Rtsch G, Schfer C, Schlkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565zbMATHMathSciNetGoogle Scholar
  20. 20.
    Vinokourov A, Shawe-taylor J, Cristianini N (2002) Inferring a semantic representation of text via cross-language correlation analysisGoogle Scholar
  21. 21.
    Wang H, Kläser A, Schmid C, Liu C (2011) Action recognition by dense trajectories. In: CVPRGoogle Scholar
  22. 22.
    Xu Z, Yang Y, Tsang I, Sebe N, Hauptmann A (2013) Feature weighting via optimal thresholding for video analysis. In: ICCVGoogle Scholar
  23. 23.
    Yan R (2006) Probabilistic latent query analysis for combining multiple retrieval sources. In: Proceedings of the 29th international ACM SIGIR conference. ACM Press, pp 324–331Google Scholar
  24. 24.
    Yan Y, Xu Z, Liu G, Ma Z, Sebe N (2013) Glocal structural feature selection with sparsity for multimedia data understanding. In: ACM MMGoogle Scholar
  25. 25.
    Yang Y, Ma Z, Hauptmann A, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimedia 15(3):321–377Google Scholar
  26. 26.
    Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Machine Intell 34(4):723–742CrossRefGoogle Scholar
  27. 27.
    Yang Y, Song J, Huang Z, Ma Z, Sebe N, Hauptmann A (2013) Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans Multimedia 15(3):572–581CrossRefGoogle Scholar
  28. 28.
    Yang Y, Xu D, Nie F, Luo J, Zhuang Y (2009) Ranking with local regression and global alignment for cross media retrieval. In: ACM MMGoogle Scholar
  29. 29.
    Yang Y, Zhuang Y, Wu F, Pan Y (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans Multimed 10(3):437–446CrossRefGoogle Scholar
  30. 30.
    Zhan Y, Sun J, Niu D, Mao Q, Fan J (2014) A semi-supervised incremental learning method based on adaptive probabilistic hypergraph for video semantic detection. Multimed Tools ApplGoogle Scholar
  31. 31.
    Zhou D, Bousquet O, Lal TN, Weston J, Schlkopf B (2004) Learning with local and global consistency. In: Advances in neural information processing systems, vol 16. MIT Press, pp 321–328Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Haoquan Shen
    • 1
    Email author
  • Yan Yan
    • 2
  • Shicheng Xu
    • 1
  • Nicolas Ballas
    • 3
  • Wenzhi Chen
    • 1
  1. 1.Department of Computer ScienceZhejiang UniversityZhejiangChina
  2. 2.Department of Information Engineering and Computer ScienceUniversity of TrentoTrentoItaly
  3. 3.CEA and Mines-ParisTechParisFrance

Personalised recommendations