Advertisement

Multimedia Tools and Applications

, Volume 77, Issue 3, pp 3209–3227 | Cite as

Complex event detection via attention-based video representation and classification

Article

Abstract

As an important task in managing unconstrained web videos, multimedia event detection (MED) has attracted wide attention recently. However, due to the complexities such as high abstraction of the events, various scenes and frequent interactions of individuals etc., MED is quite challenging. In this paper, we propose a novel MED algorithm via attention-based video representation and classification. Firstly, inspired by human’s selective attention mechanism, an attention-based saliency localization network (ASLN) is constructed to quickly predict the semantic saliency objects of video frames. Afterwards, in order to complementarily represent salient objects and the surroundings, two Convolutional Neural Networks (CNNs) features, i.e., local saliency feature and global feature are respectively extracted from the salient objects and the whole feature map. Thirdly, after binding two features together, Vector of Locally Aggregated Descriptors (VLAD) is applied to encode them into the video representation. Finally, the linear Support Vector Machine (SVM) classifiers are trained to classify. We extensively evaluate the performance on TRECVID MED14_10Ex, MED14_100Ex and Columbia Consume Video (CCV) datasets. Experimental results show that the proposed single model outperforms state-of-the-art approaches on all three real-world video datasets, and demonstrate the effectiveness.

Keywords

Multimedia event detection Visual attention Salient object Vlad 

Notes

Acknowledgements

This work is supported by Chinese National Natural Science Foundation under Grants 61471049, 61372169 and 61532018.

References

  1. 1.
    Arandjelovic R, Zisserman A (2013) All about VLAD. IEEE Conference on Computer Vision and Pattern Recognition, Portland, pp 1578–1585.Google Scholar
  2. 2.
    Chang X, Yang Y, Xing EP, Yu Y (2015) Complex event detection using semantic saliency and nearly-isotonic SVM. IEEE Conference on Machine Learning, Lille, pp 1348–1357.Google Scholar
  3. 3.
    Chang X, Yu Y, Yang Y, Hauptmann A (2015) Searching persuasively: joint event detection and evidence recounting with limited supervision. ACM Multimedia, Brisbane, pp 581–590.Google Scholar
  4. 4.
    Chang X, Yang Y, Long G, Zhang C, Hauptmann A (2016) Dynamic concept composition for zero-example event detection. AAAI International Conference on Artificial Intelligence, Phoenix, pp 3464–3470.Google Scholar
  5. 5.
    Chang X, Yu Y, Yang Y, Xing E (2016) They are not equally reliable: semantic event search using differentiated concept classifiers. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, pp 1884–1893.Google Scholar
  6. 6.
    Chang X, Yu Y, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632CrossRefGoogle Scholar
  7. 7.
    Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann A (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197CrossRefGoogle Scholar
  8. 8.
    Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. Annual Conference on Neural Information Processing Systems, Barcelona, pp 379–387.Google Scholar
  9. 9.
    Fan R, Chang K, Hsieh C, Wang X, Lin C (2008) LIBLINEAR: a library for large linear classification. Int J Mach Learn Res 9:1871–1874MATHGoogle Scholar
  10. 10.
    Gers FA, Schmidhuber J, Cummins FA (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12(10):2451–2471CrossRefGoogle Scholar
  11. 11.
    Gidaris S, Komodakis N (2015) Object detection via a multi-region and semantic segmentation-aware CNN model. IEEE International Conference on Computer Vision, Santiago, pp 1134–1142.Google Scholar
  12. 12.
    Gkalelis N, Mezaris V (2014) Video event detection using generalized subclass discriminant analysis and linear support vector machines. ACM International Conference on Multimedia Retrieval, Glasgow, p 25.Google Scholar
  13. 13.
    He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916CrossRefGoogle Scholar
  14. 14.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, pp 770–778.Google Scholar
  15. 15.
    Jia Y, Shelhamer E, Donahue J et al (2014) Caffe: convolutional architecture for fast feature embedding. ACM Multimedia, Orlando, pp 675–678.Google Scholar
  16. 16.
    Jiang Y, Ye G, Chang S et al (2011) Consumer video understanding: a benchmark database and an evaluation of human and machine performance. ACM International Conference on Multimedia Retrieval, Trento, p 29.Google Scholar
  17. 17.
    Karpathy A, Toderici G, Shetty S et al (2014) Large-scale video classification with convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, pp 1725–1732.Google Scholar
  18. 18.
    Karthikeyan S, Ngo T, Eckstein M, Manjunath B (2015) Eye tracking assisted extraction of attentionally important objects from videos. IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp 3241–3250.Google Scholar
  19. 19.
    Lai K, Liu D, Chen M, Chang S (2014) Recognizing complex events in videos by learning key static-dynamic evidences. European Conference on Computer Vision, Zurich, pp 675–688.Google Scholar
  20. 20.
    Li W, Yu Q, Divakaran A, Vasconcelos N (2013) Dynamic pooling for complex event recognition. IEEE International Conference on Computer Vision, Sydney, pp 2728–2735.Google Scholar
  21. 21.
    Liu G, Yan Y, Gao C, Tong W, Hauptmann A, Sebe N (2014) The mystery of faces: investigating face contribution for multimedia event detection. ACM International Conference on Multimedia Retrieval, Glasgow, p 467.Google Scholar
  22. 22.
    Luisier F, Tickoo M, Andrews W, Ye G, Liu D, Chang S et al (2014) BBN VISER TRECVID 2014 multimedia event detection and multimedia event recounting systems. In TRECVID 2014 workshop, Orlando, FL, USA.Google Scholar
  23. 23.
    Ma Z, Yang Y, Sebe N, Hauptmann A (2014) Knowledge adaptation with partially shared features for event detection using few exemplars. IEEE Trans Pattern Anal Mach Intell 36(9):1789–1802CrossRefGoogle Scholar
  24. 24.
    Martin M (1979) Local and global precessing: the role of sparisty. Mem Cogn 7(3):476–484CrossRefGoogle Scholar
  25. 25.
    Mettes P, Koelma D, Snoek C (2016) The ImageNet shuffle: reorganized pre-training for video event detection. ACM International Conference on Multimedia Retrieval, New York, pp 175–182.Google Scholar
  26. 26.
    Ng J, Hausknecht M, Vijayanarasimhan S et al (2015) Beyond short snippets: deep networks for video classification. IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp 4694–4702.Google Scholar
  27. 27.
    Pomerantz JR (1983) Global and local precedence: selective attention in form and motion perception. Int J Exp Psychol Gen 112(4):512–540Google Scholar
  28. 28.
    Redmon J, Divvala SK, Girshick RB, Farhadi A (2016) You only look once: unified, real-time object detection. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, pp 779–788.Google Scholar
  29. 29.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556.Google Scholar
  30. 30.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp 1–9.Google Scholar
  31. 31.
    Tang KD, Li F, Koller D (2012) Learning latent temporal structure for complex event detection. IEEE Conference on Computer Vision and Pattern Recognition, Providence, pp 1250–1257.Google Scholar
  32. 32.
    Treisman AM (1969) Strategies and models of selective attention. Psychol Rev 76(3):282–299CrossRefGoogle Scholar
  33. 33.
    Treue S, Martinez T (1999) Feature-based attention influences motion processing gain in macaque visual cortex. Nature 399:575–579CrossRefGoogle Scholar
  34. 34.
    Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. ACM Multimedia, Firenze, pp 1469–1472.Google Scholar
  35. 35.
    Wang H, Schmid C (2013) Action recognition with improved trajectories. IEEE International Conference on Computer Vision, Sydney, pp 3551–3558.Google Scholar
  36. 36.
    Wang H, Kläser A, Schmid C (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79MathSciNetCrossRefGoogle Scholar
  37. 37.
    Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp 4305–4314.Google Scholar
  38. 38.
    Xu Z, Yang Y, Hauptmann A (2015) A discriminative CNN video representation for event detection. IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp 1798–1807.Google Scholar
  39. 39.
    Ye G, Jhuo I, Liu D, Jiang Y, Lee D, Chang S (2012) Joint audio-visual bi-modal codewords for video event detection. ACM International Conference on Multimedia Retrieval, Hong Kong, p 39.Google Scholar
  40. 40.
    Ye H, Wu Z, Zhao R, Wang X, Jiang Y, Xue X (2015) Evaluating two-stream CNN for video classification. . ACM International Conference on Multimedia Retrieval, Shanghai, pp 435–442.Google Scholar
  41. 41.
    Younessian E, Mitamura T, Hauptmann A (2012) Multimodal knowledge-based analysis in multimedia event detection. ACM International Conference on Multimedia Retrieval, Hong Kong, p 5.Google Scholar
  42. 42.
    Yu Q, Liu J, Cheng H, Divakaran A, Sawhney H (2013) Semantic pooling for complex event detection. ACM Multimedia, Barcelona, pp 733–736.Google Scholar
  43. 43.
    Zhao Z, Song Y, Su F (2016) Specific video identification via joint learning of latent semantic concept, scene and temporal structure. Neurocomputing 208:378–386CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.School of Information and Communication EngineeringBeijing University of Posts and TelecommunicationsBeijingChina
  2. 2.Beijing Key Laboratory of Network System and Network CultureBeijing University of Posts and TelecommunicationsBeijingChina

Personalised recommendations