Video Activity Recognition Using Sequence Kernel Based Support Vector Machines

  • Sony S. Allappa
  • Veena ThenkanidiyoorEmail author
  • Dileep Aroor Dinesh
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11351)


This paper addresses issues in performing video activity recognition using support vector machines (SVMs). The videos comprise of sequence of sub-activities where a sub-activity correspond to a segment of video. For building activity recognizer, each segment is encoded into a feature vector. Hence a video is represented as a sequence of feature vectors. In this work, we propose to explore GMM-based encoding scheme ot encode a video segment into bag-of-visual-word vector representation. We also propose to use Fisher score vector as an encoded representation for a video segment. For building SVM-based activity recognizer, it is necessary to use suitable kernel that match sequences of feature vectors. Such kernels are called sequence kernels. In this work, we propose different sequence kernels like modified time flexible kernel, segment level pyramid match kernel, segment level probability sequence kernel and segment level Fisher kernel for matching videos when segments are represented using an encoded feature vector representation. The effectiveness of the proposed sequence kernels in the SVM- based activity recognition are studied using benchmark datasets.


Video activity recognition Gaussian mixture Model based encoding Fisher score vector Support evctor machine Time flexible kernel Modified time flexible kernel Segment level pyramid match kernel Segment level probability sequence kernel Segment level Fisher kernel 


  1. 1.
    Rodriguez, M., Orrite, C., Medrano, C., Makris, D.: A time flexible kernel framework for video-based activity recognition. Image Vis. Comput. 48, 26–36 (2016)CrossRefGoogle Scholar
  2. 2.
    Yamato, J., Ohya, J., Ishii, K.: Recognizing human action in time-sequential images using hidden Markov model. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 379–385 (1992)Google Scholar
  3. 3.
    Shabou, A., LeBorgne, H.: Locality-constrained and spatially regularized coding for scene categorization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3618–3625 (2012)Google Scholar
  4. 4.
    Wang, J., Liu, P., She, M., Liu, H.: Human action categorization using conditional random field. In: IEEE Workshop on Robotic Intelligence in Informationally Structured Space (RiiSS), pp. 131–135 (2011)Google Scholar
  5. 5.
    Dileep, A.D., Sekhar, C.C.: HMM based intermediate matching kernel for classification of sequential patterns of speech using support vector machines. IEEE Trans. Audio Speech Lang. Process. 21(12), 2570–2582 (2013)CrossRefGoogle Scholar
  6. 6.
    Sharma, N., Sharma, A., Thenkanidiyoor, V., Dileep, A.D.: Text classification using combined sparse representation classifiers and support vector machines. In: 4th International Symposium on Computational and Business Intelligence (ISCBI), pp. 181–185 (2016)Google Scholar
  7. 7.
    Van Gemert, J.C., Veenman, C.J., Smeulders, A.W.M., Geusebroek, J.-M.: Visual word ambiguity. IEEE Trans. Pattern Anal. Mach. Intell. 32(17), 1271–1283 (2010)CrossRefGoogle Scholar
  8. 8.
    Chatfield, K., Lempitsky, V.S., Vedaldi, A., Zisserman, A.: The devil is in the details: an evaluation of recent feature encoding methods. BMVC 2(4), 8 (2011)Google Scholar
  9. 9.
    Soomro, K., Zamir, A.R.: Action recognition in realistic sports videos. In: Moeslund, T.B., Thomas, G., Hilton, A. (eds.) Computer Vision in Sports. ACVPR, pp. 181–208. Springer, Cham (2014). Scholar
  10. 10.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)Google Scholar
  11. 11.
    Xu, D., Chang, S.-F.: Video event recognition using kernel methods with multilevel temporal alignment. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1985–1997 (2008)CrossRefGoogle Scholar
  12. 12.
    Cao, L., Mu, Y., Natsev, A., Chang, S.-F., Hua, G., Smith, J.R.: Scene aligned pooling for complex video recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, pp. 688–701. Springer, Heidelberg (2012). Scholar
  13. 13.
    Vahdat, A., Cannons, K., Mori, G., Oh, S., Kim, I.: Compositional models for video event detection: a multiple kernel learning latent variable approach. In: IEEE International Conference on Computer Vision (ICCV), pp. 1185–1192 (2013)Google Scholar
  14. 14.
    Li, W., Yu, Q., Divakaran, A., Vasconcelos, N.: Dynamic pooling for complex event recognition. In: IEEE International Conference on Computer Vision (ICCV), pp. 2728–2735 (2013)Google Scholar
  15. 15.
    Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)CrossRefGoogle Scholar
  16. 16.
    Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia, pp. 357–360 (2007)Google Scholar
  17. 17.
    Klaser, A., Marszałek, M., Schmid, C.: A Spatio-temporal descriptor based on 3D-gradients. In: 19th British Machine Vision Conference (BMVC), pp. 1–275 (2008)Google Scholar
  18. 18.
    Laptev, I.: Space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)CrossRefGoogle Scholar
  19. 19.
    Kliper-Gross, O., Gurovich, Y., Hassner, T., Wolf, L.: Motion interchange patterns for action recognition in unconstrained videos. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 256–269. Springer, Heidelberg (2012). Scholar
  20. 20.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), New York, vol. 2, pp. 2169–2178 (2006)Google Scholar
  21. 21.
    Gupta, S., Dileep, A.D., Thenkanidiyoor, V.: Segment-level pyramid match kernels for the classification of varying length patterns of speech using SVMs. In: 24th European Signal Processing Conference (EUSIPCO), pp. 2030–2034 (2016)Google Scholar
  22. 22.
    Zha, S., Luisier, F., Andrews, W., Srivastava, N., Salakhutdinov, R.: Exploiting image-trained CNN architectures for unconstrained video classification. arXiv preprint arXiv:1503.04144 (2015)
  23. 23.
    Wu, Z., Wang, X., Jiang, Y.-G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification, pp. pp. 461–470. arXiv preprint arXiv:1504.01561 (2015)
  24. 24.
    Varadarajan, B., Toderici, G., Vijayanarasimhan, S., Natsev, A.: Efficient large scale video classification. arXiv preprint arXiv:1505.06250 (2015)
  25. 25.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 1725–1732 (2014)Google Scholar
  26. 26.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  27. 27.
    Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305–4314 (2015)Google Scholar
  28. 28.
    Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream ConvNets. arXiv preprint arXiv:1507.02159 (2015)
  29. 29.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). Scholar
  30. 30.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409–1556 (2014)
  31. 31.
    Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2929–2936 (2009)Google Scholar
  32. 32.
    Gupta, S., Thenkanidiyoor, V., Aroor Dinesh, D.: Segment-level probabilistic sequence kernel based support vector machines for classification of varying length patterns of speech. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds.) ICONIP 2016. LNCS, vol. 9950, pp. 321–328. Springer, Cham (2016). Scholar
  33. 33.
    Thenkanidiyoor, V., Chandra Sekhar, C.: Dynamic kernels based approaches to analysis of varying length patterns in speech and image processing tasks. In: Pattern Recognition And Big Data. World Scientific (2017)Google Scholar
  34. 34.
    Reddy, K.K., Shah, M.: Recognizing 50 human action categories of web videos. Mach. Vis. Appl. J. (MVAP) 24, 971–981 (2012)CrossRefGoogle Scholar
  35. 35.
    Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)Google Scholar
  36. 36.
    Sharma, A., Kumar, A., Allappa, S., Thenkanidiyoor, V., Dileep, A.D.: Modified time flexible kernel for video activity recognition using support vector machines. In: 7th International Conference on Pattern Recognition Applications and Methods, pp. 133–140 (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Sony S. Allappa
    • 1
  • Veena Thenkanidiyoor
    • 1
    Email author
  • Dileep Aroor Dinesh
    • 2
  1. 1.National Institute of Technology GoaFarmagudiIndia
  2. 2.Indian Institute of Technology MandiMandiIndia

Personalised recommendations