The Visual Computer

, Volume 34, Issue 3, pp 391–403 | Cite as

Motion keypoint trajectory and covariance descriptor for human action recognition

Original Article


Human action recognition from videos is a challenging task in computer vision. In recent years, histogram-based descriptors that are calculated along dense trajectories have shown promising results for human action recognition, but they usually ignore motion information of the tracking points, and the relationship between different motion variables is not well utilized. To address this issue, we propose a motion keypoint trajectory (MKT) approach and a trajectory-based covariance (TBC) descriptor, which is calculated along the motion keypoint trajectories. The proposed MKT approach tracks motion keypoints at multiple spatial scales and employs an optical flow rectification algorithm to reduce the influence of camera motions and thus achieves better performance than the improved dense trajectory (IDT) approach well known in the literature. In particular, MKT is faster than IDT, because MKT does not need to use human detection and extracts fewer trajectories than IDT. Furthermore, the TBC descriptor outperforms the classical histogram-based descriptors, such as the Histogram of Oriented Gradient, Histogram of Optical Flow and Motion Boundary Histogram. Experimental results on three challenging datasets (i.e., Olympic Sports, HMDB51 and UCF50) demonstrate that our approach is able to achieve better recognition performances than a number of state-of-the-art approaches.


Human action recognition Motion keypoint trajectory Optical flow rectification Trajectory-based covariance descriptor 



This work was supported in part by the National Natural Science Foundation of China under Grants 61472281 and 61622115, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No. GZ2015005), and the NSF of Jiangxi Province under Grant 20161BAB202069.


  1. 1.
    Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Log-Euclidean metrics for fast and simple calculus on diffusion tensors. Magn. Reson. Med. 56(2), 411–421 (2006)CrossRefGoogle Scholar
  2. 2.
    Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: European Conference on Computer Vision, pp. 404–417 (2006)Google Scholar
  3. 3.
    Bilinski, P., Bremond, F.: Video covariance matrix logarithm for human action recognition in videos. In: International Joint Conference on Artificial Intelligence, pp. 2140–2147 (2015)Google Scholar
  4. 4.
    Borges, P.V.K., Conci, N., Cavallaro, A.: Video-based human behavior understanding: a survey. IEEE Trans. Circuits Syst. Video Technol. 23(11), 1993–2008 (2013)CrossRefGoogle Scholar
  5. 5.
    Brendel, W., Todorovic, S.: Learning spatiotemporal graphs of human activities. In: International Conference on Computer Vision, pp. 778–785 (2011)Google Scholar
  6. 6.
    Cheng, G., Huang, Y., Wan, Y., Buckles, B.P.: Exploring temporal structure of trajectory components for action recognition. Int. J. Intell. Syst. 30(2), 99–119 (2015)CrossRefGoogle Scholar
  7. 7.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 886–893 (2005)Google Scholar
  8. 8.
    Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: European Conference on Computer Vision, pp. 428–441 (2006)Google Scholar
  9. 9.
    Dawn, D.D., Shaikh, S.H.: A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis. Comput. 32(3), 289–306 (2016)CrossRefGoogle Scholar
  10. 10.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)MATHGoogle Scholar
  11. 11.
    Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Image Analysis, pp. 363–370 (2003)Google Scholar
  12. 12.
    Förstner, W., Moonen, B.: A metric for covariance matrices. In: Geodesy—The Challenge of the 3rd Millennium, pp. 299–309 (2003)Google Scholar
  13. 13.
    Guo, K., Ishwar, P., Konrad, J.: Action recognition from video using feature covariance matrices. IEEE Trans. Image Process. 22(6), 2479–2494 (2013)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Hoai, M., Zisserman, A.: Improving human action recognition using score distribution and ranking. In: Asian Conference on Computer Vision, pp. 3–20 (2014)Google Scholar
  15. 15.
    Jain, M., Jegou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2555–2562 (2013)Google Scholar
  16. 16.
    Junejo, I.N., Junejo, K.N., Aghbari, Z.A.: Silhouette-based human action recognition using sax-shapes. Vis. Comput. 30(3), 259–269 (2014)CrossRefGoogle Scholar
  17. 17.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Annual Conference on Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  18. 18.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: International Conference on Computer Vision, pp. 2556–2563 (2011)Google Scholar
  19. 19.
    Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2), 107–123 (2005)CrossRefGoogle Scholar
  20. 20.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)Google Scholar
  21. 21.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2169–2178 (2006)Google Scholar
  22. 22.
    Li, Y., Ye, J., Wang, T., Huang, S.: Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition. Vis. Comput. 31(10), 1383–1394 (2015)CrossRefGoogle Scholar
  23. 23.
    Ma, J., Zhao, J., Tian, J., Yuille, A.L., Tu, Z.: Robust point matching via vector field consensus. IEEE Trans. Image Process. 23(4), 1706–1721 (2014)MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: International Conference on Computer Vision, pp. 104–111 (2009)Google Scholar
  25. 25.
    Niebles, J.C., Chen, C.W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: European Conference on Computer Vision, pp. 392–405 (2010)Google Scholar
  26. 26.
    Oneata, D., Verbeek, J., Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: International Conference on Computer Vision, pp. 1817–1824 (2013)Google Scholar
  27. 27.
    Pang, Y., Yuan, Y., Li, X.: Gabor-based region covariance matrices for face recognition. IEEE Trans. Circuits Syst. Video Technol. 18(7), 989–993 (2008)CrossRefGoogle Scholar
  28. 28.
    Reddy, K.K., Shah, M.: Recognizing 50 human action categories of web videos. Mach. Vis. Appl. 24(5), 971–981 (2013)CrossRefGoogle Scholar
  29. 29.
    Sanchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: theory and practice. Int. J. Comput. Vis. 105(3), 222–245 (2013)MathSciNetCrossRefMATHGoogle Scholar
  30. 30.
    Shi, J., Tomasi, C.: Good features to track. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 593–600 (1994)Google Scholar
  31. 31.
    Sun, J., Mu, Y., Yan, S., Cheong, L.F.: Activity recognition using dense long-duration trajectories. In: IEEE International Conference on Multimedia and Expo, pp. 322–327 (2010)Google Scholar
  32. 32.
    Truong, A., Boujut, H., Zaharia, T.: Laban descriptors for gesture recognition and emotional analysis. Vis. Comput. 32(1), 83–98 (2016)CrossRefGoogle Scholar
  33. 33.
    Tuzel, O., Porikli, F., Meer, P.: Region covariance: a fast descriptor for detection and classification. In: European Conference on Computer Vision, pp. 589–600 (2006)Google Scholar
  34. 34.
    Vishwakarma, S., Agrawal, A.: A survey on activity recognition and behavior understanding in video surveillance. Vis. Comput. 29(10), 983–1009 (2013)CrossRefGoogle Scholar
  35. 35.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)MathSciNetCrossRefGoogle Scholar
  36. 36.
    Wang, H., Oneata, D., Verbeek, J., Schmid, C.: A robust and efficient video representation for action recognition. Int. J. Comput. Vis. 119(3), 219–238 (2016)MathSciNetCrossRefGoogle Scholar
  37. 37.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: International Conference on Computer Vision, pp. 3551–3558 (2013)Google Scholar
  38. 38.
    Wang, H., Yi, Y., Wu, J.: Human action recognition with trajectory based covariance descriptor in unconstrained videos. In: ACM international conference on Multimedia, pp. 1175–1178 (2015)Google Scholar
  39. 39.
    Willems, G., Tuytelaars, T., Gool, L.V.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: European Conference on Computer Vision, pp. 650–663 (2008)Google Scholar
  40. 40.
    Wu, J., Hu, D., Chen, F.: Action recognition by hidden temporal models. Vis. Comput. 30(12), 1395–1404 (2014)CrossRefGoogle Scholar
  41. 41.
    Zhou, L., Lu, Z., Leung, H., Shang, L.: Spatial temporal pyramid matching using temporal sparse representation for human motion retrieval. Vis. Comput. 30(6), 845–854 (2014)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  1. 1.Department of Computer Science and TechnologyTongji UniversityShanghaiPeople’s Republic of China
  2. 2.Key Laboratory of Embedded System and Service Computing, Ministry of EducationTongji UniversityShanghaiPeople’s Republic of China
  3. 3.Department of Mathematics and Computer ScienceGannan Normal UniversityGanzhouPeople’s Republic of China

Personalised recommendations