Pyramidal Zernike Over Time: A Spatiotemporal Feature Descriptor Based on Zernike Moments

  • Igor L. O. Bastos
  • Larissa Rocha Soares
  • William Robson Schwartz
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10657)


This paper aims at presenting an approach to recognize human activities in videos through the application of Zernike invariant moments. Instead of computing the regular Zernike moments, our technique, named Pyramidal Zernike Over Time (PZOT), creates a pyramidal structure and uses the Zernike response at different levels to associate subsequent frames, adding temporal information. At the end, the feature response is associated to Gabor filters to generate video descriptions. To evaluate the present approach, experiments were performed on the UCFSports dataset using a standard protocol, achieving an accuracy of 86.05%, comparable to results achieved by other widely employed spatiotemporal feature descriptors available in the literature.


Activity recognition Feature extraction Zernike moments 



The authors would like to thank the Brazilian National Research Council – CNPq, the Minas Gerais Research Foundation – FAPEMIG (Grants APQ-00567-14 and PPM-00540-17) and the Coordination for the Improvement of Higher Education Personnel – CAPES (DeepEyes Project).


  1. 1.
    Bastos, I.L.O., Angelo, M.F., Loula, A.: Recognition of static gestures applied to Brazilian sign language (libras). In: SIBGRAPI, pp. 305–312 (2015)Google Scholar
  2. 2.
    Caetano, C., Santos, J.A., Schwartz, W.R.: Optical flow co-occurrence matrices: a novel spatiotemporal feature descriptor. In: Proceedings of the 23rd ICPR (2016)Google Scholar
  3. 3.
    Chaquet, J.M., Carmona, E.J., Antonio, F.: A survey of video datasets for human action and activity recognition. Comput. Vis. Image Underst. 117, 633–659 (2013)CrossRefGoogle Scholar
  4. 4.
    Costantini, L., Seidenari, L., Serra, G., Capodiferro, L., Del Bimbo, A.: Space-time Zernike moments and pyramid kernel descriptors for action classification. In: Maino, G., Foresti, G.L. (eds.) ICIAP 2011, Part II. LNCS, vol. 6979, pp. 199–208. Springer, Heidelberg (2011). CrossRefGoogle Scholar
  5. 5.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of IEEE CVPR 2005, vol. 1, pp. 886–893 (2005)Google Scholar
  6. 6.
    Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006). CrossRefGoogle Scholar
  7. 7.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE CVPR 2016 (2016)Google Scholar
  8. 8.
    Ke, S., Thuc, H., Lee, Y., Hwang, J., Yoo, J., Choi, K.: A review on video-based human activity recognition. Computers 2, 88–131 (2013)CrossRefGoogle Scholar
  9. 9.
    Khotanzad, A., Hong, Y.H.: Invariant image recognition by Zernike moments. IEEE Trans. Pattern Anal. Mach. Intell. 12(5), 489–497 (1990)CrossRefGoogle Scholar
  10. 10.
    Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: Proceedings of 19th BMVC, pp. 275:1–10 (2008)Google Scholar
  11. 11.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of IEEE CVPR 2008, pp. 1–8 (2008)Google Scholar
  12. 12.
    Lassoued, I., Zagrouba, E., Chahir, Y.: An efficient approach for video action classification based on 3D Zernike moments. In: Park, J.J., Yang, L.T., Lee, C. (eds.) FutureTech 2011. CCIS, vol. 185, pp. 196–205. Springer, Heidelberg (2011). CrossRefGoogle Scholar
  13. 13.
    Newton, A.R., Hse, H.: Sketched symbol recognition using Zernike moments. In: Proceedings of the 17th ICPR, vol. 1, pp. 367–370 (2004)Google Scholar
  14. 14.
    Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of CVPR 2008 (2008)Google Scholar
  15. 15.
    Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM ICM, pp. 357–360 (2007)Google Scholar
  16. 16.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in NIPS 27, pp. 568–576 (2014)Google Scholar
  17. 17.
    Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: Proceedings of the IEEE ICCV 2003, vol. 2, pp. 1470–1477 (2003)Google Scholar
  18. 18.
    Tsolakidis, D.G., Kosmopoulos, D.I., Papadourakis, G.: Plant leaf recognition using Zernike moments and histogram of oriented gradients. In: Likas, A., Blekas, K., Kalles, D. (eds.) SETN 2014. LNCS (LNAI), vol. 8445, pp. 406–417. Springer, Cham (2014). CrossRefGoogle Scholar
  19. 19.
    Wang, H., Klaser, A., Schmid, C., Liu, C.: Action recognition by dense trajectories. In: Proceedings of the IEEE CVPR 2011, pp. 3169–3176 (2011)Google Scholar
  20. 20.
    Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008). CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Smart Surveillance Interest Group, Department of Computer ScienceUniversidade Federal de Minas GeraisBelo HorizonteBrazil
  2. 2.Reuse in Software EngineeringUniversidade Federal da BahiaSalvadorBrazil

Personalised recommendations