Spatio-temporal SIFT and Its Application to Human Action Classification

  • Manal Al Ghamdi
  • Lei Zhang
  • Yoshihiko Gotoh
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7583)


This paper presents a space-time extension of scale-invariant feature transform (SIFT) originally applied to the 2-dimensional (2D) volumetric images. Most of the previous extensions dealt with 3-dimensional (3D) spacial information using a combination of a 2D detector and a 3D descriptor for applications such as medical image analysis. In this work we build a spatio-temporal difference-of-Gaussian (DoG) pyramid to detect the local extrema, aiming at processing video streams. Interest points are extracted not only from the spatial plane (xy) but also from the planes along the time axis (xt and yt). The space-time extension was evaluated using the human action classification task. Experiments with the KTH and the UCF sports datasets show that the approach was able to produce results comparable to the state-of-the-arts.


Action Recognition Interest Point Scale Invariant Feature Transform Human Action Recognition Gaussian Pyramid 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (2005)Google Scholar
  2. 2.
    Niebles, J., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision 79 (2008)Google Scholar
  3. 3.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: International Conference on Pattern Recognition, vol. 3 (2004)Google Scholar
  4. 4.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2004)Google Scholar
  5. 5.
    Dorr, M., Jarodzka, H., Barth, E.: Space-variant spatio-temporal filtering of video for gaze visualization and perceptual learning. In: Symposium on Eye-Tracking Research & Applications, New York (2010)Google Scholar
  6. 6.
    Lopes, A., Oliveira, R., de Almeida, J., de Araujo, A.A.: Spatio-temporal frames in a bag-of-visual-features approach for human actions recognition. In: Brazilian Symposium on Computer Graphics and Image Processing (2009)Google Scholar
  7. 7.
    Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: International Conference on Multimedia (2007)Google Scholar
  8. 8.
    Laptev, I., Lindeberg, T.: Space-time interest points. In: International Conference on Computer Vision (2003)Google Scholar
  9. 9.
    Cheung, W., Hamarneh, G.: N-sift: N-dimensional scale invariant feature transform for matching medical images. In: International Symposium on Biomedical Imaging: From Nano to Macro (2007)Google Scholar
  10. 10.
    Allaire, S., Kim, J., Breen, S., Jaffray, D., Pekar, V.: Full orientation invariance and improved feature selectivity of 3d sift with application to medical image analysis. In: Computer Vision and Pattern Recognition Workshops (2008)Google Scholar
  11. 11.
    Chen, M.Y., Hauptmann, A.: Mosift: Recognizing human actions in surveillance videos. Transform (2009)Google Scholar
  12. 12.
    Uz, K., Vetterli, M., LeGall, D.: Interpolative multiresolution coding of advance television with compatible subchannels. IEEE Transactions on Circuits and Systems for Video Technology 1 (1991)Google Scholar
  13. 13.
    Vedaldi, A., Fulkerson, B.: Vlfeat: an open and portable library of computer vision algorithms. In: International Conference on Multimedia, New York (2010)Google Scholar
  14. 14.
    Shao, L., Mattivi, R.: Feature detector and descriptor evaluation in human action recognition. In: International Conference on Image and Video Retrieval (2010)Google Scholar
  15. 15.
    Rodriguez, M., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: Conference on Computer Vision and Pattern Recognition (2008)Google Scholar
  16. 16.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Conference on Computer Vision and Pattern Recognition (2008)Google Scholar
  17. 17.
    Liu, J., Luo, J., Shah, M.: Action recognition in unconstrained amateur videos. In: International Conference on Acoustics, Speech and Signal Processing (2009)Google Scholar
  18. 18.
    Wang, H., Ullah, M.M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: British Machine Vision Conference (2009)Google Scholar
  19. 19.
    Kläser, A., Marszałek, M., Laptev, I., Schmid, C.: Will person detection help bag-of-features action recognition? Technical Report RR-7373, INRIA Grenoble, France (2010)Google Scholar
  20. 20.
    Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: Conference on Computer Vision and Pattern Recognition (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Manal Al Ghamdi
    • 1
  • Lei Zhang
    • 2
  • Yoshihiko Gotoh
    • 1
  1. 1.University of SheffieldUK
  2. 2.Harbin Engineering UniversityPRC

Personalised recommendations