Systematic Evaluation of Spatio-Temporal Features on Comparative Video Challenges

  • Julian Stöttinger
  • Bogdan Tudor Goras
  • Thomas Pöntiz
  • Allan Hanbury
  • Nicu Sebe
  • Theo Gevers
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6468)


In the last decade, we observed a great interest in evaluation of local visual features in the domain of images. The aim is to provide researchers guidance when selecting the best approaches for new applications and data-sets. Most of the state-of-the-art features have been extended to the temporal domain to allow for video retrieval and categorization using similar techniques to those used for images. However, there is no comprehensive evaluation of these. We provide the first comparative evaluation based on isolated and well defined alterations of video data. We select the three most promising approaches, namely the Harris3D, Hessian3D, and Gabor detectors and the HOG/HOF, SURF3D, and HOG3D descriptors. For the evaluation of the detectors, we measure their repeatability on the challenges treating the videos as 3D volumes. To evaluate the robustness of spatio-temporal descriptors, we propose a principled classification pipeline where the increasingly altered videos build a set of queries. This allows for an in-depth analysis of local detectors and descriptors and their combinations.


Interest Point Query Image Structure Tensor Original Video Video Retrieval 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cula, O.G., Dana, K.J.: Compact representation of bidirectional texture functions. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 1, 1041 (2001)Google Scholar
  2. 2.
    Duchenne, O., Laptev, I., Sivic, J., Bach, F., Ponce, J.: Automatic annotation of human actions in video. In: ICCV (2009)Google Scholar
  3. 3.
    Junejo, I., Dexter, E., Laptev, I., Pérez, P.: View-independent action recognition from temporal self-similarities. PAMI (2009)Google Scholar
  4. 4.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR, pp. 1–8 (2008)Google Scholar
  5. 5.
    Schüldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: ICPR (2004)Google Scholar
  6. 6.
    Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV (2003)Google Scholar
  7. 7.
    Willems, G., Tuytelaars, T., Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  8. 8.
    Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: VS-PETS, pp. 65–72 (2005)Google Scholar
  9. 9.
    Ke, Q., Kanade, T.: Quasiconvex optimization for robust geometric reconstruction. In: ICCV, pp. 986–993 (2005)Google Scholar
  10. 10.
    Oikonomopoulos, A., Patras, I., Pantic, M.: Kernel-based recognition of human actions using spatiotemporal salient points. In: CVPR, p. 151 (2006)Google Scholar
  11. 11.
    Wang, H., Ullah, M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC (2009)Google Scholar
  12. 12.
    Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: ICCV, pp. 1–8 (2007)Google Scholar
  13. 13.
    Kläser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC, pp. 995–1004 (2008)Google Scholar
  14. 14.
    Wong, S.F., Cipolla, R.: Extracting spatiotemporal interest points using global information. In: ICCV, pp. 1–8 (2007)Google Scholar
  15. 15.
    Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. PAMI 29, 2247–2253 (2007)CrossRefGoogle Scholar
  16. 16.
    Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Gool, L.V.: A comparison of affine region detectors. IJCV 65, 43–72 (2005)CrossRefGoogle Scholar
  17. 17.
    Stöttinger, J., Zambanini, S., Khan, R., Hanbury, A.: Feeval - a dataset for evaluation of spatio-temporal local features. In: ICPR (2010)Google Scholar
  18. 18.
    Harris, C., Stephens, M.: A combined corner and edge detection. In: AVC, pp. 147–151 (1988)Google Scholar
  19. 19.
    Lindeberg, T.: Feature detection with automatic scale selection. IJCV 30, 79–116 (1998)CrossRefGoogle Scholar
  20. 20.
    Pönitz, T., Donner, R., Stöttinger, J., Hanbury, A.: Efficient and distinct large scale bags of words. In: AAPR (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Julian Stöttinger
    • 1
    • 3
  • Bogdan Tudor Goras
    • 2
  • Thomas Pöntiz
    • 3
  • Allan Hanbury
    • 4
  • Nicu Sebe
    • 5
  • Theo Gevers
    • 6
  1. 1.CVL, Institute for Computer-Aided automationTU ViennaAustria
  2. 2.Faculty of Electronics, Telecommuniction and InformaticsTech. University of IasiRomania
  3. 3.CogVis Ltd.ViennaAustria
  4. 4.IR FacilityViennaAustria
  5. 5.Dept. of Information Eng. and Computer ScienceUniversity of TrentoItaly
  6. 6.Faculty of ScienceUniversity of AmsterdamThe Netherlands

Personalised recommendations