Tracklet Descriptors for Action Modeling and Video Analysis

  • Michalis Raptis
  • Stefano Soatto
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6311)


We present spatio-temporal feature descriptors that can be inferred from video and used as building blocks in action recognition systems. They capture the evolution of “elementary action elements” under a set of assumptions on the image-formation model and are designed to be insensitive to nuisance variability (absolute position, contrast), while retaining discriminative statistics due to the fine-scale motion and the local shape in compact regions of the image. Despite their simplicity, these descriptors, used in conjunction with basic classifiers, attain state of the art performance in the recognition of actions in benchmark datasets.


Recognition Rate Base Region Action Recognition Interest Point Dynamic Time Warping 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  2. 2.
    Birchfield, S.: Klt: An implementation of the kanade-lucas-tomasi feature tracker (1996)Google Scholar
  3. 3.
    Bobick, A., Davis, J.: The recognition of human movement using temporal templates. IEEE Trans. on Pattern Anal. and Machine Intell. (2001)Google Scholar
  4. 4.
    Chen, M., Mummert, L., Pillai, P., Hauptmann, A., Sukthankar, R.: Exploiting multi-level parallelism for low-latency activity recognition in streaming video. In: Proc. of the First Annual ACM SIGMM Conf. on Multimedia systems. ACM, New York (2010)Google Scholar
  5. 5.
    Csurka, G., Dance, C.R., Dan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Proc. of the Eur. Conf. on Computer Vision, ECCV (2004)Google Scholar
  6. 6.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proc. IEEE Conf. on Computer Vision and Pattern Recongition (2005)Google Scholar
  7. 7.
    Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: VS-PETS (October 2005)Google Scholar
  8. 8.
    Efros, A., Berg, A., Mori, G., Malik, J.: Recognizing action at a distance. In: Proc. Intl. Conf. on Computer Vision (2003)Google Scholar
  9. 9.
    Frey, B., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972 (2007)CrossRefMathSciNetGoogle Scholar
  10. 10.
    Harris, C., Stephens, M.: A combined corner and edge detector. In: Alvey Vision Conference, Manchester, UK, vol. 15, p. 50 (1988)Google Scholar
  11. 11.
    Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: Proc. Intl. Conf. on Computer Vision (2007)Google Scholar
  12. 12.
    Johansson, G.: Visual perception of biological motion and a model for its analysis. Perceiving events and objects (1973)Google Scholar
  13. 13.
    Kaâniche, M., Brémond, F.: Gesture recognition by learning local motion signatures. In: Proc. Conf. Computer Vision and Pattern Recognition (2010)Google Scholar
  14. 14.
    Kläser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3dgradients. In: British Machine Vision Conference, September 2008, pp. 995–1004 (2008)Google Scholar
  15. 15.
    Kumar, M., Patel, N., Woo, J.: Clustering seasonality patterns in the presence of errors. In: Proceedings of the Eighth ACM SIGKDD (2002)Google Scholar
  16. 16.
    Laptev, I.: On space-time interest points. Intl. J. of Comp. Vis. 64(2), 107–123 (2005)CrossRefMathSciNetGoogle Scholar
  17. 17.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proc. Conf. Computer Vision and Pattern Recognition (2008)Google Scholar
  18. 18.
    Laptev, I., Pérez, P.: Retrieving actions in movies. In: Proc. Intl. Conf. on Computer Vision (2007)Google Scholar
  19. 19.
    Lee, T., Soatto, S.: An end-to-end visual recognition system. Technical Report UCLA-CSD-100008 (February 10, 2010) (revised March 18, 2010)Google Scholar
  20. 20.
    Lin, Z., Jiang, Z., Davis, L.: Recognizing actions by shape-motion prototype trees. In: Proc. Intl. Conf. on Computer Vision (2009)Google Scholar
  21. 21.
    Liu, J., Luo, J., Shah, M.: Recognizing Realistic Actions from Videos “in the Wild”. In: Proc. IEEE Computer Vision and Pattern Recognition (2009) Google Scholar
  22. 22.
    Liu, J., Shah, M.: Learning human actions via information maximization. In: Proc. IEEE Conf. on Computer Vision and Pattern Recongition (2008)Google Scholar
  23. 23.
    Lowe, D.: Object recognition from local scale-invariant features. In: Intl. Conf. on Computer Vision (1999)Google Scholar
  24. 24.
    Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proc. 7th Int. Joint Conf. on Art. Intell. (1981)Google Scholar
  25. 25.
    Matikainen, P., Hebert, M., Sukthankar, R.: Trajectons: Action recognition through the motion analysis of tracked features. In: ICCV workshop on Videooriented Objected and Event Classification (2009)Google Scholar
  26. 26.
    Messing, R., Pal, C.: Behavior recognition in video with extended models of feature velocity dynamics. In: AAAI Spring Symposium Technical Report (2009)Google Scholar
  27. 27.
    Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: Intl. Conf. on Computer Vision (2009)Google Scholar
  28. 28.
    Niebles, J., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Intl. J. of Comp. Vis. 79(3) (2008)Google Scholar
  29. 29.
    Nowozin, S., Bakir, G., Tsuda, K.: Discriminative subsequence mining for action classification. In: Proc. Intl. Conf. on Computer Vision (2007)Google Scholar
  30. 30.
    Rabiner, L., Juang, B.: Fundamentals of speech recognition. Prentice Hall, Englewood Cliffs (1993)Google Scholar
  31. 31.
    Robert, C.P.: The Bayesian Choice. Springer, New York (2001)zbMATHGoogle Scholar
  32. 32.
    Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26(1), 43–49 (1978)zbMATHCrossRefGoogle Scholar
  33. 33.
    Schindler, K., Van Gool, L.: Action snippets: How many frames does human action recognition require? In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (2008)Google Scholar
  34. 34.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local SVM approach. In: Proc. Intl. Conf. on Pattern Recognition (2004)Google Scholar
  35. 35.
    Shi, J., Tomasi, C.: Good features to track. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (1994)Google Scholar
  36. 36.
    Soatto, S.: Towards a mathematical theory of visual information (2010)Google Scholar
  37. 37.
    Soatto, S., Yezzi, A.: Deformotion: deforming motion, shape average and the joint segmentation and registration of images. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 32–47. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  38. 38.
    Sun, J., Wu, X., Yan, S., Cheong, L., Chua, T., Li, J.: Hierarchical spatio-temporal context modeling for action recognition. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition (2009)Google Scholar
  39. 39.
    Veeraraghavan, A., Chellappa, R., Roy-Chowdhury, A.: The function space of an activity. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (2006)Google Scholar
  40. 40.
    Wang, H., Ullah, M.M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: British Machine Vision Conference (2009)Google Scholar
  41. 41.
    Yao, B., Zhu, S.: Learning Deformable Action Templates from Cluttered Videos. In: Intl. Conf. on Computer Vision (2009)Google Scholar
  42. 42.
    Yeffet, L., Wolf, L.: Local trinary patterns for human action recognition. In: Proc. Intl. Conf. on Computer Vision (2009)Google Scholar
  43. 43.
    Zelnik-Manor, L., Irani, M.: Statistical analysis of dynamic actions. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(9), 1530–1535 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Michalis Raptis
    • 1
  • Stefano Soatto
    • 1
  1. 1.University of CaliforniaLos Angeles

Personalised recommendations