Automated Textual Descriptions for a Wide Range of Video Events with 48 Human Actions

Hanckmann, Patrick; Schutte, Klamer; Burghouts, Gertjan J.

doi:10.1007/978-3-642-33863-2_37

Automated Textual Descriptions for a Wide Range of Video Events with 48 Human Actions

Patrick Hanckmann¹⁹,
Klamer Schutte¹⁹ &
Gertjan J. Burghouts¹⁹

Conference paper

4111 Accesses
16 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 7583))

Abstract

Presented is a hybrid method to generate textual descriptions of video based on actions. The method includes an action classifier and a description generator. The aim for the action classifier is to detect and classify the actions in the video, such that they can be used as verbs for the description generator. The aim of the description generator is (1) to find the actors (objects or persons) in the video and connect these correctly to the verbs, such that these represent the subject, and direct and indirect objects, and (2) to generate a sentence based on the verb, subject, and direct and indirect objects. The novelty of our method is that we exploit the discriminative power of a bag-of-features action detector with the generative power of a rule-based action descriptor. Shown is that this approach outperforms a homogeneous setup with the rule-based action detector and action descriptor.

This work has been sponsored by DARPA, Mind’s Eye program.

Download to read the full chapter text

Chapter PDF

References

DARPA: Hosting corpora suitable for research in visual activity recognition, in particular, the video corpora collected as part of DARPA’s Mind’s Eye program (2011), http://www.visint.org
Schüldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: Proc. of ICPR, pp. 32–36 (2004)
Google Scholar
Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29, 2247–2253 (2007)
Article Google Scholar
Ali, S., Shah, M.: Floor Fields for Tracking in High Density Crowd Scenes. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 1–14. Springer, Heidelberg (2008)
Chapter Google Scholar
Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR (2009)
Google Scholar
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: CVPR (2009)
Google Scholar
Gagnon, L.: Automatic detection of visual elements in films and description with a synthetic voice- application to video description. In: Proceedings of the 9th International Conference on Low Vision (2008)
Google Scholar
Gupta, A., Srinivasan, P., Shi, J., Davis, L.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 2012–2019 (2009)
Google Scholar
Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision 50, 171–184 (2002)
Article MATH Google Scholar
Khan, M.U.G., Zhang, L., Gotoh, Y.: Towards coherent natural language description of video streams. In: ICCV Workshops, pp. 664–671. IEEE (2011)
Google Scholar
Burghouts, G., Bouma, H., de Hollander, R., van den Broek, S., Schutte, K.: Recognition of 48 human behaviors from video. in Int. Symp. Optronics in Defense and Security, OPTRO (2012)
Google Scholar
Ditzel, M., Kester, L., van den Broek, S.: System design for distributed adaptive observation systems. In: IEEE Int. Conf. Information Fusion (2011)
Google Scholar
Bouma, H., Hanckmann, P., Marck, J.-W., Penning, L., den Hollander, R., ten Hove, J.-M., van den Broek, S., Schutte, K., Burghouts, G.: Automatic human action recognition in a scene from visual inputs. In: Proc. SPIE, vol. 8388 (2012)
Google Scholar
Burghouts, G., den Hollander, R., Schutte, K., Marck, J., Landsmeer, S., Breejen, E.d.: Increasing the security at vital infrastructures: automated detection of deviant behaviors. In: Proc. SPIE, vol. 8019 (2011)
Google Scholar
Withagen, P., Schutte, K., Groen, F.: Probabilistic classification between foreground objects and background. In: Proc. IEEE Int. Conf. Pattern Recognition, pp. 31–34 (2004)
Google Scholar
Laptev, I.: Improving object detection with boosted histograms. Image and Vision Computing, 535–544 (2009)
Google Scholar
Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. IEEE Trans. Pattern Analysis and Machine Intelligence 32(9), 1627–1645 (2010)
Article Google Scholar
Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: IEEE Computer Vision and Pattern Recognition (2008)
Google Scholar
van den Broek, S., Hanckmann, P., Ditzel, M.: Situation and threat assessment for urban scenarios in a distributed system. In: Proc. Int. Conf. Information Fusion (2011)
Google Scholar
Steinberg, A.N., Bowman, C.L.: Rethinking the JDL data fusion levels. In: NSSDF Conference Proceedings (2004)
Google Scholar
Burghouts, G., Schutte, K.: Correlations between 48 human actions improve their detection. In: ICPR 2012 (2012)
Google Scholar
Breiman, L.: Random forests. Machine Learning 45, 1 (2001)
Google Scholar
Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In: ICCV (2009)
Google Scholar
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 423–430 (2003)
Google Scholar
The Stanford Natural Language Processing Group: The Stanford parser: A statistical parser (2003), http://nlp.stanford.edu/software/lex-parser.shtml

Download references

Author information

Authors and Affiliations

TNO, The Hague, The Netherlands
Patrick Hanckmann, Klamer Schutte & Gertjan J. Burghouts

Authors

Patrick Hanckmann
View author publications
You can also search for this author in PubMed Google Scholar
Klamer Schutte
View author publications
You can also search for this author in PubMed Google Scholar
Gertjan J. Burghouts
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento di Ingegneria Elettrica, Gestionale e Meccanica (DIEGM), Università degli Studi di Udine, Via delle Scienze, 208, 33100, Udine, Italy
Andrea Fusiello
IIT Istituto Italiano di Tecnologia, Via Morego 30, 16163, Genoa, Italy
Vittorio Murino
Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Modena e Reggio Emilia, Strada Vignolege, 905, 41125, Modena, Italy
Rita Cucchiara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hanckmann, P., Schutte, K., Burghouts, G.J. (2012). Automated Textual Descriptions for a Wide Range of Video Events with 48 Human Actions. In: Fusiello, A., Murino, V., Cucchiara, R. (eds) Computer Vision – ECCV 2012. Workshops and Demonstrations. ECCV 2012. Lecture Notes in Computer Science, vol 7583. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33863-2_37

Download citation

DOI: https://doi.org/10.1007/978-3-642-33863-2_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33862-5
Online ISBN: 978-3-642-33863-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics