Abstract
Presented is a hybrid method to generate textual descriptions of video based on actions. The method includes an action classifier and a description generator. The aim for the action classifier is to detect and classify the actions in the video, such that they can be used as verbs for the description generator. The aim of the description generator is (1) to find the actors (objects or persons) in the video and connect these correctly to the verbs, such that these represent the subject, and direct and indirect objects, and (2) to generate a sentence based on the verb, subject, and direct and indirect objects. The novelty of our method is that we exploit the discriminative power of a bag-of-features action detector with the generative power of a rule-based action descriptor. Shown is that this approach outperforms a homogeneous setup with the rule-based action detector and action descriptor.
This work has been sponsored by DARPA, Mind’s Eye program.
Chapter PDF
References
DARPA: Hosting corpora suitable for research in visual activity recognition, in particular, the video corpora collected as part of DARPA’s Mind’s Eye program (2011), http://www.visint.org
Schüldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: Proc. of ICPR, pp. 32–36 (2004)
Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29, 2247–2253 (2007)
Ali, S., Shah, M.: Floor Fields for Tracking in High Density Crowd Scenes. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 1–14. Springer, Heidelberg (2008)
Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR (2009)
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: CVPR (2009)
Gagnon, L.: Automatic detection of visual elements in films and description with a synthetic voice- application to video description. In: Proceedings of the 9th International Conference on Low Vision (2008)
Gupta, A., Srinivasan, P., Shi, J., Davis, L.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 2012–2019 (2009)
Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision 50, 171–184 (2002)
Khan, M.U.G., Zhang, L., Gotoh, Y.: Towards coherent natural language description of video streams. In: ICCV Workshops, pp. 664–671. IEEE (2011)
Burghouts, G., Bouma, H., de Hollander, R., van den Broek, S., Schutte, K.: Recognition of 48 human behaviors from video. in Int. Symp. Optronics in Defense and Security, OPTRO (2012)
Ditzel, M., Kester, L., van den Broek, S.: System design for distributed adaptive observation systems. In: IEEE Int. Conf. Information Fusion (2011)
Bouma, H., Hanckmann, P., Marck, J.-W., Penning, L., den Hollander, R., ten Hove, J.-M., van den Broek, S., Schutte, K., Burghouts, G.: Automatic human action recognition in a scene from visual inputs. In: Proc. SPIE, vol. 8388 (2012)
Burghouts, G., den Hollander, R., Schutte, K., Marck, J., Landsmeer, S., Breejen, E.d.: Increasing the security at vital infrastructures: automated detection of deviant behaviors. In: Proc. SPIE, vol. 8019 (2011)
Withagen, P., Schutte, K., Groen, F.: Probabilistic classification between foreground objects and background. In: Proc. IEEE Int. Conf. Pattern Recognition, pp. 31–34 (2004)
Laptev, I.: Improving object detection with boosted histograms. Image and Vision Computing, 535–544 (2009)
Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. IEEE Trans. Pattern Analysis and Machine Intelligence 32(9), 1627–1645 (2010)
Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: IEEE Computer Vision and Pattern Recognition (2008)
van den Broek, S., Hanckmann, P., Ditzel, M.: Situation and threat assessment for urban scenarios in a distributed system. In: Proc. Int. Conf. Information Fusion (2011)
Steinberg, A.N., Bowman, C.L.: Rethinking the JDL data fusion levels. In: NSSDF Conference Proceedings (2004)
Burghouts, G., Schutte, K.: Correlations between 48 human actions improve their detection. In: ICPR 2012 (2012)
Breiman, L.: Random forests. Machine Learning 45, 1 (2001)
Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In: ICCV (2009)
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 423–430 (2003)
The Stanford Natural Language Processing Group: The Stanford parser: A statistical parser (2003), http://nlp.stanford.edu/software/lex-parser.shtml
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hanckmann, P., Schutte, K., Burghouts, G.J. (2012). Automated Textual Descriptions for a Wide Range of Video Events with 48 Human Actions. In: Fusiello, A., Murino, V., Cucchiara, R. (eds) Computer Vision – ECCV 2012. Workshops and Demonstrations. ECCV 2012. Lecture Notes in Computer Science, vol 7583. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33863-2_37
Download citation
DOI: https://doi.org/10.1007/978-3-642-33863-2_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33862-5
Online ISBN: 978-3-642-33863-2
eBook Packages: Computer ScienceComputer Science (R0)