Multi-label Discriminative Weakly-Supervised Human Activity Recognition and Localization

Adeli Mosabbeb, Ehsan; Cabral, Ricardo; De la Torre, Fernando; Fathy, Mahmood

doi:10.1007/978-3-319-16814-2_16

Ehsan Adeli Mosabbeb¹⁷,
Ricardo Cabral¹⁸,
Fernando De la Torre¹⁸ &
…
Mahmood Fathy¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9007))

Included in the following conference series:

Asian Conference on Computer Vision

1840 Accesses
4 Citations

Abstract

Activity recognition in video has become increasingly important due to its many applications ranging from in-home elder care, surveillance, human computer interaction to automatic sports commentary. To date, most approaches to video rely on fully supervised settings that require time consuming and error prone manual labeling. Moreover, existing supervised approaches are typically tailored for classification, not detection problems (the spatial and temporal support of the action has to be detected). Recently, weakly-supervised learning (WSL) approaches were able to learn discriminative classifiers while localizing the action in space and/or time using weak labels. However, existing approaches for WSL provide coarse localization in terms of spatial regions or spatio-temporal volumes. Moreover, it is unclear how to extend current approaches to the multi-label case that is common in practical applications. This paper proposes a matrix completion approach to the problem of WSL for multi-label learning for video. Our approach localizes non-rectangular spatio-temporal discriminative regions that are inferred by clustering regions of common texture and motion features. We illustrate how our approach improves existing WSL and supervised learning techniques in three standard databases: Hollywood, UCF sports, and MSR-II.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Bold capital letters denote matrices (e.g., D). All non-bold letters denote scalar variables. \(d_{ij}\) denotes the scalar in the row i and column j of D.\(\langle \mathbf{d}_{1},\mathbf{d}_{2}\rangle \) denotes the inner product between two vectors \(\mathbf{d}_{1}\) and \(\mathbf{d}_{2}.\) \(\Vert \mathbf{d} \Vert _{2}^{2}=\langle \mathbf{d},\mathbf{d}\rangle =\Sigma _{i}d_{i}^{2}\) denotes the squared Euclidean Norm of\(\mathbf{d}. \Vert \mathbf{A}\Vert _{*}\) designates the nuclear norm (sum of singular values) of A.

References

Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)
Google Scholar
Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: ICCV (2009)
Google Scholar
Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR (2008)
Google Scholar
Sheikh, Y., Sheikh, M., Shah, M.: Exploring the space of a human action. In: ICCV (2005)
Google Scholar
Wang, H., Kläser, A., Schmid, C., Cheng-Lin, L.: Action recognition by dense trajectories. In: CVPR (2011)
Google Scholar
Cao, L., Liu, Z., Huang, T.S.: Cross-dataset action detection. In: CVPR (2010)
Google Scholar
Raptis, M., Kokkinos, I., Soatto, S.: Discovering discriminative action parts from mid-level video representations. In: CVPR (2012)
Google Scholar
Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. In: ICCV (2011)
Google Scholar
Nguyen, M.H., Torresani, L., De la Torre, F., Rother, C.: Weakly-supervised discriminative localization and classification: a joint learning process. In: ICCV (2009)
Google Scholar
Siva, P., Xiang, T.: Weakly-supervised action detection. In: BMVC (2011)
Google Scholar
Zhou, Z., Zhang, M.: Multi-instance multi-label learning with application to scene classification. In: NIPS (2006)
Google Scholar
Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for action detection. In: CVPR (2013)
Google Scholar
Wang, L., Qiao, Y., Tang, X.: Motionlets: mid-level 3D parts for human motion recognition. In: CVPR, pp. 2674–2681 (2013)
Google Scholar
Liu, G., Lin, Z., Yu, Y.: Robust subspace segmentation by low-rank representation. In: ICML (2010)
Google Scholar
Cheng, B., Liu, G., Wang, J., Huang, Z., Yan, S.: Multi-task low-rank affinity pursuit for image segmentation. In: ICCV (2011)
Google Scholar
Elhamifar, E., Vidal, R.: Sparse subspace clustering. In: CVPR (2009)
Google Scholar
Tang, K., Sukthankar, R., Yagnik, J., Fei-Fei, L.: Discriminative segment annotation in weakly labeled video. In: CVPR (2013)
Google Scholar
Cabral, R.S., De la Torre, F., Costeira, J.P., Bernardino, A.: Matrix completion for multi-label image classification. In: NIPS (2011)
Google Scholar
Wang, H., Ullah, M.M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC (2009)
Google Scholar
Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: CVPR (2010)
Google Scholar
Hoai, M., Lan, Z., De la Torre, F.: Joint segmentation and classification of human actions in video. In: CVPR (2011)
Google Scholar
Shapovalova, N., Vahdat, A., Cannons, K., Lan, T., Mori, G.: Similarity constrained latent support vector machine: an application to weakly supervised action classification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VII. LNCS, vol. 7578, pp. 55–68. Springer, Heidelberg (2012)
Chapter Google Scholar
Tang, K., Fei-Fei, L., Koller, D.: Learning latent temporal structure for complex event detection. In: CVPR (2012)
Google Scholar
Chen, C.Y., Grauman, K.: Efficient Activity Detection with max-subgraph Search. In: CVPR (2012)
Google Scholar
Tran, D., Yuan, J.: Max-margin structured output regression for spatio-temporal action localization. In: NIPS (2012)
Google Scholar
Duchenne, O., Laptev, I., Sivic, J., Bach, F., Ponce, J.: Automatic annotation of human actions in video. In: ICCV (2009)
Google Scholar
Tran, D., Yuan, J., Forsyth, D.: Video event detection: from subvolume localization to spatio-temporal path search. IEEE Trans. Pattern Anal. Mach. Intell. 36(2), 404–416 (2014)
Article Google Scholar
Kumar, B.G.V., Patras, I.: Supervised dictionary learning for action localization. In: FG (2013)
Google Scholar
Gaidon, A., Harchaoui, Z., Schmid, C.: Temporal localization of actions with actoms. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2782–2795 (2013)
Article Google Scholar
Ma, S., Zhang, J., Ikizler-Cinbis, N., Sclaroff, S.: Action recognition and localization by hierarchical space-time segments. In: ICCV (2013)
Google Scholar
Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89, 31–71 (1997)
Article MATH Google Scholar
Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: NIPS (2003)
Google Scholar
Prest, A., Leistner, C., Civera, J., Schmid, C., Ferrari, V.: Learning object class detectors from weakly annotated video. In: CVPR (2012)
Google Scholar
Hartmann, G., Grundmann, M., Hoffman, J., Tsai, D., Kwatra, V., Madani, O., Vijayanarasimhan, S., Essa, I., Rehg, J., Sukthankar, R.: Weakly supervised learning of object segmentations from web-scale video. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012 Ws/Demos, Part I. LNCS, vol. 7583, pp. 198–208. Springer, Heidelberg (2012)
Google Scholar
Li, F., Sminchisescu, C.: Convex multiple-instance learning by estimating likelihood ratio. In: NIPS (2010)
Google Scholar
Joulin, A., Bach, F.: A convex relaxation for weakly-supervised classifiers. In: ICML (2012)
Google Scholar
Lin, Z., Chen, M., Wu, L., Ma, Y.: The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices. UIUC Technical report 2215 (2009)
Google Scholar
Tron, R., Vidal, R.: Distributed computer vision algorithms through distributed averaging. In: CVPR (2011)
Google Scholar
Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recognit. 37, 1757–1771 (2004)
Article Google Scholar
Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014)
Article Google Scholar
Goldberg, A.B., Zhu, X., Recht, B., Xu, J.M., Nowak, R.D.: Transduction with matrix completion: three birds with one stone. In: NIPS (2010)
Google Scholar
Tian, Y., Cao, L., Liu, Z., Zhang, Z.: Hierarchical filtered motion for action recognition in crowded videos. IEEE Trans. Sys. Man. Cyb. Part C 42, 313–323 (2012)
Article Google Scholar
Yuan, J., Liu, Z., Wu, Y.: Discriminative video pattern search for efficient action detection. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1728–1743 (2011)
Article Google Scholar
Matikainen, P., Hebert, M., Sukthankar, R.: Trajectons: action recognition through the motion analysis of tracked features. In: ICCV (2009)
Google Scholar
Klaeser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: BMVC (2008)
Google Scholar
Wu, S., Oreifej, O., Shah, M.: Action recognition in videos acquired by a moving camera using motion decomposition of Lagrangian particle trajectories. In: ICCV (2011)
Google Scholar
Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR (2011)
Google Scholar
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103, 60–79 (2013)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Iran University of Science and Technology, Tehran, Iran
Ehsan Adeli Mosabbeb & Mahmood Fathy
Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA
Ricardo Cabral & Fernando De la Torre

Authors

Ehsan Adeli Mosabbeb
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo Cabral
View author publications
You can also search for this author in PubMed Google Scholar
Fernando De la Torre
View author publications
You can also search for this author in PubMed Google Scholar
Mahmood Fathy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ehsan Adeli Mosabbeb .

Editor information

Editors and Affiliations

Technische Universität München, Garching, Germany
Daniel Cremers
University of Adelaide, Adelaide, South Australia, Australia
Ian Reid
Keio University, Yokohama, Kanagawa, Japan
Hideo Saito
University of California at Merced, Merced, California, USA
Ming-Hsuan Yang

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material (zip 12,063 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Adeli Mosabbeb, E., Cabral, R., De la Torre, F., Fathy, M. (2015). Multi-label Discriminative Weakly-Supervised Human Activity Recognition and Localization. In: Cremers, D., Reid, I., Saito, H., Yang, MH. (eds) Computer Vision -- ACCV 2014. ACCV 2014. Lecture Notes in Computer Science(), vol 9007. Springer, Cham. https://doi.org/10.1007/978-3-319-16814-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-16814-2_16
Published: 17 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16813-5
Online ISBN: 978-3-319-16814-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics