Advertisement

Discovering Groups of People in Images

  • Wongun Choi
  • Yu-Wei Chao
  • Caroline Pantofaru
  • Silvio Savarese
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8692)

Abstract

Understanding group activities from images is an important yet challenging task. This is because there is an exponentially large number of semantic and geometrical relationships among individuals that one must model in order to effectively recognize and localize the group activities. Rather than focusing on directly recognizing group activities as most of the previous works do, we advocate the importance of introducing an intermediate representation for modeling groups of humans which we call structure groups. Such groups define the way people spatially interact with each other. People might be facing each other to talk, while others sit on a bench side by side, and some might stand alone. In this paper we contribute a method for identifying and localizing these structured groups in a single image despite their varying viewpoints, number of participants, and occlusions. We propose to learn an ensemble of discriminative interaction patterns to encode the relationships between people in 3D and introduce a novel efficient iterative augmentation algorithm for solving this complex inference problem. A nice byproduct of the inference scheme is an approximate 3D layout estimate of the structured groups in the scene. Finally, we contribute an extremely challenging new dataset that contains images each showing multiple people performing multiple activities. Extensive evaluation confirms our theoretical findings.

Keywords

Group discovery Social interaction Activity recognition 

Supplementary material

978-3-319-10593-2_28_MOESM1_ESM.pdf (8.3 mb)
Electronic Supplementary Material(8,506 KB)

References

  1. 1.
    Amer, M.R., Xie, D., Zhao, M., Todorovic, S., Zhu, S.-C.: Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575, pp. 187–200. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  2. 2.
    Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3d human pose annotations. In: International Conference on Computer Vision, ICCV (2009), http://www.eecs.berkeley.edu/~lbourdev/poselets
  3. 3.
    Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm
  4. 4.
    Chen, C.Y., Grauman, K.: Efficient activity detection with max-subgraph search. In: CVPR (2012)Google Scholar
  5. 5.
    Choi, W., Savarese, S.: A unified framework for multi-target tracking and collective activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575, pp. 215–230. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  6. 6.
    Choi, W., Shahid, K., Savarese, S.: What are they doing?: Collective activity classification using spatio-temporal relationship among people. In: VSWS (2009)Google Scholar
  7. 7.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)Google Scholar
  8. 8.
    Desai, C., Ramanan, D.: Detecting actions, poses, and objects with relational phraselets. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575, pp. 158–172. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  9. 9.
    Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: VS-PETS (2005)Google Scholar
  10. 10.
    Eichner, M., Ferrari, V.: We are family: Joint pose estimation of multiple persons. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 228–242. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge (VOC 2012) Results, http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
  12. 12.
    Hoai, M., De la Torre, F.: Max-margin early event detectors. In: CVPR (2012)Google Scholar
  13. 13.
    Hoai, M., Lan, Z.Z., De la Torre, F.: Joint segmentation and classification of human actions in video. In: CVPR (2011)Google Scholar
  14. 14.
    Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. IJCV (2008)Google Scholar
  15. 15.
    Joachims, T., Finley, T., Yu, C.N.: Cutting-plane training of structural svms. Machine Learning (2009)Google Scholar
  16. 16.
    Khamis, S., Morariu, V.I., Davis, L.S.: Combining per-frame and per-track cues for multi-person action recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I. LNCS, vol. 7572, pp. 116–129. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  17. 17.
    Koller, D., Friedman, N.: Probabilistic graphical models: principles and techniques. MIT press (2009)Google Scholar
  18. 18.
    Ladicky, L., Russell, C., Kohli, P., Torr, P.H.S.: Graph cut based inference with co-occurrence statistics. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 239–253. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  19. 19.
    Lan, T., Wang, Y., Yang, W., Mori, G.: Beyond actions: Discriminative models for contextual group activities. In: NIPS (2010)Google Scholar
  20. 20.
    Lan, T., Wang, Y., Mori, G., Robinovitch, S.N.: Retrieving actions in group contexts. In: Kutulakos, K.N. (ed.) ECCV 2010 Workshops, Part I. LNCS, vol. 6553, pp. 181–194. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  21. 21.
    Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV (2003)Google Scholar
  22. 22.
    Leal-Taixe, L., Fenzi, M., Kuznetsova, A., Rosenhahn, B., Savarese, S.: Learning an image-based motion context for multiple people tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2014)Google Scholar
  23. 23.
    Liu, J., Luo, J., Shah, M.: Recongizing realistic actions from videos “in the wild”. In: CVPR (2009)Google Scholar
  24. 24.
    Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. IJCV (2008)Google Scholar
  25. 25.
    Odashima, S., Shimosaka, M., Kaneko, T., Fukui, R., Sato, T.: Collective activity localization with contextual spatial pyramid. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012 Ws/Demos, Part III. LNCS, vol. 7585, pp. 243–252. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  26. 26.
    Patron-Perez, A., Marszałek, M., Zisserman, A., Reid, I.D.: High five: Recognising human interactions in TV shows. In: BMVC (2010)Google Scholar
  27. 27.
    Pellegrini, S., Ess, A., Van Gool, L.: Improving data association by joint modeling of pedestrian trajectories and groupings. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 452–465. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  28. 28.
    Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: ICCV (2009)Google Scholar
  29. 29.
    Ryoo, M.S., Aggarwal, J.K.: Stochastic representation and recognition of high-level group activities. IJCV (2010)Google Scholar
  30. 30.
    Shi, J., Malik, J.: Normalized cuts and image segmentation. PAMI (2000)Google Scholar
  31. 31.
    Shi, J., Malik, J.: Normalized cuts and image segmentation. PAMI 22(8), 888–905 (2000)CrossRefGoogle Scholar
  32. 32.
    Shotton, J., Winn, J., Rother, C., Criminisi, A.: Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV (2009)Google Scholar
  33. 33.
    Singh, S., Gupta, A., Efros, A.A.: Unsupervised discovery of mid-level discriminative patches. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 73–86. Springer, Heidelberg (2012), http://arxiv.org/abs/1205.3137 CrossRefGoogle Scholar
  34. 34.
    Unnikrishnan, R., Pantofaru, C., Hebert, M.: Toward objective evaluation of image segmentation algorithms. PAMI 29(6), 929–944 (2007)CrossRefGoogle Scholar
  35. 35.
    Yang, Y., Baker, S., Kannan, A., Ramanan, D.: Recognizing proxemics in personal photos. In: CVPR (2012)Google Scholar
  36. 36.
    Yao, A., Gall, J., Van Gool, L.: A hough transform-based voting framework for action recognition. In: CVPR (June 2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Wongun Choi
    • 1
  • Yu-Wei Chao
    • 2
  • Caroline Pantofaru
    • 3
  • Silvio Savarese
    • 4
  1. 1.NEC LaboratoriesUSA
  2. 2.University of MichiganAnn ArborUSA
  3. 3.Google, IncUSA
  4. 4.Stanford UniversityUSA

Personalised recommendations