Joint Person Segmentation and Identification in Synchronized First- and Third-Person Videos

  • Mingze XuEmail author
  • Chenyou Fan
  • Yuchen Wang
  • Michael S. Ryoo
  • David J. Crandall
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11205)


In a world of pervasive cameras, public spaces are often captured from multiple perspectives by cameras of different types, both fixed and mobile. An important problem is to organize these heterogeneous collections of videos by finding connections between them, such as identifying correspondences between the people appearing in the videos and the people holding or wearing the cameras. In this paper, we wish to solve two specific problems: (1) given two or more synchronized third-person videos of a scene, produce a pixel-level segmentation of each visible person and identify corresponding people across different views (i.e., determine who in camera A corresponds with whom in camera B), and (2) given one or more synchronized third-person videos as well as a first-person video taken by a mobile or wearable camera, segment and identify the camera wearer in the third-person videos. Unlike previous work which requires ground truth bounding boxes to estimate the correspondences, we perform person segmentation and identification jointly. We find that solving these two problems simultaneously is mutually beneficial, because better fine-grained segmentation allows us to better perform matching across views, and information from multiple views helps us perform more accurate segmentation. We evaluate our approach on two challenging datasets of interacting people captured from multiple wearable cameras, and show that our proposed method performs significantly better than the state-of-the-art on both person segmentation and identification.


Synchronized first- and third-person cameras 



This work was supported by the National Science Foundation (CAREER IIS-1253549), and the IU Office of the Vice Provost for Research, the College of Arts and Sciences, and the School of Informatics, Computing, and Engineering through the Emerging Areas of Research Project “Learning: Brains, Machines, and Children.” We would like to thank Sven Bambach for assisting with dataset collection, and Katherine Spoon and Anthony Tai for suggestions on our paper draft.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
    Ardeshir, S., Borji, A.: Ego2Top: matching viewers in egocentric and top-view videos. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 253–268. Springer, Cham (2016). Scholar
  5. 5.
    Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561 (2015)
  6. 6.
    Batra, D., Kowdle, A., Parikh, D., Luo, J., Chen, T.: iCoseg: interactive co-segmentation with intelligent scribble guidance. In: CVPR (2010)Google Scholar
  7. 7.
    Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)Google Scholar
  8. 8.
    Chen, D.J., Chen, H.T., Chang, L.W.: Video object cosegmentation. In: ACMMM (2012)Google Scholar
  9. 9.
    Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv:1802.02611 (2018)
  10. 10.
    Cheng, J., Tsai, Y.H., Wang, S., Yang, M.H.: SegFlow: joint learning for video object segmentation and optical flow. In: ICCV (2017)Google Scholar
  11. 11.
    Chiu, W.C., Fritz, M.: Multi-class video co-segmentation with a generative multi-video model. In: CVPR (2013)Google Scholar
  12. 12.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  13. 13.
    Endres, I., Hoiem, D.: Category independent object proposals. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 575–588. Springer, Heidelberg (2010). Scholar
  14. 14.
    Fan, C., et al.: Identifying first-person camera wearers in third-person videos. In: CVPR (2017)Google Scholar
  15. 15.
    Fu, H., Xu, D., Zhang, B., Lin, S.: Object-based multiple foreground video co-segmentation. In: CVPR (2014)Google Scholar
  16. 16.
    Girshick, R.: Fast R-CNN. In: CVPR (2015)Google Scholar
  17. 17.
    Guo, J., Cheong, L.F., Tan, R.T., Zhou, S.Z.: Consistent foreground co-segmentation. In: ACCV (2014)Google Scholar
  18. 18.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. arXiv:1703.06870 (2017)
  19. 19.
    Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: CVPR (2017)Google Scholar
  20. 20.
    Jang, W.D., Kim, C.S.: Online video object segmentation via convolutional trident network. In: CVPR (2017)Google Scholar
  21. 21.
    Jun Koh, Y., Kim, C.S.: Primary object segmentation in videos based on region augmentation and reduction. In: CVPR (2017)Google Scholar
  22. 22.
    Khoreva, A., Perazzi, F., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. arXiv:1612.02646 (2016)
  23. 23.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  24. 24.
    Lafayette Group: Survey of technology needs - body worn cameras. Technical report (2015)Google Scholar
  25. 25.
    LDV Capital: 45 billion cameras by 2022 fuel business opportunities. Technical report (2017)Google Scholar
  26. 26.
    Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. arXiv:1611.07709 (2016)
  27. 27.
    Lin, G., Milan, A., Shen, C., Reid, I.: RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In: CVPR (2017)Google Scholar
  28. 28.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  29. 29.
    Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: CVPR (2017)Google Scholar
  30. 30.
    Pinheiro, P.O., Collobert, R., Dollár, P.: Learning to segment object candidates. In: NIPS (2015)Google Scholar
  31. 31.
    Pinheiro, P.O., Lin, T.-Y., Collobert, R., Dollár, P.: Learning to refine object segments. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 75–91. Springer, Cham (2016). Scholar
  32. 32.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  33. 33.
    Rother, C., Minka, T., Blake, A., Kolmogorov, V.: Cosegmentation of image pairs by histogram matching-incorporating a global constraint into MRFs. In: CVPR (2006)Google Scholar
  34. 34.
    Rubio, J.C., Serrat, J., López, A.: Video co-segmentation. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7725, pp. 13–24. Springer, Heidelberg (2013). Scholar
  35. 35.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  36. 36.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
  37. 37.
    Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. PAMI 36, 1442–1468 (2014)CrossRefGoogle Scholar
  38. 38.
    Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. arXiv:1704.05737 (2017)
  39. 39.
    Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. arXiv:1706.09364 (2017)
  40. 40.
    Yonetani, R., Kitani, K.M., Sato, Y.: Ego-surfing first-person videos. In: CVPR (2015)Google Scholar
  41. 41.
    Yoon, J.S., Rameau, F., Kim, J., Lee, S., Shin, S., Kweon, I.S.: Pixel-level matching for video object segmentation using convolutional neural networks. arXiv:1708.05137 (2017)
  42. 42.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. arXiv:1612.01105 (2016)
  43. 43.
    Zheng, K., et al.: Learning view-invariant features for person identification in temporally synchronized videos taken by wearable cameras. In: ICCV (2017)Google Scholar
  44. 44.
    Zheng, L., Yang, Y., Hauptmann, A.G.: Person re-identification: past, present and future. arXiv:1610.02984 (2016)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Mingze Xu
    • 1
    Email author
  • Chenyou Fan
    • 1
  • Yuchen Wang
    • 1
  • Michael S. Ryoo
    • 1
  • David J. Crandall
    • 1
  1. 1.School of Informatics, Computing, and EngineeringIndiana UniversityBloomingtonUSA

Personalised recommendations