Video Object Discovery and Co-segmentation with Extremely Weak Supervision

  • Le Wang
  • Gang Hua
  • Rahul Sukthankar
  • Jianru Xue
  • Nanning Zheng
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8692)


Video object co-segmentation refers to the problem of simultaneously segmenting a common category of objects from multiple videos. Most existing video co-segmentation methods assume that all frames from all videos contain the target objects. Unfortunately, this assumption is rarely true in practice, particularly for large video sets, and existing methods perform poorly when the assumption is violated. Hence, any practical video object co-segmentation algorithm needs to identify the relevant frames containing the target object from all videos, and then co-segment the object only from these relevant frames. We present a spatiotemporal energy minimization formulation for simultaneous video object discovery and co-segmentation across multiple videos. Our formulation incorporates a spatiotemporal auto-context model, which is combined with appearance modeling for superpixel labeling. The superpixel-level labels are propagated to the frame level through a multiple instance boosting algorithm with spatial reasoning (Spatial-MILBoosting), based on which frames containing the video object are identified. Our method only needs to be bootstrapped with the frame-level labels for a few video frames (e.g., usually 1 to 3) to indicate if they contain the target objects or not. Experiments on three datasets validate the efficacy of our proposed method, which compares favorably with the state-of-the-art.


video object discovery video object co-segmentation spatiotemporal auto-context model Spatial-MILBoosting 

Supplementary material

978-3-319-10593-2_42_MOESM1_ESM.pdf (569 kb)
Electronic Supplementary Material(569 KB)


  1. 1.
    Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Susstrunk, S.: SLIC superpixels compared to state-of-the-art superpixel methods. TPAMI 34(11), 2274–2282 (2012)CrossRefGoogle Scholar
  2. 2.
    Alexe, B., Deselaers, T., Ferrari, V.: What is an object? In: CVPR, pp. 73–80 (2010)Google Scholar
  3. 3.
    Avidan, S.: SpatialBoost: Adding spatial reasoning to adaboost. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. Part IV. LNCS, vol. 3954, pp. 386–396. Springer, Heidelberg (2006)Google Scholar
  4. 4.
    Bai, X., Wang, J., Simons, D., Sapiro, G.: Video SnapCut: robust video object cutout using localized classifiers. ACM Trans. on Graphics 28, 70 (2009)CrossRefGoogle Scholar
  5. 5.
    Batra, D., Kowdle, A., Parikh, D., Luo, J., Chen, T.: iCoseg: Interactive co-segmentation with intelligent scribble guidance. In: CVPR, pp. 3169–3176 (2010)Google Scholar
  6. 6.
    Boykov, Y., Funka-Lea, G.: Graph cuts and efficient ND image segmentation. IJCV 70(2), 109–131 (2006)CrossRefGoogle Scholar
  7. 7.
    Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. TPAMI 26(9), 1124–1137 (2004)CrossRefGoogle Scholar
  8. 8.
    Brox, T., Malik, J.: Object segmentation by long term analysis of point trajectories. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 282–295. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  9. 9.
    Chen, D.J., Chen, H.T., Chang, L.W.: Video object cosegmentation. In: ACM Multimedia, pp. 805–808 (2012)Google Scholar
  10. 10.
    Chiu, W.C., Fritz, M.: Multi-class video co-segmentation with a generative multi-video model. In: CVPR, pp. 321–328 (2013)Google Scholar
  11. 11.
    Dai, J., Wu, Y.N., Zhou, J., Zhu, S.C.: Cosegmentation and cosketch by unsupervised learning. In: ICCV (2013)Google Scholar
  12. 12.
    Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video segmentation. In: CVPR, pp. 2141–2148 (2010)Google Scholar
  13. 13.
    Guo, J., Li, Z., Cheong, L.F., Zhou, S.Z.: Video co-segmentation for meaningful action extraction. In: ICCV (2013)Google Scholar
  14. 14.
    Harel, J., Koch, C., Perona, P., et al.: Graph-based visual saliency. In: NIPS, pp. 545–552 (2006)Google Scholar
  15. 15.
    Joulin, A., Bach, F., Ponce, J.: Discriminative clustering for image co-segmentation. In: CVPR, pp. 1943–1950 (2010)Google Scholar
  16. 16.
    Lee, Y.J., Kim, J., Grauman, K.: Key-segments for video object segmentation. In: ICCV, pp. 1995–2002 (2011)Google Scholar
  17. 17.
    Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.: Video segmentation by tracking many figure-ground segments. In: ICCV (2013)Google Scholar
  18. 18.
    Liu, D., Chen, T.: A topic-motion model for unsupervised video object discovery. In: CVPR, pp. 1–8 (2007)Google Scholar
  19. 19.
    Liu, D., Hua, G., Chen, T.: A hierarchical visual model for video object summarization. TPAMI 32(12), 2178–2190 (2010)CrossRefGoogle Scholar
  20. 20.
    Ma, T., Latecki, L.J.: Maximum weight cliques with mutex constraints for video object segmentation. In: CVPR, pp. 670–677 (2012)Google Scholar
  21. 21.
    Ochs, P., Brox, T.: Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions. In: ICCV, pp. 1583–1590 (2011)Google Scholar
  22. 22.
    Ochs, P., Brox, T.: Higher order motion models and spectral clustering. In: CVPR, pp. 614–621 (2012)Google Scholar
  23. 23.
    Papazoglou, A., Ferrari, V.: Fast object segmentation in unconstrained video. In: ICCV (2013)Google Scholar
  24. 24.
    Prest, A., Leistner, C., Civera, J., Schmid, C., Ferrari, V.: Learning object class detectors from weakly annotated video. In: CVPR, pp. 3282–3289 (2012)Google Scholar
  25. 25.
    Rubinstein, M., Joulin, A., Kopf, J., Liu, C.: Unsupervised joint object discovery and segmentation in internet images. In: CVPR, pp. 1939–1946 (2013)Google Scholar
  26. 26.
    Rubinstein, M., Liu, C., Freeman, W.T.: Annotation propagation in large image databases via dense image correspondence. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 85–99. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  27. 27.
    Rubio, J.C., Serrat, J., López, A.: Video co-segmentation. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part II. LNCS, vol. 7725, pp. 13–24. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  28. 28.
    Tang, K., Sukthankar, R., Yagnik, J., Fei-Fei, L.: Discriminative segment annotation in weakly labeled video. In: CVPR, pp. 2483–2490 (2013)Google Scholar
  29. 29.
    Tiburzi, F., Escudero, M., Bescós, J., Martínez, J.M.: A ground truth for motion-based video-object segmentation. In: ICIP, pp. 17–20 (2008)Google Scholar
  30. 30.
    Tsai, D., Flagg, M., Rehg, J.: Motion coherent tracking with multi-label MRF optimization. In: BMVC (2010)Google Scholar
  31. 31.
    Tu, Z.: Auto-context and its application to high-level vision tasks. In: CVPR, pp. 1–8 (2008)Google Scholar
  32. 32.
    Tuytelaars, T., Lampert, C.H., Blaschko, M.B., Buntine, W.: Unsupervised object discovery: A comparison. IJCV 88(2), 284–302 (2010)CrossRefGoogle Scholar
  33. 33.
    Vicente, S., Kolmogorov, V., Rother, C.: Cosegmentation revisited: Models and optimization. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312, pp. 465–479. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  34. 34.
    Vicente, S., Rother, C., Kolmogorov, V.: Object cosegmentation. In: CVPR, pp. 2217–2224 (2011)Google Scholar
  35. 35.
    Viola, P., Platt, J.C., Zhang, C.: Multiple instance boosting for object detection. In: NIPS, pp. 1417–1424 (2005)Google Scholar
  36. 36.
    Wang, L., Xue, J., Zheng, N., Hua, G.: Automatic salient object extraction with contextual cue. In: ICCV, pp. 105–112 (2011)Google Scholar
  37. 37.
    Wang, L., Xue, J., Zheng, N., Hua, G.: Concurrent segmentation of categorized objects from an image collection. In: ICPR, pp. 3309–3312 (2012)Google Scholar
  38. 38.
    Wang, L., Hua, G., Xue, J., Gao, Z., Zheng, N.: Joint segmentation and recognition of categorized objects from noisy web image collection. TIP (2014)Google Scholar
  39. 39.
    Xu, L., Jia, J., Matsushita, Y.: Motion detail preserving optical flow estimation. TPAMI 34(9), 1744–1757 (2012)CrossRefGoogle Scholar
  40. 40.
    Xue, J., Wang, L., Zheng, N., Hua, G.: Automatic salient object extraction with contextual cue and its applications to recognition and alpha matting. PR 46(11), 2874–2889 (2013)Google Scholar
  41. 41.
    Zhang, D., Javed, O., Shah, M.: Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In: CVPR, pp. 628–635 (2013)Google Scholar
  42. 42.
    Zhao, G., Yuan, J., Hua, G.: Topical video object discovery from key frames by modeling word co-occurrence prior. In: CVPR, pp. 1602–1609 (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Le Wang
    • 1
  • Gang Hua
    • 2
  • Rahul Sukthankar
    • 3
  • Jianru Xue
    • 1
  • Nanning Zheng
    • 1
  1. 1.Xi’an Jiaotong UniversityChina
  2. 2.Stevens Institute of TechnologyUSA
  3. 3.Google ResearchUSA

Personalised recommendations