Video Object Segmentation with Language Referring Expressions

  • Anna KhorevaEmail author
  • Anna Rohrbach
  • Bernt Schiele
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11364)


Most state-of-the-art semi-supervised video object segmentation methods rely on a pixel-accurate mask of a target object provided for the first frame of a video. However, obtaining a detailed segmentation mask is expensive and time-consuming. In this work we explore an alternative way of identifying a target object, namely by employing language referring expressions. Besides being a more practical and natural way of pointing out a target object, using language specifications can help to avoid drift as well as make the system more robust to complex dynamics and appearance variations. Leveraging recent advances of language grounding models designed for images, we propose an approach to extend them to video data, ensuring temporally coherent predictions. To evaluate our approach we augment the popular video object segmentation benchmarks, \({\text {DAVIS}}_{{16}}\) and \({\text {DAVIS}}_{{17}}\) with language descriptions of target objects. We show that our language-supervised approach performs on par with the methods which have access to a pixel-level mask of the target object on \({\text {DAVIS}}_{{16}}\) and is competitive to methods using scribbles on the challenging \({\text {DAVIS}}_{{17}}\) dataset.

Supplementary material

484519_1_En_8_MOESM1_ESM.pdf (7.3 mb)
Supplementary material 1 (pdf 7468 KB)


  1. 1.
    Balajee Vasudevan, A., Dai, D., Van Gool, L.: Object referring in videos with language and human gaze. In: CVPR (2018)Google Scholar
  2. 2.
    Benard, A., Gygli, M.: Interactive video object segmentation in the wild. arXiv:1801.00269 (2017)
  3. 3.
    Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixe, L., Cremers, D., Gool, L.V.: One-shot video object segmentation. In: CVPR (2017)Google Scholar
  4. 4.
    Chen, K., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding. In: ICCV (2017)Google Scholar
  5. 5.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv:1606.00915 (2016)
  6. 6.
    Cheng, J., Tsai, Y.H., Wang, S., Yang, M.H.: SegFlow: joint learning for video object segmentation and optical flow. In: ICCV (2017)Google Scholar
  7. 7.
    Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H.S., Hu, S.M.: Global contrast based salient region detection. PAMI 37, 569–582 (2015)CrossRefGoogle Scholar
  8. 8.
    Dollár, P., Zitnick, C.L.: Fast edge detection using structured forests. PAMI. 37, 1558–1570 (2015)CrossRefGoogle Scholar
  9. 9.
    Faktor, A., Irani, M.: Video segmentation by non-local consensus voting. In: BMVC (2014)Google Scholar
  10. 10.
    Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: CVPR (2018)Google Scholar
  11. 11.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS (2010)Google Scholar
  12. 12.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: ICCV (2017)Google Scholar
  13. 13.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  14. 14.
    Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: CVPR (2017)Google Scholar
  15. 15.
    Hu, Y.T., Huang, J., Schwing, A.G.: MaskRNN: instance level video object segmentation. In: NIPS (2017)Google Scholar
  16. 16.
    Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: CVPR (2017)Google Scholar
  17. 17.
    Jain, S.D., Xiong, B., Grauman, K.: FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In: CVPR (2017)Google Scholar
  18. 18.
    Khoreva, A., Benenson, R., Ilg, E., Brox, T., Schiele, B.: Lucid data dreaming for multiple object tracking. arXiv:1703.09554 (2017)
  19. 19.
    Koh, Y., Kim, C.: Primary object segmentation in videos based on region augmentation and reduction. In: CVPR (2017)Google Scholar
  20. 20.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332 (2016)
  21. 21.
    Li, R., et al.: Referring image segmentation via recurrent refinement networks. In: CVPR (2018)Google Scholar
  22. 22.
    Li, Z., Tao, R., Gavves, E., Snoek, C.G.M., Smeulders, A.W.M.: Tracking by natural language specification. In: CVPR (2017)Google Scholar
  23. 23.
    Lin, D., Dai, J., Jia, J., He, K., Sun, J.: ScribbleSup: scribble-supervised convolutional networks for semantic segmentation. In: CVPR (2016)Google Scholar
  24. 24.
    Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  25. 25.
    Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.: Recurrent multimodal interaction for referring image segmentation. In: ICCV (2017)Google Scholar
  26. 26.
    Liu, W., Rabinovich, A., Berg, A.C.: ParseNet: looking wider to see better. arxiv:1506.04579 (2015)
  27. 27.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  28. 28.
    Luo, R., Shakhnarovich, G.: Comprehension-guided referring expressions. In: CVPR (2017)Google Scholar
  29. 29.
    Maninis, K., Caelles, S., Pont-Tuset, J., Gool, L.V.: Deep extreme cut: from extreme points to object segmentation. In: CVPR (2018)Google Scholar
  30. 30.
    Maninis, K., et al.: Video object segmentation without temporal information. arxiv:1709.06031 (2017)
  31. 31.
    Mao, J., Jonathan, H., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)Google Scholar
  32. 32.
    Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). Scholar
  33. 33.
    Papazoglou, A., Ferrari, V.: Fast object segmentation in unconstrained video. In: ICCV (2013)Google Scholar
  34. 34.
    Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool, L.V., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR (2016)Google Scholar
  35. 35.
    Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: CVPR (2017)Google Scholar
  36. 36.
    Perazzi, F., Wang, O., Gross, M., Sorkine-Hornung, A.: Fully connected object proposals for video segmentation. In: ICCV (2015)Google Scholar
  37. 37.
    Pont-Tuset, J., et al.: The 2018 DAVIS challenge on video object segmentation. arXiv:1803.00557 (2018)
  38. 38.
    Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv:1704.00675 (2017)
  39. 39.
    Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM (2016)Google Scholar
  40. 40.
    Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. In: ICCV (2017)Google Scholar
  41. 41.
    Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object flow. In: CVPR (2016)Google Scholar
  42. 42.
    Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for the 2017 DAVIS challenge on video object segmentation. In: DAVIS Challenge - CVPR Workshops (2017)Google Scholar
  43. 43.
    Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. In: BMVC (2017)Google Scholar
  44. 44.
    Wang, P., et al.: Understanding convolution for semantic segmentation. arXiv:1702.08502 (2017)
  45. 45.
    Wang, W., Shen, J.: Super-trajectory for video segmentation. arXiv:1702.08634 (2017)
  46. 46.
    Wen, L., Du, D., Lei, Z., Li, S.Z., Yang, M.H.: JOTS: joint online tracking and segmentation. In: CVPR (2015)Google Scholar
  47. 47.
    Xiao, F., Lee, Y.J.: Track and segment: an iterative unsupervised approach for video object proposals. In: CVPR (2016)Google Scholar
  48. 48.
    Yeh, R., Xiong, J., Hwu, W.M., Do, M., Schwing, A.: Interpretable and globally optimal prediction for textual grounding using image concepts. In: NIPS (2017)Google Scholar
  49. 49.
    Yu, L., et al.: MAttNet: modular attention network for referring expression comprehension. In: CVPR (2018)Google Scholar
  50. 50.
    Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). Scholar
  51. 51.
    Zhang, Y., Yuan, L., Guo, Y., He, Z., Huang, I.A., Lee, H.: Discriminative bimodal networks for visual localization and detection with natural language queries. In: CVPR (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Max Planck Institute for InformaticsSaarbrückenGermany
  2. 2.Bosch Center for Artificial IntelligenceRenningenGermany
  3. 3.University of California, BerkeleyBerkeleyUSA

Personalised recommendations