Video Instance-Level Human Parsing

  • Liang LinEmail author
  • Dongyu Zhang
  • Ping Luo
  • Wangmeng Zuo


This chapter introduces a novel Adaptive Temporal Encoding Network (ATEN) that alternatively performs temporal encoding among key frames and flow-guided feature propagation from other consecutive frames between two key frames. Specifically, ATEN first incorporates a Parsin-RCNN to produce the instance-level parsing result for each key frame, which integrates global human parsing and instance-level human segmentation into a unified model. To balance accuracy and efficiency, flow-guided feature propagation is used to directly parse consecutive frames according to their identified temporal consistency with key frames. On the other hand, ATEN leverages the convolutional gated recurrent units (convGRU) to exploit temporal changes over a series of key frames, which are further used to facilitate frame-level instance-level parsing. By alternatively performing direct feature propagation between consistent frames and temporal encoding networks among key frames, our ATEN achieves a good balance between frame-level accuracy and time efficiency, which is a common crucial problem in video object segmentation research.


  1. 1.
    J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, arXiv preprint arXiv:1411.4038 (2014)
  2. 2.
    L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 (2016) TPAMI (2015)
  3. 3.
    X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin, S. Yan, Human parsing with contextualized convolutional neural network, in ICCV (2015)Google Scholar
  4. 4.
    X. Liang, X. Shen, J. Feng, L. Lin, S. Yan, Semantic object parsing with graph lstm, in ECCV (2016)Google Scholar
  5. 5.
    X. Liang, H. Zhou, E. Xing, Dynamic-structured semantic propagation network, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 752–761 (2018)Google Scholar
  6. 6.
    X. Liang, S. Liu, X. Shen, J. Yang, L. Liu, J. Dong, L. Lin, S. Yan, Deep human parsing with active template regression. TPAMI (2015)Google Scholar
  7. 7.
    S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P. Torr, Conditional random fields as recurrent neural networks, in ICCV (2015)Google Scholar
  8. 8.
    K. Gong, X. Liang, D. Zhang, X. Shen, L. Lin, Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing, in CVPR (2017)Google Scholar
  9. 9.
    X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, et al., Detect what you can: Detecting and representing objects using holistic models and body parts, in CVPR (2014)Google Scholar
  10. 10.
    A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, Van Der Smagt, Patrick, D. Cremers, T. Brox, Flownet: Learning optical flow with convolutional networks, in Proceedings of the IEEE international conference on computer vision, pp. 2758–2766 (2015)Google Scholar
  11. 11.
    K. Cho, V. Merriënboer, Bart, C. Gulcehre, D. Bahdanau, B. Fethi, S. Holger, B. Yoshua, Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv:1406.1078,2014
  12. 12.
    X. Zhu, Y. Wang, J. Dai, L. Yuan, Y. Wei, Flow-guided feature aggregation for video object detection, in Proceedings of the IEEE International Conference on Computer Vision, pp. 408–417 (2017)Google Scholar
  13. 13.
    X. Zhu, Y. Xiong, J. Dai, L. Yuan, Y. Wei, Deep feature flow for video recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358 (2017)Google Scholar
  14. 14.
    E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, T. Brox, FlowNet 2.0: Evolution of optical flow estimation with deep networks, in IEEE Conference on Computer Vision and Pattern Recognition, pp. 1647–1655 (2017)Google Scholar
  15. 15.
    K. He, G. Gkioxari, P. Dollar, R. Girshick, Mask r-cnn, in ICCV (2017)Google Scholar
  16. 16.
    L.C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (2018)Google Scholar
  17. 17.
    L.C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  • Liang Lin
    • 1
    Email author
  • Dongyu Zhang
    • 1
  • Ping Luo
    • 2
  • Wangmeng Zuo
    • 3
  1. 1.School of Data and Computer ScienceSun Yat-sen UniversityGuangzhouChina
  2. 2.School of Information EngineeringThe Chinese University of Hong KongHong KongHong Kong
  3. 3.School of Computer ScienceHarbin Institute of TechnologyHarbinChina

Personalised recommendations