Zoom Better to See Clearer: Human and Object Parsing with Hierarchical Auto-Zoom Net

  • Fangting XiaEmail author
  • Peng Wang
  • Liang-Chieh Chen
  • Alan L. Yuille
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9909)


Parsing articulated objects, e.g. humans and animals, into semantic parts (e.g. head, body and arms, etc.) from natural images is a challenging and fundamental problem in computer vision. A big difficulty is the large variability of scale and location for objects and their corresponding parts. Even limited mistakes in estimating scale and location will degrade the parsing output and cause errors in boundary details. To tackle this difficulty, we propose a “Hierarchical Auto-Zoom Net” (HAZN) for object part parsing which adapts to the local scales of objects and parts. HAZN is a sequence of two “Auto-Zoom Nets” (AZNs), each employing fully convolutional networks for two tasks: (1) predict the locations and scales of object instances (the first AZN) or their parts (the second AZN); (2) estimate the part scores for predicted object instance or part regions. Our model can adaptively “zoom” (resize) predicted image regions into their proper scales to refine the parsing. We conduct extensive experiments over the PASCAL part datasets on humans, horses, and cows. In all the three categories, our approach significantly outperforms alternative state-of-the-arts by more than \(5\,\%\) mIOU and is especially better at segmenting small instances and small parts. In summary, our strategy of first zooming into objects and then zooming into parts is very effective. It also enables us to process different regions of the image at different scales adaptively so that we do not need to waste computational resources scaling the entire image.


Human parsing Part segmentation Multi-scale modeling 



We would like to gratefully acknowledge support from NSF award CCF-1317376, and NSF STC award CCF-1231216. We also thank NVIDIA for providing us with free GPUs that are used to train deep models. Additionally, many thanks to Lingxi Xie, Zhou Ren, and Xianjie Chen for proofreading this paper and giving suggestions.

Supplementary material

419978_1_En_39_MOESM1_ESM.pdf (2.6 mb)
Supplementary material 1 (pdf 2708 KB)


  1. 1.
    Alexe, B., Deselaers, T., Ferrari, V.: Measuring the objectness of image windows. PAMI 34(11), 2189–2202 (2012)CrossRefGoogle Scholar
  2. 2.
    Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. PAMI 33(5), 898–916 (2011)CrossRefGoogle Scholar
  3. 3.
    Bo, Y., Fowlkes, C.C.: Shape-based pedestrian parsing. In: CVPR (2011)Google Scholar
  4. 4.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLR (2015)Google Scholar
  5. 5.
    Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: Scale-aware semantic image segmentation. arXiv:1511.03339 (2015)
  6. 6.
    Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.L.: Detect what you can: Detecting and representing objects using holistic models and body parts. In: CVPR (2014)Google Scholar
  7. 7.
    Dai, J., He, K., Sun, J.: Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: ICCV (2015)Google Scholar
  8. 8.
    Dong, J., Chen, Q., Shen, X., Yang, J., Yan, S.: Towards unified human parsing and pose estimation. In: CVPR (2014)Google Scholar
  9. 9.
    Eslami, S.M.A., Williams, C.K.I.: A generative model for parts-based object segmentation. In: NIPS (2012)Google Scholar
  10. 10.
    Everingham, M., Eslami, S.A., Gool, L.V., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. IJCV 111(1), 98–136 (2014)CrossRefGoogle Scholar
  11. 11.
    Florack, L., Romeny, B.T.H., Viergever, M., Koenderink, J.: The gaussian scale-space paradigm and the multiscale local jet. IJCV 18(1), 61–75 (1996)CrossRefGoogle Scholar
  12. 12.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  13. 13.
    Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR (2015)Google Scholar
  14. 14.
    Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. IJCV 80(1), 3–15 (2008)CrossRefGoogle Scholar
  15. 15.
    Huang, L., Yang, Y., Deng, Y., Yu, Y.: Densebox: unifying landmark localization with end to end object detection. arXiv:1509.04874 (2015)
  16. 16.
    Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with gaussian edge potentials. In: NIPS (2011)Google Scholar
  17. 17.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  18. 18.
    Li, Y., Hou, X., Koch, C., Rehg, J., Yuille, A.: The secrets of salient object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 280–287 (2014)Google Scholar
  19. 19.
    Liang, X., Wei, Y., Shen, X., Yang, J., Lin, L., Yan, S.: Proposal-free network for instance-level object segmentation. CoRR abs/1509.02636 (2015)Google Scholar
  20. 20.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Heidelberg (2014)Google Scholar
  21. 21.
    Liu, S., Liang, X., Liu, L., Shen, X., Yang, J., Xu, C., Lin, L., Cao, X., Yan, S.: Matching-CNN meets KNN: quasi-parametric human parsing. In: CVPR (2015)Google Scholar
  22. 22.
    Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: ICCV (2015)Google Scholar
  23. 23.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  24. 24.
    Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. arXiv:1505.04366 (2015)
  25. 25.
    Papandreou, G., Chen, L.C., Murphy, K., Yuille, A.L.: Weakly- and semi-supervised learning of a dcnn for semantic image segmentation. In: ICCV (2015)Google Scholar
  26. 26.
    Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: Unified, real-time object detection. CoRR abs/1506.02640 (2015)Google Scholar
  27. 27.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. arXiv:1506.01497 (2015)
  28. 28.
    Tsogkas, S., Kokkinos, I., Papandreou, G., Vedaldi, A.: Semantic part segmentation with deep learning. arXiv:1505.02438 (2015)
  29. 29.
    Wang, J., Yuille, A.: Semantic part segmentation using compositional model combining shape and appearance. In: CVPR (2015)Google Scholar
  30. 30.
    Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.: Joint object and part segmentation using deep learned potentials. In: ICCV (2015)Google Scholar
  31. 31.
    Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.L.: Towards unified depth and semantic prediction from a single image. In: CVPR (2015)Google Scholar
  32. 32.
    Wang, P., Wang, J., Zeng, G., Feng, J., Zha, H., Li, S.: Salient object detection for searched web images via global saliency. In: CVPR, pp. 3194–3201 (2012)Google Scholar
  33. 33.
    Xia, F., Zhu, J., Wang, P., Yuille, A.L.: Pose-guided human parsing with deep learned features. AAAI abs/1508.03881 (2016)Google Scholar
  34. 34.
    Yamaguchi, K., Kiapour, M.H., Ortiz, L.E., Berg, T.L.: Parsing clothing in fashion photographs. In: CVPR (2012)Google Scholar
  35. 35.
    Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-CNNs for fine-grained category detection. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part I. LNCS, vol. 8689, pp. 834–849. Springer, Heidelberg (2014)Google Scholar
  36. 36.
    Zhu, L.L., Chen, Y., Lin, C., Yuille, A.: Max margin learning of hierarchical configural deformable templates (hcdts) for efficient object parsing and pose estimation. IJCV 93(1), 1–21 (2011)CrossRefzbMATHGoogle Scholar
  37. 37.
    Zhu, Y., Urtasun, R., Salakhutdinov, R., Fidler, S.: segDeepM: exploiting segmentation and context in deep neural networks for object detection. In: CVPR (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Fangting Xia
    • 1
    Email author
  • Peng Wang
    • 1
  • Liang-Chieh Chen
    • 1
  • Alan L. Yuille
    • 1
  1. 1.University of CaliforniaLos AngelesUSA

Personalised recommendations