Multimedia Tools and Applications

, Volume 78, Issue 6, pp 7341–7363 | Cite as

Combining fractal hourglass network and skeleton joints pairwise affinity for multi-person pose estimation

  • Yanmin Luo
  • Zhitong XuEmail author
  • Peizhong Liu
  • Yongzhao Du
  • Jingming Guo


Human pose estimation, especially multi-person pose estimation, is vital for understanding human abnormal behavior. In this paper, we develop a fractal hourglass model to automatically regress human body joints, and propose a layered double-way inference algorithm to calculate the affinity between neighboring skeleton joints. Firstly, the original hourglass resident unit was replaced and the candidate skeleton joints location heatmap regression process was described. And then, we determine the specific body joints location and optimize the regression results. Next, the double-way conditional probabilities between adjacent joints is defined as joints pairwise affinity, and is applied to match adjacent human body part. What’s more, we adopt the spatial distance constraint to refine body joints matching result. Finally, we connect the best matching joints-pair, and iterate the process until all candidate joints are assigned into individual. Extensive experiments on the MPII multi-person subset and the COCO 2016 keypoints challenge show the effectiveness of our method, outperforming the second best method (Associative Embedding) by 0.45 and 1.20%.


Fractal hourglass network Joints location heatmap regression Skeleton joints pairwise affinity Layered double-way inference Multi-person pose estimation 



We would like to gratitude the authors of the MPII human pose dataset and the team members of the COCO 2016 Keypoint Challenges. At the same time, we also thank our laboratory member’s assistance.


This work was supported by the grants from National Natural Science Foundation of China (Grant No. 61605048), the Talent project of Huaqiao University (Grant No. 14BS215), and Quanzhou scientific and technological planning projects of Fujian, China (Grant No. 2015Z120).


  1. 1.
    Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3686–3693Google Scholar
  2. 2.
    Belagiannis V, Zisserman A (2017) Recurrent human pose estimation. In: Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition, pp 468–475Google Scholar
  3. 3.
    Cao Z, Simon T, Wei SE, Sheikh Y (2016) Realtime multi-person 2D pose estimation using part affinity fields. arXiv:1611.08050Google Scholar
  4. 4.
    Carreira J, Agrawal P, Fragkiadaki K, Malik J (2016) Human pose estimation with iterative error feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4733–4742Google Scholar
  5. 5.
    Chen X, Yuille A (2014) Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Proceedings of Advances in Neural Information Processing Systems, pp 1736–1744Google Scholar
  6. 6.
    Chu X, Yang W, Ouyang WL, Ma C, Yuille AL, Wang XG (2017) Multi-context attention for human pose estimation. arXiv:1702.07432Google Scholar
  7. 7.
  8. 8.
    Collobert R, Kavukcuoglu K, Farabet C (2011) Torch7: a matlab-like environment for machine learning. In: Proceedings of Advances in Neural Information Processing SystemsGoogle Scholar
  9. 9.
    Fan X, Zheng K, Lin Y, Wang S (2015) Combining local appearance and holistic view: dual-source deep neural networks for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1347–1355Google Scholar
  10. 10.
    Fang HS, Xie SQ, Tai YW, Lu CW (2016) RMPE: regional multi-person pose estimation. arXiv: 1612.00137Google Scholar
  11. 11.
    Geng Y, Liang RZ, Li W, Wang J, Liang G, Xu C, Wang J (2016) Learning convolutional neural network to maximize pos@top performance measure. In: European Symposium on Artificial Neural Networks (ESANN), pp 589–594Google Scholar
  12. 12.
    Geng Y, Zhang G, Li W, Gu Y, Liang RZ, Liang G, Wang J, Wu Y, Patil N, Wang JY (2017) A novel image tag completion method based on convolutional neural transformation. In: International Conference on Artificial Neural Networks, pp 539–546Google Scholar
  13. 13.
    Guo Y, Tao D, Yu J, Xiong H, Li Y, Tao D (2016) Deep neural networks with relativity learning for facial expression recognition. In: IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp 1–6Google Scholar
  14. 14.
    He KM, Zhang XY, Ren SQ, Sun J (2015) Deep residual learning for image recognition. arXiv:1512.03385Google Scholar
  15. 15.
    He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. arXiv: 1703.06870Google Scholar
  16. 16.
    Insafutdinov E, Andriluka M, Pishchulin L, Tang S, Levinkov E, Andres B, Schiele B (2016) ArtTrack: articulated multi-person tracking in the wild. arXiv: 1612.01465Google Scholar
  17. 17.
    Insafutdinov E, Pishchulin L, Andres B, Andriluka M, Schiele B (2016) DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: European Conference on Computer Vision, pp 34–50CrossRefGoogle Scholar
  18. 18.
    Iqbal U, Gall J (2016) Multi-person pose estimation with local joint-to-person associations. In: European Conference on Computer Vision, pp 627–642Google Scholar
  19. 19.
    Jain A, Tompson J, Andriluka M, Taylor GW, Bregler C (2013) Learning human pose estimation features with convolutional networks. Comput SciGoogle Scholar
  20. 20.
    Ke SR, Zhu LJ, Hwang JN, Pai HI, Lan KM, Liao CP (2010) Real-time 3D human pose estimation from monocular view with applications to event detection and video gaming. In: Proceedings of Seventh IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp 489–496Google Scholar
  21. 21.
    Ke SR, Hwang JN, Lan KM, Wang SZ (2011) View-invariant 3D human body pose reconstruction using a monocular video camera. In: Fifth ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC), pp 1–6Google Scholar
  22. 22.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst 25(2):1097–1105Google Scholar
  23. 23.
    Lin TY, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick CL, Dollár P (2014) Microsoft COCO: common objects in context. In: European Conference on Computer Vision, pp 740–755Google Scholar
  24. 24.
    Loffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167Google Scholar
  25. 25.
    Neubeck A, Gool LV (2006) Efficient non-maximum suppression. In: International Conference on Pattern Recognition, pp 850–855Google Scholar
  26. 26.
    Newell A, Yang KY, Deng J (2016) Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision, pp 483–499CrossRefGoogle Scholar
  27. 27.
    Newell A, Huang Z, Deng J (2016) Associative embedding: end-to-end learning for joint detection and grouping. arXiv: 1611.05424Google Scholar
  28. 28.
    Pan Z, Liu S, Fu W (2017) A review of visual moving target tracking. Multimed Tools Appl 76(16):16989–17018CrossRefGoogle Scholar
  29. 29.
    Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K (2017) Towards accurate multi-person pose estimation in the wild. arXiv:1701.01779Google Scholar
  30. 30.
    Pishchulin L, Insafutdinov E, Tang S, Andres B, Andriluka M, Gehler P, Schiele B (2016) DeepCut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4929–4937Google Scholar
  31. 31.
    Popoola OP, Wang K (2012) Video-based abnormal human behavior recognition—a review. IEEE Trans on System Man & Cybern 42(6):865–878CrossRefGoogle Scholar
  32. 32.
    Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp 91–99Google Scholar
  33. 33.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Comput SciGoogle Scholar
  34. 34.
    Tao D, Cheng J, Song M, Lin X (2016) Manifold ranking-based matrix factorization for saliency detection. IEEE Transactions on Neural Networks and Learning Systems (TNNLS) 27(6):1122–1134MathSciNetCrossRefGoogle Scholar
  35. 35.
    Tao D, Guo Y, Yu B, Pang J, Yu Z (2017) Deep multi-view feature learning for person re-identification. IEEE Trans Circuits Syst Video Technol (TCSVT) PP(99):1–1Google Scholar
  36. 36.
    Tieleman T, Hinton G (2017) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. In COURSERA: Neural Networks for Machine Learning, 4(2)Google Scholar
  37. 37.
    Tompson J, Jain A, Lecun Y, Bregler C (2014) Joint training of a convolutional network and a graphical model for human pose estimation. In: Proceedings of Advances in Neural Information Processing Systems, pp 1799–1807Google Scholar
  38. 38.
    Toshev A, Szegedy C (2013) DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1653–1660Google Scholar
  39. 39.
    Wang C, Wang Y, Yuille AL (2013) An approach to pose-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 915–922Google Scholar
  40. 40.
    Wang H, Dan O, Verbeek J, Schmid C (2016) A robust and efficient video representation for action recognition. Int J Comput Vis 119(3):219–238MathSciNetCrossRefGoogle Scholar
  41. 41.
    Xiao T, Li H, Ouyang W, Wang X (2016) Learning deep feature representations with domain guided dropout for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1249–1258Google Scholar
  42. 42.
    Yang Y, Ramanan D (2013) Articulated human detection with flexible mixtures of parts. IEEE Trans Pattern Anal Mach Intell 35(12):2878–2890CrossRefGoogle Scholar
  43. 43.
    Yuan Y, Fang J, Wang Q (2015) Online anomaly detection in crowd scenes via structure analysis. IEEE Trans on Cybernetics 45(3):548–561CrossRefGoogle Scholar
  44. 44.
    Zhang G, Liang G, Li W, Fang J, Wang J, Geng Y, Wang JY (2017) Learning convolutional ranking-score function by query preference regularization. In: International Conference on Intelligent Data Engineering and Automated Learning, pp 1–8Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Yanmin Luo
    • 1
    • 2
  • Zhitong Xu
    • 1
    • 2
    Email author
  • Peizhong Liu
    • 3
  • Yongzhao Du
    • 3
  • Jingming Guo
    • 4
  1. 1.College of Computer Science and TechnologyHuaqiao UniversityXiamenChina
  2. 2.Key Laboratory for Computer Vision and Pattern Recognition of Xiamen CityHuaqiao UniversityXiamenChina
  3. 3.College of EngineeringHuaqiao UniversityQuanzhouChina
  4. 4.Department of Electrical EngineeringNational Taiwan University of Science and TechnologyTaipeiTaiwan

Personalised recommendations