Multimedia Tools and Applications

, Volume 78, Issue 10, pp 13131–13148 | Cite as

Joint face alignment and segmentation via deep multi-task learning

  • Yucheng Zhao
  • Fan Tang
  • Weiming DongEmail author
  • Feiyue Huang
  • Xiaopeng Zhang


Face alignment and segmentation are challenging problems which have been extensively studied in the field of multimedia. These two tasks are closely related and their learning processes are supposed to benefit each other. Hence, we present a joint multi-task learning algorithm for both face alignment and segmentation using deep convolutional neural network (CNN). The proposed multi-task learning approach allows CNN model to simultaneously share visual knowledge between different tasks. With a carefully designed refinement residual module, the cross-layer features are fused in a collaborative manner. To the best of our knowledge, this is the first time that face alignment and segmentation are learned together via deep multi-task learning. Our experiments show that learning these two related tasks simultaneously builds a synergy between them, improves the performance of each individual task, and rivals recent approaches. Furthermore, we demonstrate the effectiveness of our model in two practical applications: virtual makeup and face swap.


Face alignment Face segmentation Multi-task learning Virtual makeup Face swap 



The Titan X used for this research was donated by the NVIDIA Corporation.


  1. 1.
    Badrinarayanan V, Kendall A, Cipolla R (2015) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561
  2. 2.
    Bao BK, Liu G, Xu C, Yan S (2012) Inductive robust principal component analysis. IEEE Trans Image Process 21(8):3794–3800MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Bao BK, Zhu G, Shen J, Yan S (2013) Robust image analysis with sparse representation on quantized visual features. IEEE Trans Image Process 22(3):860–871MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Belhumeur PN, Jacobs DW, Kriegman DJ, Kumar N (2013) Localizing parts of faces using a consensus of exemplars. IEEE Trans Pattern Anal Mach Intell 35(12):2930–2940CrossRefGoogle Scholar
  5. 5.
    Bookstein FL (1989) Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans Pattern Anal Mach Intell 11(6):567–585CrossRefzbMATHGoogle Scholar
  6. 6.
    Cao X, Wei Y, Wen F, Sun J (2014) Face alignment by explicit shape regression. Int J Comput Vis 107(2):177–190MathSciNetCrossRefGoogle Scholar
  7. 7.
    Caruana R (1998) Multitask learning. In: Learning to learn. Springer, pp 95–133Google Scholar
  8. 8.
    Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE international conference on computer vision, pp 2650–2658Google Scholar
  9. 9.
    Elad M, Milanfar P (2017) Style transfer via texture synthesis. IEEE Trans Image Process 26(5):2338–2351Google Scholar
  10. 10.
    Everingham M, Eslami SA, Van Gool L, Williams CK, Winn J, Zisserman A (2015) The pascal visual object classes challenge: a retrospective. Int J Comput Vis 111(1):98–136CrossRefGoogle Scholar
  11. 11.
    Gkioxari G, Hariharan B, Girshick R, Malik J (2014) R-CNNs for pose estimation and action detection. arXiv preprint. arXiv:1406.5212
  12. 12.
    Gross R, Matthews I, Cohn J, Kanade T, Baker S (2010) Multi-pie. Image Vis Comput 28(5):807–813CrossRefGoogle Scholar
  13. 13.
    Happy S, Routray A (2015) Automatic facial expression recognition using features of salient facial patches. IEEE Trans Affect Comput 6(1):1–12CrossRefGoogle Scholar
  14. 14.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778Google Scholar
  15. 15.
    Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning, pp 448–456Google Scholar
  16. 16.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on multimedia. ACM, pp 675–678Google Scholar
  17. 17.
    Korshunova I, Shi W, Dambre J, Theis L (2016) Fast face-swap using convolutional neural networks. arXiv:1611.09577
  18. 18.
    Köstinger M, Wohlhart P, Roth PM, Bischof H (2011) Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In: IEEE international conference on computer vision workshops (ICCV workshops), pp 2144–2151.
  19. 19.
    Liang L, Xiao R, Wen F, Sun J (2008) Face alignment via component-based discriminative search. In: European conference on computer vision. Springer International Publishing, pp 72–85Google Scholar
  20. 20.
    Liu S, Ou X, Qian R, Wang W, Cao X (2016) Makeup like a superstar: deep localized makeup transfer network. In: 25th international joint conference on artificial intelligence, IJCAI2016Google Scholar
  21. 21.
    Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3431–3440Google Scholar
  22. 22.
    Luo P, Wang X, Tang X (2012) Hierarchical face parsing via deep learning. In: IEEE conference on computer vision and pattern recognition, pp 2480–2487Google Scholar
  23. 23.
    Masi I, Trần AT, Hassner T, Leksut JT, Medioni G (2016) Do we really need to collect millions of faces for effective face recognition? In: European conference on computer vision. Springer, pp 579–596Google Scholar
  24. 24.
    Matthews I, Baker S (2004) Active appearance models revisited. Int J Comput Vis 60(2):135–164CrossRefGoogle Scholar
  25. 25.
    Mosaddegh S, Simon L, Jurie F (2014) Photorealistic face de-identification by aggregating donors’ face components. In: Asian conference on computer vision. Springer, pp 159–174Google Scholar
  26. 26.
    Oikawa MA, Dias Z, de Rezende Rocha A, Goldenstein S (2016) Manifold learning and spectral clustering for image phylogeny forests. IEEE Trans Inf Forensics Secur 11(1):5–18CrossRefGoogle Scholar
  27. 27.
    Pinheiro PO, Lin TY, Collobert R, Dollár P (2016) Learning to refine object segments. In: European conference on computer vision. Springer, pp 75–91Google Scholar
  28. 28.
    Ranjan R, Patel VM, Chellappa R (2016) Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. arXiv:1603.01249
  29. 29.
    Reinhard E, Adhikhmin M, Gooch B, Shirley P (2001) Color transfer between images. IEEE Comput Graph Appl 21(5):34–41CrossRefGoogle Scholar
  30. 30.
    Saito S, Li T, Li H (2016) Real-time facial segmentation and performance capture from rgb input. In: European conference on computer vision. Springer International Publishing, pp 244–261Google Scholar
  31. 31.
    Shao Z, Ding S, Zhao Y, Zhang Q, Ma L (2016) Learning deep representation from coarse to fine for face alignment. In: IEEE international conference on multimedia and expoGoogle Scholar
  32. 32.
    Sheng K, Dong W, Kong Y, Mei X, Li J, Wang C, Huang F, Hu BG (2015) Evaluating the quality of face alignment without ground truth. Comput Graphics Forum 34(7):213–223CrossRefGoogle Scholar
  33. 33.
    Smith BM, Zhang L, Brandt J, Lin Z, Yang J (2013) Exemplar-based face parsing. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3484–3491Google Scholar
  34. 34.
    Sun Y, Wang X, Tang X (2013) Deep convolutional network cascade for facial point detection. In: IEEE conference on cmputer vision and pattern recognition (CVPR), pp 3476–3483Google Scholar
  35. 35.
    Van de Sande KE, Uijlings JR, Gevers T, Smeulders AW (2011) Segmentation as selective search for object recognition. In: IEEE international conference on computer vision (ICCV). IEEE, pp 1879–1886Google Scholar
  36. 36.
    Xiong X, De la Torre F (2013) Supervised descent method and its applications to face alignment. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 532–539Google Scholar
  37. 37.
    Yang Y, Hospedales TM (2014) A unified perspective on multi-domain and multi-task learning. arXiv:1412.7489
  38. 38.
    Zhang J, Shan S, Kan M, Chen X (2014) Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In: European conference on computer vision. Springer International Publishing, pp 1–16Google Scholar
  39. 39.
    Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 23(10):1499–1503CrossRefGoogle Scholar
  40. 40.
    Zhang Z, Luo P, Loy CC, Tang X (2014) Facial landmark detection by deep multi-task learning. In: European conference on computer vision. Springer, pp 94–108Google Scholar
  41. 41.
    Zhou J, Chen J, Ye J (2011) Clustered multi-task learning via alternating structure optimization. In: Advances in neural information processing systems, pp 702–710Google Scholar
  42. 42.
    Zhu X, Ramanan D (2012) Face detection, pose estimation, and landmark localization in the wild. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE Computer Society, Washington, DC, pp 2879–2886Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.NLPR-LIAMA, Institute of AutomationChinese Academy of SciencesBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina
  3. 3.YouTu LabTencentShanghaiChina

Personalised recommendations