Deep Deformation Network for Object Landmark Localization

  • Xiang YuEmail author
  • Feng Zhou
  • Manmohan Chandraker
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9909)


We propose a novel cascaded framework, namely deep deformation network (DDN), for localizing landmarks in non-rigid objects. The hallmarks of DDN are its incorporation of geometric constraints within a convolutional neural network (CNN) framework, ease and efficiency of training, as well as generality of application. A novel shape basis network (SBN) forms the first stage of the cascade, whereby landmarks are initialized by combining the benefits of CNN features and a learned shape basis to reduce the complexity of the highly nonlinear pose manifold. In the second stage, a point transformer network (PTN) estimates local deformation parameterized as thin-plate spline transformation for a finer refinement. Our framework does not incorporate either handcrafted features or part connectivity, which enables an end-to-end shape prediction pipeline during both training and testing. In contrast to prior cascaded networks for landmark localization that learn a mapping from feature space to landmark locations, we demonstrate that the regularization induced through geometric priors in the DDN makes it easier to train, yet produces superior results. The efficacy and generality of the architecture is demonstrated through state-of-the-art performances on several benchmarks for multiple tasks such as facial landmark localization, human body pose estimation and bird part localization.


Landmark localization Convolutional Neural Network Non-rigid shape analysis 


  1. 1.
    Cootes, T., Taylor, C., Cooper, D., Graham, J.: Active shape models-their training and application. CVIU 61(1), 38–59 (1995)Google Scholar
  2. 2.
    Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. In: Burkhardt, H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1407, p. 484. Springer, Heidelberg (1998)Google Scholar
  3. 3.
    Cristinacce, D., Cootes, T.: Automatic feature localization with constrained local models. PR 41(10), 3054–3067 (2007)zbMATHGoogle Scholar
  4. 4.
    Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. PAMI 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  5. 5.
    Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: CVPR (2011)Google Scholar
  6. 6.
    Bourdev, L., Maji, S., Brox, T., Malik, J.: Detecting people using mutually consistent poselet activations. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part VI. LNCS, vol. 6316, pp. 168–181. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  7. 7.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: arXiv preprint (2014)Google Scholar
  8. 8.
    Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: CVPR (2013)Google Scholar
  9. 9.
    Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part VI. LNCS, vol. 8694, pp. 94–108. Springer, Heidelberg (2014)Google Scholar
  10. 10.
    Zhang, J., Shan, S., Kan, M., Chen, X.: Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part II. LNCS, vol. 8690, pp. 1–16. Springer, Heidelberg (2014)Google Scholar
  11. 11.
    Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: CVPR (2014)Google Scholar
  12. 12.
    Saragih, J., Lucey, S., Cohn, J.: Deformable model fitting by regularized landmark mean-shift. IJCV 91(2), 200–215 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Yu, X., Yang, F., Huang, J., Metaxas, D.: Explicit occlusion detection based deformable fitting for facial landmark localization. In: FG (2013)Google Scholar
  14. 14.
    Pedersoli, M., Timofte, R., Tuytelaars, T., Gool, L.V.: Using a deformation field model for localizing faces and facial points under weak supervisional regression forests. In: CVPR (2014)Google Scholar
  15. 15.
    Yu, X., Huang, J., Zhang, S., Metaxas, D.: Face landmark fitting via optimized part mixtures and cascaded deformable model. PAMI (2015)Google Scholar
  16. 16.
    Matthews, I., Baker, S.: Active appearance models revisited. IJCV 60(2), 135–164 (2004)CrossRefGoogle Scholar
  17. 17.
    Tzimiropoulos, G., Pantic, M.: Optimization problems for fast AAM fitting in-the-wild. In: ICCV (2013)Google Scholar
  18. 18.
    Cheng, X., Sridharan, S., Saragih, J., Lucey, S.: Rank minimization across appearance and shape for AAM ensemble fitting. In: ICCV (2013)Google Scholar
  19. 19.
    Belhumeur, P., Jacobs, D., Kriegman, D., Kumar, N.: Localizing parts of faces using a consensus of exemplars. In: CVPR (2011)Google Scholar
  20. 20.
    Yu, X., Lin, Z., Brandt, J., Metaxas, D.N.: Consensus of regression for occlusion-robust facial feature localization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part IV. LNCS, vol. 8692, pp. 105–118. Springer, Heidelberg (2014)Google Scholar
  21. 21.
    Zhou, F., Brandt, J., Lin, Z.: Exemplar-based graph matching for robust facial landmark localization. In: ICCV (2013)Google Scholar
  22. 22.
    Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression. Int. J. Comput. Vis. 107(2), 177–190 (2013)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Dantone, M., Gall, J., Fanelli, G., Gool, L.V.: Realtime facial feature detection using conditional regression forests. In: CVPR (2012)Google Scholar
  24. 24.
    Xiong, X., la Torre, F.D.: Supervised descent method and its applications to face alignment. In: CVPR (2013)Google Scholar
  25. 25.
    Ren, S., Cao, X., Wei, Y., Sun, J.: Face alignment at 3000 FPS via regressing local binary features. In: CVPR (2014)Google Scholar
  26. 26.
    Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: CVPR (2014)Google Scholar
  27. 27.
    Lee, D., Park, H., Too, C.: Face alignment using cascade gaussian process regression trees. In: CVPR (2015)Google Scholar
  28. 28.
    Zhu, S., Li, C., Loy, C., Tang, X.: Face alignment by coarse-to-fine shape searching. In: CVPR (2015)Google Scholar
  29. 29.
    Yang, H., Mou, W., Zhang, Y., Patras, I., Gunes, H., Robinson, P.: Face alignment assisted by head pose estimation. In: BMVC (2015)Google Scholar
  30. 30.
    Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition. IJCV 61(1), 55–79 (2005)CrossRefGoogle Scholar
  31. 31.
    Wang, F., Li, Y.: Beyond physical connections: tree models in human pose estimation. In: CVPR (2013)Google Scholar
  32. 32.
    Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Strong appearance and expressive spatial models for human pose estimation. In: ICCV (2013)Google Scholar
  33. 33.
    Chen, X., Yuille, A.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: NIPS (2014)Google Scholar
  34. 34.
    Tompson, J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS (2014)Google Scholar
  35. 35.
    Fan, X., Zheng, K., Lin, Y., Wang, S.: Combining local appearance and holistic view: dual-source deep neural networks for human pose estimation. In: CVPR (2015)Google Scholar
  36. 36.
    Wah, C., Branson, S., Perona, P., Belongie, S.: Multiclass recognition and part localization with humans in the loop. In: ICCV (2011)Google Scholar
  37. 37.
    Chai, Y., Lempitsky, V., Zisserman, A.: Symbiotic segmentation and part localization for fine-grained categorization. In: ICCV (2013)Google Scholar
  38. 38.
    Liu, J., Belhumeur, P.: Bird part localization using exemplar-based models with enforced pose and subcategory consistency. In: ICCV (2013)Google Scholar
  39. 39.
    Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-CNNs for fine-grained category detection. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part I. LNCS, vol. 8689, pp. 834–849. Springer, Heidelberg (2014)Google Scholar
  40. 40.
    Lin, D., Shen, X., Lu, C., Jia, J.: Deep LAC: deep localization, alignment and classification for fine-grained recognition. In: CVPR (2015)Google Scholar
  41. 41.
    Zhu, X., Ramanan, D.: Face detection, pose estimation and landmark localization in the wild. In: CVPR (2012)Google Scholar
  42. 42.
    Dollar, P., Welder, P., Perona, P.: Cascaded pose regression. In: CVPR (2010)Google Scholar
  43. 43.
    Burgos-Artizzu, X., Perona, P., Dollar, P.: Robust face landmark estimation under occlusion. In: ICCV (2013)Google Scholar
  44. 44.
    Yan, J., Lei, Z., Yang, Y., Li, S.Z.: Stacked deformable part model with shape regression for object part localization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part II. LNCS, vol. 8690, pp. 568–583. Springer, Heidelberg (2014)Google Scholar
  45. 45.
    Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV (2015)Google Scholar
  46. 46.
    Razavian, A.S., Azizpour, H., Maki, A., Sullivan, J., Ek, C.H., Carlsson, S.: Persistent evidence of local image properties in generic convnets. In: Paulsen, R.R., Pedersen, K.S. (eds.) SCIA 2015. LNCS, vol. 9127, pp. 249–262. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  47. 47.
    Jaderberg, M., Simony, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NIPS (2015)Google Scholar
  48. 48.
    Kanazawa, A., Jacobs, D., Chandraker, M.: Warpnet: weakly supervised matching for single-view reconstruction. In: CVPR (2016)Google Scholar
  49. 49.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  50. 50.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: arXiv preprint (2016)Google Scholar
  51. 51.
    Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICML (2009)Google Scholar
  52. 52.
    Baltrusaitis, T., Robinson, P., Morency, L.: Constrained local neural fields for robust facial landmark detection in the wild. In: ICCVW (2013)Google Scholar
  53. 53.
    Bookstein, F.L.: Principal warps: thin-plate splines and the decomposition of deformations. PAMI 11(6), 567–585 (1989)CrossRefzbMATHGoogle Scholar
  54. 54.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint (2014)Google Scholar
  55. 55.
    Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: the first facial landmark localization challenge. In: ICCVW (2013)Google Scholar
  56. 56.
    Le, V., Brandt, J., Lin, Z., Bourdev, L., Huang, T.S.: Interactive facial feature localization. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 679–692. Springer, Heidelberg (2012)Google Scholar
  57. 57.
    Messer, K., Matas, J., Kittler, J., Letting, J., Maitre, G.: XM2VTSDB: the extended M2VTS database. In: Second International Conference on Audio and Video-based Biometric Person Authentication (AVBPA) (1999)Google Scholar
  58. 58.
    Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: British Machine Vision Conference (2010)Google Scholar
  59. 59.
    Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: CVPR (2011)Google Scholar
  60. 60.
    Zhang, N., Shelhamer, E., Gao, Y., Darrell, T.: Fine-grained pose prediction, normalization and recognition. In: arXiv preprint (2016)Google Scholar
  61. 61.
    Welder, P., Branson, S., Mita, T., Wah, C., Schrod, F., Belong, S., Perona, P.: Caltech-ucsd birds 200. In: CTechnical report CNS-TR-2010-001 (2010)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Department of Media AnalyticsNEC Laboratories AmericaCupertinoUSA

Personalised recommendations