1 Introduction

Each year, nearly one million children are born with a genetic condition. The phenotype variability among genetic syndromes and among populations with different age and/or ethnical background often causes delays and errors in their identification and diagnosis, which can translate into irreversible injuries and even death. The reported average accuracy in the detection of one of the most studied genetic syndromes (Down syndrome) by a trained pediatrician is as low as 64% [1], so methods for their early detection are critical [2].

New developments in the analysis of facial dysmorphology from photographic data have shown promising results in genetic syndrome detection [3, 4]. However, two-dimensional (2D) photography only provides a projection of the patient’s face in one plane, and therefore quantification of dysmorphology from 2D photography is sensitive to the orientation of the patient’s face with respect to the camera. To overcome these limitations, some works [5, 6] have explored the use of three-dimensional (3D) photography to quantify facial dysmorpholgy. However, the use of 3D photography to screen children in routine clinics is not practical because of the need for a dedicated area in the clinics, the cost of the equipment, and the limited access to it in developing countries.

To address this challenge, we propose a novel method to use the 3D shape of the face estimated from three views: one frontal and two profiles (left and right) unconstrained 2D photographs (uncalibrated images acquired using a smartphone).

Recent works on 3D face shape estimation from 2D pictures use a variety of techniques, such as landmark-based [7], shape-from-shading-based [8] and learning-based [9, 10] methods. Although these methods have revolutionized 3D face reconstruction using a single image, they struggle to accurately locate feature points at the face boundaries and the ears. The work [11] tried to mitigate this problem by using large data collections including multiple images acquired at different poses, which only focused on the frontal part of the face and optimized each picture independently.

In this paper, we estimate the 3D face shape by integrating information from three views of the same subject. First, we use a unified 3D morphable model (3DMM) [12] to estimate the 3D locations of a set of landmarks from the 2D images by minimizing the difference between the observed positions of the landmarks in the 2D images and the projections of their corresponding predicted 3D positions. Then, from the reconstructed 3D face, we calculate a set of geometric features, and we use them together with the texture information around those landmarks to train a classifier to quantify facial dysmorphology and to detect genetic syndromes.

2 Methods

2.1 Generic Face Model Estimation

To reconstruct the 3D face shape of a subject from different 2D pictures, we used the 3DMM Basel Face Model (BFM) [12], which was built from 3D scans of 100 male and 100 female faces using principal components analysis. We selected a set of vertices on the 3DMM corresponding to the landmarks defined on the 2D face images as shown in Fig. 1. In addition to the 68 automatic landmarks detected in the frontal images based on [13], we incorporated a set of 8 manual landmarks to better describe the nose region. We also placed 25 landmarks on each profile image.

Fig. 1.
figure 1

Workflow of the proposed method to identify facial dysmorphology associated to genetic syndromes from unconstrained frontal and profile photographs of a patient. Note the landmarks used on the frontal and profile photographs. The pose parameters \( {\mathbb{P}}^{l} ,l \in \left\{ {1,2,3} \right\} \), for the \( l^{th} \) 2D photograph, and the shape coefficients b are iteratively optimized.

We used a scaled orthographic perspective transformation to fit the 3DMM to the 2D pictures, similar to the approach presented in [7] for a single image. With this approach, the 2D projections of the 3D vertices do not depend on the distance from the camera, but only on a uniform scale \( s \in {\mathbb{R}}^{ + } \). That scale is given by the ratio of the focal length of the camera and the mean distance from the camera to the object. Thus, the projected 2D position of a 3D point \( v = \left( {x,y,z} \right)^{T} \) from the 3DMM is

$$ p = s\left( {\left[ {\begin{array}{*{20}c} 1 & 0 & 0 \\ 0 & 1 & 0 \\ \end{array} } \right]R_{rot} v + t} \right), $$
(1)

where \( R_{rot} \in {\mathbb{R}}^{3 \times 3} \) is the 3D rotation matrix and \( t \in {\mathbb{R}}^{2} \) is the 2D translation. The coordinates of vertex v in the 3DMM can be expressed as \( v = Pb + \bar{u} \), where \( b \in {\mathbb{R}}^{S} \) are the shape parameters, \( \bar{u} \in {\mathbb{R}}^{3n} \) is the mean shape with n vertices, and \( P \in {\mathbb{R}}^{3n \times S} \) are the S principal components.

The 3DMM was fitted to each 2D image l by minimizing the projection error (\( E_{l} \)),

$$ E_{l} = \frac{1}{n}\sum\limits_{i = 1}^{n}\| {q_{i}^{l} - s^{l} \left( {R^{l} v_{i}^{l} + t^{l} } \right)\|_{F}^{2} } , $$
(2)

where \( l \in \left\{ {1,2,3} \right\} \) represents the frontal (index 1) and two profile 2D images (indices 2 and 3), \( \left\| \cdot \right\|_{F} \) is the Frobenius norm, \( q_{i}^{l} \) represent the 2D landmarks on the image, and \( v_{i}^{l} = P_{i}^{l} b + \bar{u}_{i}^{l} \) are the selected corresponding vertices on the 3DMM, \( R^{l} \) represents the rotation which holds the first two rows in \( R_{rot} \) (Eq. 1), and \( t^{l} \) and \( s^{l} \) are the translation, and scaling of the \( l \) th image, respectively.

Since the optimization of Eq. 2 for the three images is not a convex problem, we solved it in three steps: (A) first we estimated the pose parameters (\( R^{l} ,t^{l} ,s^{l} \)) for each 2D image; (B) then we estimated the shape coefficients (\( b \)) as a linear least squares problem; and (C) we refined the pose parameters and shape coefficients simultaneously as a nonlinear least squares problem.

  1. (A)

    Pose Estimation

    We made an initial estimation of the pose parameters \( R^{l} \), \( t^{l} \), and \( s^{l} \) using the constrained pose from the orthography and scaling method [7]. With this approach, we approximated the perspective projection with a scaled orthographic projection (Eq. 1) by solving the following linear system

    $$ \mathop {\arg \rm{min} }\limits_{R^{l} ,t^{l} ,s^{l}} \frac{1}{2}\left\| {C\phi - {\mathcal{H}}} \right\|_{2}^{2} , $$
    (3)

where \( C = s^{l} R^{l} P_{i}^{l} \) is the projected position of the selected vertices on the 3DMM in homogeneous coordinates, \( p_{i} = \left( {x_{i} ,y_{i} } \right)^{T} \) are the observed landmarks in the 2D images, \( {\mathcal{H}} = p_{i}^{l} - s^{l} \left( {R^{l} \bar{u}_{i}^{l} + t^{l} } \right) \) is the concatenated position of the n landmarks on the \( l \) th 2D image in corresponding to the 3D vertices, and \( \bar{u}_{i}^{l} \) is the selected 3D vertices. \( \phi \) represents the estimated coefficients, which are used to extract our pose parameters \( R^{l} ,t^{l} ,s^{l} \). This model allows for 6 degrees of freedom, with 3 coefficients for 3D rotation, 2 for translation in the 2D projection plane, and 1 for isotropic scaling.

Unlike our formulation from Eq. 2, in Eq. 3, we represent the rotation about each axis as a different scalar angle, instead of one single matrix representing all rotations. We used singular value decomposition to ensure that the estimated \( R^{l} \) was a valid rotation matrix. After the initial pose estimation using Eq. 3, we refined the pose parameters by minimizing the projection errors \( E_{l} \) in Eq. 2 with respect to themselves using the trust-region reflective algorithm [14].

  1. (B)

    Shape Estimation

    Once the pose parameters were calculated, we estimated the shape coefficients b by concatenating the locations of the observed landmarks in the 2D images of the 3 views of a subject, and minimizing the difference between these locations and the 2D projections of their corresponding vertices in the 3DMM iteratively using \( \mathop \sum \limits_{l = 1}^{3} E_{l} \) with respect to b. During optimization, the shape parameters were constrained to the range \( \left[ { - 3\lambda ,3\lambda } \right] \) to ensure a plausible shape, where \( \lambda \) is the eigenvalue associated to each principal component in the 3DMM. The 2D projections for the \( l \) th image were computed using their own pose parameters (\( R^{l} \), \( t^{l} \), and \( s^{l} \)), while the shape coefficients for each of the 3 images was estimated simultaneously.

  2. (C)

    Global Refinement

    Since different pose parameters were optimized for the different 2D images, we performed a bundle adjustment to iteratively align the 3 views (frontal and two profile images). We used the trust-region reflective algorithm to solve the following non-linear optimization:

    $$ \mathop {\arg \rm{min} }\limits_{{b,R^{l} ,t^{l} ,s^{l} }} \left( {\sum\limits_{l = 1}^{3} {{{w}}_{\rm{l}} {{E}}_{\rm{l}} } + \delta \sum\limits_{i = 1}^{k} {\left( {\frac{{b_{i} }}{{\sqrt {\lambda_{i} } }}} \right)^{2} } } \right), $$
    (4)

where \( \sum\nolimits_{i = 1}^{k} {\left( {b_{i} /\sqrt {\lambda_{i} } } \right)^{2} } \) is the shape prior adopted from [7] to ensure the plausibility of the solution, k is the number of principal components of the 3DMM, \( \lambda \) is the eigenvalue of the 3DMM, \( w_{l} \) is the weight of the \( l \) th image calculated as a function of the number of landmarks in the image similar to [7], and \( \delta \) is the weight for the shape prior as used in [7]. Both the pose parameters and the shape coefficients were estimated simultaneously using Eq. 4, thus obtaining the final face shape estimation given by the shape parameters \( b. \)

2.2 Identification of Dysmorphology Associated to Genetic Syndromes

Once we estimated the 3D shape of the face, our goal was to detect facial dysmorphology associated to genetic syndromes. To that end, we first computed the set of 24 facial features as shown in Fig. 2, which have been shown to be relevant to identify genetic syndromes in [3, 4]. Unlike these previous works, our approach used the estimated 3D geometric measurements instead of their projection in 2D.

Fig. 2.
figure 2

Geometric measurements used to identify facial dysmorphology. \( d_{horizontal} \) and \( d_{vertical} \) were used to normalize horizontal and vertical distances, respectively.

As presented in [3, 4], appearance information around each landmark provides meaningful information to detect genetic syndromes. For that reason, we followed the approach described in [4] to quantify the texture around each landmark in the 2D photographs. In summary, we calculated the local binary pattern (LBP) of the patch around each landmark. Then, we used a 2D extension of linear discriminant analysis [4] to convert this LBP to a single score at each landmark (Fig. 2, yellow points), which describes how likely the appearance is to describe dysmorphology.

From all the geometric and texture features, we first selected the most discriminative ones using recursive feature elimination, thus training a linear support vector machine classifier and recursively eliminating the features with the lowest weight. Then, we evaluated the accuracy of our approach to identify facial dysmorphology associated to genetic syndromes using a leave-one-out cross-validation.

2.3 Datasets

We collected 3 2D photographs (frontal, left and right profile) from a group of 48 subjects (22 male and 26 female, average age 4 ± 3 years, age range 1 month to 12 years) of diverse ancestry, using an in-house smartphone app. Twenty-four subjects presented genetic syndromes (including Down, Noonan, Turner, Wolf-Hirschorn syndromes, etc.), and the other 24 cases were healthy. The subjects of both groups were matched by age, ethnicity, and gender.

3 Experimental Results and Discussion

To evaluate the accuracy estimating the 3D shape of the face, we computed the point-to-point root mean square error (RMSE) and the standard deviation (SD) between the 2D projected position of the vertices in the estimated 3D face shape and their corresponding locations observed on the 2D images. We normalized all differences by the face size, similar to [3, 13].

Table 1 shows the RMSE for the face shape estimated using one photograph, 2 photographs, or 3 photographs. We obtained an average reconstruction error of 2.66 ± 0.43% using the 3 photographs simultaneously, improving by 44%, 49%, and 48% the results obtained on all 3 views using only the frontal, right, and left profile photographs, respectively. These improvements were statistically significant (p-value < 0.001 for all) as determined by the Wilcoxon signed-rank test. As it may be expected, the lowest error at each individual view (frontal or profile) was obtained when using only the photograph of that view. Unsurprisingly, results using the 3 views are slightly worse than using a single view because of the simultaneous fitting to all views, but there is a substantial decrease in standard deviation, which indicated better stability of the method.

Table 1. Errors obtained by estimating the 3D face using different combination of the fontal (F), left (L), and right (R) profile images. Lower value is desirable.

Furthermore, we compared the estimated faces resulting from our proposed method with those obtained using state-of-the-art methods [7, 10]. Since those methods were designed to work only with single images, for a fair comparison, only the frontal image of each subject was used. In addition, the method from Bas et al. [7] was revised to use our landmark correspondence. As shown in Table 2, our method outperforms the state-of-the-art methods. An example of the landmarks estimated with the proposed method is shown in Fig. 3, where we can observe low differences between the estimated landmark position projected on the 2D photographs and their true location. Results show that the proposed method provides a closer face shape reconstruction to the observations from the 2D photographs.

Table 2. Comparisons of RMSE between the proposed and the state-of-the-art methods. (%)
Fig. 3.
figure 3

The faces reconstructed using different methods. The right column shows the acquired 2D photographs. Top row: the 2D projected location (red) of the vertices of the estimated 3D face shape and the ground truth (green) in the 2D photographs. Bottom row: the estimated 3D face shapes. The red dots indicate the corresponding vertices to the 2D photographs.

Finally, cross-validation of the classifier trained using the geometric measurements estimated from our 3D reconstructed face shape reported an accuracy of 73%, compared to the results of 58% that we obtained using the geometric measurements from the 2D photographs (p-value < 0.001). Our accuracy increased to 96% (with sensitivity 96%, specificity 100%) when we combined our estimated 3D measurements with the local texture information.

A potential limitation is the use of a statistical model built from an older population, which is a parameter that will be easily fixed when more data are available. However, the innovation in our method and formulation is independent on what statistical model is used. Even with such limitation, our method outperformed state-of-the-art approaches.

4 Conclusions

We presented a method for an accurate reconstruction of the 3D shape of the face from unconstrained 2D photographs using a statistical 3D morphable model. Our method achieved the lowest reconstruction error compared with other state-of-the-art approaches on single photographs. Moreover, we showed that the 3D measurements estimated with our framework outperformed the results obtained using 2D measurements for the quantification of facial features used to assess dysmorphology associated to genetic syndromes. Importantly, the proposed framework does not require camera calibration, which allowed us to acquire these pictures using a standard mobile phone. This makes our technology easily translatable to the clinics, with the potential to assist in earlier detection of genetic syndromes.