Keywords

1 Introduction

Three-dimensional (3D) face models have recently been employed to assist pose or expression invariant face recognition [3, 14, 42], and the state-of-the-art performance has been achieved. A crucial step in these 3D face-assisted face recognition methods is to reconstruct the 3D face model from a two-dimensional (2D) face image. Besides its applications in face recognition, 3D face reconstruction is also useful in other face-related tasks, such as facial expression analysis [7, 36] and facial animation [4, 5]. While many 3D face reconstruction methods are available, they require landmarks on the face image as input, and are difficult to handle large-pose faces that have invisible landmarks due to self-occlusion.

Existing studies tackle the problems of face alignment (or facial landmark localization) and 3D face reconstruction separately. However, these two problems are chicken-and-egg problems. On one hand, 2D face images are projections of 3D faces onto the 2D plane. Knowing a 3D face and a 3D-to-2D mapping function, it is easy to compute the visibility and position of 2D landmarks. On the other hand, the landmarks provide rich information about facial geometry, which is the basis of 3D face reconstruction. Figure 1 illustrates the correlation between 2D landmarks and the 3D face. That is, the visibility and position of landmarks in the projected 2D image are determined by three factors: the 3D face shape, 3D deformation due to expression and pose, and camera projection parameters. Let us denote a 3D face shape as S and its 2D landmarks as U. The formation of 2D landmarks from the 3D face can be represented by \(U = f_{C} \circ f_{P} \circ f_{E}(S)\), where \(f_{C}\) is camera projection, \(f_{P}\) and \(f_{E}\) are deformation caused by pose and expression, respectively. Given such a clear correlation between 2D landmarks U and 3D shape S, it is evident that they should ideally be solved jointly, instead of separately as in prior works - indeed this is the core of this work.

Fig. 1.
figure 1

We view 2D landmarks are generated from a 3D face through 3D expression (\(f_E\)) and pose (\(f_P\)) deformation, and camera projection (\(f_C\)) (top row). While conventional face alignment and 3D face reconstruction are two separate tasks and the latter requires the former as the input, this paper performs these two tasks jointly, i.e., reconstructing a pose-expression-normalized (PEN) 3D face and estimating visible/invisible landmarks (green/red points) from a 2D face image with arbitrary poses and expressions. (Color figure online)

Motivated by the aforementioned observation, this paper proposes to simultaneously solve the two problems of face alignment and 3D face shape reconstruction in one unified framework. To this end, two sets of regressors are jointly learned from a training set of pairing annotated 2D face images and 3D face shapes. These two sets of regressors are alternately applied to locate the landmarks on an input 2D image, and meanwhile reconstruct its pose-expression-normalized (PEN) 3D face shape. Note that most single image-based 3D face reconstruction algorithms aim to assist face recognition. For this purpose, we argue that reconstructing the PEN 3D shape is more useful than reconstructing the 3D shape that has the same pose and expression as the input 2D face [23, 28, 31].

The rest of this paper is organized as follows. Section 2 briefly reviews related work in the literature. Section 3 introduces in detail the proposed joint face alignment and 3D face reconstruction method. Section 4 reports experimental results. Section 5 concludes the paper.

2 Related Work

Face Alignment. Classical face alignment methods, including Active Shape Model (ASM) [9, 11] or Active Appearance Model (AAM) [8, 25], search for landmarks based on global shape models and generative texture models. Constrained Local Model (CLM) [10] also utilizes global shape models to regularize the landmark locations, but it employs discriminative local texture models. Regression based methods [6, 27, 35, 39] have been recently proposed to directly estimate landmark locations by applying cascaded regressors to an input 2D face image. These methods mostly do not consider the visibility of facial landmarks under different view angles. Consequently, their performance degrades substantially for non-frontal faces, and their detected landmarks could be ambiguous because the anatomically correct landmarks might be invisible due to self-occlusion (see Fig. 1).

A few methods focused on large-pose face alignment, which can be roughly divided into two categories: multi-view based and 3D model based. Multi-view based methods [37, 40] define different sets of landmarks as templates, one for each view range. Given an input image, they fit the multi-view templates to it and choose the best fitted one as the final result. These methods are usually complicated to apply, and can not detect invisible self-occluded landmarks. 3D model based methods, in contrast, can better handle self-occluded landmarks with the assistance of 3D face models. Their basic idea is to fit a 3D face model to the input image to recover the 3D landmark locations. Most of these methods [17, 18, 41] use 3D morphable models (3DMM) [2] – either a simplified one with a sparse set of landmarks [18, 41] or a relatively dense one [17]. They estimate the 3DMM parameters by using cascaded regressors with texture features as the input. In [18], the visibility of landmarks is explicitly computed, and the method can cope with face images of yaw angles ranging from \(-90^{\circ }\) to \(90^{\circ }\), whereas the method in [17] does not work properly for faces of yaw angles beyond \(60^{\circ }\). In [33], Tulyakov and Sebe propose to directly estimate the 3D landmark locations via texture-feature-based regressors for faces of yaw angles upto \(50^{\circ }\).

These existing 3D model based methods establish regressions between 2D image features and 3D landmark locations (or indirectly, 3DMM parameters). While our proposed approach is also based on 3D model, unlike existing methods, it carries out regressions both on 2D images and in the 3D space. Regressions on 2D images predict 2D landmarks, while regressions in the 3D space predict 3D landmarks locations. By integrating both regressions, our proposed method can more accurately locate landmarks, and better handle self-occluded landmarks. It thus works well for images of arbitrary view angles in \([-90^{\circ }, 90^{\circ }]\).

3D Face Reconstruction. Estimating the 3D face geometry from a single 2D image is an ill-posed problem. Existing methods, such as Shape from Shading (SFS) and 3DMM, thus heavily depend on priors or constraints. SFS based methods [20, 31] usually utilize an average 3D face model as a reference, and assume the Lambertian lighting model for the 3D face surface. One limitation of SFS methods lies in its assumed connection between 2D texture clues and 3D shape, which is too weak to discriminate among different individuals. 3DMM [2, 3, 28] establishes statistical parametric models for both texture and shape, and represents a 3D face as a linear combination of basis shapes and textures. To recover the 3D face from a 2D image, 3DMM-based methods estimate the combination coefficients by minimizing the discrepancy between the input 2D face image and the one rendered from the reconstructed 3D face. They can better cope with 2D face images of varying illuminations and poses. However, they still suffer from invisible facial landmarks when the input face has large pose angles. To deal with extreme poses, Lee et al. [22], Qu et al. [26] and Liu et al. [23] propose to discard the self-occluded landmarks or treat them as missing data. All these existing 3D face reconstruction methods require landmarks as input. Consequently, they either manually mark the landmarks, or employ standalone face alignment methods to automatically locate the landmarks. Moreover, existing methods always generate 3D faces that have the same pose and expression as the input image, which may not be desired in face recognition due to the challenge of matching 3D faces with expressions [12]. In this paper, we improve 3D face reconstruction from two aspects: (i) integrating the face alignment step into the 3D face reconstruction procedure, and (ii) reconstructing PEN 3D faces, which are believed to be useful for face recognition.

3 Proposed Method

3.1 Overview

We denote an n-vertex 3D face shape of neutral expression and frontal pose as,

$$\begin{aligned} S = \begin{pmatrix} x_{1} &{}\quad x_{2} &{}\quad \cdots &{}\quad x_{n} \\ y_{1} &{}\quad y_{2} &{}\quad \cdots &{}\quad y_{n}\\ z_{1} &{}\quad z_{2} &{}\quad \cdots &{}\quad z_{n}\\ 1 &{}\quad 1 &{}\quad \cdots &{}\quad 1 \end{pmatrix}, \end{aligned}$$
(1)

and a subset of S with columns corresponding to l landmarks as \({S}_{L}\). The projections of these landmarks on the 2D face image \(\mathbf {I}\) are represented by

$$\begin{aligned} U = \begin{pmatrix} u_{1}&{}\quad u_{2} &{}\quad \cdots &{}\quad u_{l}\\ v_{1}&{}\quad v_{2} &{}\quad \cdots &{}\quad v_{l} \end{pmatrix} = f_{C}\circ f_{P} \circ f_{E} ({S}_{L}) \approx \mathbf {M}\times {S}_{L}. \end{aligned}$$
(2)

Here, we use a 3D-to-2D mapping matrix \(\mathbf {M}\) to approximate the composite effect of expression and pose induced deformation and camera projection. Given an input 2D face image \(\mathbf {I}\), our goal is to simultaneously locate its landmarks U and reconstruct its 3D face shape S. Note that, in some context, we also write the 3D face shape and the landmarks as column vectors: \(\mathbf {S}=(x_{1}, y_{1}, z_{1}, x_{2}, y_{2}, z_{2}, \cdots , x_{n}, y_{n}, z_{n})^{\mathsf {T}}\), and \(\mathbf {U}=(u_{1}, v_{1}, u_{2}, v_{2}, \cdots , u_{l}, v_{l})^{\mathsf {T}}\), where ‘\(\mathsf {T}\)’ is transpose operator.

Fig. 2.
figure 2

Flowchart of the proposed joint face alignment and 3D face reconstruction method.

Figure 2 shows the flowchart of the proposed method. For the input 2D face image \(\mathbf {I}\), its 3D face shape \(\mathbf {S}\) is initialized as the mean 3D shape of training faces. Its landmarks \(\mathbf {U}\) are initialized by fitting the mean landmarks of training frontal faces into the face region specified by a bounding box in \(\mathbf {I}\) via similarity transforms. \(\mathbf {U}\) and \(\mathbf {S}\) are iteratively updated by applying a series of regressors. Each iteration contains three main steps: (i) updating landmarks, (ii) updating 3D face shape, and (iii) refining landmarks.

Updating landmarks. This step updates the landmarks’ locations from \(\mathbf {U}^{k-1}\) to \(\hat{\mathbf {U}}^{k}\) based on the texture features in the input 2D image. This is similar to the conventional cascaded regressor based 2D face alignment [35]. The adjustment to the landmarks’ locations in \(k^{\texttt {th}}\) iteration, \(\varDelta {\mathbf {U}^{k}}\) is determined by the local texture features around \(\mathbf {U}^{k-1}\) via a regressor,

$$\begin{aligned} \varDelta {\mathbf {U}}^{k} = R_{U}^{k}(h(\mathbf {I}, \mathbf {U}^{k-1})), \end{aligned}$$
(3)

where \(h(\mathbf {I}, \mathbf {U})\) denotes a texture feature extracted around the landmarks \(\mathbf {U}\) in the image \(\mathbf {I}\), and \(R_{U}^{k}\) is a regression function. The landmarks can be then updated by \(\hat{\mathbf {U}}^{k} = \mathbf {U}^{k-1} + \varDelta {\mathbf {U}}^{k}\). The method for learning these landmark regressors will be introduced in Sect. 3.3.

Updating 3D face shape. In this step, the above-obtained landmark location adjustment is used to estimate the adjustment of the 3D face shape. Specifically, a regression function \(R_{S}^{k}\) models the correlation between the landmark location adjustment \(\varDelta {\mathbf {U}}^{k}\) and the expected adjustment to the 3D shape \(\varDelta {\mathbf {S}}^{k}\), i.e.,

$$\begin{aligned} \varDelta {\mathbf {S}}^{k} = R_{S}^{k}(\varDelta {\mathbf {U}}^{k}). \end{aligned}$$
(4)

The 3D shape can be then updated by \(\mathbf {S}^{k} = \mathbf {S}^{k-1} + \varDelta {\mathbf {S}}^{k}\). The method for learning these shape regressors will be given in Sect. 3.4.

Refining landmarks. Once a new estimate of the 3D shape is obtained, the landmarks can be further refined accordingly. For this purpose, the 3D-to-2D mapping matrix is needed. Hence, we estimate \(\mathbf {M}^{k}\) based on \(\mathbf {S}^{k}\) and \(\hat{\mathbf {U}}^{k}\). The refined landmarks \(\mathbf {U}^{k}\) can be then obtained by projecting \(\mathbf {S}^{k}\) onto the image via \(\mathbf {M}^{k}\) according to Eq. (2). During this process, the visibility of the landmarks is also re-computed. Details about this step will be given in Sect. 3.5.

3.2 Training Data Preparation

Before we provide the details about the three steps, we first introduce the training data needed for learning the landmarks and 3D shape regressors. Since the purpose of these regressors is to gradually adjust the estimated landmarks and 3D shape towards their true values, we need a sufficient number of triplet data \(\{(\mathbf {I}_{i}, \mathbf {S}^{*}_{i}, \mathbf {U}^{*}_{i})\vert i=1,2,\cdots ,N\}\), where \(\mathbf {S}^{*}_{i}\) and \(\mathbf {U}^{*}_{i}\) are, respectively, the ground truth 3D shape and landmarks for the image \(\mathbf {I}_{i}\), and N is the total number of training samples. All the 3D face shapes have been established dense correspondences among their vertices; in other words, they have the same number of vertices, and vertices of the same index have the same semantic meaning. Moreover, both visible and invisible landmarks in \(\mathbf {I}_{i}\) have been annotated and included in \(\mathbf {U}^{*}_{i}\). For invisible landmarks, the annotated positions should be anatomically correct positions (e.g., red points in Fig. 1).

Obviously, to make the regressors robust to expression and pose variations, the training data should contain 2D face images of varying expressions and poses. As for the 3D shape \(\mathbf {S}^{*}_{i}\) corresponding to the \(\mathbf {I}_{i}\) in the training data, it can either have the same expression and pose as \(\mathbf {I}_{i}\), or just have neutral expression and frontal pose no matter what expression and pose \(\mathbf {I}_{i}\) has. In the former, the learned regressors will output 3D face shapes that have the same expression and pose as the input images; while in the latter, the learned regressors will generate neutral and frontal 3D shapes for any input images. In either case, the dense registration among all 3D shapes \(\mathbf {S}_{i}^{*}\) is needed for regressor learning. In this paper, we follow the latter for two reasons: (i) dense registration of 3D face shapes with different expressions is difficult, and (ii) the reconstructed PEN 3D shapes are preferred for being used in 3D face recognition.

It is, however, difficult to find in the public domain such data sets of 3D face shapes and corresponding annotated 2D images with various expressions/poses. Thus, we construct two sets of training data by ourselves: one based on BU3DFE [36], and the other based on LFW [16]. BU3DFE database contains 3D face scans of 56 males and 44 females, acquired in neutral plus six basic expressions (happiness, disgust, fear, angry, surprise and sadness). All basic expressions are acquired at four levels of intensity. These 3D face scans have been manually annotated with 84 landmarks (83 landmarks provided by the database and one nose tip marked by ourselves). For each of the 100 subjects, we select one scan of neutral expression as the ground truth 3D shape. For the rest six expressions, we choose the scans of the first level intensity, and project them to 2D images with recorded landmark locations. From each of the seven scans, 19 face images are generated with different poses (\(-90^{\circ }\) to \(90^{\circ }\) yaw with a \(10^{\circ }\) interval). As a result, each 3D shape has 133 images of different poses and expressions. We use the method [13] to establish dense correspondence of BU3DFE neutral scans.

LFW database contains 13,233 images of 5,749 subjects. We select 150 subjects, each having at least 10 images, and use 68 landmarks on these face images that are provided by the work of [41]. From the neutral frontal image of each subject, we employ the method in [23] to reconstruct the 3D shape, which is densely registered. Finally, we obtain 4,149 images of 150 subjects and their corresponding neutral 3D face shapes.

The resultant 3D shapes have \(n=9,677\) for BU3DFE and \(n=53,215\) for LFW. Figure 3 shows some example 2D face images and corresponding 3D faces in the two databases. Obviously, 3D shapes in BU3DFE consist of a sparser set of vertices, and consequently look a little bit blur in Fig. 3.

Fig. 3.
figure 3

Example 2D face images with annotated landmarks and corresponding neutral 3D shapes from the BU3DFE and LFW databases.

3.3 Learning Landmark Regressors

According to Eq. (3), landmark regressors estimate the adjustment to \(\mathbf {U}^{k-1}\) such that the updated landmarks \(\mathbf {U}^{k}\) get closer to their true positions. In the training phase, the true positions and visibility of the landmarks are given by the ground truth \(\mathbf {U}^{*}\). Therefore, the objective of the landmark regressors \(R^{k}_{U}\) is to better predict the difference between \(\mathbf {U}^{k}\) and \(\mathbf {U}^{*}\). In this paper, we employ linear regressors as the landmark regressors, and learn them by fulfilling the following optimization:

$$\begin{aligned} R^{k}_{U} = \mathop {\arg \min } \limits _{R^{k}_{U}}\sum _{i=1}^N\parallel \left( \mathbf {U}^{*}_{i} - \mathbf {U}^{k}_{i}\right) - R^{k}_{U}(h(\mathbf {I}_{i}, \mathbf {U}_{i}^{k-1}))\parallel _2^2, \end{aligned}$$
(5)

which has a close-form least-square solution. Note that other regression schemes, such as CNN [19], can be easily adopted in our framework.

We use 128-dim SIFT descriptors [24] as the local feature. The feature vector of h is a concatenation of the SIFT descriptors at all the l landmarks, i.e., a 128l-dim vector. If a landmark is invisible, no feature will be extracted, and its corresponding entries in h will be zero. It is worth mentioning that the regressors estimate the semantic positions of all landmarks including invisible landmarks.

3.4 Learning 3D Shape Regressors

The landmark adjustment \(\varDelta {\mathbf {U}}^{k}\) is also used as the input to the 3D shape regressor \(R^{k}_{S}\). The objective of \(R^{k}_{S}\) is to compute an update to the initially estimated 3D shape \(\mathbf {S}^{k-1}\) in the \(k^{\texttt {th}}\) iteration to minimize the difference between the updated 3D shape and the ground truth. Using similar linear regressors, the 3D shape regressors can be learned by solving the following optimization via least squares:

$$\begin{aligned} R^{k}_{S} = \mathop {\arg \min } \limits _{{R^{k}_S}}\sum _{i=1}^N\parallel (\mathbf {S}^{*}_{i} - \mathbf {S}^{k}_{i}) - R^{k}_{S}\left( \varDelta {\mathbf {U}}^{k}_{i}\right) \parallel _2^2, \end{aligned}$$
(6)

with its closed form solution as

$$\begin{aligned} R^{k}_{S}=\varDelta \mathbb {S}^{k}(\varDelta \mathbb {U}^{k})^{\mathsf {T}}(\varDelta \mathbb {U}^{k}(\varDelta \mathbb {U}^{k})^{\mathsf {T}})^{-1}, \end{aligned}$$
(7)

where \(\varDelta \mathbf {\mathbb {S}}^{k} = \mathbf {\mathbb {S}}^{*}-\mathbf {\mathbb {S}}^{k}\) and \(\varDelta \mathbf {\mathbb {U}}^{k}\) are, respectively, the 3D shape and landmark adjustment. \(\mathbb {S}\in \mathbb {R}^{3n*N}\) and \(\mathbb {U}\in \mathbb {R}^{2l*N}\) denote, respectively, the ensemble of 3D face shapes and 2D landmarks of all training samples with each column corresponding to one sample. It can be mathematically shown that N should be larger than 2l so that \(\varDelta \mathbb {U}^{k}(\varDelta \mathbb {U}^{k})^{\mathsf {T}}\) is invertible. Fortunately, since the set of used landmarks are usually sparse, this requirement is easy to be satisfied in real-world applications.

3.5 Estimating 3D-to-2D Mapping and Landmark Visibility

In order to refine the landmarks with the updated 3D face shape, we have to project the 3D shape to the 2D image with a 3D-to-2D mapping matrix. In this paper, we dynamically estimate the mapping matrix based on \(\mathbf {S}^{k}\) and \(\hat{\mathbf {U}}^{k}\). As discussed earlier in Sect. 3.1, the mapping matrix is a composite effect of expression and pose induced deformation and camera projection. Here, we assume a weak perspective projection for the camera projection as in prior work [18, 38], and further assume that the expression and pose induced deformation can be approximated by a linear transform. As a result, the mapping matrix \(\mathbf {M}^{k}\) is represented by a \(2\times 4\) matrix, and can be estimated as a least squares solution to the following fitting problem:

$$\begin{aligned} \mathbf {M}^{k} = \mathop {\arg \min } \limits _{\mathbf {M}^{k}} \parallel \hat{{U}}^{k} - \mathbf {M}^{k}\times {S}^{k}_{L}\parallel _2^2. \end{aligned}$$
(8)

Once a new mapping matrix is computed, the landmarks can be further refined as \({U}^{k} = \mathbf {M}^{k} \times {S}^{k}_{L}\).

The visibility of the landmarks can be then computed based on the mapping matrix \(\mathbf {M}\) using the method in [18]. Suppose the average surface normal around a landmark in the 3D face shape \(\mathbf {S}\) is \(\overrightarrow{\mathbf{n }}\). Its visibility \(\mathbf v \) can be measured by

$$\begin{aligned} \mathbf v = \frac{1}{2}\left( 1 + sgn\left( \overrightarrow{\mathbf{n }} \cdot \left( \frac{\mathbf {M}_{1}}{\left\| \mathbf {M}_{1} \right\| }\times \frac{\mathbf {M}_{2}}{\left\| \mathbf {M}_{2} \right\| } \right) \right) \right) , \end{aligned}$$
(9)

where sgn() is the sign function, ‘\(\cdot \)’ means dot product and ‘\(\times \)’ cross-product, and \(\mathbf {M}_{1}\) and \(\mathbf {M}_{2}\) are the left-most three elements at the first and second row of the mapping matrix \(\mathbf {M}\). This basically rotates the surface normal and validates if it points toward the camera or not.

The whole process of learning the cascaded coupled landmark and 3D shape regressors is summarized in Algorithm 1.

figure a

4 Experiments

4.1 Protocols

We conduct three sets of experiments to evaluate the proposed method in 3D shape reconstruction, face alignment, and benefits to face recognition.

Datasets. The training data are constructed from two public face databases: BU3DFE and LFW, as detailed in Sect. 3.2. Respectively, two different models are trained using each of the two training sets. Our test sets include BU3DFE and AFW (Annotated Faces in-the-Wild) [40]. To evaluate the 3D shape reconstruction accuracy, a 10-fold cross validation is applied to split the BU3DFE data into training and testing subsets, resulting in 11,970 training samples and 1,330 testing samples. To evaluate the face alignment accuracy, the AFW database [40] is tested using the LFW-trained model. AFW is a widely used benchmark in the face alignment literature. It contains 205 images of 468 faces with different poses within \(\pm 90^\circ \). In [30], 337 of these faces have been manually annotated with face bounding boxes and 68 landmarks. We use them in our experiments.

Experiment setup. During training and testing, each image is associated with a bounding box, which specifies the face region in the image. To initialize the landmarks in it, the mean of the landmarks in all neutral frontal training images is fitted to the face region via a similarity transform. In this paper, we set the number of iterations \(K=5\) (discussion of convergence issue is provided in supplemental material). SIFT descriptors are computed on \(32 \times 32\) local patches around the landmarks, and the implementation by [35] is used in our experiments.

Fig. 4.
figure 4

MAE of the proposed method on BU3DFE (a) under different yaw angles and (b) under different expressions, i.e., neutral (NE), happy (HA), fear (FE), sad (SA), angry (AN), disgust (DI) and surprise (SU).

Evaluation metrics. Two metrics are used to evaluate the 3D face shape reconstruction accuracy: Mean Absolute Error (MAE) and Normalized Per-vertex Depth Error (NPDE). MAE is defined as \(\texttt {MAE} = \frac{1}{N_{T}}\sum _{i=1}^{N_{T}}(\Vert \mathbf {S}^{*}_{i}-\hat{\mathbf {S}}_{i} \Vert /n)\), where \(N_{T}\) is the total number of testing samples, \(\mathbf {S}^{*}_{i}\) and \(\hat{\mathbf {S}}_{i}\) are the ground truth and reconstructed 3D face shape of the \(i^{\texttt {th}}\) testing sample. NPDE measures the depth error at the \(j^{\texttt {th}}\) vertex in a testing sample as \(\texttt {NPDE}(x_{j}, y_{j}) = \left( |z^{*}_{j} - \hat{z}_{j}|\right) /\left( z^{*}_{max} - z^{*}_{min}\right) \), where \(z^{*}_{max}\) and \(z^{*}_{min}\) are the maximum and minimum depth values in the ground truth 3D shape of the testing sample, and \(z^{*}_{j}\) and \(\hat{z}_{j}\) are the ground truth and reconstructed depth values at the \(j^{\texttt {th}}\) vertex.

The face alignment accuracy is measured by Normalized Mean Error (NME). It is defined as the mean of the normalized estimation error of visible landmarks for all testing samples:

$$\begin{aligned} \texttt {NME} = \frac{1}{N_{T}}\sum _{i=1}^{N_{T}}{\left( \frac{1}{d_{i}}\frac{1}{N^\mathbf{v }_{i}} \sum _{j=1}^{l}\mathbf{v }_{ij}||(\hat{u}_{ij}, \hat{v}_{ij}) - (u^{*}_{ij}, v^{*}_{ij})|| \right) }, \end{aligned}$$
(10)

where \(d_{i}\) is the square root of the face bounding box area of the \(i^{\texttt {th}}\) testing sample, \(N^\mathbf{v }_{i}\) is the number of visible landmarks in it, \((u^{*}_{ij}, v^{*}_{ij})\) and \((\hat{u}_{ij}, \hat{v}_{ij})\) are, respectively, the ground truth and estimated coordinates of its \(j^{\texttt {th}}\) landmark.

4.2 3D Face Reconstruction Accuracy

Reconstruction accuracy across poses. Figure 4(a) shows the average MAE of our proposed method under different pose angles of the input 2D images. To give a fair comparison with the method in [23], we only compute the reconstruction error of neutral testing images, after rotating the reconstructed 3D faces to frontal view. As can be seen, the average MAE of our method is lower than that of the baseline. Moreover, as the pose angle becomes large, the error does not increase substantially. This proves the effectiveness of the proposed method in handling arbitrary view face images. Figure 5 shows the reconstruction and face alignment results of one subject.

Reconstruction accuracy across expressions. Figure 4(b) shows the average MAE of our proposed method across expressions. Although the error increases as expressions become intensive, the maximum increment (i.e., SU vs. NE) is below \(7\,\%\). This proves the robustness of the proposed method in normalizing expressions while maintaining model individualities. Figure 6 shows the reconstruction and face alignment results of a subject under seven expressions.

Fig. 5.
figure 5

Reconstruction results for a BU3DFE subject at three different pose angles. Column one are input images. Columns 2 and 3 show the reconstructed (‘REC’) 3D faces from two views. Column 4 are the NPDE between the ground truth (‘GT’) and REC 3D faces. The detected landmarks are shown in Column 5. The last column shows the GT 3D face of this subject, the initial (‘INIT’) 3D face, and the NPDE between them. NPDE increases as the color changes from blue to red. The average and the standard deviation are given below each NPDE map. Note that the same INIT 3D face is used for all input images. (Color figure online)

Fig. 6.
figure 6

Face alignment and reconstruction results for a BU3DFE subject with different expressions. Row 1 shows the input images. Row 2 shows the estimated 3D shapes, and Row 3 shows the NPDE maps with the average and standard deviation. The last row shows the detected landmarks.

Table 1. NME of the proposed method and two baseline methods on AFW.

4.3 Face Alignment Accuracy

As for the face alignment evaluation on AFW, we select two recent works as baseline methods: (1) CDM [37], the first method claimed to perform pose-free face alignment; (2) PIFA [18], a regression-type method that can predict the anatomically correct locations of landmarks for arbitrary view face images. We use the executable code of CDM and PIFA to compute their performance on our test set. The CDM code integrates face detection, and it successfully detects and aligns 268 out of 337 testing images. Therefore, to compare with CDM, we evaluate the NME on the 268 testing images. For PIFA and the proposed method, the face bounding boxes provided by [30] are used. One note is that the CDM detects 66 landmarks and PIFA detects 21 landmarks. For a fair comparison, we evaluate the NME on 18 landmarks that are the intersections of the three landmark sets. As shown in Table 1, our accuracy is better than the two baseline methods. Figure 7 shows some face alignment results.

Fig. 7.
figure 7

Detected 18 landmarks for images in AFW by the proposed method.

4.4 Application to Face Recognition

While there are many recent face alignment and reconstruction work [1, 15, 21, 29, 32, 34], few work takes one step further to evaluate the contribution of alignment or reconstruction to subsequent tasks. In contrast, we quantitatively evaluate the effect of the reconstructed PEN 3D face shapes on face recognition by performing direct 3D to 3D shape matching and fuse it with conventional 2D face recognition. Specifically, we choose 70 subjects in BU3DFE to train the proposed regressors, and use the rest 30 subjects for testing. The neutral frontal face images of the testing subjects compose the gallery, and their faces under 19 poses and 7 expressions (totally 3,990 images) are the probe images. We use a commercial off-the-shelf (COTS) 2D face matcherFootnote 1 as the baseline. The iterative closest points (ICP) algorithm is applied to match the reconstructed normalized 3D face shapes. It aligns the 3D shapes reconstructed from probe and gallery images, and computes the distances between them, which are then converted to similarity scores via subtracting them from the maximum distance. These scores are finally normalized to the range of [0, 1], and fused with the scores of the COTS matcher (which are within [0, 1] also) by using a sum rule. The recognition result for a probe is defined as the subject whose gallery sample has the highest score with it. The recognition rate is then defined as the percentage of correctly recognized subjects. Figure 8 shows the recognition rates. It can be clearly seen that the reconstructed normalized 3D face shapes do help improve the face recognition accuracy, especially for face images of large pose angles and all types of expressions. Interestingly, despite the relatively robust 2D face recognition performance w.r.t. expressions, the fusion with 3D matching still improves the performance across all expressions – a strong testimony on the discriminative capability of the expression-normalized 3D face shape.

Fig. 8.
figure 8

Face recognition results of a COTS matcher and its fusion with proposed reconstructed 3D face based matcher under varying (a) poses and (b) expressions.

4.5 Computational Efficiency

According to our experiments on a PC with i7-4710 CPU and 8 GB memory, the Matlab implementation of the proposed method runs at \(\sim 26\) FPS (\(K=5\) and \(n=9,677\)). Hence, it can detect landmarks and reconstruct 3D face shape in real time.

5 Conclusions

In this paper, we present a novel regression based method for simultaneous face alignment and 3D face reconstruction for 2D images of arbitrary poses and expressions. It utilizes landmarks on a 2D face image as clues for reconstructing 3D shapes, and uses the reconstructed 3D shapes to refine landmarks. By alternately applying cascaded landmark regressors and 3D shape regressors, the proposed method can effectively accomplish the two tasks simultaneously in real time. Unlike existing 3D face reconstruction methods, the proposed method does not require additional face alignment methods, but can fully automatically reconstruct normalized 3D shapes from a single face image of arbitrary poses and expressions. Compared with existing face alignment methods, the proposed method can effectively handle invisible landmarks with the assistance of 3D face models. Extensive experiments with comparison to state-of-the-art methods demonstrate the effectiveness of the proposed method in both face alignment and 3D face shape reconstruction, and in facilitating face recognition as well.