Joint Face Alignment and 3D Face Reconstruction

Liu, Feng; Zeng, Dan; Zhao, Qijun; Liu, Xiaoming

doi:10.1007/978-3-319-46454-1_33

Feng Liu¹⁷,
Dan Zeng¹⁷,
Qijun Zhao¹⁷ &
…
Xiaoming Liu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9909))

Included in the following conference series:

European Conference on Computer Vision

12k Accesses
51 Citations
3 Altmetric

Abstract

We present an approach to simultaneously solve the two problems of face alignment and 3D face reconstruction from an input 2D face image of arbitrary poses and expressions. The proposed method iteratively and alternately applies two sets of cascaded regressors, one for updating 2D landmarks and the other for updating reconstructed pose-expression-normalized (PEN) 3D face shape. The 3D face shape and the landmarks are correlated via a 3D-to-2D mapping matrix. In each iteration, adjustment to the landmarks is firstly estimated via a landmark regressor, and this landmark adjustment is also used to estimate 3D face shape adjustment via a shape regressor. The 3D-to-2D mapping is then computed based on the adjusted 3D face shape and 2D landmarks, and it further refines the 2D landmarks. An effective algorithm is devised to learn these regressors based on a training dataset of pairing annotated 3D face shapes and 2D face images. Compared with existing methods, the proposed method can fully automatically generate PEN 3D face shapes in real time from a single 2D face image and locate both visible and invisible 2D landmarks. Extensive experiments show that the proposed method can achieve the state-of-the-art accuracy in both face alignment and 3D face reconstruction, and benefit face recognition owing to its reconstructed PEN 3D face shapes.

You have full access to this open access chapter, Download conference paper PDF

Pose-Invariant Face Alignment via CNN-Based Dense 3D Model Fitting

Article 19 April 2017

On 3D face reconstruction via cascaded regression in shape space

Article 01 December 2017

Improving Large Pose Face Alignment by Regressing 2D and 3D Landmarks Simultaneously and Visibility Refinement

Keywords

1 Introduction

Three-dimensional (3D) face models have recently been employed to assist pose or expression invariant face recognition [3, 14, 42], and the state-of-the-art performance has been achieved. A crucial step in these 3D face-assisted face recognition methods is to reconstruct the 3D face model from a two-dimensional (2D) face image. Besides its applications in face recognition, 3D face reconstruction is also useful in other face-related tasks, such as facial expression analysis [7, 36] and facial animation [4, 5]. While many 3D face reconstruction methods are available, they require landmarks on the face image as input, and are difficult to handle large-pose faces that have invisible landmarks due to self-occlusion.

Existing studies tackle the problems of face alignment (or facial landmark localization) and 3D face reconstruction separately. However, these two problems are chicken-and-egg problems. On one hand, 2D face images are projections of 3D faces onto the 2D plane. Knowing a 3D face and a 3D-to-2D mapping function, it is easy to compute the visibility and position of 2D landmarks. On the other hand, the landmarks provide rich information about facial geometry, which is the basis of 3D face reconstruction. Figure 1 illustrates the correlation between 2D landmarks and the 3D face. That is, the visibility and position of landmarks in the projected 2D image are determined by three factors: the 3D face shape, 3D deformation due to expression and pose, and camera projection parameters. Let us denote a 3D face shape as S and its 2D landmarks as U. The formation of 2D landmarks from the 3D face can be represented by $U = f_{C} \circ f_{P} \circ f_{E}(S)$, where $f_{C}$ is camera projection, $f_{P}$ and $f_{E}$ are deformation caused by pose and expression, respectively. Given such a clear correlation between 2D landmarks U and 3D shape S, it is evident that they should ideally be solved jointly, instead of separately as in prior works - indeed this is the core of this work.

Motivated by the aforementioned observation, this paper proposes to simultaneously solve the two problems of face alignment and 3D face shape reconstruction in one unified framework. To this end, two sets of regressors are jointly learned from a training set of pairing annotated 2D face images and 3D face shapes. These two sets of regressors are alternately applied to locate the landmarks on an input 2D image, and meanwhile reconstruct its pose-expression-normalized (PEN) 3D face shape. Note that most single image-based 3D face reconstruction algorithms aim to assist face recognition. For this purpose, we argue that reconstructing the PEN 3D shape is more useful than reconstructing the 3D shape that has the same pose and expression as the input 2D face [23, 28, 31].

The rest of this paper is organized as follows. Section 2 briefly reviews related work in the literature. Section 3 introduces in detail the proposed joint face alignment and 3D face reconstruction method. Section 4 reports experimental results. Section 5 concludes the paper.

2 Related Work

Face Alignment. Classical face alignment methods, including Active Shape Model (ASM) [9, 11] or Active Appearance Model (AAM) [8, 25], search for landmarks based on global shape models and generative texture models. Constrained Local Model (CLM) [10] also utilizes global shape models to regularize the landmark locations, but it employs discriminative local texture models. Regression based methods [6, 27, 35, 39] have been recently proposed to directly estimate landmark locations by applying cascaded regressors to an input 2D face image. These methods mostly do not consider the visibility of facial landmarks under different view angles. Consequently, their performance degrades substantially for non-frontal faces, and their detected landmarks could be ambiguous because the anatomically correct landmarks might be invisible due to self-occlusion (see Fig. 1).

A few methods focused on large-pose face alignment, which can be roughly divided into two categories: multi-view based and 3D model based. Multi-view based methods [37, 40] define different sets of landmarks as templates, one for each view range. Given an input image, they fit the multi-view templates to it and choose the best fitted one as the final result. These methods are usually complicated to apply, and can not detect invisible self-occluded landmarks. 3D model based methods, in contrast, can better handle self-occluded landmarks with the assistance of 3D face models. Their basic idea is to fit a 3D face model to the input image to recover the 3D landmark locations. Most of these methods [17, 18, 41] use 3D morphable models (3DMM) [2] – either a simplified one with a sparse set of landmarks [18, 41] or a relatively dense one [17]. They estimate the 3DMM parameters by using cascaded regressors with texture features as the input. In [18], the visibility of landmarks is explicitly computed, and the method can cope with face images of yaw angles ranging from $-90^{\circ }$ to $90^{\circ }$, whereas the method in [17] does not work properly for faces of yaw angles beyond $60^{\circ }$. In [33], Tulyakov and Sebe propose to directly estimate the 3D landmark locations via texture-feature-based regressors for faces of yaw angles upto $50^{\circ }$.

These existing 3D model based methods establish regressions between 2D image features and 3D landmark locations (or indirectly, 3DMM parameters). While our proposed approach is also based on 3D model, unlike existing methods, it carries out regressions both on 2D images and in the 3D space. Regressions on 2D images predict 2D landmarks, while regressions in the 3D space predict 3D landmarks locations. By integrating both regressions, our proposed method can more accurately locate landmarks, and better handle self-occluded landmarks. It thus works well for images of arbitrary view angles in $[-90^{\circ }, 90^{\circ }]$.

3D Face Reconstruction. Estimating the 3D face geometry from a single 2D image is an ill-posed problem. Existing methods, such as Shape from Shading (SFS) and 3DMM, thus heavily depend on priors or constraints. SFS based methods [20, 31] usually utilize an average 3D face model as a reference, and assume the Lambertian lighting model for the 3D face surface. One limitation of SFS methods lies in its assumed connection between 2D texture clues and 3D shape, which is too weak to discriminate among different individuals. 3DMM [2, 3, 28] establishes statistical parametric models for both texture and shape, and represents a 3D face as a linear combination of basis shapes and textures. To recover the 3D face from a 2D image, 3DMM-based methods estimate the combination coefficients by minimizing the discrepancy between the input 2D face image and the one rendered from the reconstructed 3D face. They can better cope with 2D face images of varying illuminations and poses. However, they still suffer from invisible facial landmarks when the input face has large pose angles. To deal with extreme poses, Lee et al. [22], Qu et al. [26] and Liu et al. [23] propose to discard the self-occluded landmarks or treat them as missing data. All these existing 3D face reconstruction methods require landmarks as input. Consequently, they either manually mark the landmarks, or employ standalone face alignment methods to automatically locate the landmarks. Moreover, existing methods always generate 3D faces that have the same pose and expression as the input image, which may not be desired in face recognition due to the challenge of matching 3D faces with expressions [12]. In this paper, we improve 3D face reconstruction from two aspects: (i) integrating the face alignment step into the 3D face reconstruction procedure, and (ii) reconstructing PEN 3D faces, which are believed to be useful for face recognition.

3 Proposed Method

3.1 Overview

We denote an n-vertex 3D face shape of neutral expression and frontal pose as,

$$\begin{aligned} S = \begin{pmatrix} x_{1} &{}\quad x_{2} &{}\quad \cdots &{}\quad x_{n} \\ y_{1} &{}\quad y_{2} &{}\quad \cdots &{}\quad y_{n}\\ z_{1} &{}\quad z_{2} &{}\quad \cdots &{}\quad z_{n}\\ 1 &{}\quad 1 &{}\quad \cdots &{}\quad 1 \end{pmatrix}, \end{aligned}$$

(1)

and a subset of S with columns corresponding to l landmarks as ${S}_{L}$. The projections of these landmarks on the 2D face image $\mathbf {I}$ are represented by

$$\begin{aligned} U = \begin{pmatrix} u_{1}&{}\quad u_{2} &{}\quad \cdots &{}\quad u_{l}\\ v_{1}&{}\quad v_{2} &{}\quad \cdots &{}\quad v_{l} \end{pmatrix} = f_{C}\circ f_{P} \circ f_{E} ({S}_{L}) \approx \mathbf {M}\times {S}_{L}. \end{aligned}$$

(2)

Here, we use a 3D-to-2D mapping matrix $\mathbf {M}$ to approximate the composite effect of expression and pose induced deformation and camera projection. Given an input 2D face image $\mathbf {I}$, our goal is to simultaneously locate its landmarks U and reconstruct its 3D face shape S. Note that, in some context, we also write the 3D face shape and the landmarks as column vectors: $\mathbf {S}=(x_{1}, y_{1}, z_{1}, x_{2}, y_{2}, z_{2}, \cdots , x_{n}, y_{n}, z_{n})^{\mathsf {T}}$, and $\mathbf {U}=(u_{1}, v_{1}, u_{2}, v_{2}, \cdots , u_{l}, v_{l})^{\mathsf {T}}$, where ‘$\mathsf {T}$’ is transpose operator.

Figure 2 shows the flowchart of the proposed method. For the input 2D face image $\mathbf {I}$, its 3D face shape $\mathbf {S}$ is initialized as the mean 3D shape of training faces. Its landmarks $\mathbf {U}$ are initialized by fitting the mean landmarks of training frontal faces into the face region specified by a bounding box in $\mathbf {I}$ via similarity transforms. $\mathbf {U}$ and $\mathbf {S}$ are iteratively updated by applying a series of regressors. Each iteration contains three main steps: (i) updating landmarks, (ii) updating 3D face shape, and (iii) refining landmarks.

Updating landmarks. This step updates the landmarks’ locations from $\mathbf {U}^{k-1}$ to $\hat{\mathbf {U}}^{k}$ based on the texture features in the input 2D image. This is similar to the conventional cascaded regressor based 2D face alignment [35]. The adjustment to the landmarks’ locations in $k^{\texttt {th}}$ iteration, $\varDelta {\mathbf {U}^{k}}$ is determined by the local texture features around $\mathbf {U}^{k-1}$ via a regressor,

$$\begin{aligned} \varDelta {\mathbf {U}}^{k} = R_{U}^{k}(h(\mathbf {I}, \mathbf {U}^{k-1})), \end{aligned}$$

(3)

where $h(\mathbf {I}, \mathbf {U})$ denotes a texture feature extracted around the landmarks $\mathbf {U}$ in the image $\mathbf {I}$, and $R_{U}^{k}$ is a regression function. The landmarks can be then updated by $\hat{\mathbf {U}}^{k} = \mathbf {U}^{k-1} + \varDelta {\mathbf {U}}^{k}$. The method for learning these landmark regressors will be introduced in Sect. 3.3.

Updating 3D face shape. In this step, the above-obtained landmark location adjustment is used to estimate the adjustment of the 3D face shape. Specifically, a regression function $R_{S}^{k}$ models the correlation between the landmark location adjustment $\varDelta {\mathbf {U}}^{k}$ and the expected adjustment to the 3D shape $\varDelta {\mathbf {S}}^{k}$, i.e.,

$$\begin{aligned} \varDelta {\mathbf {S}}^{k} = R_{S}^{k}(\varDelta {\mathbf {U}}^{k}). \end{aligned}$$

(4)

The 3D shape can be then updated by $\mathbf {S}^{k} = \mathbf {S}^{k-1} + \varDelta {\mathbf {S}}^{k}$. The method for learning these shape regressors will be given in Sect. 3.4.

Refining landmarks. Once a new estimate of the 3D shape is obtained, the landmarks can be further refined accordingly. For this purpose, the 3D-to-2D mapping matrix is needed. Hence, we estimate $\mathbf {M}^{k}$ based on $\mathbf {S}^{k}$ and $\hat{\mathbf {U}}^{k}$. The refined landmarks $\mathbf {U}^{k}$ can be then obtained by projecting $\mathbf {S}^{k}$ onto the image via $\mathbf {M}^{k}$ according to Eq. (2). During this process, the visibility of the landmarks is also re-computed. Details about this step will be given in Sect. 3.5.

3.2 Training Data Preparation

Before we provide the details about the three steps, we first introduce the training data needed for learning the landmarks and 3D shape regressors. Since the purpose of these regressors is to gradually adjust the estimated landmarks and 3D shape towards their true values, we need a sufficient number of triplet data $\{(\mathbf {I}_{i}, \mathbf {S}^{*}_{i}, \mathbf {U}^{*}_{i})\vert i=1,2,\cdots ,N\}$, where $\mathbf {S}^{*}_{i}$ and $\mathbf {U}^{*}_{i}$ are, respectively, the ground truth 3D shape and landmarks for the image $\mathbf {I}_{i}$, and N is the total number of training samples. All the 3D face shapes have been established dense correspondences among their vertices; in other words, they have the same number of vertices, and vertices of the same index have the same semantic meaning. Moreover, both visible and invisible landmarks in $\mathbf {I}_{i}$ have been annotated and included in $\mathbf {U}^{*}_{i}$. For invisible landmarks, the annotated positions should be anatomically correct positions (e.g., red points in Fig. 1).

Obviously, to make the regressors robust to expression and pose variations, the training data should contain 2D face images of varying expressions and poses. As for the 3D shape $\mathbf {S}^{*}_{i}$ corresponding to the $\mathbf {I}_{i}$ in the training data, it can either have the same expression and pose as $\mathbf {I}_{i}$, or just have neutral expression and frontal pose no matter what expression and pose $\mathbf {I}_{i}$ has. In the former, the learned regressors will output 3D face shapes that have the same expression and pose as the input images; while in the latter, the learned regressors will generate neutral and frontal 3D shapes for any input images. In either case, the dense registration among all 3D shapes $\mathbf {S}_{i}^{*}$ is needed for regressor learning. In this paper, we follow the latter for two reasons: (i) dense registration of 3D face shapes with different expressions is difficult, and (ii) the reconstructed PEN 3D shapes are preferred for being used in 3D face recognition.

It is, however, difficult to find in the public domain such data sets of 3D face shapes and corresponding annotated 2D images with various expressions/poses. Thus, we construct two sets of training data by ourselves: one based on BU3DFE [36], and the other based on LFW [16]. BU3DFE database contains 3D face scans of 56 males and 44 females, acquired in neutral plus six basic expressions (happiness, disgust, fear, angry, surprise and sadness). All basic expressions are acquired at four levels of intensity. These 3D face scans have been manually annotated with 84 landmarks (83 landmarks provided by the database and one nose tip marked by ourselves). For each of the 100 subjects, we select one scan of neutral expression as the ground truth 3D shape. For the rest six expressions, we choose the scans of the first level intensity, and project them to 2D images with recorded landmark locations. From each of the seven scans, 19 face images are generated with different poses ($-90^{\circ }$ to $90^{\circ }$ yaw with a $10^{\circ }$ interval). As a result, each 3D shape has 133 images of different poses and expressions. We use the method [13] to establish dense correspondence of BU3DFE neutral scans.

LFW database contains 13,233 images of 5,749 subjects. We select 150 subjects, each having at least 10 images, and use 68 landmarks on these face images that are provided by the work of [41]. From the neutral frontal image of each subject, we employ the method in [23] to reconstruct the 3D shape, which is densely registered. Finally, we obtain 4,149 images of 150 subjects and their corresponding neutral 3D face shapes.

The resultant 3D shapes have $n=9,677$ for BU3DFE and $n=53,215$ for LFW. Figure 3 shows some example 2D face images and corresponding 3D faces in the two databases. Obviously, 3D shapes in BU3DFE consist of a sparser set of vertices, and consequently look a little bit blur in Fig. 3.

3.3 Learning Landmark Regressors

According to Eq. (3), landmark regressors estimate the adjustment to $\mathbf {U}^{k-1}$ such that the updated landmarks $\mathbf {U}^{k}$ get closer to their true positions. In the training phase, the true positions and visibility of the landmarks are given by the ground truth $\mathbf {U}^{*}$. Therefore, the objective of the landmark regressors $R^{k}_{U}$ is to better predict the difference between $\mathbf {U}^{k}$ and $\mathbf {U}^{*}$. In this paper, we employ linear regressors as the landmark regressors, and learn them by fulfilling the following optimization:

$$\begin{aligned} R^{k}_{U} = \mathop {\arg \min } \limits _{R^{k}_{U}}\sum _{i=1}^N\parallel \left( \mathbf {U}^{*}_{i} - \mathbf {U}^{k}_{i}\right) - R^{k}_{U}(h(\mathbf {I}_{i}, \mathbf {U}_{i}^{k-1}))\parallel _2^2, \end{aligned}$$

(5)

which has a close-form least-square solution. Note that other regression schemes, such as CNN [19], can be easily adopted in our framework.

We use 128-dim SIFT descriptors [24] as the local feature. The feature vector of h is a concatenation of the SIFT descriptors at all the l landmarks, i.e., a 128l-dim vector. If a landmark is invisible, no feature will be extracted, and its corresponding entries in h will be zero. It is worth mentioning that the regressors estimate the semantic positions of all landmarks including invisible landmarks.

3.4 Learning 3D Shape Regressors

The landmark adjustment $\varDelta {\mathbf {U}}^{k}$ is also used as the input to the 3D shape regressor $R^{k}_{S}$. The objective of $R^{k}_{S}$ is to compute an update to the initially estimated 3D shape $\mathbf {S}^{k-1}$ in the $k^{\texttt {th}}$ iteration to minimize the difference between the updated 3D shape and the ground truth. Using similar linear regressors, the 3D shape regressors can be learned by solving the following optimization via least squares:

$$\begin{aligned} R^{k}_{S} = \mathop {\arg \min } \limits _{{R^{k}_S}}\sum _{i=1}^N\parallel (\mathbf {S}^{*}_{i} - \mathbf {S}^{k}_{i}) - R^{k}_{S}\left( \varDelta {\mathbf {U}}^{k}_{i}\right) \parallel _2^2, \end{aligned}$$

(6)

with its closed form solution as

$$\begin{aligned} R^{k}_{S}=\varDelta \mathbb {S}^{k}(\varDelta \mathbb {U}^{k})^{\mathsf {T}}(\varDelta \mathbb {U}^{k}(\varDelta \mathbb {U}^{k})^{\mathsf {T}})^{-1}, \end{aligned}$$

(7)

where $\varDelta \mathbf {\mathbb {S}}^{k} = \mathbf {\mathbb {S}}^{*}-\mathbf {\mathbb {S}}^{k}$ and $\varDelta \mathbf {\mathbb {U}}^{k}$ are, respectively, the 3D shape and landmark adjustment. $\mathbb {S}\in \mathbb {R}^{3n*N}$ and $\mathbb {U}\in \mathbb {R}^{2l*N}$ denote, respectively, the ensemble of 3D face shapes and 2D landmarks of all training samples with each column corresponding to one sample. It can be mathematically shown that N should be larger than 2l so that $\varDelta \mathbb {U}^{k}(\varDelta \mathbb {U}^{k})^{\mathsf {T}}$ is invertible. Fortunately, since the set of used landmarks are usually sparse, this requirement is easy to be satisfied in real-world applications.

3.5 Estimating 3D-to-2D Mapping and Landmark Visibility

In order to refine the landmarks with the updated 3D face shape, we have to project the 3D shape to the 2D image with a 3D-to-2D mapping matrix. In this paper, we dynamically estimate the mapping matrix based on $\mathbf {S}^{k}$ and $\hat{\mathbf {U}}^{k}$. As discussed earlier in Sect. 3.1, the mapping matrix is a composite effect of expression and pose induced deformation and camera projection. Here, we assume a weak perspective projection for the camera projection as in prior work [18, 38], and further assume that the expression and pose induced deformation can be approximated by a linear transform. As a result, the mapping matrix $\mathbf {M}^{k}$ is represented by a $2\times 4$ matrix, and can be estimated as a least squares solution to the following fitting problem:

$$\begin{aligned} \mathbf {M}^{k} = \mathop {\arg \min } \limits _{\mathbf {M}^{k}} \parallel \hat{{U}}^{k} - \mathbf {M}^{k}\times {S}^{k}_{L}\parallel _2^2. \end{aligned}$$

(8)

Once a new mapping matrix is computed, the landmarks can be further refined as ${U}^{k} = \mathbf {M}^{k} \times {S}^{k}_{L}$.

The visibility of the landmarks can be then computed based on the mapping matrix $\mathbf {M}$ using the method in [18]. Suppose the average surface normal around a landmark in the 3D face shape $\mathbf {S}$ is $\overrightarrow{\mathbf{n }}$. Its visibility $\mathbf v $ can be measured by

$$\begin{aligned} \mathbf v = \frac{1}{2}\left( 1 + sgn\left( \overrightarrow{\mathbf{n }} \cdot \left( \frac{\mathbf {M}_{1}}{\left\| \mathbf {M}_{1} \right\| }\times \frac{\mathbf {M}_{2}}{\left\| \mathbf {M}_{2} \right\| } \right) \right) \right) , \end{aligned}$$

(9)

where sgn() is the sign function, ‘$\cdot $’ means dot product and ‘$\times $’ cross-product, and $\mathbf {M}_{1}$ and $\mathbf {M}_{2}$ are the left-most three elements at the first and second row of the mapping matrix $\mathbf {M}$. This basically rotates the surface normal and validates if it points toward the camera or not.

The whole process of learning the cascaded coupled landmark and 3D shape regressors is summarized in Algorithm 1.

4 Experiments

4.1 Protocols

We conduct three sets of experiments to evaluate the proposed method in 3D shape reconstruction, face alignment, and benefits to face recognition.

Datasets. The training data are constructed from two public face databases: BU3DFE and LFW, as detailed in Sect. 3.2. Respectively, two different models are trained using each of the two training sets. Our test sets include BU3DFE and AFW (Annotated Faces in-the-Wild) [40]. To evaluate the 3D shape reconstruction accuracy, a 10-fold cross validation is applied to split the BU3DFE data into training and testing subsets, resulting in 11,970 training samples and 1,330 testing samples. To evaluate the face alignment accuracy, the AFW database [40] is tested using the LFW-trained model. AFW is a widely used benchmark in the face alignment literature. It contains 205 images of 468 faces with different poses within $\pm 90^\circ $. In [30], 337 of these faces have been manually annotated with face bounding boxes and 68 landmarks. We use them in our experiments.

Experiment setup. During training and testing, each image is associated with a bounding box, which specifies the face region in the image. To initialize the landmarks in it, the mean of the landmarks in all neutral frontal training images is fitted to the face region via a similarity transform. In this paper, we set the number of iterations $K=5$ (discussion of convergence issue is provided in supplemental material). SIFT descriptors are computed on $32 \times 32$ local patches around the landmarks, and the implementation by [35] is used in our experiments.

Evaluation metrics. Two metrics are used to evaluate the 3D face shape reconstruction accuracy: Mean Absolute Error (MAE) and Normalized Per-vertex Depth Error (NPDE). MAE is defined as $\texttt {MAE} = \frac{1}{N_{T}}\sum _{i=1}^{N_{T}}(\Vert \mathbf {S}^{*}_{i}-\hat{\mathbf {S}}_{i} \Vert /n)$, where $N_{T}$ is the total number of testing samples, $\mathbf {S}^{*}_{i}$ and $\hat{\mathbf {S}}_{i}$ are the ground truth and reconstructed 3D face shape of the $i^{\texttt {th}}$ testing sample. NPDE measures the depth error at the $j^{\texttt {th}}$ vertex in a testing sample as $\texttt {NPDE}(x_{j}, y_{j}) = \left( |z^{*}_{j} - \hat{z}_{j}|\right) /\left( z^{*}_{max} - z^{*}_{min}\right) $, where $z^{*}_{max}$ and $z^{*}_{min}$ are the maximum and minimum depth values in the ground truth 3D shape of the testing sample, and $z^{*}_{j}$ and $\hat{z}_{j}$ are the ground truth and reconstructed depth values at the $j^{\texttt {th}}$ vertex.

The face alignment accuracy is measured by Normalized Mean Error (NME). It is defined as the mean of the normalized estimation error of visible landmarks for all testing samples:

$$\begin{aligned} \texttt {NME} = \frac{1}{N_{T}}\sum _{i=1}^{N_{T}}{\left( \frac{1}{d_{i}}\frac{1}{N^\mathbf{v }_{i}} \sum _{j=1}^{l}\mathbf{v }_{ij}||(\hat{u}_{ij}, \hat{v}_{ij}) - (u^{*}_{ij}, v^{*}_{ij})|| \right) }, \end{aligned}$$

(10)

where $d_{i}$ is the square root of the face bounding box area of the $i^{\texttt {th}}$ testing sample, $N^\mathbf{v }_{i}$ is the number of visible landmarks in it, $(u^{*}_{ij}, v^{*}_{ij})$ and $(\hat{u}_{ij}, \hat{v}_{ij})$ are, respectively, the ground truth and estimated coordinates of its $j^{\texttt {th}}$ landmark.

4.2 3D Face Reconstruction Accuracy

Reconstruction accuracy across poses. Figure 4(a) shows the average MAE of our proposed method under different pose angles of the input 2D images. To give a fair comparison with the method in [23], we only compute the reconstruction error of neutral testing images, after rotating the reconstructed 3D faces to frontal view. As can be seen, the average MAE of our method is lower than that of the baseline. Moreover, as the pose angle becomes large, the error does not increase substantially. This proves the effectiveness of the proposed method in handling arbitrary view face images. Figure 5 shows the reconstruction and face alignment results of one subject.

Reconstruction accuracy across expressions. Figure 4(b) shows the average MAE of our proposed method across expressions. Although the error increases as expressions become intensive, the maximum increment (i.e., SU vs. NE) is below $7\,\%$. This proves the robustness of the proposed method in normalizing expressions while maintaining model individualities. Figure 6 shows the reconstruction and face alignment results of a subject under seven expressions.

Table 1. NME of the proposed method and two baseline methods on AFW.

Full size table

4.3 Face Alignment Accuracy

As for the face alignment evaluation on AFW, we select two recent works as baseline methods: (1) CDM [37], the first method claimed to perform pose-free face alignment; (2) PIFA [18], a regression-type method that can predict the anatomically correct locations of landmarks for arbitrary view face images. We use the executable code of CDM and PIFA to compute their performance on our test set. The CDM code integrates face detection, and it successfully detects and aligns 268 out of 337 testing images. Therefore, to compare with CDM, we evaluate the NME on the 268 testing images. For PIFA and the proposed method, the face bounding boxes provided by [30] are used. One note is that the CDM detects 66 landmarks and PIFA detects 21 landmarks. For a fair comparison, we evaluate the NME on 18 landmarks that are the intersections of the three landmark sets. As shown in Table 1, our accuracy is better than the two baseline methods. Figure 7 shows some face alignment results.

4.4 Application to Face Recognition

While there are many recent face alignment and reconstruction work [1, 15, 21, 29, 32, 34], few work takes one step further to evaluate the contribution of alignment or reconstruction to subsequent tasks. In contrast, we quantitatively evaluate the effect of the reconstructed PEN 3D face shapes on face recognition by performing direct 3D to 3D shape matching and fuse it with conventional 2D face recognition. Specifically, we choose 70 subjects in BU3DFE to train the proposed regressors, and use the rest 30 subjects for testing. The neutral frontal face images of the testing subjects compose the gallery, and their faces under 19 poses and 7 expressions (totally 3,990 images) are the probe images. We use a commercial off-the-shelf (COTS) 2D face matcher^{Footnote 1} as the baseline. The iterative closest points (ICP) algorithm is applied to match the reconstructed normalized 3D face shapes. It aligns the 3D shapes reconstructed from probe and gallery images, and computes the distances between them, which are then converted to similarity scores via subtracting them from the maximum distance. These scores are finally normalized to the range of [0, 1], and fused with the scores of the COTS matcher (which are within [0, 1] also) by using a sum rule. The recognition result for a probe is defined as the subject whose gallery sample has the highest score with it. The recognition rate is then defined as the percentage of correctly recognized subjects. Figure 8 shows the recognition rates. It can be clearly seen that the reconstructed normalized 3D face shapes do help improve the face recognition accuracy, especially for face images of large pose angles and all types of expressions. Interestingly, despite the relatively robust 2D face recognition performance w.r.t. expressions, the fusion with 3D matching still improves the performance across all expressions – a strong testimony on the discriminative capability of the expression-normalized 3D face shape.

4.5 Computational Efficiency

According to our experiments on a PC with i7-4710 CPU and 8 GB memory, the Matlab implementation of the proposed method runs at $\sim 26$ FPS ($K=5$ and $n=9,677$). Hence, it can detect landmarks and reconstruct 3D face shape in real time.

5 Conclusions

In this paper, we present a novel regression based method for simultaneous face alignment and 3D face reconstruction for 2D images of arbitrary poses and expressions. It utilizes landmarks on a 2D face image as clues for reconstructing 3D shapes, and uses the reconstructed 3D shapes to refine landmarks. By alternately applying cascaded landmark regressors and 3D shape regressors, the proposed method can effectively accomplish the two tasks simultaneously in real time. Unlike existing 3D face reconstruction methods, the proposed method does not require additional face alignment methods, but can fully automatically reconstruct normalized 3D shapes from a single face image of arbitrary poses and expressions. Compared with existing face alignment methods, the proposed method can effectively handle invisible landmarks with the assistance of 3D face models. Extensive experiments with comparison to state-of-the-art methods demonstrate the effectiveness of the proposed method in both face alignment and 3D face shape reconstruction, and in facilitating face recognition as well.

Notes

1.
http://www.wisesoft.com.cn.

References

Asthana, A., Zafeiriou, S., Tzimiropoulos, G., Cheng, S., Pantic, M.: From pixels to response maps: discriminative image filtering for face alignment in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 37(6), 1312–1320 (2015)
Article Google Scholar
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: SIGGRAPH, pp. 187–194. ACM Press/Addison-Wesley Publishing Co. (1999)
Google Scholar
Blanz, V., Vetter, T.: Face recognition based on fitting a 3D morphable model. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1063–1074 (2003)
Article Google Scholar
Cao, C., Weng, Y., Lin, S., Zhou, K.: 3D shape regression for real-time facial animation. Trans. Graph. (TOG) 32(4), 41 (2013)
MATH Google Scholar
Cao, C., Wu, H., Weng, Y., Shao, T., Zhou, K.: Real-time facial animation with image-based dynamic avatars. ACM Trans. Graph. (TOG) 35(4), 126 (2016)
Article Google Scholar
Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression. Int. J. Comput. Vision 107(2), 177–190 (2014)
Article MathSciNet Google Scholar
Chu, B., Romdhani, S., Chen, L.: 3D-aided face recognition robust to expression and pose variations. In: CVPR, pp. 1907–1914. IEEE (2014)
Google Scholar
Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 6, 681–685 (2001)
Article Google Scholar
Cootes, T.F., Lanitis, A.: Active shape models: evaluation of a multi-resolution method for improving image search. In: BMVC, pp. 327–338. Citeseer (1994)
Google Scholar
Cristinacce, D., Cootes, T.: Automatic feature localisation with constrained local models. Pattern Recogn. 41(10), 3054–3067 (2008)
Article MATH Google Scholar
Cristinacce, D., Cootes, T.F.: Boosted regression active shape models. In: BMVC, pp. 1–10 (2007)
Google Scholar
Drira, H., Ben Amor, B., Srivastava, A., Daoudi, M., Slama, R.: 3D face recognition under expressions, occlusions, and pose variations. IEEE Trans. Pattern Anal. Mach. Intell. 35(9), 2270–2283 (2013)
Article Google Scholar
Gong, X., Wang, G.: An automatic approach for pixel-wise correspondence between 3D faces. Hybrid Inf. Technol. 2, 198–205 (2006)
Google Scholar
Han, H., Jain, A.K.: 3D face texture modeling from uncalibrated frontal and profile images. In: BTAS, pp. 223–230. IEEE (2012)
Google Scholar
Hassner, T.: Viewing real-world faces in 3D. In: ICCV, pp. 3607–3614 (2013)
Google Scholar
Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07–49, University of Massachusetts, Amherst (2007)
Google Scholar
Jeni, L.A., Cohn, J.F., Kanade, T.: Dense 3D face alignment from 2D videos in real-time. In: FG. IEEE (2015)
Google Scholar
Jourabloo, A., Liu, X.: Pose-invariant 3D face alignment. In: ICCV, pp. 3694–3702 (2015)
Google Scholar
Jourabloo, A., Liu, X.: Large-pose face alignment via CNN-based dense 3D model fitting. In: CVPR, June 2016
Google Scholar
Kemelmacher-Shlizerman, I., Basri, R.: 3D face reconstruction from a single image using a single reference face shape. IEEE Trans. Pattern Anal. Mach. Intell. 33(2), 394–405 (2011)
Article Google Scholar
Lee, D., Park, H., Yoo, C.D.: Face alignment using cascade Gaussian process regression trees. In: CVPR, pp. 4204–4212. IEEE (2015)
Google Scholar
Lee, Y.J., Lee, S.J., Park, K.R., Jo, J., Kim, J.: Single view-based 3D face reconstruction robust to self-occlusion. EURASIP J. Adv. Sig. Process. 2012(1), 1–20 (2012)
Article Google Scholar
Liu, F., Zeng, D., Li, J., Zhao, Q.: Cascaded regressor based 3D face reconstruction from a single arbitrary view image. arXiv preprint arXiv:1509.06161 (2015)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)
Article Google Scholar
Matthews, I., Baker, S.: Active appearance models revisited. Int. J. Comput. Vision 60(2), 135–164 (2004)
Article Google Scholar
Qu, C., Monari, E., Schuchert, T., Beyerer, J.: Fast, robust and automatic 3D face model reconstruction from videos. In: AVSS, pp. 113–118. IEEE (2014)
Google Scholar
Ren, S., Cao, X., Wei, Y., Sun, J.: Face alignment at 3000 FPS via regressing local binary features. In: CVPR, pp. 1685–1692. IEEE (2014)
Google Scholar
Romdhani, S., Vetter, T.: Estimating 3D shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior. In: CVPR, pp. 986–993. IEEE (2005)
Google Scholar
Roth, J., Tong, Y., Liu, X.: Adaptive 3D face reconstruction from unconstrained photo collections. In: CVPR, June 2016
Google Scholar
Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: the first facial landmark localization challenge. In: ICCVW, pp. 397–403. IEEE (2013)
Google Scholar
Suwajanakorn, S., Kemelmacher-Shlizerman, I., Seitz, S.M.: Total moving face reconstruction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part IV. LNCS, vol. 8692, pp. 796–812. Springer, Heidelberg (2014)
Google Scholar
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: What makes tom hanks look like tom hanks. In: ICCV, pp. 3952–3960 (2015)
Google Scholar
Tulyakov, S., Sebe, N.: Regressing a 3D face shape from a single image. In: ICCV, pp. 3748–3755. IEEE (2015)
Google Scholar
Tzimiropoulos, G.: Project-out cascaded regression with an application to face alignment. In: CVPR, pp. 3659–3667. IEEE (2015)
Google Scholar
Xiong, X., De la Torre, F.: Supervised descent method and its applications to face alignment. In: CVPR, pp. 532–539. IEEE (2013)
Google Scholar
Yin, L., Wei, X., Sun, Y., Wang, J., Rosato, M.J.: A 3D facial expression database for facial behavior research. In: FG, pp. 211–216. IEEE (2006)
Google Scholar
Yu, X., Huang, J., Zhang, S., Yan, W., Metaxas, D.N.: Pose-free facial landmark fitting via optimized part mixtures and cascaded deformable shape model. In: ICCV, pp. 1944–1951. IEEE (2013)
Google Scholar
Zhou, X., Leonardos, S., Hu, X., Daniilidis, K.: 3D shape estimation from 2D landmarks: a convex relaxation approach. In: CVPR, pp. 4447–4455. IEEE (2015)
Google Scholar
Zhu, S., Li, C., Loy, C.C., Tang, X.: Face alignment by coarse-to-fine shape searching. In: CVPR, pp. 4998–5006 (2015)
Google Scholar
Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: CVPR, pp. 2879–2886. IEEE (2012)
Google Scholar
Zhu, X., Lei, Z., Liu, X., Shi, H., Li, S.Z.: Face alignment across large poses: a 3D solution. In: CVPR, June 2016
Google Scholar
Zhu, X., Lei, Z., Yan, J., Yi, D., Li, S.Z.: High-fidelity pose and expression normalization for face recognition in the wild. In: CVPR, pp. 787–796. IEEE (2015)
Google Scholar

Download references

Acknowledgement

All correspondences should be forwarded to Dr. Qijun Zhao via qjzhao@scu.edu.cn. This work is supported by the National Key Scientific Instrument and Equipment Development Projects of China (No. 2013YQ49087904).

Author information

Authors and Affiliations

College of Computer Science, Sichuan University, Chengdu, China
Feng Liu, Dan Zeng & Qijun Zhao
Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
Xiaoming Liu

Authors

Feng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dan Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Qijun Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoming Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qijun Zhao .

Editor information

Editors and Affiliations

RWTH Aachen , Aachen, Germany
Bastian Leibe
Czech Technical University , Prague 2, Czech Republic
Jiri Matas
University of Trento , Povo - Trento, Italy
Nicu Sebe
University of Amsterdam , Amsterdam, The Netherlands
Max Welling

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6147 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, F., Zeng, D., Zhao, Q., Liu, X. (2016). Joint Face Alignment and 3D Face Reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science(), vol 9909. Springer, Cham. https://doi.org/10.1007/978-3-319-46454-1_33

Download citation

DOI: https://doi.org/10.1007/978-3-319-46454-1_33
Published: 16 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46453-4
Online ISBN: 978-3-319-46454-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics