1 Introduction

Non-rigid face tracking is an important topic, which has been having a great attention since last decades. It is useful in many domains such as: video monitoring, human computer interface, biometric. The problem gets much more challenging if occurring out-of-plane rotation, the illumination changes, the presence of many people, or occlusions. In our study, we propose an approach to track non-rigid face at out-of-plane rotation, even the profile face. In other words, our method gets involved in the estimation of six rigid face parameters, namely the 3D translation and the three axial rotationsFootnote 1), and non-rigid parameters at the same time.

For non-rigid face tracking, a set of landmarks are considered as the face shape model. Since the pioneer work of [1], it is well-known that Active Appearance Model (AAM) provides an efficient way to represent and track frontal faces. Many works [24] have suggested improvements in terms of fitting accuracy or profile-view tracking. Constrained Local Model (CLM) has been proposed by [5] that consists of an exhaustive local search around landmarks constrained by a shape model. [6, 7] both improved this method in terms of accuracy and speed; more specifically, [7] can track single face with vertical rotation up to \(90^{\circ }\) in well-controlled environment. The Cascaded Pose Regression (CPR), which is firstly proposed by [8], has recently shown remarkable performance [9, 10]. This method shows the high accuracy and real-time speed, merely it is restricted at the near-frontal face tracking. Most of the methods work at constrained views because of two reasons: (i) The acquisition of ground-truth for unconstrained views is really expensive in practice and (ii) how to handle the hidden landmarks on invisible side is hard.

The literature also mentioned other face models such as: cylinder [1113], ellipsoid [14] or mesh [15]. Most of these methods can estimate the three large rotations even on the profile-view, but it is worth noting that they handle with rigid rather than non-rigid facial expression. On the other hand, the popular 3D Candide-3 model has been defined to manage rigid and non-rigid parameters. [16] used Kalman Filter to the interest points in a video sequence based on the adaptive rendered keyframe, and this work is semi-automatic and is insufficient to work in quick movement. [17] used Mahalanobis distances of local features with the constraint of the face model, to capture both rigid and non-rigid head motions. [18] learned a linear model between model parameters and the face’s appearance. These methods poorly works on profile-view. [19] extended Candide face to work with the profile, but their objective function, combining structure and appearance features with dynamic modeling, appears to slowly converge due to the high dimensionality. [20] proposed an adaptive Bayesian approach to track principal components of landmark appearance. Their algorithm appears to be robust for tracking landmarks, but unable to recover when tracking is lost. Let us notice that these methods use the synthetic database to train tracking models. For pose estimation, The pose estimation performance of mentioned methods can be improved more if integrating Kalman Filtering [21] or Particle Filtering [22].

A face tracking framework is robust if it can operate with a wide range of pose views, face expression, environmental changes and occlusions, and also have recovering capability. In [11, 12], the authors utilized dynamic templates based on cylinder model in order to handle with lighting and self-occlusion. Local features can be considered [7, 10], since local descriptors are not much affected by facial expressions and self-occlusion. In order to have recovering capability, tracking-by-detection or wide baseline matching [15, 23, 24] have been applied. The primary idea is to match the current frame with preceding-stored keyframes. The matching is sufficient to fast movements, illumination, and able to recover the lost tracking. However, the matching is only suitable to work with rigid parameters; moreover, these methods degrade when the number of keypoints detected on the face is not enough. Recently, [25] propose the combination of traditional tracking techniques and deep learning to have a proficient performance of pose tracking. Many commercial products also exist, i.e. [26], which shows effect results in pose and face animation tracking, but this product needs to work in controlled environments of illumination and movements. In addition, it has to wait for the frontal view to re-initialize the model when the face is lost.

In this paper, our contribution is two-folds: (i) using a large offline synthetic database to train tracking models, (ii) proposing a two-step tracking approach to track non-rigid face. These points are immediately introduced in more detail. Firstly, a large synthesized database is built to avoid the expensive and time-consuming manual annotation. To the best of our knowledge, although there were some papers worked with the synthetic data [18, 27, 28], our paper is the first study that investigates the large offline synthetic dataset for the free-pose tracking of non-rigid face. Secondly, the tracking approach consists of two steps: a) The first step benefits 2D SIFT matching between the current frame and some preceding-stored keyframes to estimate only rigid parameters. By this way, our method is sustainable to fast movement and recoverable in terms of lost tracking. b) The second step obtains the whole set of parameters (rigid and non-rigid) by a heuristic method using pose-wise SVMs. This way can efficiently align a 3D model into the profile face in similar manner of the frontal face fitting. The combination of three descriptors is also considered to have better local representation.

The remaining of this paper is organized as follows: Sect. 2 describes the face model and the used descriptors. Section 3 discusses the pipeline of the proposed framework. Experimental results and analysis are presented in Sect. 4. Finally, we provide in Sect. 5 some conclusions and further perspectives.

2 Face Representation

2.1 Shape Representation

Candide-3, initially proposed by [29], is a popular face model managing both facial shape and animation. It consists of \(N=113\) vertices representing 168 surfaces. If \(\text {g}\in R^{3N}\) denotes the vector of dimension 3N, obtained by concatenation of the three components of the N vertices, the model writes:

$$\begin{aligned} \text {g}(\sigma ,\alpha )=\overline{\text {g}}+\text {S}\sigma +\text {A}\alpha \end{aligned}$$
(1)

where \(\overline{\text {g}}\) denotes the mean value of \(\text {g}\). The known matrices \(\text {S}) \in R^{3N\times 14}\) and \(\text {A}) \in R^{3N\times 65}\) are Shape and Animation Units that control respectively shape and animation through \(\sigma \) and \(\alpha \) parameters. Among the 65 components of animation control \(\alpha \), 11 ones are associated to track eyebrows, eyes and lips. Rotation and translation also need to be estimated during tracking. Therefore, the full model parameter, denoted \(\varTheta \), has 17 dimensions: 3 dimensions of rotation \((r_{x},r_{y},r_{z})\), 3 dimensions of translation \((t_{x},t_{y},t_{z})\) and 11 dimensions of animation \(r_{a}\): \(\varTheta = [r_{x}\, r_{y}\, r_{z}\, t_{x}\, t_{y}\, t_{z}\, r_{a}]^{T}\). Notice that both \(\sigma \) and \(\varTheta \) are estimated at first frame, but only \(\varTheta \) is estimated at next frames because we assume that the shape parameter does not change. In next section, \(\varTheta (t)\) indicates the model parameters at time.

Fig. 1.
figure 1

(a) The Candide-3 model with facial points in our method. (b) The way to compute the response map at the mouth corner using three descriptors via SVM weights.

2.2 Projection

We assume the perspective projection, in which the camera calibration has been obtained from empirical experiments. In our case, the intrinsic camera matrix is written as follows:

$$\begin{aligned} \left[ \begin{array}{ccc} f_{x} &{} 0 &{} c_{x}\\ 0 &{} f_{y} &{} c_{y}\\ 0 &{} 0 &{} 1 \end{array}\right] \end{aligned}$$
(2)

where the focal length of camera \(f_{x} = f_{y} = 1000\) pixels and the coordinates of a camera’s principal point \((c_{x},c_{y})\) as a center of the 2D video frame. The such focal length is defined because it is shown in [30] that the focal length does not require to be accurately known if the distance between the 3D object and camera is much larger than the 3D object depth. Notice that because of the perspective projection assumption, the depth \(t_z\) is directly related to scale parameter.

2.3 Appearance Representation

The facial appearance is represented by a set of \(N_p\) = 30 landmarks (Fig. 1). The local patch of a landmark is described by three local descriptors: intensity, gradient and Local Binary Patterns (LBP) [31] because the combination of multiple descriptors are more discriminative and robust. This combination is fast enough if using linear SVM like [6]. The patch size is 15 \(\times \) 15 in our study.

3 Our Method

We present the framework into three sub-sections: (i) the model training from synthesized dataset, (ii) the robust initialization using wide baseline matching, and (iii) the fitting strategy using pose-wise SVMs.

3.1 Model Training from Synthesized Data

We consider the synthesized data for the training because of some reasons: (i) Most of available datasets were built for frontal face alignment [32]. The others contain profile information such as ALFW [32], Multi-PIE [33], but the range of Pitch or Roll are restricted. In addition, the number of landmarks of frontal and profile faces is different. That makes a gap, how to track from frontal to profile faces (ii) The campaign of ground-truth for building a new dataset is very expensive and (iii) The hidden landmarks could be localized in synthesized dataset, so the gap between frontal and profile tracking could be bridged.

Fig. 2.
figure 2

From left to right of training process: 143 frontal images, landmark annotation and 3D model alignment, synthesized images rendering, and pose-wise SVMs training.

The training process is shown in Fig. 2. At first, we select 143 frontal images (143 different subjects) from Multi-PIE. We then align 3D face model into the known landmarks of each image by POSIT [34] and warp the image texture to the model. Afterwards, rendering is deployed to generate a set of synthesized images of different poses. Finally, all synthesized images are clustered into pose-wise groups before extracting local features and training landmark models by linear SVM classifiers. In terms of rendering, we only consider the three rotations to generate synthesize data. Indeed, we can assume that the translation parameters do not considerably affect the facial appearance. Because of storage and computational problems, the data are rendered in following ranges: 15 of \(Yaw \in [-70:10:70]\), 11 of \(Pitch \in [-50:10:50]\), 7 of \(Roll \in [-30:10:30]\). The empirical experiments show that the mentioned ranges are sufficient for a robust tracking.

Linear SVMs are used for landmark model training because of its computational efficiency [6], and the combination of three descriptors (Sect. 2.3) makes robust response maps more robust. Because of large pose variation of the dataset, the pose-wise SVMs are trained as follows. The total rendered images are splitted into 1155 (\(15(Yaw)\times 11(Pitch)\times 7(Roll)\)) pose-wise groups (143 images/group). Each group is used to train 90 pose-wise linear SVMs (30 landmarks \(\times \) 3 descriptors) in similar manner of [6]. So, the total of 103950 classifiers (namely \(\zeta \)) needs to be trained. In the other words, let us denote \(C_{x,y,z}) \in \zeta \) is one classifier that is trained on the specific pose x (\(\in \) 1155 poses), the landmark id y (\(\in [1,..,30]\)) and the descriptor type z (\(\in \) [intensity, gradient, LBP]). With the given descriptor of local region \(\phi _{z}\), \(C_{x,y,z}(\phi _{z})\) returns the map of confidence levels, called response map. This map is the confidence matrix of how correct the landmark may be localized. See Fig. 1. The number of classifiers seems too great, but it is applicable in practice because this training is once offline and just a few of classifier is employed at each time in tracking. In order to train a such big number of linear SVM classifiers, a very fast linear SVM, libLinear [35], is one suitable tool.

3.2 Robust Initialization

In non-rigid face tracking, the aligned model from the previous frame was usually used as the initialization for the current frame. This initialization is hard to robustly work with fast motions. Some others, e.g. [7], (in the implementation) adaptively localized the current face position via the maximum response of template matching [36]. However, the false positive detection can happen, and the recovery in terms of lost tracking is impossible if the face detection is not involved. In fact, the information from some previous frames could provide a more robust initialization. [24] showed impressive results of pose tracking by matching via keypoint learning. Yet, we propose to use the simpler strategy for initialization. Our method uses the SIFT matching like [23] and estimate the rigid parameters closely to [15]. It is sustainable enough to fast motions and provides the accurate recovery before fitting the face model by pose-wise SVMs in the next step.

First of all, 2D SIFT points are detected. We base on the projections from the 3D model of the keyframe k onto the 2D current frame t to estimate the rotation and translation (rigid parameters). Let us denote \(n_k\) and \(n_t\) are the numbers of SIFT points detected respectively on a keyframe k and a frame \(t>k\), and

$$\begin{aligned} \quad l_k = \left\{ l_k^0, l_k^1, ..., l_k^{n_k}\right\} \quad \texttt {and} \quad l_t = \left\{ l_t^0, l_t^1, ..., l_t^{n_t}\right\} \end{aligned}$$
(3)

are their respective locations. Let define the 3D points \(L_k\), which associated with the 2D points \(l_{k}\), are the intersections between the 3D model and the straight lines passing through the projection center of the camera and the 2D locations \(l_{k}\). Because some points can be invisible (that are ignored), the number of \(L_{k}\) could be different to m. If \(R_{k,t}\) and \(T_{k,t}\) denote respectively the rotation and translation from frame k to frame t, we can write that, for \(i=1\) to m, the predicted i-th point at frame t could be written:

$$\begin{aligned} \widehat{l}_t^{i} = K \varPhi (L_{k}^{i}) \quad \text {where} \quad \varPhi (L_{k}^{i})=(R_{k,t}\circ T_{k,t})L_k^{i} \end{aligned}$$
(4)

where \(\circ \) denotes the composition operator and K is the intrinsic camera matrix. To determine \(R_{k,t}\) and \(T_{k,t}\), we use the following least squares algorithm:

$$\begin{aligned} \{\hat{R}_{k,t},\hat{T}_{k,t}\} = \arg \min _{R_{k,t}),T_{k,t}}\sum _{l_k^j \leftrightarrows l_t^i}\left( l_t^{i} - K\varPhi (L_{k}^{j}) \right) ^{2} \end{aligned}$$
(5)

where \(\sum _{l_k^j \leftrightarrows l_t^i}\) denotes the sum over the couples (ij) obtained by matching RANSAC algorithm of [37] between the keyframe k and the current frame t. This transformation is denoted \(l_k^j \leftrightarrows l_t^i\). Before using RANSAC, we use the Flann matcher in both directions (from the keyframe k to the current frame t and vice versa) and return their intersection as a result. Finally, the optimization of the expression (5)) is effected numerically via the Levenberg-Marquardt algorithm.

Fig. 3.
figure 3

Our two-step approach from the frame t to \(t+1\). First step uses SIFT matching to estimate the rigid parameters. Second step uses pose-wise SVMs to re-estimate rigid and non-rigid paramters. The aligned current frames are stored as keyframes if they satisfy some given conditions.

3.3 Matching Strategy by Keyframes

Wide baseline matching via SIFT is deployed to estimate rigid parameters as the initialization for the next step. After detecting the landmarks at first frame using the landmark detector, we align 3D face models into landmarks and estimate the rigid parameters \(\check{\varTheta }(1)\) of \(\varTheta (1)\) and then using the pose-wise models (Sect. 3.4 to estimate the rigid and non-rigid parameters simultaneously. The information of this first frame such as the 2D face region and its rigid parameters, 2D and 3D corresponding points, and the value of objective function in (Sect. 3.4 are saved as a keyframe. To find the rigid parameters \(\check{\varTheta }(2)\) at the second frame, the first keyframe is used to match with the second one via the method as reported in Sect. 3.4. If the number of matching points is less than a given threshold \(T_p\) = 25, Haris detector [38] and KLT [39] is considered instead exactly the same like the SIFT matching for rigid estimation. The method reported in (Sect. 3.4 then is applied to estimate again parameters. The same strategy is applied to coming frames. To estimate rigid parameters of frame t, it is matched to all preceding-stored keyframes \(\mathscr {K}_t\) to select a candidate keyframe k. The candidate keyframe is the keyframe that have the maximum number of matching points with the current frame (after removing the ouliers by RANSAC). This number should be larger than \(T_p\); otherwise, we estimate parameters using Harris points tracked by KLT from the previous frame. After model alignment, the current frame is registered as a keyframe into the set of keyframes if its three residuals (Yaw, Pitch or Roll) is out of value ranges of the whole set of preceding keyframes. The first keyframe is fixed and other keyframes can be updated. The updating happens if the keyframe is candidate keyframe and its value of (Sect. 3.4 is bigger than current frame’s. To make sure unless bad keyframes were registered, we detect the face position parallel by the face detector and compute the distance between this position and where is detected by matching. The keyframe used for matching (candidate keyframe) is withdrawn from the set of keyframes if this distance is too large. Our method is fully automatic, and no manual keyframe is selected before tracking the video sequence. The strategy of tracking is reported in {Algorithm 1}.

figure a

3.4 Fitting via Pose-Wise Classifiers

The previous step provides precisely the initial pose of face model. This pose permits to determine which pose-wise SVMs among the set of SVMs (\(\zeta \)) should be chosen for fitting. For simplicity, \(\check{\varTheta }(t)\) and \(\varTheta (t)\) are represented as \(\check{\varTheta }\) and \(\varTheta \). As above that \(\check{\varTheta }\) is the rotation components of the current model parameter \(\varTheta \) that are estimated after the initialization step. m groups of SVMs (\(C_{\check{\varTheta }_{i},y,z}\), \(i=1,...,m\)), where \(\check{\varTheta }_i\) is m nearest values of \(\check{\varTheta }\), are chosen for fitting. \(v=4\) obtains the best performance in our empirical experiments. Given the \(\varTheta \) parameter of 3D face model, \(x_k(\varTheta )\) denotes the projection of the k-th landmark on the current frame. The response map of \(x_k(\varTheta )\) is computed independently by each group as follows: Three local descriptors \(\phi _{z}), z \in \) [intensity (gray), gradient (grad), LBP (lbp)], are extracted around the landmark k-th. The combined response map is the element-by-element multiplication of response maps (normalized into [0, 1]) that are computed independently by descriptors: \(w = C_{\varTheta _i,k,gray}(\phi _{gray}).*C_{\varTheta _i,k,grad}(\phi _{grad}).*C_{\varTheta _i,k,lbp}(\phi _{lbp})\), see Fig. 1. This final combined response map is applied to detect candidates of landmark location. The same procedure is applied for all landmarks. It is worth noting that the face is normalized to one reference face before extracting feature descriptors.

If picking up the highest score position as the candidate of k-th landmark, m candidates have to be considered (from v pose-wise SVMs \(C_{\varTheta _i,k,z}\),\(i=1,...,v\)). However, the highest score is not always the best one through observations and other peaks are probably the candidates as well. By this investigation, we keep more than one candidate (if it is local peak and its score is bigger than \(70\,\%\) of the highest one) before determining the best by shape constraints. The set of candidates of k-th landmark detected by v classifiers \(C_{\varTheta _i,k,z}\) are merged together. Let us denote \(\varOmega _{k}\) is this merged set of candidates. The rigid and non-rigid parameters can be estimated via the objective function:

$$\begin{aligned} \widehat{\varTheta } = \arg \min _{p_{k}\in \varOmega _{k},\varTheta } \sum _{k=1}^{n} w_{k} \left\| x_{k}(\varTheta ) - p_{k} \right\| _2^2 \end{aligned}$$
(6)

where \(x_k(\varTheta )\) is the projection of kth landmark corresponding to \(\varTheta \). Meanwhile, the position \(p_{k} \in \varOmega _k\) with the confidence \(w_k\) (from its response map) to be the candidate of the kth landmark. The optimization problem in Eq. 6 is combinatorial. In our work, we propose a heuristic method, which is based on ICP (Iterative Closest Points) [40] algorithm, to find the solution. The proposed approach consists of two iterative sub-steps: (i) looking for the closest candidate \(p_{k}\) from \(x_k(\varTheta )\), and (ii) estimating the update \(\varDelta \varTheta \) using gradient method. As represented in Algorithm 2. The update \(\varDelta \varTheta \) can be computed via the approximation of Taylor expansion that was mentioned similarly in [7], where \(J_{k}\) is the Jacobian matrix of the kth landmark.

figure b
$$\begin{aligned} \varDelta \varTheta =\left( \sum _{k=1}^{n}w_{k}J_{k}^{T}J_{k}\right) ^{-1}\left( \sum _{k=1}^{n}w_{k}J_{k}^{T}(p_{k}-x_{k}(\varTheta ))\right) \end{aligned}$$
(7)

4 Experimental Results

Boston University Face Tracking (BUFT) database of [11] and Talking Face videoFootnote 2) are adopted to evaluate the precision of pose estimation and landmark tracking respectively. VidTimid videos of [41], and Honda/UCSD of [42] are also used to investigate profile-face tracking capability.

BUFT: The pose ground-truth is captured by magnetic sensors “Flock and Birds” with an accuracy of less than \(1^{o}\). The uniform-light set, which is used to evaluate, has a total of 45 video sequences (320\(\times \)240 resolution) for 5 subjects (9 videos per subject) with available ground-truth of pose Yaw (or Pan), Pitch (or Tilt), Roll. The precision is measured by Mean Absolute Error (MAE) of three directions between the estimation and ground-truth over tracked frames: \(E_{yaw},E_{pitch},E_{roll}\) and \(E_{m}=\frac{1}{3}\left( E_{yaw}+E_{pitch}+E_{roll}\right) \) where \(E_{yaw}=\frac{1}{N_{s}}\sum |\varTheta _{yaw}^{i}-\hat{\varTheta }_{yaw}^{i}|\) (similarly for the Pitch and Roll). \(N_{s}\) is the number of frames and \(\varTheta _{yaw}^{i}, \hat{\varTheta }_{yaw}^{i}\) are the estimated value and ground-truth of Yaw respectively.

Table 1. The pose precision of our method and state-of-the-art methods on uniform-light set of BUFT dataset.

Since BUFT videos have low resolution and the number of SIFT points is often not enough to apply the matching, our result (Table 1) is still comparable to state-of-the-art methods. Our method achieves the same mean error \(E_{m}\) as [7, 13, 20], but worse than [12, 19, 2325]. With the use of offline training of synthesized data, the result is promising. The algorithm is better than [20] at Yaw and Roll precision and [7] at Pitch and Roll precision. The fully automatic method is marked (*) in Table 1; otherwise, it is the manual method. In addtion of rigid tracking, our method is able to track non-rigid parameters (+) in Table 1. The other methods having better results than us, is able to estimate only the rigid parameter or is a manual method. Otherwise, our method can estimate both rigid and non-rigid parameters, recover the lost-tracking while the training data is synthetic.

The Talking Face Video: is a freely 5000-frames video sequence of a talking face with available ground-truth of 68 facial points on the whole video. The Root-Mean-Squared (RMS) error is used to measure the landmark tracking (non-rigid) precision. Although the number of landmarks of methods is different, the same evaluation scheme could be still applied on the same number of selected landmarks. Twelve landmarks at corners of eyes, nose and mouth are chosen. The Fig. 4) shows the RMS of our method (red curve), and FaceTracker (blue curve) [7] on the Talking Face video. The vertical axis is the RMS error (in pixel) and the horizontal axis is the frame number. The result shows that even though our method just learned from the synthesized data, what we obtain is comparable to the state-of-the-art method, even more robust. The average precision of the entire video of our method is 5.8 pixels and FaceTracker is 6.8 pixels.

Fig. 4.
figure 4

The RMS of our framework (red curve) and FaceTracker [7] (blue curve). The vertical axis is RMS error (in pixel) and the horizontal axis is the frame number (Color figure online).

Fig. 5.
figure 5

Our tracking method on some sample videos of VidTimid and Honda/UCSD.

VidTimid and Honda/UCSD: The VidTimid is captured in resolution 512x384 pixels at the good office environment. Honda/UCSD dataset at resolution 640x480, is much more challenging than VidTimid that provides a wide range of different poses at different conditions such as face partly occlusion, scale changes, illumination, etc. The ground-truth of pose or landmarks is unavailable in these databases; hence, they are used for visualizing purpose of the profile tracking. Our framework again demonstrates its capability even in more complex movements of the head. In fact, our method is more robust than FaceTracker in terms of keeping track unloosing and it can recover face quickly without waiting for frontal face reset as FaceTracker. See Fig. 5. Some full videos in paper can be found at hereFootnote 3, in which one our own video is also recorded for evaluation. Our method is again more robust on FaceTracker on this video.

Although real-time computation is unsustainable (about 5s/frame on Desktop 3.1 GHz, 8G RAM) due to Matlab implementation. In which, the first step is about 3s/frame because of SIFT matching. The C/C++ implementation and the replacement of SIFT by another faster descriptor is a possible future work. In addition, our method is not robust with complex background because no background is included in our synthetic training data. The aware of background in training process may be a possible solution.

5 Conclusions

We presented a robust framework for wide rotation tracking of non-rigid face. Our method used the large synthesized dataset rendering from a small set of annotated frontal views. This dataset was divided into groups to train pose-wise linear SVM classifiers. The response map of one landmark is the combination of response maps from three descriptors: intensity, gradient and LBP. Through keeping some candidates from one combined response map, we apply an heuristic method to choose the best one via the constraint of 3D shape model. In addition, the SIFT matching makes our method robust to fast movements and provides a good initial rigid parameters. Through keyframes, our method can do recover the lost tracking quickly without waiting for frontal-view reset. Our method is more robust than one state-of-the-art method in terms of the profile tracking and comparable in landmark tracking. However, our method is still limited to work with complex background because of that no complex background is included in training synthesized images. It can be more efficient if the backgrounds of synthesized images are more complex. In addition, the usage of SIFT matching is slow and it needs to be improved for a real-time performance as future direction.