A Non-rigid Face Tracking Method for Wide Rotation Using Synthetic Data

Tran, Ngoc-Trung; Ababsa, Fakhreddine; Charbit, Maurice

doi:10.1007/978-3-319-27677-9_12

A Non-rigid Face Tracking Method for Wide Rotation Using Synthetic Data

Ngoc-Trung Tran^16,17,
Fakhreddine Ababsa¹⁷ &
Maurice Charbit¹⁶

Conference paper
First Online: 09 January 2016

750 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9493))

Abstract

This paper propose a new method for wide-rotation non-rigid face tracking that is still a challenging problem in computer vision community. Our method consists of training and tracking phases. In training, we propose to use a large off-line synthetic database to overcome the problem of data collection. The local appearance models are then trained using linear Support Vector Machine (SVM). In tracking, we propose a two-step approach: (i) The first step uses baseline matching for a good initialization. The matching strategy between the current frame and a set of adaptive keyframes is also involved to be recoverable in terms of failed tracking. (ii) The second step estimates the model parameters using a heuristic method via pose-wise SVMs. The combination makes our approach work robustly with wide rotation, up to $90^{\circ }$ of vertical axis. In addition, our method appears to be robust even in the presence of fast movements thanks to baseline matching. Compared to state-of-the-art methods, our method shows a good compromise of rigid and non-rigid parameter accuracies. This study gives a promising perspective because of the good results in terms of pose estimation (average error is less than $4^o$ on BUFT dataset) and landmark tracking precision (5.8 pixel error compared to 6.8 of one state-of-the-art method on Talking Face video. These results highlight the potential of using synthetic data to track non-rigid face in unconstrained poses.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Non-rigid face tracking is an important topic, which has been having a great attention since last decades. It is useful in many domains such as: video monitoring, human computer interface, biometric. The problem gets much more challenging if occurring out-of-plane rotation, the illumination changes, the presence of many people, or occlusions. In our study, we propose an approach to track non-rigid face at out-of-plane rotation, even the profile face. In other words, our method gets involved in the estimation of six rigid face parameters, namely the 3D translation and the three axial rotations^{Footnote 1}), and non-rigid parameters at the same time.

For non-rigid face tracking, a set of landmarks are considered as the face shape model. Since the pioneer work of [1], it is well-known that Active Appearance Model (AAM) provides an efficient way to represent and track frontal faces. Many works [2–4] have suggested improvements in terms of fitting accuracy or profile-view tracking. Constrained Local Model (CLM) has been proposed by [5] that consists of an exhaustive local search around landmarks constrained by a shape model. [6, 7] both improved this method in terms of accuracy and speed; more specifically, [7] can track single face with vertical rotation up to $90^{\circ }$ in well-controlled environment. The Cascaded Pose Regression (CPR), which is firstly proposed by [8], has recently shown remarkable performance [9, 10]. This method shows the high accuracy and real-time speed, merely it is restricted at the near-frontal face tracking. Most of the methods work at constrained views because of two reasons: (i) The acquisition of ground-truth for unconstrained views is really expensive in practice and (ii) how to handle the hidden landmarks on invisible side is hard.

The literature also mentioned other face models such as: cylinder [11–13], ellipsoid [14] or mesh [15]. Most of these methods can estimate the three large rotations even on the profile-view, but it is worth noting that they handle with rigid rather than non-rigid facial expression. On the other hand, the popular 3D Candide-3 model has been defined to manage rigid and non-rigid parameters. [16] used Kalman Filter to the interest points in a video sequence based on the adaptive rendered keyframe, and this work is semi-automatic and is insufficient to work in quick movement. [17] used Mahalanobis distances of local features with the constraint of the face model, to capture both rigid and non-rigid head motions. [18] learned a linear model between model parameters and the face’s appearance. These methods poorly works on profile-view. [19] extended Candide face to work with the profile, but their objective function, combining structure and appearance features with dynamic modeling, appears to slowly converge due to the high dimensionality. [20] proposed an adaptive Bayesian approach to track principal components of landmark appearance. Their algorithm appears to be robust for tracking landmarks, but unable to recover when tracking is lost. Let us notice that these methods use the synthetic database to train tracking models. For pose estimation, The pose estimation performance of mentioned methods can be improved more if integrating Kalman Filtering [21] or Particle Filtering [22].

A face tracking framework is robust if it can operate with a wide range of pose views, face expression, environmental changes and occlusions, and also have recovering capability. In [11, 12], the authors utilized dynamic templates based on cylinder model in order to handle with lighting and self-occlusion. Local features can be considered [7, 10], since local descriptors are not much affected by facial expressions and self-occlusion. In order to have recovering capability, tracking-by-detection or wide baseline matching [15, 23, 24] have been applied. The primary idea is to match the current frame with preceding-stored keyframes. The matching is sufficient to fast movements, illumination, and able to recover the lost tracking. However, the matching is only suitable to work with rigid parameters; moreover, these methods degrade when the number of keypoints detected on the face is not enough. Recently, [25] propose the combination of traditional tracking techniques and deep learning to have a proficient performance of pose tracking. Many commercial products also exist, i.e. [26], which shows effect results in pose and face animation tracking, but this product needs to work in controlled environments of illumination and movements. In addition, it has to wait for the frontal view to re-initialize the model when the face is lost.

In this paper, our contribution is two-folds: (i) using a large offline synthetic database to train tracking models, (ii) proposing a two-step tracking approach to track non-rigid face. These points are immediately introduced in more detail. Firstly, a large synthesized database is built to avoid the expensive and time-consuming manual annotation. To the best of our knowledge, although there were some papers worked with the synthetic data [18, 27, 28], our paper is the first study that investigates the large offline synthetic dataset for the free-pose tracking of non-rigid face. Secondly, the tracking approach consists of two steps: a) The first step benefits 2D SIFT matching between the current frame and some preceding-stored keyframes to estimate only rigid parameters. By this way, our method is sustainable to fast movement and recoverable in terms of lost tracking. b) The second step obtains the whole set of parameters (rigid and non-rigid) by a heuristic method using pose-wise SVMs. This way can efficiently align a 3D model into the profile face in similar manner of the frontal face fitting. The combination of three descriptors is also considered to have better local representation.

The remaining of this paper is organized as follows: Sect. 2 describes the face model and the used descriptors. Section 3 discusses the pipeline of the proposed framework. Experimental results and analysis are presented in Sect. 4. Finally, we provide in Sect. 5 some conclusions and further perspectives.

2 Face Representation

2.1 Shape Representation

Candide-3, initially proposed by [29], is a popular face model managing both facial shape and animation. It consists of $N=113$ vertices representing 168 surfaces. If $\text {g}\in R^{3N}$ denotes the vector of dimension 3N, obtained by concatenation of the three components of the N vertices, the model writes:

$$\begin{aligned} \text {g}(\sigma ,\alpha )=\overline{\text {g}}+\text {S}\sigma +\text {A}\alpha \end{aligned}$$

(1)

where $\overline{\text {g}}$ denotes the mean value of $\text {g}$. The known matrices $\text {S}) \in R^{3N\times 14}$ and $\text {A}) \in R^{3N\times 65}$ are Shape and Animation Units that control respectively shape and animation through $\sigma $ and $\alpha $ parameters. Among the 65 components of animation control $\alpha $, 11 ones are associated to track eyebrows, eyes and lips. Rotation and translation also need to be estimated during tracking. Therefore, the full model parameter, denoted $\varTheta $, has 17 dimensions: 3 dimensions of rotation $(r_{x},r_{y},r_{z})$, 3 dimensions of translation $(t_{x},t_{y},t_{z})$ and 11 dimensions of animation $r_{a}$: $\varTheta = [r_{x}\, r_{y}\, r_{z}\, t_{x}\, t_{y}\, t_{z}\, r_{a}]^{T}$. Notice that both $\sigma $ and $\varTheta $ are estimated at first frame, but only $\varTheta $ is estimated at next frames because we assume that the shape parameter does not change. In next section, $\varTheta (t)$ indicates the model parameters at time.

2.2 Projection

We assume the perspective projection, in which the camera calibration has been obtained from empirical experiments. In our case, the intrinsic camera matrix is written as follows:

$$\begin{aligned} \left[ \begin{array}{ccc} f_{x} &{} 0 &{} c_{x}\\ 0 &{} f_{y} &{} c_{y}\\ 0 &{} 0 &{} 1 \end{array}\right] \end{aligned}$$

(2)

where the focal length of camera $f_{x} = f_{y} = 1000$ pixels and the coordinates of a camera’s principal point $(c_{x},c_{y})$ as a center of the 2D video frame. The such focal length is defined because it is shown in [30] that the focal length does not require to be accurately known if the distance between the 3D object and camera is much larger than the 3D object depth. Notice that because of the perspective projection assumption, the depth $t_z$ is directly related to scale parameter.

2.3 Appearance Representation

The facial appearance is represented by a set of $N_p$ = 30 landmarks (Fig. 1). The local patch of a landmark is described by three local descriptors: intensity, gradient and Local Binary Patterns (LBP) [31] because the combination of multiple descriptors are more discriminative and robust. This combination is fast enough if using linear SVM like [6]. The patch size is 15 $\times $ 15 in our study.

3 Our Method

We present the framework into three sub-sections: (i) the model training from synthesized dataset, (ii) the robust initialization using wide baseline matching, and (iii) the fitting strategy using pose-wise SVMs.

3.1 Model Training from Synthesized Data

We consider the synthesized data for the training because of some reasons: (i) Most of available datasets were built for frontal face alignment [32]. The others contain profile information such as ALFW [32], Multi-PIE [33], but the range of Pitch or Roll are restricted. In addition, the number of landmarks of frontal and profile faces is different. That makes a gap, how to track from frontal to profile faces (ii) The campaign of ground-truth for building a new dataset is very expensive and (iii) The hidden landmarks could be localized in synthesized dataset, so the gap between frontal and profile tracking could be bridged.

The training process is shown in Fig. 2. At first, we select 143 frontal images (143 different subjects) from Multi-PIE. We then align 3D face model into the known landmarks of each image by POSIT [34] and warp the image texture to the model. Afterwards, rendering is deployed to generate a set of synthesized images of different poses. Finally, all synthesized images are clustered into pose-wise groups before extracting local features and training landmark models by linear SVM classifiers. In terms of rendering, we only consider the three rotations to generate synthesize data. Indeed, we can assume that the translation parameters do not considerably affect the facial appearance. Because of storage and computational problems, the data are rendered in following ranges: 15 of $Yaw \in [-70:10:70]$, 11 of $Pitch \in [-50:10:50]$, 7 of $Roll \in [-30:10:30]$. The empirical experiments show that the mentioned ranges are sufficient for a robust tracking.

Linear SVMs are used for landmark model training because of its computational efficiency [6], and the combination of three descriptors (Sect. 2.3) makes robust response maps more robust. Because of large pose variation of the dataset, the pose-wise SVMs are trained as follows. The total rendered images are splitted into 1155 ($15(Yaw)\times 11(Pitch)\times 7(Roll)$) pose-wise groups (143 images/group). Each group is used to train 90 pose-wise linear SVMs (30 landmarks $\times $ 3 descriptors) in similar manner of [6]. So, the total of 103950 classifiers (namely $\zeta $) needs to be trained. In the other words, let us denote $C_{x,y,z}) \in \zeta $ is one classifier that is trained on the specific pose x ($\in $ 1155 poses), the landmark id y ($\in [1,..,30]$) and the descriptor type z ($\in $ [intensity, gradient, LBP]). With the given descriptor of local region $\phi _{z}$, $C_{x,y,z}(\phi _{z})$ returns the map of confidence levels, called response map. This map is the confidence matrix of how correct the landmark may be localized. See Fig. 1. The number of classifiers seems too great, but it is applicable in practice because this training is once offline and just a few of classifier is employed at each time in tracking. In order to train a such big number of linear SVM classifiers, a very fast linear SVM, libLinear [35], is one suitable tool.

3.2 Robust Initialization

In non-rigid face tracking, the aligned model from the previous frame was usually used as the initialization for the current frame. This initialization is hard to robustly work with fast motions. Some others, e.g. [7], (in the implementation) adaptively localized the current face position via the maximum response of template matching [36]. However, the false positive detection can happen, and the recovery in terms of lost tracking is impossible if the face detection is not involved. In fact, the information from some previous frames could provide a more robust initialization. [24] showed impressive results of pose tracking by matching via keypoint learning. Yet, we propose to use the simpler strategy for initialization. Our method uses the SIFT matching like [23] and estimate the rigid parameters closely to [15]. It is sustainable enough to fast motions and provides the accurate recovery before fitting the face model by pose-wise SVMs in the next step.

First of all, 2D SIFT points are detected. We base on the projections from the 3D model of the keyframe k onto the 2D current frame t to estimate the rotation and translation (rigid parameters). Let us denote $n_k$ and $n_t$ are the numbers of SIFT points detected respectively on a keyframe k and a frame $t>k$, and

$$\begin{aligned} \quad l_k = \left\{ l_k^0, l_k^1, ..., l_k^{n_k}\right\} \quad \texttt {and} \quad l_t = \left\{ l_t^0, l_t^1, ..., l_t^{n_t}\right\} \end{aligned}$$

(3)

are their respective locations. Let define the 3D points $L_k$, which associated with the 2D points $l_{k}$, are the intersections between the 3D model and the straight lines passing through the projection center of the camera and the 2D locations $l_{k}$. Because some points can be invisible (that are ignored), the number of $L_{k}$ could be different to m. If $R_{k,t}$ and $T_{k,t}$ denote respectively the rotation and translation from frame k to frame t, we can write that, for $i=1$ to m, the predicted i-th point at frame t could be written:

$$\begin{aligned} \widehat{l}_t^{i} = K \varPhi (L_{k}^{i}) \quad \text {where} \quad \varPhi (L_{k}^{i})=(R_{k,t}\circ T_{k,t})L_k^{i} \end{aligned}$$

(4)

where $\circ $ denotes the composition operator and K is the intrinsic camera matrix. To determine $R_{k,t}$ and $T_{k,t}$, we use the following least squares algorithm:

$$\begin{aligned} \{\hat{R}_{k,t},\hat{T}_{k,t}\} = \arg \min _{R_{k,t}),T_{k,t}}\sum _{l_k^j \leftrightarrows l_t^i}\left( l_t^{i} - K\varPhi (L_{k}^{j}) \right) ^{2} \end{aligned}$$

(5)

where $\sum _{l_k^j \leftrightarrows l_t^i}$ denotes the sum over the couples (i, j) obtained by matching RANSAC algorithm of [37] between the keyframe k and the current frame t. This transformation is denoted $l_k^j \leftrightarrows l_t^i$. Before using RANSAC, we use the Flann matcher in both directions (from the keyframe k to the current frame t and vice versa) and return their intersection as a result. Finally, the optimization of the expression (5)) is effected numerically via the Levenberg-Marquardt algorithm.

3.3 Matching Strategy by Keyframes

Wide baseline matching via SIFT is deployed to estimate rigid parameters as the initialization for the next step. After detecting the landmarks at first frame using the landmark detector, we align 3D face models into landmarks and estimate the rigid parameters $\check{\varTheta }(1)$ of $\varTheta (1)$ and then using the pose-wise models (Sect. 3.4 to estimate the rigid and non-rigid parameters simultaneously. The information of this first frame such as the 2D face region and its rigid parameters, 2D and 3D corresponding points, and the value of objective function in (Sect. 3.4 are saved as a keyframe. To find the rigid parameters $\check{\varTheta }(2)$ at the second frame, the first keyframe is used to match with the second one via the method as reported in Sect. 3.4. If the number of matching points is less than a given threshold $T_p$ = 25, Haris detector [38] and KLT [39] is considered instead exactly the same like the SIFT matching for rigid estimation. The method reported in (Sect. 3.4 then is applied to estimate again parameters. The same strategy is applied to coming frames. To estimate rigid parameters of frame t, it is matched to all preceding-stored keyframes $\mathscr {K}_t$ to select a candidate keyframe k. The candidate keyframe is the keyframe that have the maximum number of matching points with the current frame (after removing the ouliers by RANSAC). This number should be larger than $T_p$; otherwise, we estimate parameters using Harris points tracked by KLT from the previous frame. After model alignment, the current frame is registered as a keyframe into the set of keyframes if its three residuals (Yaw, Pitch or Roll) is out of value ranges of the whole set of preceding keyframes. The first keyframe is fixed and other keyframes can be updated. The updating happens if the keyframe is candidate keyframe and its value of (Sect. 3.4 is bigger than current frame’s. To make sure unless bad keyframes were registered, we detect the face position parallel by the face detector and compute the distance between this position and where is detected by matching. The keyframe used for matching (candidate keyframe) is withdrawn from the set of keyframes if this distance is too large. Our method is fully automatic, and no manual keyframe is selected before tracking the video sequence. The strategy of tracking is reported in {Algorithm 1}.

3.4 Fitting via Pose-Wise Classifiers

The previous step provides precisely the initial pose of face model. This pose permits to determine which pose-wise SVMs among the set of SVMs ($\zeta $) should be chosen for fitting. For simplicity, $\check{\varTheta }(t)$ and $\varTheta (t)$ are represented as $\check{\varTheta }$ and $\varTheta $. As above that $\check{\varTheta }$ is the rotation components of the current model parameter $\varTheta $ that are estimated after the initialization step. m groups of SVMs ($C_{\check{\varTheta }_{i},y,z}$, $i=1,...,m$), where $\check{\varTheta }_i$ is m nearest values of $\check{\varTheta }$, are chosen for fitting. $v=4$ obtains the best performance in our empirical experiments. Given the $\varTheta $ parameter of 3D face model, $x_k(\varTheta )$ denotes the projection of the k-th landmark on the current frame. The response map of $x_k(\varTheta )$ is computed independently by each group as follows: Three local descriptors $\phi _{z}), z \in $ [intensity (gray), gradient (grad), LBP (lbp)], are extracted around the landmark k-th. The combined response map is the element-by-element multiplication of response maps (normalized into [0, 1]) that are computed independently by descriptors: $w = C_{\varTheta _i,k,gray}(\phi _{gray}).*C_{\varTheta _i,k,grad}(\phi _{grad}).*C_{\varTheta _i,k,lbp}(\phi _{lbp})$, see Fig. 1. This final combined response map is applied to detect candidates of landmark location. The same procedure is applied for all landmarks. It is worth noting that the face is normalized to one reference face before extracting feature descriptors.

If picking up the highest score position as the candidate of k-th landmark, m candidates have to be considered (from v pose-wise SVMs $C_{\varTheta _i,k,z}$,$i=1,...,v$). However, the highest score is not always the best one through observations and other peaks are probably the candidates as well. By this investigation, we keep more than one candidate (if it is local peak and its score is bigger than $70\,\%$ of the highest one) before determining the best by shape constraints. The set of candidates of k-th landmark detected by v classifiers $C_{\varTheta _i,k,z}$ are merged together. Let us denote $\varOmega _{k}$ is this merged set of candidates. The rigid and non-rigid parameters can be estimated via the objective function:

$$\begin{aligned} \widehat{\varTheta } = \arg \min _{p_{k}\in \varOmega _{k},\varTheta } \sum _{k=1}^{n} w_{k} \left\| x_{k}(\varTheta ) - p_{k} \right\| _2^2 \end{aligned}$$

(6)

where $x_k(\varTheta )$ is the projection of kth landmark corresponding to $\varTheta $. Meanwhile, the position $p_{k} \in \varOmega _k$ with the confidence $w_k$ (from its response map) to be the candidate of the kth landmark. The optimization problem in Eq. 6 is combinatorial. In our work, we propose a heuristic method, which is based on ICP (Iterative Closest Points) [40] algorithm, to find the solution. The proposed approach consists of two iterative sub-steps: (i) looking for the closest candidate $p_{k}$ from $x_k(\varTheta )$, and (ii) estimating the update $\varDelta \varTheta $ using gradient method. As represented in Algorithm 2. The update $\varDelta \varTheta $ can be computed via the approximation of Taylor expansion that was mentioned similarly in [7], where $J_{k}$ is the Jacobian matrix of the kth landmark.

$$\begin{aligned} \varDelta \varTheta =\left( \sum _{k=1}^{n}w_{k}J_{k}^{T}J_{k}\right) ^{-1}\left( \sum _{k=1}^{n}w_{k}J_{k}^{T}(p_{k}-x_{k}(\varTheta ))\right) \end{aligned}$$

(7)

4 Experimental Results

Boston University Face Tracking (BUFT) database of [11] and Talking Face video^{Footnote 2}) are adopted to evaluate the precision of pose estimation and landmark tracking respectively. VidTimid videos of [41], and Honda/UCSD of [42] are also used to investigate profile-face tracking capability.

BUFT: The pose ground-truth is captured by magnetic sensors “Flock and Birds” with an accuracy of less than $1^{o}$. The uniform-light set, which is used to evaluate, has a total of 45 video sequences (320$\times $240 resolution) for 5 subjects (9 videos per subject) with available ground-truth of pose Yaw (or Pan), Pitch (or Tilt), Roll. The precision is measured by Mean Absolute Error (MAE) of three directions between the estimation and ground-truth over tracked frames: $E_{yaw},E_{pitch},E_{roll}$ and $E_{m}=\frac{1}{3}\left( E_{yaw}+E_{pitch}+E_{roll}\right) $ where $E_{yaw}=\frac{1}{N_{s}}\sum |\varTheta _{yaw}^{i}-\hat{\varTheta }_{yaw}^{i}|$ (similarly for the Pitch and Roll). $N_{s}$ is the number of frames and $\varTheta _{yaw}^{i}, \hat{\varTheta }_{yaw}^{i}$ are the estimated value and ground-truth of Yaw respectively.

Table 1. The pose precision of our method and state-of-the-art methods on uniform-light set of BUFT dataset.

Full size table

Since BUFT videos have low resolution and the number of SIFT points is often not enough to apply the matching, our result (Table 1) is still comparable to state-of-the-art methods. Our method achieves the same mean error $E_{m}$ as [7, 13, 20], but worse than [12, 19, 23–25]. With the use of offline training of synthesized data, the result is promising. The algorithm is better than [20] at Yaw and Roll precision and [7] at Pitch and Roll precision. The fully automatic method is marked (*) in Table 1; otherwise, it is the manual method. In addtion of rigid tracking, our method is able to track non-rigid parameters (+) in Table 1. The other methods having better results than us, is able to estimate only the rigid parameter or is a manual method. Otherwise, our method can estimate both rigid and non-rigid parameters, recover the lost-tracking while the training data is synthetic.

The Talking Face Video: is a freely 5000-frames video sequence of a talking face with available ground-truth of 68 facial points on the whole video. The Root-Mean-Squared (RMS) error is used to measure the landmark tracking (non-rigid) precision. Although the number of landmarks of methods is different, the same evaluation scheme could be still applied on the same number of selected landmarks. Twelve landmarks at corners of eyes, nose and mouth are chosen. The Fig. 4) shows the RMS of our method (red curve), and FaceTracker (blue curve) [7] on the Talking Face video. The vertical axis is the RMS error (in pixel) and the horizontal axis is the frame number. The result shows that even though our method just learned from the synthesized data, what we obtain is comparable to the state-of-the-art method, even more robust. The average precision of the entire video of our method is 5.8 pixels and FaceTracker is 6.8 pixels.

VidTimid and Honda/UCSD: The VidTimid is captured in resolution 512x384 pixels at the good office environment. Honda/UCSD dataset at resolution 640x480, is much more challenging than VidTimid that provides a wide range of different poses at different conditions such as face partly occlusion, scale changes, illumination, etc. The ground-truth of pose or landmarks is unavailable in these databases; hence, they are used for visualizing purpose of the profile tracking. Our framework again demonstrates its capability even in more complex movements of the head. In fact, our method is more robust than FaceTracker in terms of keeping track unloosing and it can recover face quickly without waiting for frontal face reset as FaceTracker. See Fig. 5. Some full videos in paper can be found at here^{Footnote 3}, in which one our own video is also recorded for evaluation. Our method is again more robust on FaceTracker on this video.

Although real-time computation is unsustainable (about 5s/frame on Desktop 3.1 GHz, 8G RAM) due to Matlab implementation. In which, the first step is about 3s/frame because of SIFT matching. The C/C++ implementation and the replacement of SIFT by another faster descriptor is a possible future work. In addition, our method is not robust with complex background because no background is included in our synthetic training data. The aware of background in training process may be a possible solution.

5 Conclusions

We presented a robust framework for wide rotation tracking of non-rigid face. Our method used the large synthesized dataset rendering from a small set of annotated frontal views. This dataset was divided into groups to train pose-wise linear SVM classifiers. The response map of one landmark is the combination of response maps from three descriptors: intensity, gradient and LBP. Through keeping some candidates from one combined response map, we apply an heuristic method to choose the best one via the constraint of 3D shape model. In addition, the SIFT matching makes our method robust to fast movements and provides a good initial rigid parameters. Through keyframes, our method can do recover the lost tracking quickly without waiting for frontal-view reset. Our method is more robust than one state-of-the-art method in terms of the profile tracking and comparable in landmark tracking. However, our method is still limited to work with complex background because of that no complex background is included in training synthesized images. It can be more efficient if the backgrounds of synthesized images are more complex. In addition, the usage of SIFT matching is slow and it needs to be improved for a real-time performance as future direction.

Notes

1.
In the literature, the terms Yaw (or Pan), Pitch (or Tilt), and Roll are adopted for the three axial rotations.
2.
http://www-prima.inrialpes.fr/FGnet/data/01-TalkingFace/talking_face.html.
3.
http://www.youtube.com/watch?v=yqAh1_2uaPA.

References

Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. TPAMI 23, 681–685 (2001)
Article Google Scholar
Xiao, J., Baker, S., Matthews, I., Kanade, T.: Real-time combined 2d+3d active appearance models. CVPR. 2, 535–542 (2004)
Google Scholar
Gross, R., Matthews, I., Baker, S.: Active appearance models with occlusion. IVC 24, 593–604 (2006)
Article Google Scholar
Matthews, I., Baker, S.: Active appearance models revisited. IJCV 60, 135–164 (2004)
Article Google Scholar
Cristinacce, D., Cootes, T. F.: Feature detection and tracking with constrained local models. In: BMVC. (2006)
Google Scholar
Wang, Y., Lucey, S., Cohn, J.: Enforcing convexity for improved alignment with constrained local models. In: CVPR (2008)
Google Scholar
Saragih, J.M., Lucey, S., Cohn, J.F.: Deformable model fitting by regularized landmark mean-shift. IJCV 91, 200–215 (2011)
Article MathSciNet MATH Google Scholar
Dollar, P., Welinder, P., Perona, P.: Cascaded pose regression. In: CVPR (2010)
Google Scholar
Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression. In: CVPR (2012)
Google Scholar
Xiong, X., la Torre Frade, F.D.: Supervised descent method and its applications to face alignment. In: CVPR (2013)
Google Scholar
Cascia, M.L., Sclaroff, S., Athitsos, V.: Fast, reliable head tracking under varying illumination: an approach based on registration of texture-mapped 3d models. TPAMI 22, 322–336 (2000)
Article Google Scholar
Xiao, J., Moriyama, T., Kanade, T., Cohn, J.: Robust full-motion recovery of head by dynamic templates and re-registration techniques. Int. J. Imaging Syst. Technol. 13, 85–94 (2003)
Article Google Scholar
Morency, L. P., Whitehill, J., Movellan, J. R.: Generalized adaptive view-based appearance model: Integrated framework for monocular head pose estimation. In: FG. (2008)
Google Scholar
An, K. H., Chung, M. J.: 3d head tracking and pose-robust 2d texture map-based face recognition using a simple ellipsoid model. In: IROS, pp. 307–312 (2008)
Google Scholar
Vacchetti, L., Lepetit, V., Fua, P.: Stable real-time 3d tracking using online and offline information. TPAMI 26, 1385–1391 (2004)
Article Google Scholar
Ström, J.: Model-based real-time head tracking. EURASIP 2002, 1039–1052 (2002)
MATH Google Scholar
Chen, Y., Davoine, F.: Simultaneous tracking of rigid head motion and non-rigid facial animation by analyzing local features statistically. In: BMVC. (2006)
Google Scholar
Alonso, J., Davoine, F., Charbit, M.: A linear estimation method for 3d pose and facial animation tracking. In: CVPR (2007)
Google Scholar
Lefevre, S., Odobez, J. M.: Structure and appearance features for robust 3d facial actions tracking. In: ICME (2009)
Google Scholar
Tran, N.-T., Ababsa, F.-E., Charbit, M., Feldmar, J., Petrovska-Delacrétaz, D., Chollet, G.: 3D face pose and animation tracking via eigen-decomposition based Bayesian approach. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Li, B., Porikli, F., Zordan, V., Klosowski, J., Coquillart, S., Luo, X., Chen, M., Gotz, D. (eds.) ISVC 2013, Part I. LNCS, vol. 8033, pp. 562–571. Springer, Heidelberg (2013)
Chapter Google Scholar
Ababsa, F.: Robust extended kalman filtering for camera pose tracking using 2d to 3d lines correspondences. In: IEEE/ASME International Conference on Advanced Intelligent Mechatronics, pp. 1834–1838 (2009)
Google Scholar
Ababsa, F., Mallem, M.: Robust line tracking using a particle filter for camera pose estimation. In: Proceedings of the ACM Symposium on Virtual Reality Software and Technology. (2006)
Google Scholar
Jang, J. S., Kanade, T.: Robust 3d head tracking by online feature registration. In: FG (2008)
Google Scholar
Wang, H., Davoine, F., Lepetit, V., Chaillou, C., Pan, C.: 3-d head tracking via invariant keypoint learning. IEEE Trans. Circ. Syst. Video Technol. 22, 1113–1126 (2012)
Article Google Scholar
Asteriadis, S., Karpouzis, K., Kollias, S.: Visual focus of attention in non-calibrated environments using gaze estimation. IJCV 107, 293–316 (2014)
Article MathSciNet Google Scholar
FaceAPI. (http://www.seeingmachines.com)
Gu, L., Kanade, T.: 3d alignment of face in a single image. In: CVPR (2006)
Google Scholar
Su, Y., Ai, H., Lao, S.: Multi-view face alignment using 3d shape model for view estimation. In: Proceedings of the Third International Conference on Advances in Biometrics (2009)
Google Scholar
Ahlberg, J.: Candide-3 - an updated parameterised face. Technical report, Department of Electrical Engineering, Linkoping University, Sweden (2001)
Google Scholar
Aggarwal, G., Veeraraghavan, A., Chellappa, R.: 3D facial pose tracking in uncalibrated videos. In: Pal, S.K., Bandyopadhyay, S., Biswas, S. (eds.) PReMI 2005. LNCS, vol. 3776, pp. 515–520. Springer, Heidelberg (2005)
Chapter Google Scholar
Ojala, T., Pietikäinen, M., Harwood, D.: A comparative study of texture measures with classification based on featured distributions. PR 29, 51–59 (1996)
Google Scholar
Koestinger, M., Wohlhart, P., Roth, P. M., Bischof, H.: Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In: First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies (2011)
Google Scholar
Gross, R., Matthews, I., Cohn, J.F., Kanade, T., Baker, S.: Multi-pie. IVC 28, 807–813 (2010)
Article Google Scholar
Dementhon, D.F., Davis, L.S.: Model-based object pose in 25 lines of code. IJCV 15, 123–141 (1995)
Article Google Scholar
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. JMLR 9, 1871–1874 (2008)
MATH Google Scholar
Lewis, J.P.: Fast normalized cross-correlation. In: Proceedings of Vision Interface, vol. 1995, pp. 120–123 (1995)
Google Scholar
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 381–395 (1981)
Article MathSciNet Google Scholar
Harris, C., Stephens, M.: A combined corner and edge detector. In: Fourth Alvey Vision Conference, pp. 147–151 (1988)
Google Scholar
Tomasi, C., Kanade, T.: Detection and tracking of point features. Technical report, International Journal of Computer Vision (1991)
Google Scholar
Besl, P.J., McKay, N.D.: A method for registration of 3-d shapes. TPAMI 14, 239–256 (1992)
Article Google Scholar
Sanderson, C.: The VidTIMIT Database. Technical report, IDIAP (2002)
Google Scholar
Lee, K., Ho, J., Yang, M., Kriegman, D.: Video-based face recognition using probabilistic appearance manifolds. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 313–320 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

LTCI-CNRS, Telecom ParisTECH, 37-39, Rue Dareau, 75014, Paris, France
Ngoc-Trung Tran & Maurice Charbit
IBISC, University of Evry, 40, Rue du Pelvoux, 91020, Evry, France
Ngoc-Trung Tran & Fakhreddine Ababsa

Authors

Ngoc-Trung Tran
View author publications
You can also search for this author in PubMed Google Scholar
Fakhreddine Ababsa
View author publications
You can also search for this author in PubMed Google Scholar
Maurice Charbit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ngoc-Trung Tran .

Editor information

Editors and Affiliations

Technical University of Lisbon, Lisbon, Portugal
Ana Fred
Sapienza Università di Roma, Roma, Italy
Maria De Marsico
Instituto Superior Técnico, Instituto de Telecomunicações, Lisbon, Portugal
Mário Figueiredo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tran, NT., Ababsa, F., Charbit, M. (2015). A Non-rigid Face Tracking Method for Wide Rotation Using Synthetic Data. In: Fred, A., De Marsico, M., Figueiredo, M. (eds) Pattern Recognition: Applications and Methods. ICPRAM 2015. Lecture Notes in Computer Science(), vol 9493. Springer, Cham. https://doi.org/10.1007/978-3-319-27677-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-27677-9_12
Published: 09 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27676-2
Online ISBN: 978-3-319-27677-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)