Abstract
In this paper, we present a novel localization method of facial feature points with generalization ability based on a data-driven semi-supervised learning approach. Even though a powerful facial feature detector can be built using a number of human-annotated training data, the collection process is time-consuming and very often impractical due to the high cost and error-prone process of manual annotations. The proposed method takes advantage of a data-driven semi-supervised learning that optimizes a hybrid detector by interacting with a hierarchical data model to suppress and regularize noisy outliers. The competitive performance comparing to other state-of-the-art technology is also shown using benchmark datasets, Bosprous, BioID.
1 Introduction
Facial feature localization has much impact on face-image based applications such as animation, expression recognition, and face registration [1]. Facial feature detectors can be categorized based on their focusing information that is observed from face images. Local detectors usually employ texture descriptors relying on sliding-window based search. Commonly used texture descriptors are the SIFT [2] and histogram of gradients (HOG) [3]. Global detectors focus more on global models such as geometrical distributions and structural relationships between facial feature points. Facial features are characterized by shape distribution models [4] or probability distribution of a feature point from other locations [5].
Local feature detectors can obtain precise localization, but they are sensitive to even small noises and are prone to generate frequent false alarms. Global detectors can prevent such local noise sensitivity, however localization accuracy cannot approach to real world requirements in general. To avoid such a contradictory dilemma, many facial feature detectors combine the local detector and global detector. Zhu and Ramanan [3] represent patches using HOG features, and employ a quadratic spring scheme as a global shape model. RANSAC-like approach was addressed by Belhumeur et al. [6], where a Bayesian formulation allows to integrate local detectors into a global structure. A robust detector with generalized performance can be acquired using a number of data labeled correctly. However, in many real-world applications the collection of large volume of well-labeled dataset is not easy due to the high cost and error-prone processes of manual annotations. We propose an outlier-aware hybrid detector based on a data-driven semi-supervised learning approach. Generalization ability can be achieved using both labeled and unlabeled data [7] based on semi-supervised learning approaches. Recently, Tong et al. [8] presented an automatic method to avoid labor-intensive and error-prone manual processes in feature annotation. In their experiment, only a small portion of faces was manually labeled, and the remaining images are automatically annotated. However, Tong et al.’s approach is only a pioneer study on automatic facial feature annotation to obtain training data easily, not aims at robust facial feature localization with generalization ability applied for noisy and uncertain real-world environments.
The interactive data-driven semi-supervised learning framework consists of the hybrid detector (abbreviated by H-DTR) and the hierarchical data model (abbreviated by HDM). We explores the adaptive outlier suppression in the H-DTR, the outlier regularization of noisy or contaminated troublesome data in the HDM, and the interactive updates of the H-DTR and the HDM for better generalization ability. The HDM is constructed based on the hierarchical outlier-aware soft K-means clustering algorithm [9]. In Sect. 2, the overview of our approach is given. We discuss the HDM and the formulation of proposed H-DTR in Sect. 3 and Sect. 4, respectively. In Sect. 5, the data-driven semi-supervised learning algorithm using the HDM is discussed. Section 6 shows the superiority of the proposed method to other state-of-the-art localization technologies by experiments. Finally, conclusion is given in Sect. 7.
2 Overview of the Approach
In general, better generalization performance can be obtained with more training data labeled correctly, however the accumulation of a number of labeled data usually requires heavy labor-intensive processes and is very often error-prone. In this section, we outline a data-driven semi-supervised approach combining a hybrid detector and a hierarchical data model which can take advantage of both labeled data and unlabeled data. The proposed method provides a robust and generalization performance in real-life noisy environments, and is a fully automatic without human supervision except initial annotation of labeled dataset which is much smaller than unlabeled dataset.
The novelty of this paper is the effective combination of the HDM and H-DTR in an interactive manner (Fig. 1). The H-DTR consists of the global detector and the local detectors. Given an image, the global detector locates the face region and facial component regions using the Haar-like feature–based boosting method [10], and the detected face region is normalized over the scale to reduce the effects of scale uncertainty. Facial feature distributions are initialized based on pre-complied positions. The global detector produces the confidence search areas of individual feature points which are constrained by the Procrustes analysis [11]. The local detectors localize the feature points using k-NN regression algorithms in terms of SIFT descriptors [2]. Both the global and local detectors have outlier detection mechanisms using the HDM to obtain a robust performance. The HDM is defined by a two-level cluster tree consisting of the heterogeneous data models: the regularized global structure and local appearance models in the 1st and 2nd levels of the HDM, respectively (Fig. 2). The HDM is built using a semi-labeled image dataset. The semi-labeled dataset includes both labeled image data annotated by hand and annotated by the H-DTR during the interactive/incremental learning.
We construct a HDM (1st generation) from a given labeled dataset (< 1>, < 2-1 > , and < 2-2 > in Fig. 1) in the initial data-driven learning step. The 1st level of the HDM represents different clusters of the global model which represent the global structures of feature point distributions in terms of Hausdroff vectors, and the 2nd level denotes different clusters of the local SIFT features for each facial feature point. We establish the 1st generation H-DTR based on the 1st generation HDM (< 3-1 > and < 3-2 >), and partial image data are randomly selected from the unlabeled image dataset (< 4>). The 1st generation H-DTR produces localized feature point sets, and constitutes the semi-supervised dataset by merging with the labeled image dataset (< 5>). Next, we perform the 2nd generation semi-supervised learning, where the hierarchical soft K-means algorithm takes the image data from the semi-labeled dataset produced in the 1st generation learning (< 1>), and regularizes and constructs the 1st level (< 2-1 >) and the 2nd level (< 2-1 >) of the HDM (2nd generation). The 2nd generation H-DTR is ready using the 2nd generation HDM (< 3-1 > and 3-2 >), selects randomly partial unlabeled data from, and so on until converging to a target performance. One can notice that well-managements of semi-labeled image dataset and regularized data models are essential to obtain incremental robust performance with generalization ability.
3 Hierarchical Data Model
The proposed framework employs the HDM (Hierarchical Data Model) which is established using the labeled training dataset initially, and updated by the H-DTR using randomly selected samples from the unlabeled dataset interactively until it converges or a predefined iteration limit. The HDM is two-level clusters generated by the soft K-means algorithm. In this paper, the global information represents the geometrical distribution of feature points and structural relationships between them, and the local information represents the appearances of individual feature points using the SIFT descriptors (Fig. 2). We first build the 1st level of the HDM (called “global structure models”) using the global information and the 2nd level (called “local appearance models”) for each 1st level cluster using the local information.
The global structure models are represented by the clusters centroids of the global Hausdroff vectors which consist of the positions of all feature points and Hausdroff distances between them. The local appearance models are represented by the cluster centroids of the SIFT vectors of individual feature points.
4 Hybrid Detector
H-DTR is a relatively simple, but has a flexible algorithm architecture so that it can be efficiently applied in real-life environments. H-DTR consists of the global detector for localizing an global structure of feature points and local detectors for more precise localization of individual feature points.
4.1 Outlier-Aware Hybrid Detector
Let \( {\mathbf{X}} = \{\text{X}^{\text{1}} , \ldots ,{\text{X}}^{\text{L}} \} \) be a set of random variable spaces, where \( {\text{X}}^{l} \) is the space of a facial feature point that can be labeled by \( l{ \in }\{ 1, \ldots ,{\mathbf{L}}\} \). The localization of facial feature points is formalized as multiclass (\( {\mathbf{L}} \) classes) classification problem as follows. Let x (= [x, y]T) denote a facial feature point. If x belongs to \( {\mathbf{X}}^{l} \), (i.e.,), x is denoted by (x,l) or xl meaning that a facial feature label l can be assigned to the feature point x. The feature localization is to assign a label l to a best feature point x that is estimated by a classifier. Given a face image I, the H-DTR finds a best facial feature vector \( {\mathbf{X}} = {\text{\{ x}}^{\text{1}} , \ldots ,{\text{x}}^{\text{L}} \} \) with \( {\text{x}}^{i} \in {\mathbf{X}}^{i} \) using both global and local detectors. The global detector is based on Procustes analysis, constrained by the global structure models of the HDM, and decides the search areas of the facial feature points. The search area of a feature label l denoted by \( {\mathbf{A}} (l ) , { }l{ \in }\{ 1, \ldots ,{\mathbf{L}}\} \) indicates the area within which the best point of feature label l can be found with high probability. It is decided empirically and the details can be found in [12]. The local detector for a feature point carries out more a precise localization using the SIFT descriptors [2] constrained by the local appearance models of the HDM and local Hausdroff distances. The global detector allows the local detectors are conditionally independent each other. Given a face image I and a training dataset \( {\text{D}} \), assuming that a prior distribution over \( {\mathbf{X}} \) exists, \( {\mathbf{X}} \) can be treated as random variable in Bayesian statistics. The posterior distribution of \( {\mathbf{X}} \) is represented by:
In this paper, the priors of a local detector for a feature point labeled l are the search area \( {\text{A(}}l ) \). Since the H-DTR is divided into global detector and local detectors, Eq. 1 is rewritten by
where \( {\mathbf{X}}_{G} \) and \( {\mathbf{X}} \) are localized feature points in the global and local steps, respectively. The H-DTR is looking for optimal facial feature vectors \( {\mathbf{X}}_{G} \) and \( {\mathbf{X}} \), satisfying
where \( p (I|{\mathbf{X}},{\mathbf{X}}_{G} \text{,}{\text{D)}} = \prod\limits_{l = 1}^{\text{L}} {p(I|{\text{x}}^{l} )} \) and \( p({\mathbf{X|X}}_{G} \text{,}{\text{D)}} = \prod\limits_{l = 1}^{\text{L}} {p({\text{x}}^{l} )} \), since the global structure \( {\mathbf{X}}_{G} \) constrains feature points by the search areas, a feature point \( {\text{x}}{}^{l} \) can treated as conditionally independent of each other. Note that \( p (I|{\mathbf{X}}^{G} ,{\text{D}}) \) and \( p(I|{\text{x}}^{l} ,{\text{x}}^{l} { \in }{\text{A(}}l )) \) are the likelihood functions of \( {\mathbf{X}}_{G} \) and \( {\text{x}}^{l} { \in }{\mathbf{X}}(l = 1, \ldots ,{\text{L}}) \), and are estimated by the global detector and local detectors. Finally, we minimize the negative of the logarithm the posterior rather than maximizing (Eq. 3).
where \( \{ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{x}^{l} \} = \{ {\text{x}}^{i} \}_{i = 1}^{\text{L}} - \{ {\text{x}}^{l} \} \) and \( {\text{R(}}{\mathbf{X}}_{G} , {\text{x}}^{1} , \ldots ,{\text{x}}^{\text{L}} ) = {\text{R(}}{\mathbf{X}}_{G} ) + \sum\limits_{i = 1}^{\text{L}} {{\text{R(x}}^{i} )} \). The first term is a global detector error that encourages a good intermediated feature point localization based on Procustes analysis, and the second term is a local detector error for more precise localizations based on the local appearance model and local Hausdroff constraints with neighboring feature points. The third term is the regularization for control global sparsity and local sparsity using the priors of the HDM, where \( {\text{R(}}{\mathbf{X}}_{G} ) \) and \( {\text{R(x}}^{i} ) \) are global and regularizations, respectively.
4.2 Hierarchical Outlier Suppression
We carried out two types of outlier suppressions: global and local outlier suppressions. The global detector determines the global structure, \( {\mathbf{X}}_{G} \), based on the Procustes analysis and investigates whether or not the \( {\mathbf{X}}_{G} \), is a global outlier. If \( {\mathbf{X}}_{G} \) is not an outlier, the global detector decides the search area of each feature point, otherwise a global localization error is declared. The local detector explores the search area to find a feature point with highest probability. The localized feature point is checked whether or not it is a local outlier. If the localized feature point is not an outlier, the localization is reported as success, otherwise a local localization error is declared. The outlier constraints are measured based on the global structure and local appearance data models in the HDM, and used to prevent global and local outliers, respectively.
4.3 Optimization
The solution of (Eq. 4) is nonconvex, but it is convex w.r.t. each of optimization variables,\( {\text{x}}^{1} , \ldots ,{\text{x}}^{\text{L}} \), and \( {\mathbf{X}}_{G} \). We develop a local optimal algorithm based on the block-coordinate decent method [13], which minimizes (Eq. 4) iteratively w.r.t. each variable while the other variables are fixed. Algorithm 1 represents our optimization procedure. Localized points with the maximum likelihood may not always be correct facial feature points, and not consistent with other feature points. There are many reasons making the detectors unstable such as noises, pose variances, cluttered background, and illuminant changes. In this context, we introduce a data-driven semi-supervised framework which can learn incrementally using both labeled and unlabeled datasets to minimize the effects of troublesome patterns and to prevent outliers in cooperative with the HDM in Sect. 5.
5 Data-Driven Semi-supervised Learning
In the proposed method, the semi-supervised learning steps iteratively construct the HDM’s and H-DTR’s using randomly selected image data from the unlabeled dataset. We initialize the 1st generation HDM and H-DTR using a labeled image dataset. Remind that the semi-labeled image dataset indicates both human-annotated data given initially and machine-annotated data that may be wrong during semi-supervised learning steps. The data-driven semi-supervised learning is controlled by the hierarchical soft K-means clustering algorithm. The machine-annotated data is produced by the current H-DTR, and it is used to construct a next generation HDM.
5.1 Semi-Labeled Dataset Update
The newly labeled dataset is merged to the semi-supervised dataset, and constitutes the next semi-supervised dataset. The H-DTR finds a best matched global structure model for \( {\mathbf{X}}_{G} \) based on the similarity measure of global Hausdroff vectors. If there is no matched global structure satisfying the global outlier constraint, it is rejected. Once the global structure, \( {\mathbf{X}}_{G} \) is determined, the local detector finds a best matched local appearance model for a feature point \( {\text{x}}^{l} { \in }{\text{A(}}l ) \). In the local detection phase, the current H-DTR performs a local matching using k-NN regression. Algorithm 2 summarizes the update process of the semi-labeled dataset in the proposed learning framework.
6 Experimental Results
The localization performance was evaluated from several points of views using popular face datasets such as Bosphorus [14], BioID [15]. Our method is compared to the state-of-the-art of technologies reported in [16]. The experiments were performed on an Intel Core(TM)2 Quad CPU Q8400 2.66 GHz with C ++.
6.1 Performance Evaluation
Our localization method is compared to that of STASM V.4 [17] and that of 3-Level IMoFA [16]. We used 200 and 500 labeled and unlabeled samples for the semi-supervised learning of local detectors of our method, respectively. 500 samples from Bosphorus are used for testing. The comparison results are given in Table 1. Our method shows better performance in average localization accuracies and mean errors than 3-Level IMoFA and STASM V.4.
In Fig. 3, the cumulative correct localization rates of the proposed method is compared to other state-of-the art methods reported by Dibeklioglu et al. [16]. The performance of our method is comparable to other approaches [16] such as 3-Level IMoFA, Generative, Sliwiga, AAM, CLM, and BorMaN.
Semi-supervised learning is carried out simultaneously with testing using the unlabeled test data samples to make fair comparisons.
7 Conclusion
Most state-of-the-art face feature detectors rely on only labeled training data, and thus have much difficulty in obtaining robust performances when the variability of images hardly be predicted in prior. Instead of using a number of labeled training data, unlabeled data that can be easily gathered is employed for improving the generalization ability of feature localization. We presented an iterative algorithm for robust facial feature localization, where the H-DTR improves the localization performance of facial feature points with incremental generalization ability.
References
Celiktutan, O., et al.: A Comparative Study of Face Landmarking Techniques, EURASIP Journal on Image and Video Processing (2013)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)
Zhu, X., Ramanan, D.: Face detection, pose estimation and landmark estimation in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)
Cristinacce, D., Cootes, T.: Feature detection and tracking with constrained local models. In: Proceedings BMVC, pp. 929–938 (2006)
Cristinacce, D., Cootes, T., Scott, I.: A multi-stage approach to facial feature detection. Proc. British Mach. Vis. Conf. 1, 231–240 (2004)
Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts of faces using a consensus of exemplars. In: CVPR (2011)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, pp. 92–100 (1998)
Tong, Y., Liu, X., Wheeler, F.W., Tu, P.: Semi-supervised facial landmark annotation. Comput. Vis. Image Underst. (CVIU) 116(8), 922–935 (2012)
Forero, P.A.: Robust clustering using outlier-sparsity regularization. IEEE Trans. Sig. Process. 60(8), 4163–4177 (2012)
Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)
Goodall, C.: Procrustes methods in the statistical analysis of shape. J. Roy. Statist. Soc. Ser. B 53(2), 285–339 (1991)
Hong, S., Khim, S., Lee, P.K.: Efficient face landmark localization using spatial-context adaboost algorithm. In: Proceedings Journal of Visaul Communication and Image Presentation (2013)
Mareček1, J., Richtárik2, P., Takáč, M.: Distributed Block Coordinate Descent for Minimizing Partially Separable Functions, Math.OC 2 June 2014
Savran, A., Alyüz, N., Dibeklioğlu, H., Çeliktutan, O., Gökberk, B., Sankur, B., Akarun, L.: Bosphorus database for 3D face analysis. In: Schouten, B., Juul, N.C., Drygajlo, A., Tistarelli, M. (eds.) BIOID 2008. LNCS, vol. 5372, pp. 47–56. Springer, Heidelberg (2008)
Dibeklioglu, H., Salah, A.A., Gevers, T.: A statistical method for 2-D facial landmarking. IEEE Trans. Image Process. 21(2), 844–858 (2012)
Milborrow, S., Nicolls, F.: Active Shape Models with SIFT Descriptors and MARS. VISAPP (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Kim, Y.Y., Hong, S.J., Rhee, J.H., Nam, M.Y., Rhee, P.K. (2015). Robust Facial Feature Localization using Data-Driven Semi-supervised Learning Approach. In: Nalpantidis, L., Krüger, V., Eklundh, JO., Gasteratos, A. (eds) Computer Vision Systems. ICVS 2015. Lecture Notes in Computer Science(), vol 9163. Springer, Cham. https://doi.org/10.1007/978-3-319-20904-3_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-20904-3_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20903-6
Online ISBN: 978-3-319-20904-3
eBook Packages: Computer ScienceComputer Science (R0)