Robust Facial Feature Localization using Data-Driven Semi-supervised Learning Approach

Kim, Yoon Young; Hong, Sung Jin; Rhee, Ji Hye; Nam, Mi Young; Rhee, Phill Kyu

doi:10.1007/978-3-319-20904-3_15

Robust Facial Feature Localization using Data-Driven Semi-supervised Learning Approach

Yoon Young Kim¹⁷,
Sung Jin Hong¹⁷,
Ji Hye Rhee¹⁸,
Mi Young Nam¹⁹ &
…
Phill Kyu Rhee¹⁷

Conference paper
First Online: 01 January 2015

1774 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9163))

Abstract

In this paper, we present a novel localization method of facial feature points with generalization ability based on a data-driven semi-supervised learning approach. Even though a powerful facial feature detector can be built using a number of human-annotated training data, the collection process is time-consuming and very often impractical due to the high cost and error-prone process of manual annotations. The proposed method takes advantage of a data-driven semi-supervised learning that optimizes a hybrid detector by interacting with a hierarchical data model to suppress and regularize noisy outliers. The competitive performance comparing to other state-of-the-art technology is also shown using benchmark datasets, Bosprous, BioID.

Download conference paper PDF

1 Introduction

Facial feature localization has much impact on face-image based applications such as animation, expression recognition, and face registration [1]. Facial feature detectors can be categorized based on their focusing information that is observed from face images. Local detectors usually employ texture descriptors relying on sliding-window based search. Commonly used texture descriptors are the SIFT [2] and histogram of gradients (HOG) [3]. Global detectors focus more on global models such as geometrical distributions and structural relationships between facial feature points. Facial features are characterized by shape distribution models [4] or probability distribution of a feature point from other locations [5].

Local feature detectors can obtain precise localization, but they are sensitive to even small noises and are prone to generate frequent false alarms. Global detectors can prevent such local noise sensitivity, however localization accuracy cannot approach to real world requirements in general. To avoid such a contradictory dilemma, many facial feature detectors combine the local detector and global detector. Zhu and Ramanan [3] represent patches using HOG features, and employ a quadratic spring scheme as a global shape model. RANSAC-like approach was addressed by Belhumeur et al. [6], where a Bayesian formulation allows to integrate local detectors into a global structure. A robust detector with generalized performance can be acquired using a number of data labeled correctly. However, in many real-world applications the collection of large volume of well-labeled dataset is not easy due to the high cost and error-prone processes of manual annotations. We propose an outlier-aware hybrid detector based on a data-driven semi-supervised learning approach. Generalization ability can be achieved using both labeled and unlabeled data [7] based on semi-supervised learning approaches. Recently, Tong et al. [8] presented an automatic method to avoid labor-intensive and error-prone manual processes in feature annotation. In their experiment, only a small portion of faces was manually labeled, and the remaining images are automatically annotated. However, Tong et al.’s approach is only a pioneer study on automatic facial feature annotation to obtain training data easily, not aims at robust facial feature localization with generalization ability applied for noisy and uncertain real-world environments.

The interactive data-driven semi-supervised learning framework consists of the hybrid detector (abbreviated by H-DTR) and the hierarchical data model (abbreviated by HDM). We explores the adaptive outlier suppression in the H-DTR, the outlier regularization of noisy or contaminated troublesome data in the HDM, and the interactive updates of the H-DTR and the HDM for better generalization ability. The HDM is constructed based on the hierarchical outlier-aware soft K-means clustering algorithm [9]. In Sect. 2, the overview of our approach is given. We discuss the HDM and the formulation of proposed H-DTR in Sect. 3 and Sect. 4, respectively. In Sect. 5, the data-driven semi-supervised learning algorithm using the HDM is discussed. Section 6 shows the superiority of the proposed method to other state-of-the-art localization technologies by experiments. Finally, conclusion is given in Sect. 7.

2 Overview of the Approach

In general, better generalization performance can be obtained with more training data labeled correctly, however the accumulation of a number of labeled data usually requires heavy labor-intensive processes and is very often error-prone. In this section, we outline a data-driven semi-supervised approach combining a hybrid detector and a hierarchical data model which can take advantage of both labeled data and unlabeled data. The proposed method provides a robust and generalization performance in real-life noisy environments, and is a fully automatic without human supervision except initial annotation of labeled dataset which is much smaller than unlabeled dataset.

The novelty of this paper is the effective combination of the HDM and H-DTR in an interactive manner (Fig. 1). The H-DTR consists of the global detector and the local detectors. Given an image, the global detector locates the face region and facial component regions using the Haar-like feature–based boosting method [10], and the detected face region is normalized over the scale to reduce the effects of scale uncertainty. Facial feature distributions are initialized based on pre-complied positions. The global detector produces the confidence search areas of individual feature points which are constrained by the Procrustes analysis [11]. The local detectors localize the feature points using k-NN regression algorithms in terms of SIFT descriptors [2]. Both the global and local detectors have outlier detection mechanisms using the HDM to obtain a robust performance. The HDM is defined by a two-level cluster tree consisting of the heterogeneous data models: the regularized global structure and local appearance models in the 1^st and 2^nd levels of the HDM, respectively (Fig. 2). The HDM is built using a semi-labeled image dataset. The semi-labeled dataset includes both labeled image data annotated by hand and annotated by the H-DTR during the interactive/incremental learning.

We construct a HDM (1^st generation) from a given labeled dataset (< 1>, < 2-1 > , and < 2-2 > in Fig. 1) in the initial data-driven learning step. The 1^st level of the HDM represents different clusters of the global model which represent the global structures of feature point distributions in terms of Hausdroff vectors, and the 2^nd level denotes different clusters of the local SIFT features for each facial feature point. We establish the 1^st generation H-DTR based on the 1^st generation HDM (< 3-1 > and < 3-2 >), and partial image data are randomly selected from the unlabeled image dataset (< 4>). The 1^st generation H-DTR produces localized feature point sets, and constitutes the semi-supervised dataset by merging with the labeled image dataset (< 5>). Next, we perform the 2^nd generation semi-supervised learning, where the hierarchical soft K-means algorithm takes the image data from the semi-labeled dataset produced in the 1^st generation learning (< 1>), and regularizes and constructs the 1^st level (< 2-1 >) and the 2^nd level (< 2-1 >) of the HDM (2^nd generation). The 2^nd generation H-DTR is ready using the 2^nd generation HDM (< 3-1 > and 3-2 >), selects randomly partial unlabeled data from, and so on until converging to a target performance. One can notice that well-managements of semi-labeled image dataset and regularized data models are essential to obtain incremental robust performance with generalization ability.

3 Hierarchical Data Model

The proposed framework employs the HDM (Hierarchical Data Model) which is established using the labeled training dataset initially, and updated by the H-DTR using randomly selected samples from the unlabeled dataset interactively until it converges or a predefined iteration limit. The HDM is two-level clusters generated by the soft K-means algorithm. In this paper, the global information represents the geometrical distribution of feature points and structural relationships between them, and the local information represents the appearances of individual feature points using the SIFT descriptors (Fig. 2). We first build the 1^st level of the HDM (called “global structure models”) using the global information and the 2^nd level (called “local appearance models”) for each 1st level cluster using the local information.

The global structure models are represented by the clusters centroids of the global Hausdroff vectors which consist of the positions of all feature points and Hausdroff distances between them. The local appearance models are represented by the cluster centroids of the SIFT vectors of individual feature points.

4 Hybrid Detector

H-DTR is a relatively simple, but has a flexible algorithm architecture so that it can be efficiently applied in real-life environments. H-DTR consists of the global detector for localizing an global structure of feature points and local detectors for more precise localization of individual feature points.

4.1 Outlier-Aware Hybrid Detector

Let $ {\mathbf{X}} = \{\text{X}^{\text{1}} , \ldots ,{\text{X}}^{\text{L}} \} $ be a set of random variable spaces, where $ {\text{X}}^{l} $ is the space of a facial feature point that can be labeled by $ l{ \in }\{ 1, \ldots ,{\mathbf{L}}\} $. The localization of facial feature points is formalized as multiclass ($ {\mathbf{L}} $ classes) classification problem as follows. Let x (= [x, y]^T) denote a facial feature point. If x belongs to $ {\mathbf{X}}^{l} $, (i.e.,), x is denoted by (x,l) or x^l meaning that a facial feature label l can be assigned to the feature point x. The feature localization is to assign a label l to a best feature point x that is estimated by a classifier. Given a face image I, the H-DTR finds a best facial feature vector $ {\mathbf{X}} = {\text{\{ x}}^{\text{1}} , \ldots ,{\text{x}}^{\text{L}} \} $ with $ {\text{x}}^{i} \in {\mathbf{X}}^{i} $ using both global and local detectors. The global detector is based on Procustes analysis, constrained by the global structure models of the HDM, and decides the search areas of the facial feature points. The search area of a feature label l denoted by $ {\mathbf{A}} (l ) , { }l{ \in }\{ 1, \ldots ,{\mathbf{L}}\} $ indicates the area within which the best point of feature label l can be found with high probability. It is decided empirically and the details can be found in [12]. The local detector for a feature point carries out more a precise localization using the SIFT descriptors [2] constrained by the local appearance models of the HDM and local Hausdroff distances. The global detector allows the local detectors are conditionally independent each other. Given a face image I and a training dataset $ {\text{D}} $, assuming that a prior distribution over $ {\mathbf{X}} $ exists, $ {\mathbf{X}} $ can be treated as random variable in Bayesian statistics. The posterior distribution of $ {\mathbf{X}} $ is represented by:

$$ p ({\mathbf{X}}|I,{\text{D)}} = \frac{{p (I |{\mathbf{X,}}{\text{D)}}p({\mathbf{X}} | {\text{D}}{\mathbf{)}}}}{{\int {} }} $$

(1)

In this paper, the priors of a local detector for a feature point labeled l are the search area $ {\text{A(}}l ) $. Since the H-DTR is divided into global detector and local detectors, Eq. 1 is rewritten by

$$ \begin{aligned} p ({\mathbf{X}}_{G} ,{\mathbf{X}}|I,{\text{D)}} = p({\mathbf{X}}_{G} |I,{\text{D)}}p({\mathbf{X}} |{\mathbf{X}}_{G} ,I,{\text{D)}} = \hfill \\ \frac{{\,p (I |{\mathbf{X}}_{G} {\mathbf{,}}{\text{D)}}p({\mathbf{X}}_{G} {\mathbf{,}}{\text{D}}{\mathbf{)}}}}{\smallint }\frac{{p (I |{\mathbf{X,X}}_{G} \text{,}{\text{D)}}p({\mathbf{X|X}}_{G} \text{,}{\text{D}}{\mathbf{)}}}}{\smallint }, \hfill \\ \end{aligned} $$

(2)

where $ {\mathbf{X}}_{G} $ and $ {\mathbf{X}} $ are localized feature points in the global and local steps, respectively. The H-DTR is looking for optimal facial feature vectors $ {\mathbf{X}}_{G} $ and $ {\mathbf{X}} $, satisfying

$$ \begin{aligned} &({\hat{\mathbf{X}}}_{G} ,{\hat{\mathbf{X}}} ){}_{\text{MAP}}(I,{\text{D)}} = \\ &\mathop {\arg \hbox{max} }\limits_{{{\mathbf{X}}_{G} ,{\mathbf{X}}}} [p (I|{\mathbf{X}}_{G} ,{\text{D}})p({\mathbf{X}}_{G} |{\text{D}})\prod\limits_{l = 1}^{\text{L}} {\{ p(I|{\text{x}}^{l} ,{\text{x}}^{l} { \in {\rm A}} (l ))p({\text{x}}^{l} ,{\text{x}}^{l} { \in }{\text{A(}}l ))\} } \hfill \\ \end{aligned} $$

(3)

where $ p (I|{\mathbf{X}},{\mathbf{X}}_{G} \text{,}{\text{D)}} = \prod\limits_{l = 1}^{\text{L}} {p(I|{\text{x}}^{l} )} $ and $ p({\mathbf{X|X}}_{G} \text{,}{\text{D)}} = \prod\limits_{l = 1}^{\text{L}} {p({\text{x}}^{l} )} $, since the global structure $ {\mathbf{X}}_{G} $ constrains feature points by the search areas, a feature point $ {\text{x}}{}^{l} $ can treated as conditionally independent of each other. Note that $ p (I|{\mathbf{X}}^{G} ,{\text{D}}) $ and $ p(I|{\text{x}}^{l} ,{\text{x}}^{l} { \in }{\text{A(}}l )) $ are the likelihood functions of $ {\mathbf{X}}_{G} $ and $ {\text{x}}^{l} { \in }{\mathbf{X}}(l = 1, \ldots ,{\text{L}}) $, and are estimated by the global detector and local detectors. Finally, we minimize the negative of the logarithm the posterior rather than maximizing (Eq. 3).

$$ \begin{aligned} & ({\hat{\mathbf{X}}}_{G} ,{\hat{\text{x}}}^{l} , \ldots ,{\hat{\text{x}}}^{\text{L}} |I,{\text{D}}) = \mathop {\arg \hbox{min} }\limits_{{{\mathbf{X}}_{G} { \in }{\mathbf{X}}}} f ({\mathbf{X}}_{G} ; {\text{x}}^{1} , \ldots ,{\text{x}}^{\text{L}} ,I,{\text{D) }} + \\ & \sum\limits_{l = 1}^{\text{L}} {\mathop {\arg \hbox{min} }\limits_{{{\mathbf{x}}^{l} { \in }{\text{X}}^{l} }} f({\mathbf{x}}^{l} ;{\mathbf{X}}_{G} , \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{\text{x}}^{l} ,I,{\text{D}})} \, + {\text{R(}}{\mathbf{X}}_{G} ,{\text{x}}^{1} , \ldots ,{\text{x}}^{\text{L}} |I,{\text{D}}) \\ \end{aligned} $$

(4)

where $ \{ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{x}^{l} \} = \{ {\text{x}}^{i} \}_{i = 1}^{\text{L}} - \{ {\text{x}}^{l} \} $ and $ {\text{R(}}{\mathbf{X}}_{G} , {\text{x}}^{1} , \ldots ,{\text{x}}^{\text{L}} ) = {\text{R(}}{\mathbf{X}}_{G} ) + \sum\limits_{i = 1}^{\text{L}} {{\text{R(x}}^{i} )} $. The first term is a global detector error that encourages a good intermediated feature point localization based on Procustes analysis, and the second term is a local detector error for more precise localizations based on the local appearance model and local Hausdroff constraints with neighboring feature points. The third term is the regularization for control global sparsity and local sparsity using the priors of the HDM, where $ {\text{R(}}{\mathbf{X}}_{G} ) $ and $ {\text{R(x}}^{i} ) $ are global and regularizations, respectively.

4.2 Hierarchical Outlier Suppression

We carried out two types of outlier suppressions: global and local outlier suppressions. The global detector determines the global structure, $ {\mathbf{X}}_{G} $, based on the Procustes analysis and investigates whether or not the $ {\mathbf{X}}_{G} $, is a global outlier. If $ {\mathbf{X}}_{G} $ is not an outlier, the global detector decides the search area of each feature point, otherwise a global localization error is declared. The local detector explores the search area to find a feature point with highest probability. The localized feature point is checked whether or not it is a local outlier. If the localized feature point is not an outlier, the localization is reported as success, otherwise a local localization error is declared. The outlier constraints are measured based on the global structure and local appearance data models in the HDM, and used to prevent global and local outliers, respectively.

4.3 Optimization

The solution of (Eq. 4) is nonconvex, but it is convex w.r.t. each of optimization variables,$ {\text{x}}^{1} , \ldots ,{\text{x}}^{\text{L}} $, and $ {\mathbf{X}}_{G} $. We develop a local optimal algorithm based on the block-coordinate decent method [13], which minimizes (Eq. 4) iteratively w.r.t. each variable while the other variables are fixed. Algorithm 1 represents our optimization procedure. Localized points with the maximum likelihood may not always be correct facial feature points, and not consistent with other feature points. There are many reasons making the detectors unstable such as noises, pose variances, cluttered background, and illuminant changes. In this context, we introduce a data-driven semi-supervised framework which can learn incrementally using both labeled and unlabeled datasets to minimize the effects of troublesome patterns and to prevent outliers in cooperative with the HDM in Sect. 5.

5 Data-Driven Semi-supervised Learning

In the proposed method, the semi-supervised learning steps iteratively construct the HDM’s and H-DTR’s using randomly selected image data from the unlabeled dataset. We initialize the 1^st generation HDM and H-DTR using a labeled image dataset. Remind that the semi-labeled image dataset indicates both human-annotated data given initially and machine-annotated data that may be wrong during semi-supervised learning steps. The data-driven semi-supervised learning is controlled by the hierarchical soft K-means clustering algorithm. The machine-annotated data is produced by the current H-DTR, and it is used to construct a next generation HDM.

5.1 Semi-Labeled Dataset Update

The newly labeled dataset is merged to the semi-supervised dataset, and constitutes the next semi-supervised dataset. The H-DTR finds a best matched global structure model for $ {\mathbf{X}}_{G} $ based on the similarity measure of global Hausdroff vectors. If there is no matched global structure satisfying the global outlier constraint, it is rejected. Once the global structure, $ {\mathbf{X}}_{G} $ is determined, the local detector finds a best matched local appearance model for a feature point $ {\text{x}}^{l} { \in }{\text{A(}}l ) $. In the local detection phase, the current H-DTR performs a local matching using k-NN regression. Algorithm 2 summarizes the update process of the semi-labeled dataset in the proposed learning framework.

6 Experimental Results

The localization performance was evaluated from several points of views using popular face datasets such as Bosphorus [14], BioID [15]. Our method is compared to the state-of-the-art of technologies reported in [16]. The experiments were performed on an Intel Core(TM)2 Quad CPU Q8400 2.66 GHz with C ++.

6.1 Performance Evaluation

Our localization method is compared to that of STASM V.4 [17] and that of 3-Level IMoFA [16]. We used 200 and 500 labeled and unlabeled samples for the semi-supervised learning of local detectors of our method, respectively. 500 samples from Bosphorus are used for testing. The comparison results are given in Table 1. Our method shows better performance in average localization accuracies and mean errors than 3-Level IMoFA and STASM V.4.

Table 1. The localization accuracies of our method compared to those of STASM V.4 and 3-Level IMoFA using Bosphrus. Facial feature points: OEC (Outer eye corners), IEC (Inner eye corners), NT (Nose tip), MC (Mouth corners), OE (Outer eyebrows), IE (Inner eyebrows), PC (Pupil centers), NS (Nose saddles), and LOM (Lip center of mouth).

Full size table

In Fig. 3, the cumulative correct localization rates of the proposed method is compared to other state-of-the art methods reported by Dibeklioglu et al. [16]. The performance of our method is comparable to other approaches [16] such as 3-Level IMoFA, Generative, Sliwiga, AAM, CLM, and BorMaN.

Semi-supervised learning is carried out simultaneously with testing using the unlabeled test data samples to make fair comparisons.

7 Conclusion

Most state-of-the-art face feature detectors rely on only labeled training data, and thus have much difficulty in obtaining robust performances when the variability of images hardly be predicted in prior. Instead of using a number of labeled training data, unlabeled data that can be easily gathered is employed for improving the generalization ability of feature localization. We presented an iterative algorithm for robust facial feature localization, where the H-DTR improves the localization performance of facial feature points with incremental generalization ability.

References

Celiktutan, O., et al.: A Comparative Study of Face Landmarking Techniques, EURASIP Journal on Image and Video Processing (2013)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)
Article Google Scholar
Zhu, X., Ramanan, D.: Face detection, pose estimation and landmark estimation in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)
Google Scholar
Cristinacce, D., Cootes, T.: Feature detection and tracking with constrained local models. In: Proceedings BMVC, pp. 929–938 (2006)
Google Scholar
Cristinacce, D., Cootes, T., Scott, I.: A multi-stage approach to facial feature detection. Proc. British Mach. Vis. Conf. 1, 231–240 (2004)
Google Scholar
Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts of faces using a consensus of exemplars. In: CVPR (2011)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, pp. 92–100 (1998)
Google Scholar
Tong, Y., Liu, X., Wheeler, F.W., Tu, P.: Semi-supervised facial landmark annotation. Comput. Vis. Image Underst. (CVIU) 116(8), 922–935 (2012)
Article Google Scholar
Forero, P.A.: Robust clustering using outlier-sparsity regularization. IEEE Trans. Sig. Process. 60(8), 4163–4177 (2012)
Article MathSciNet Google Scholar
Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)
Article Google Scholar
Goodall, C.: Procrustes methods in the statistical analysis of shape. J. Roy. Statist. Soc. Ser. B 53(2), 285–339 (1991)
MATH MathSciNet Google Scholar
Hong, S., Khim, S., Lee, P.K.: Efficient face landmark localization using spatial-context adaboost algorithm. In: Proceedings Journal of Visaul Communication and Image Presentation (2013)
Google Scholar
Mareček1, J., Richtárik2, P., Takáč, M.: Distributed Block Coordinate Descent for Minimizing Partially Separable Functions, Math.OC 2 June 2014
Google Scholar
Savran, A., Alyüz, N., Dibeklioğlu, H., Çeliktutan, O., Gökberk, B., Sankur, B., Akarun, L.: Bosphorus database for 3D face analysis. In: Schouten, B., Juul, N.C., Drygajlo, A., Tistarelli, M. (eds.) BIOID 2008. LNCS, vol. 5372, pp. 47–56. Springer, Heidelberg (2008)
Chapter Google Scholar
http://www.bioid.com/downloads/software/bioid-face-database
Dibeklioglu, H., Salah, A.A., Gevers, T.: A statistical method for 2-D facial landmarking. IEEE Trans. Image Process. 21(2), 844–858 (2012)
Article MathSciNet Google Scholar
Milborrow, S., Nicolls, F.: Active Shape Models with SIFT Descriptors and MARS. VISAPP (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Inha University, Incheon, South Korea
Yoon Young Kim, Sung Jin Hong & Phill Kyu Rhee
Department of Computer Science, Yonsei University, Seoul, South Korea
Ji Hye Rhee
YM-Naeultech, Incheon, Korea
Mi Young Nam

Authors

Yoon Young Kim
View author publications
You can also search for this author in PubMed Google Scholar
Sung Jin Hong
View author publications
You can also search for this author in PubMed Google Scholar
Ji Hye Rhee
View author publications
You can also search for this author in PubMed Google Scholar
Mi Young Nam
View author publications
You can also search for this author in PubMed Google Scholar
Phill Kyu Rhee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Phill Kyu Rhee .

Editor information

Editors and Affiliations

Aalborg University, Copenhagen, Denmark
Lazaros Nalpantidis
Aalborg University, Copenhagen, Denmark
Volker Krüger
Royal Institute of Technology - KTH, Stockholm, Sweden
Jan-Olof Eklundh
Democritus University of Thrace, Xanthi, Greece
Antonios Gasteratos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, Y.Y., Hong, S.J., Rhee, J.H., Nam, M.Y., Rhee, P.K. (2015). Robust Facial Feature Localization using Data-Driven Semi-supervised Learning Approach. In: Nalpantidis, L., Krüger, V., Eklundh, JO., Gasteratos, A. (eds) Computer Vision Systems. ICVS 2015. Lecture Notes in Computer Science(), vol 9163. Springer, Cham. https://doi.org/10.1007/978-3-319-20904-3_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-20904-3_15
Published: 19 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20903-6
Online ISBN: 978-3-319-20904-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics