1 Introduction

Facial feature localization has much impact on face-image based applications such as animation, expression recognition, and face registration [1]. Facial feature detectors can be categorized based on their focusing information that is observed from face images. Local detectors usually employ texture descriptors relying on sliding-window based search. Commonly used texture descriptors are the SIFT [2] and histogram of gradients (HOG) [3]. Global detectors focus more on global models such as geometrical distributions and structural relationships between facial feature points. Facial features are characterized by shape distribution models [4] or probability distribution of a feature point from other locations [5].

Local feature detectors can obtain precise localization, but they are sensitive to even small noises and are prone to generate frequent false alarms. Global detectors can prevent such local noise sensitivity, however localization accuracy cannot approach to real world requirements in general. To avoid such a contradictory dilemma, many facial feature detectors combine the local detector and global detector. Zhu and Ramanan [3] represent patches using HOG features, and employ a quadratic spring scheme as a global shape model. RANSAC-like approach was addressed by Belhumeur et al. [6], where a Bayesian formulation allows to integrate local detectors into a global structure. A robust detector with generalized performance can be acquired using a number of data labeled correctly. However, in many real-world applications the collection of large volume of well-labeled dataset is not easy due to the high cost and error-prone processes of manual annotations. We propose an outlier-aware hybrid detector based on a data-driven semi-supervised learning approach. Generalization ability can be achieved using both labeled and unlabeled data [7] based on semi-supervised learning approaches. Recently, Tong et al. [8] presented an automatic method to avoid labor-intensive and error-prone manual processes in feature annotation. In their experiment, only a small portion of faces was manually labeled, and the remaining images are automatically annotated. However, Tong et al.’s approach is only a pioneer study on automatic facial feature annotation to obtain training data easily, not aims at robust facial feature localization with generalization ability applied for noisy and uncertain real-world environments.

The interactive data-driven semi-supervised learning framework consists of the hybrid detector (abbreviated by H-DTR) and the hierarchical data model (abbreviated by HDM). We explores the adaptive outlier suppression in the H-DTR, the outlier regularization of noisy or contaminated troublesome data in the HDM, and the interactive updates of the H-DTR and the HDM for better generalization ability. The HDM is constructed based on the hierarchical outlier-aware soft K-means clustering algorithm [9]. In Sect. 2, the overview of our approach is given. We discuss the HDM and the formulation of proposed H-DTR in Sect. 3 and Sect. 4, respectively. In Sect. 5, the data-driven semi-supervised learning algorithm using the HDM is discussed. Section 6 shows the superiority of the proposed method to other state-of-the-art localization technologies by experiments. Finally, conclusion is given in Sect. 7.

2 Overview of the Approach

In general, better generalization performance can be obtained with more training data labeled correctly, however the accumulation of a number of labeled data usually requires heavy labor-intensive processes and is very often error-prone. In this section, we outline a data-driven semi-supervised approach combining a hybrid detector and a hierarchical data model which can take advantage of both labeled data and unlabeled data. The proposed method provides a robust and generalization performance in real-life noisy environments, and is a fully automatic without human supervision except initial annotation of labeled dataset which is much smaller than unlabeled dataset.

The novelty of this paper is the effective combination of the HDM and H-DTR in an interactive manner (Fig. 1). The H-DTR consists of the global detector and the local detectors. Given an image, the global detector locates the face region and facial component regions using the Haar-like feature–based boosting method [10], and the detected face region is normalized over the scale to reduce the effects of scale uncertainty. Facial feature distributions are initialized based on pre-complied positions. The global detector produces the confidence search areas of individual feature points which are constrained by the Procrustes analysis [11]. The local detectors localize the feature points using k-NN regression algorithms in terms of SIFT descriptors [2]. Both the global and local detectors have outlier detection mechanisms using the HDM to obtain a robust performance. The HDM is defined by a two-level cluster tree consisting of the heterogeneous data models: the regularized global structure and local appearance models in the 1st and 2nd levels of the HDM, respectively (Fig. 2). The HDM is built using a semi-labeled image dataset. The semi-labeled dataset includes both labeled image data annotated by hand and annotated by the H-DTR during the interactive/incremental learning.

Fig. 1.
figure 1

The data-driven semi-supervised learning framework, where the optimization for a robust localization performance with generalization ability is obtained based on the iterative constructions of the HDM and H-DTR using both labeled data set and unlabeled image dataset.

Fig. 2.
figure 2

An illustration of the HDM: \( {\text{H}}^{j} \) denotes the j th global structure model of the 1st level (\( j{ \in }\{ 1, \ldots ,J\} \));\( {\text{S}}^{(j,k,l)} \) indicates the k th local appearance model of the l th feature point label \( {\text{S}}^{(j,l)} \) of the j th global structure model \( {\text{H}}^{j} \) in the 2nd level (\( l{ \in }\{ 1, \ldots ,{\text{L}}\} \)).

We construct a HDM (1st generation) from a given labeled dataset (< 1>, < 2-1 > , and < 2-2 > in Fig. 1) in the initial data-driven learning step. The 1st level of the HDM represents different clusters of the global model which represent the global structures of feature point distributions in terms of Hausdroff vectors, and the 2nd level denotes different clusters of the local SIFT features for each facial feature point. We establish the 1st generation H-DTR based on the 1st generation HDM (< 3-1 > and < 3-2 >), and partial image data are randomly selected from the unlabeled image dataset (< 4>). The 1st generation H-DTR produces localized feature point sets, and constitutes the semi-supervised dataset by merging with the labeled image dataset (< 5>). Next, we perform the 2nd generation semi-supervised learning, where the hierarchical soft K-means algorithm takes the image data from the semi-labeled dataset produced in the 1st generation learning (< 1>), and regularizes and constructs the 1st level (< 2-1 >) and the 2nd level (< 2-1 >) of the HDM (2nd generation). The 2nd generation H-DTR is ready using the 2nd generation HDM (< 3-1 > and 3-2 >), selects randomly partial unlabeled data from, and so on until converging to a target performance. One can notice that well-managements of semi-labeled image dataset and regularized data models are essential to obtain incremental robust performance with generalization ability.

3 Hierarchical Data Model

The proposed framework employs the HDM (Hierarchical Data Model) which is established using the labeled training dataset initially, and updated by the H-DTR using randomly selected samples from the unlabeled dataset interactively until it converges or a predefined iteration limit. The HDM is two-level clusters generated by the soft K-means algorithm. In this paper, the global information represents the geometrical distribution of feature points and structural relationships between them, and the local information represents the appearances of individual feature points using the SIFT descriptors (Fig. 2). We first build the 1st level of the HDM (called “global structure models”) using the global information and the 2nd level (called “local appearance models”) for each 1st level cluster using the local information.

The global structure models are represented by the clusters centroids of the global Hausdroff vectors which consist of the positions of all feature points and Hausdroff distances between them. The local appearance models are represented by the cluster centroids of the SIFT vectors of individual feature points.

4 Hybrid Detector

H-DTR is a relatively simple, but has a flexible algorithm architecture so that it can be efficiently applied in real-life environments. H-DTR consists of the global detector for localizing an global structure of feature points and local detectors for more precise localization of individual feature points.

4.1 Outlier-Aware Hybrid Detector

Let \( {\mathbf{X}} = \{\text{X}^{\text{1}} , \ldots ,{\text{X}}^{\text{L}} \} \) be a set of random variable spaces, where \( {\text{X}}^{l} \) is the space of a facial feature point that can be labeled by \( l{ \in }\{ 1, \ldots ,{\mathbf{L}}\} \). The localization of facial feature points is formalized as multiclass (\( {\mathbf{L}} \) classes) classification problem as follows. Let x (= [x, y]T) denote a facial feature point. If x belongs to \( {\mathbf{X}}^{l} \), (i.e.,), x is denoted by (x,l) or xl meaning that a facial feature label l can be assigned to the feature point x. The feature localization is to assign a label l to a best feature point x that is estimated by a classifier. Given a face image I, the H-DTR finds a best facial feature vector \( {\mathbf{X}} = {\text{\{ x}}^{\text{1}} , \ldots ,{\text{x}}^{\text{L}} \} \) with \( {\text{x}}^{i} \in {\mathbf{X}}^{i} \) using both global and local detectors. The global detector is based on Procustes analysis, constrained by the global structure models of the HDM, and decides the search areas of the facial feature points. The search area of a feature label l denoted by \( {\mathbf{A}} (l ) , { }l{ \in }\{ 1, \ldots ,{\mathbf{L}}\} \) indicates the area within which the best point of feature label l can be found with high probability. It is decided empirically and the details can be found in [12]. The local detector for a feature point carries out more a precise localization using the SIFT descriptors [2] constrained by the local appearance models of the HDM and local Hausdroff distances. The global detector allows the local detectors are conditionally independent each other. Given a face image I and a training dataset \( {\text{D}} \), assuming that a prior distribution over \( {\mathbf{X}} \) exists, \( {\mathbf{X}} \) can be treated as random variable in Bayesian statistics. The posterior distribution of \( {\mathbf{X}} \) is represented by:

$$ p ({\mathbf{X}}|I,{\text{D)}} = \frac{{p (I |{\mathbf{X,}}{\text{D)}}p({\mathbf{X}} | {\text{D}}{\mathbf{)}}}}{{\int {} }} $$
(1)

In this paper, the priors of a local detector for a feature point labeled l are the search area \( {\text{A(}}l ) \). Since the H-DTR is divided into global detector and local detectors, Eq. 1 is rewritten by

$$ \begin{aligned} p ({\mathbf{X}}_{G} ,{\mathbf{X}}|I,{\text{D)}} = p({\mathbf{X}}_{G} |I,{\text{D)}}p({\mathbf{X}} |{\mathbf{X}}_{G} ,I,{\text{D)}} = \hfill \\ \frac{{\,p (I |{\mathbf{X}}_{G} {\mathbf{,}}{\text{D)}}p({\mathbf{X}}_{G} {\mathbf{,}}{\text{D}}{\mathbf{)}}}}{\smallint }\frac{{p (I |{\mathbf{X,X}}_{G} \text{,}{\text{D)}}p({\mathbf{X|X}}_{G} \text{,}{\text{D}}{\mathbf{)}}}}{\smallint }, \hfill \\ \end{aligned} $$
(2)

where \( {\mathbf{X}}_{G} \) and \( {\mathbf{X}} \) are localized feature points in the global and local steps, respectively. The H-DTR is looking for optimal facial feature vectors \( {\mathbf{X}}_{G} \) and \( {\mathbf{X}} \), satisfying

$$ \begin{aligned} &({\hat{\mathbf{X}}}_{G} ,{\hat{\mathbf{X}}} ){}_{\text{MAP}}(I,{\text{D)}} = \\ &\mathop {\arg \hbox{max} }\limits_{{{\mathbf{X}}_{G} ,{\mathbf{X}}}} [p (I|{\mathbf{X}}_{G} ,{\text{D}})p({\mathbf{X}}_{G} |{\text{D}})\prod\limits_{l = 1}^{\text{L}} {\{ p(I|{\text{x}}^{l} ,{\text{x}}^{l} { \in {\rm A}} (l ))p({\text{x}}^{l} ,{\text{x}}^{l} { \in }{\text{A(}}l ))\} } \hfill \\ \end{aligned} $$
(3)

where \( p (I|{\mathbf{X}},{\mathbf{X}}_{G} \text{,}{\text{D)}} = \prod\limits_{l = 1}^{\text{L}} {p(I|{\text{x}}^{l} )} \) and \( p({\mathbf{X|X}}_{G} \text{,}{\text{D)}} = \prod\limits_{l = 1}^{\text{L}} {p({\text{x}}^{l} )} \), since the global structure \( {\mathbf{X}}_{G} \) constrains feature points by the search areas, a feature point \( {\text{x}}{}^{l} \) can treated as conditionally independent of each other. Note that \( p (I|{\mathbf{X}}^{G} ,{\text{D}}) \) and \( p(I|{\text{x}}^{l} ,{\text{x}}^{l} { \in }{\text{A(}}l )) \) are the likelihood functions of \( {\mathbf{X}}_{G} \) and \( {\text{x}}^{l} { \in }{\mathbf{X}}(l = 1, \ldots ,{\text{L}}) \), and are estimated by the global detector and local detectors. Finally, we minimize the negative of the logarithm the posterior rather than maximizing (Eq. 3).

$$ \begin{aligned} & ({\hat{\mathbf{X}}}_{G} ,{\hat{\text{x}}}^{l} , \ldots ,{\hat{\text{x}}}^{\text{L}} |I,{\text{D}}) = \mathop {\arg \hbox{min} }\limits_{{{\mathbf{X}}_{G} { \in }{\mathbf{X}}}} f ({\mathbf{X}}_{G} ; {\text{x}}^{1} , \ldots ,{\text{x}}^{\text{L}} ,I,{\text{D) }} + \\ & \sum\limits_{l = 1}^{\text{L}} {\mathop {\arg \hbox{min} }\limits_{{{\mathbf{x}}^{l} { \in }{\text{X}}^{l} }} f({\mathbf{x}}^{l} ;{\mathbf{X}}_{G} , \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{\text{x}}^{l} ,I,{\text{D}})} \, + {\text{R(}}{\mathbf{X}}_{G} ,{\text{x}}^{1} , \ldots ,{\text{x}}^{\text{L}} |I,{\text{D}}) \\ \end{aligned} $$
(4)

where \( \{ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{x}^{l} \} = \{ {\text{x}}^{i} \}_{i = 1}^{\text{L}} - \{ {\text{x}}^{l} \} \) and \( {\text{R(}}{\mathbf{X}}_{G} , {\text{x}}^{1} , \ldots ,{\text{x}}^{\text{L}} ) = {\text{R(}}{\mathbf{X}}_{G} ) + \sum\limits_{i = 1}^{\text{L}} {{\text{R(x}}^{i} )} \). The first term is a global detector error that encourages a good intermediated feature point localization based on Procustes analysis, and the second term is a local detector error for more precise localizations based on the local appearance model and local Hausdroff constraints with neighboring feature points. The third term is the regularization for control global sparsity and local sparsity using the priors of the HDM, where \( {\text{R(}}{\mathbf{X}}_{G} ) \) and \( {\text{R(x}}^{i} ) \) are global and regularizations, respectively.

4.2 Hierarchical Outlier Suppression

We carried out two types of outlier suppressions: global and local outlier suppressions. The global detector determines the global structure, \( {\mathbf{X}}_{G} \), based on the Procustes analysis and investigates whether or not the \( {\mathbf{X}}_{G} \), is a global outlier. If \( {\mathbf{X}}_{G} \) is not an outlier, the global detector decides the search area of each feature point, otherwise a global localization error is declared. The local detector explores the search area to find a feature point with highest probability. The localized feature point is checked whether or not it is a local outlier. If the localized feature point is not an outlier, the localization is reported as success, otherwise a local localization error is declared. The outlier constraints are measured based on the global structure and local appearance data models in the HDM, and used to prevent global and local outliers, respectively.

4.3 Optimization

The solution of (Eq. 4) is nonconvex, but it is convex w.r.t. each of optimization variables,\( {\text{x}}^{1} , \ldots ,{\text{x}}^{\text{L}} \), and \( {\mathbf{X}}_{G} \). We develop a local optimal algorithm based on the block-coordinate decent method [13], which minimizes (Eq. 4) iteratively w.r.t. each variable while the other variables are fixed. Algorithm 1 represents our optimization procedure. Localized points with the maximum likelihood may not always be correct facial feature points, and not consistent with other feature points. There are many reasons making the detectors unstable such as noises, pose variances, cluttered background, and illuminant changes. In this context, we introduce a data-driven semi-supervised framework which can learn incrementally using both labeled and unlabeled datasets to minimize the effects of troublesome patterns and to prevent outliers in cooperative with the HDM in Sect. 5.

5 Data-Driven Semi-supervised Learning

In the proposed method, the semi-supervised learning steps iteratively construct the HDM’s and H-DTR’s using randomly selected image data from the unlabeled dataset. We initialize the 1st generation HDM and H-DTR using a labeled image dataset. Remind that the semi-labeled image dataset indicates both human-annotated data given initially and machine-annotated data that may be wrong during semi-supervised learning steps. The data-driven semi-supervised learning is controlled by the hierarchical soft K-means clustering algorithm. The machine-annotated data is produced by the current H-DTR, and it is used to construct a next generation HDM.

5.1 Semi-Labeled Dataset Update

The newly labeled dataset is merged to the semi-supervised dataset, and constitutes the next semi-supervised dataset. The H-DTR finds a best matched global structure model for \( {\mathbf{X}}_{G} \) based on the similarity measure of global Hausdroff vectors. If there is no matched global structure satisfying the global outlier constraint, it is rejected. Once the global structure, \( {\mathbf{X}}_{G} \) is determined, the local detector finds a best matched local appearance model for a feature point \( {\text{x}}^{l} { \in }{\text{A(}}l ) \). In the local detection phase, the current H-DTR performs a local matching using k-NN regression. Algorithm 2 summarizes the update process of the semi-labeled dataset in the proposed learning framework.

6 Experimental Results

The localization performance was evaluated from several points of views using popular face datasets such as Bosphorus [14], BioID [15]. Our method is compared to the state-of-the-art of technologies reported in [16]. The experiments were performed on an Intel Core(TM)2 Quad CPU Q8400 2.66 GHz with C ++.

6.1 Performance Evaluation

Our localization method is compared to that of STASM V.4 [17] and that of 3-Level IMoFA [16]. We used 200 and 500 labeled and unlabeled samples for the semi-supervised learning of local detectors of our method, respectively. 500 samples from Bosphorus are used for testing. The comparison results are given in Table 1. Our method shows better performance in average localization accuracies and mean errors than 3-Level IMoFA and STASM V.4.

Table 1. The localization accuracies of our method compared to those of STASM V.4 and 3-Level IMoFA using Bosphrus. Facial feature points: OEC (Outer eye corners), IEC (Inner eye corners), NT (Nose tip), MC (Mouth corners), OE (Outer eyebrows), IE (Inner eyebrows), PC (Pupil centers), NS (Nose saddles), and LOM (Lip center of mouth).

In Fig. 3, the cumulative correct localization rates of the proposed method is compared to other state-of-the art methods reported by Dibeklioglu et al. [16]. The performance of our method is comparable to other approaches [16] such as 3-Level IMoFA, Generative, Sliwiga, AAM, CLM, and BorMaN.

Fig. 3.
figure 3

Cumulative correct localization rate with respect to entire feature localization error \( m_{e} \) of our method and other state-of-the-art technologies [16] using BioID dataset

Semi-supervised learning is carried out simultaneously with testing using the unlabeled test data samples to make fair comparisons.

7 Conclusion

Most state-of-the-art face feature detectors rely on only labeled training data, and thus have much difficulty in obtaining robust performances when the variability of images hardly be predicted in prior. Instead of using a number of labeled training data, unlabeled data that can be easily gathered is employed for improving the generalization ability of feature localization. We presented an iterative algorithm for robust facial feature localization, where the H-DTR improves the localization performance of facial feature points with incremental generalization ability.