Keywords

1 Introduction

Analyzing facial expressions is one of the most important methods of human emotion recognition and facial expressions are defined as the corresponding facial changes in response to a person’s inner emotional state and intentions [1]. Nowadays, automatic facial expression recognition (FER) has miscellaneous applications, such as affective computing, interactive games, social psychology, synthetic animation, and intelligent robots [2].

Automatic FER systems can be divided into two categories: those that based on static images and those that based on dynamic image sequences [3]. The static-based method only contains information of the currently input image, while the sequence-based method can use temporal information from multi frames to identify the expression. FER systems receive static images or dynamic sequences as input and then output the corresponding expression category. This work focuses on methods based on the key frames extracted from dynamic image sequences.

In the past two decades, many attempts have been made to recognize facial expressions, and the effectiveness of these attempts depends largely on the size of the labeled training set. A large-scale training set can better reflect the real distribution of samples and hence acquire a better generalization error. However, manual annotation is demanding, time consuming and expensive [4]. A semi-supervised method can simultaneously use labeled and unlabeled data to improve the classification performance with small datasets, reduce the workload of manual labeling and enhance the practicability of FER [5].

There have been few attempts to recognize facial expressions using a semi-supervised method. Existing methods can be roughly divided into two categories: semi-supervised learning (SSL) [6,7,8] and semi-supervised clustering [9,10,11]. SSL exploits the distribution of the unlabeled data to enhance training. Semi-supervised clustering sets the pairwise constraints with labeled data for cluster analysis. In 2004, Cohen et al. [6] were the first to apply SSL to facial expression recognition. They trained probabilistic classifiers with labeled and unlabeled data based on Bayesian networks and achieved an average recognition accuracy of 74.8% on the Cohn-Kanade dataset. Hady et al. [7] mentioned a learning framework to exploit the unlabeled data by the combination of the Co-Training and the one-against-one output-space decomposition approach, which uses Tri-Class SVMs as binary classifiers. The average recognition accuracy on the four basic expressions of the Cohn-Kanade dataset was 86.95%. Jiang et al. [8] focused on the problem of multi-pose facial expression recognition by bringing transfer learning into SSL. Liu et al. [9] addressed the expression recognition in the wild under a semi-supervised frame that combined reference manifold learning with Semi-Supervised Non-negative Matrix Factorization to select discriminant unlabeled data for enhanced training. Liliana et al. [10] proposed a semi-supervised clustering method based on Fuzzy C-means (FCM) to consider the level of ambiguity of facial expressions. Araujo et al. [11] mentioned a semi-supervised temporal clustering method and applied it to the complex problem of facial emotion categorization.

Although the unlabeled samples are helpful to construct the exact model for facial expression classification, experiments show that the effect of some SSL methods is even worse than simply using the methods employed for labeled samples [12, 13]. To address this problem, Li and Zhou presented the safe semi-supervised vector machine (S4VM) [14] to explore multiple candidate low-density separators, estimate the decision boundary closest to the real situation and ensure the best classification effect. The researchers define S4VM as a safe semi-supervised classifier whose performance never degenerates, even when using unlabeled data.

Inspired by Li and Zhou, this work proposes a semi-supervised learning method based on the DPND feature. The DPND feature proposed in our previous work [15] extract the deep representations of the peak (the fully expressive) frame and the neutral frame, respectively, and use the difference between them to represent the facial expression. In this paper, to further improve the robustness, a set of DPND features is extracted from each facial expression sequence which select the key frames near to the cluster centroids. Then, a cascaded semi-supervised classifier is constructed to classify facial expressions with both labeled and unlabeled samples. The final classification result of each sequence is decided by the voting of all key-frame pairs.

The rest of this paper is organized as follows. The details of the semi-supervised FER method are presented in Sect. 2. The experimental setup is described in detail, and the experiment results are given in Sect. 3. Section 4 concludes the paper.

2 The Proposed Method

In this section, the proposed semi-supervised FER approach will be described in detail. The proposed method consists of two main parts: (1) Multiple DPND feature extraction from expression sequences and (2) construction of a cascaded semi-supervised classifier for FER.

2.1 Multiple DPDN Feature Extraction

To address the FER problem, researchers have proposed many elaborate features to represent facial expressions during past decades [16]. However, some recent works show that features learned from millions of training samples by deep learning outperform manually designed features in face-related tasks, such as face detection [17] and face recognition [18]. Encouraged by these advancements, the popular VGG-16 [19] is adopted as the network architecture for deep representation extraction in this study. The VGG-16 is pre-trained on the VGG face dataset, which contains 2.6 M face images from 2,622 subjects. When face images are put into the VGG-16, the output of neuron responses by one of the intermediate layers of the VGG-16 network can be extracted as images’ deep representation. In this paper, the DPND feature is employed to describe the change between the neutral frame and the peak frame as our previous work [15]:

$$ f_{DPND} = \left( {f^{P} - f^{N} } \right)/N $$
(1)

where \( f^{P} \) and \( f^{N} \) are deep representation features extracted from the peak frame and neutral frame, respectively, and N is the normalized factor. The DPND feature can effectively retain facial expression information while eliminating individual differences and environmental noises.

For some standard facial expression datasets, such as CK+ [20], in which each sequence begins with the neutral expression and ends with the peak expression, the DPND feature can be easily obtained by the deep representation feature of the beginning frame and the end frame. However, the neutral frame and the peak frame of an expression sequence are not directly available in some datasets, such as the BU-4DFE [21]. To extract the DPND feature from expression sequences, a joint method of K-means clustering and rank-SVM is presented.

However, a single DPND feature [15] from each sequence to represent the facial expression has two limitations: first, the extraction of key frames has a certain randomness due to the random initialization of cluster centroids; second, the extracted key frames can only approximately represent the neutral frames and peak frames. In order to further improve the robustness, in this work, a set of DPND features is extracted from each facial expression sequence which select the key frames near to the cluster centroids obtained using K-means. The final classification result of each sequence is decided by the voting of all key-frame pairs. In this way, the multiple DPND feature can effectively avoid the problem caused by the inaccurate selection of key frames. And the subsequent experiments prove that, compared to the single DPND feature, the multiple DPND feature can indeed improve the accuracy of FER.

2.2 Construct a Cascaded Multi-class Classifier for FER

In this subsection, a cascaded classifier is introduced to the S4VM construct to recognize the six basic facial expressions using the proposed DPND feature. The original S4VM proposed by Li and Zhou [14] is an inductive binary classifier. For applying it to FER tasks, a set of S4VMs is combined with a cascaded structure, and each S4VM divides a kind of facial expression from the given dataset. A brief introduction of S4VM is first given.

Safe Semi-Supervised Support Vector Machine (S4VM).

Let \( {\mathbf{\mathcal{X}}} \) be the input space and \( {\mathbf{\mathcal{Y}}} = \left\{ { \pm 1} \right\} \) be the label space. A set of labeled data as \( \left\{ {\varvec{x}_{\varvec{i}} ,\varvec{y}_{\varvec{i}} } \right\}_{{\varvec{i} = 1}}^{\varvec{l}} \) and a set of unlabeled data are given as \( \left\{ {\hat{x}_{j} } \right\}_{j = 1}^{u} \). Semi-Supervised learning SVM (S3VM) aims to find a decision function \( \varvec{f}:{\mathbf{\mathcal{X}}} \to \left\{ { \pm 1} \right\} \) and a label assignment on unlabeled instances \( {\mathbf{y}} = \left\{ {\varvec{y}_{{\varvec{l} + 1}} , \ldots ,\varvec{y}_{{\varvec{l} + \varvec{u}}} } \right\} \in {\mathbf{\mathcal{B}}} \) such that the following objective function is minimized,

$$ h\left( {f,\hat{y}} \right) = \frac{{\parallel f\parallel_{H} }}{2} + C_{1} \mathop \sum \limits_{i = 1}^{l} l(y_{i} ,f\left( {x_{i} } \right)) + C_{2} \mathop \sum \limits_{j = 1}^{u} l(\hat{y}_{j} ,f\left( {\hat{x}_{j} } \right)) $$
(2)

S4VM focuses on the safeness of SSL algorithms. Its main idea is to generate multiple low-density separators to approximate the ground truth decision boundary and maximize the improvement in performance of inductive SVMs for any candidate separator. To generate a pool of diverse separators \( \left\{ {f_{t} } \right\}_{t = 1}^{T} \), the following function is minimized:

$$ \begin{array}{*{20}c} {min} \\ {\left\{ {f,\hat{y}_{t} \in \beta } \right\}_{t = 1}^{T} } \\ \end{array} \sum\nolimits_{t = 1}^{T} {h(f_{t} ,\hat{y}_{t} ) + M\varOmega (\left\{ {\hat{y}_{t} } \right\}_{t = 1}^{T} ),} $$
(3)

where T is the number of separators, \( \varOmega \) is a penalty coefficient about the diversity of separators, and M is a large constant to ensure diversity. A variety of methods can be adopted to solve this optimization problem, such as global simulated annealing search and representative sampling.

To learn a label assignment \( {\mathbf{y}} \) such that the performance against the inductive SVM, \( \varvec{y}^{svm} \), is improved, the worst-case improvement over inductive SVM is maximized and \( {\bar{\mathbf{y}}} \) is denoted as the optimal solution:

$$ \bar{\varvec{y}} = \mathop {\text{argmax}}\limits_{y} \mathop {\hbox{min} }\limits_{{\hat{\varvec{y}}}} gain\left( {\varvec{y},\hat{\varvec{y}},\varvec{y}^{{\varvec{svm}}} } \right) - loss\left( {\varvec{y},\hat{\varvec{y}},\varvec{y}^{{\varvec{svm}}} } \right) $$
(4)

where \( gain\left( {\varvec{y},\hat{\varvec{y}},\varvec{y}^{svm} } \right) \) and \( loss\left( {\varvec{y},\hat{\varvec{y}},\varvec{y}^{svm} } \right) \) are the gained and lost accuracies compared to the inductive SVM, respectively. It has been shown that the accuracy of \( {\bar{\mathbf{y}}} \) is never worse than that of \( \varvec{y}^{{\varvec{svm}}} \) and achieves the maximal performance improvement over that of \( \varvec{y}^{{\varvec{svm}}} \) in the worst cases.

Multi-class Classification with the Cascaded S4VM.

The original S4VM is typically designed for binary classification problems; thus, S4VM must be extended into a multi-class classifier for FER. The most common strategies are called one-against-one and one-against-all, however, S4VM, as an inductive method, cannot use one-against-one to construct a multi-class classification, while adoption of one-against-all is ineffective due to the same large training set for each binary classification.

This paper constructs multi-class classification based on a cascaded structure [22, 23], which can hold inductive and effective to unlabeled data. In detail, the training set that contains labeled and unlabeled data is put into the cascaded classifier, and samples of the specified class are picked out for each S4VM classifier. The identified unlabeled data and the corresponding labeled data are removed from the training set, while the remaining samples are passed to the next S4VM classifier.

It is worth noting that the performance of multi-class classifiers varies widely according to different cascaded order. To design a more effective cascaded classifier, the order of the S4VM classifiers is determined according to a discriminant measure of labeled data. The ratio of the inner-class distance and the inter-class distance is defined as the separable measure:

$$ {\text{S}}_{p} = \frac{{D_{pp} }}{{\mathop \sum \nolimits_{q \ne p} D_{pq} }} $$
(5)

where \( {\text{D}}_{pq} = \frac{1}{\left| p \right|\left| q \right|}\mathop \sum \limits_{i \in p,j \in q} d_{ij} \) is the average distance between any two samples in the class p and q. The class p is separated from the training set according to the ascending order of \( {\text{S}}_{p} \). The corresponding classes are sorted to \( p_{1} ,p_{2} , \ldots ,p_{m} \). Then, a classifier with a cascaded structure is constructed, such as that shown in Fig. 1.

Fig. 1.
figure 1

Multi-class classification based on a cascaded structure.

Samples of class \( p_{1} \) are assigned to the positive category, and samples of the rest classes are assigned to the negative category; then, the first sub-binary classifier S4VM1 is trained. After that, samples of class \( p_{1} \) are removed from the training set. Similarly, samples of class \( p_{2 } \) are assigned to the positive category, and the rest of the samples are assigned to the negative category; then, the second sub-binary classifier SVM2 is trained until all the sub-classifiers are trained. Finally, a cascaded S4VM is obtained.

3 Experiments

3.1 Experimental Protocol

To evaluate the effectiveness of the proposed algorithm, two public sequence-based datasets, CK+ [20] and BU-4DFE [21], were chosen for the experiment; the CK+ dataset has been used in [10]. The details of these two datasets are listed in Table 1. In our experiment, only six basic expressions (angry, disgust, fear, happy, sad and surprise) were considered, and we extracted a subset of 53 subjects from the CK+ and a subset of 64 subjects from the BU-4DFE. Some samples of the two databases are shown in Fig. 2. For the CK+ dataset, the DPND feature is the difference between the deep representation feature of the first frame and the last frame; for BU-4DFE, the DPND feature is extracted from the facial sequences directly by our proposed method.

Table 1. Details of the CK+, BU-4DFE dataset.
Fig. 2.
figure 2

Exemplar expression images in the CK+, BU-4DFE dataset.

3.2 Comparison Among the Multiple DPND, the Single DPND and the DPR Feature

In order to show the effectiveness of the DPND feature, we compared it with the static feature that the deep representations of peak frames (DPR feature) extracted from the VGG-16 network. Then, the proposed cascaded S4VM was employed to evaluate the effects of the different features. For BU-4DFE, the multiple DPND feature was extracted from a set of key-frame pairs near to the cluster centroids. It is noteworthy that the labeled samples only accounted for 10% of the training set in the experiment. The average accuracies of the different features are listed in Table 2. The results indicate that the accuracy of the single DPND feature on the CK+ and BU-4DFE are 8.5% and 21% higher than that of the DPR feature, and the performance of the multiple DPND feature is 3.4% higher than that of the single DPND feature on the BU-4DFE, which strongly proves the excellence of the DPND feature, especially the multiple DPND feature.

Table 2. Average accuracy of the DPND and DPR features.

3.3 Comparisons with the State-of-the-Art Method

In this subsection, we compare the proposed method (the cascaded S4VM with the DPND feature) with the current state-of-the-art method [10] on the CK+ dataset. The method [10] is based on an SSL algorithm. It first employed an Active Appearance Model to detect human facial points for feature extraction and then utilized semi-supervised Fuzzy C-Means to work as the classifier system; we refer to the method as SSFCM. It selected 329 images of eight emotions from the CK+ dataset, of which 63% were used as a training set and the remaining samples were used for testing. The average accuracies of the proposed method and SSFCM method are shown in Table 3. The proposed method outperforms the SSFCM method [10] even though the SSFCM method selected the peak frames out from the sequences manually and used more labeled data than our method.

Table 3. Average accuracies of the proposed method and the current state-of-the-art method on the CK+.

3.4 Comparison with the Supervised Classification

In this subsection, we aimed to use the CK+ and BU-4DFE dataset to evaluate the capability of the SSL method for FER. To this end, the proposed cascaded S4VM and SVM were used as expression classifiers and SVM was considered the baseline because it has been demonstrated as a successful approach for FER tasks. The performance of the cascaded S4VM was calculated based on its outputs, including the list of generated labels for unlabeled data. Using the same data, SVM was applied as a fully supervised version of the cascaded S4VM (see Table 4) for comparison of the semi-supervised learning and supervised learning. The results demonstrate that although a small proportion of each dataset was labelled (10%), the accuracy of the cascaded S4VM for FER on the CK+ and BU-4DFE are 5% and 12% higher than that of SVM.

Table 4. Accuracy of the cascaded S4VM compared to SVM.

For more evaluation, the accuracy of the cascaded S4VM was considered with different amounts of labeled data (10%, 12.5%, 17%, 20%, 25% and 50%), as shown in Fig. 3. In all these experiments, the cascaded S4VM achieved better accuracy than SVM, especially in the case of few labelled data, which confirms the cascaded S4VM’s efficiency. The results illustrate that combined with information from labeled and unlabeled samples, the cascaded S4VM can predict the distribution of data more reasonably and then adjust the decision boundary to improve the classification accuracy. Figure 3 also shows that as the number of labeled data increases, the accuracy of the cascaded S4VM and SVM also increase and match.

Fig. 3.
figure 3

Accuracy with different percentages of labelled data.

4 Conclusion

In this paper, we propose a semi-supervised method based on the multiple DPND feature for FER. The DPND feature tends to emphasize the facial parts that are changed in the transition from the neutral to the expressive face and to eliminate differences in individual face identities and environmental noises. In this work, the multiple DPND feature are extracted from each sequence to improve the robustness of feature representation. Then, a cascaded semi-supervised classifier is constructed to recognize six basic facial expressions using both labeled and unlabeled data. The proposed method achieves an accuracy of 89.4% on the CK+ dataset and an accuracy of 71.8% on the BU-4DFE dataset when only 10% of the training samples are labeled. The encouraging results on public databases suggests that our method has strong potential to recognize facial expressions in real-world applications.