Keywords

1 Introduction

For the last two decades we have been started to use different smart devices and applications in our daily life, robots are going to be used in our shops [5], schools and hospitals [22]. However, lots of them lose their favor by losing novelty effect. Haag et al. [10] argued the communication between humans and systems can improve by considering emotions as an additional interaction modality. Meanwhile researchers showed the systems which recognize and respond to human’s emotions are more caring, likable, supportive and trustworthy [2]. Hence, recognizing human’s emotion became an important topic to study.

Emotion detection is the ability to recognize another’s affective state, which typically involves the integration and analysis of expressions through different modalities, like facial expression, speech, body movements and gestures [3]. Since 55% of human emotions are conveyed by facial expression [17], Facial Emotion Recognition (FER) is the most investigated method for human emotion recognition task.

FER contains two main parts, facial expression analysis and facial behavior analysis, as shown in Fig. 1. Facial expression analysis carried out via two main approaches, feature extraction and Action Unit (AU) detection. The feature extraction approaches proceed on by detecting face region and facial components, e.g., eyebrows, eyes, nose and mouth from an input image. Then two different types of features are extracted: geometric and appearance features. Geometric features represent the positions of salient points of the face, e.g., ends of the eyes, end of the nose, mouth and the shape of the facial components, while appearance features represent the text variations of the face, e.g., color, edge density, crinkles, and wrinkles [28]. Finally, the pre-trained machine learning classifier attempts to classify the given face as portraying one emotion [12].

The AU detection methods, however, are independent of facial appearance and analyse facial muscles movements by tracking AUs. Each AU indicates fundamental movements of a single or a group of musclesFootnote 1. Through facial expression of different emotions, different combinations of AUs are activated. Ekman [9] defined the Facial Action Coding System (FACS), which encodes the movements of AUs to describe human facial movements and converts the detected AUs to the corresponding emotion. An important advantage of the AU detection methods is that they remove the need of analysing complex high-dimensional features [24].

Fig. 1.
figure 1

General Facial Emotion Recognition approaches.

Facial behavior analysis is the other way to perform FER. Cohn et al. [4] proposed two conceptual approaches for studying the facial behavior: “message-based” approach and “sign-based” approach. Message-based approaches categorize facial behaviors as the meaning of expressions and are widely used by psychologists. Message-based methods can be divided into discrete categorical and continuous dimensional methods. Discrete categorical methods assign an expression to one of pre-defined prototypical categories, including six basic emotions proposed by Ekman [8] like anger, disgust, fear, happiness, sadness, and surprise, while continuous emotional methods describe each facial expression by continuous axes, such as arousal and valence [30].

Sign-based approaches, however, describe facial actions regardless of their meaning, and different expressions are classified based on the activated AUs [19]. Indeed, sign-based approaches are similar to AU detection approaches.

Since sign-based algorithms are trained to detect activated AUs in a given image or video to recognize the emotion, the sign-based FER problems can be transformed into the problem of activated AU detection [25]. Hence, applying a proper toolkit, like OpenFace [1] the activation values of facial AUs can be obtained and used for model training for emotion detection. However, as Du et al. [6] showed, determining the exact combination of activated AUs in each emotion is difficult. Thereby, the main contribution of this study is finding the most pivotal activated AUs in each emotion. To this end, we developed a Stacked Auto Encoder (SAE) deep network on the statues of 15 facial AUs to extract the high-order features of the input data that is not possible to obtain by humans. Given automatic extracted features, we added a Softmax layer to full-fill the classification task.

The remain of this paper is structured as follows: Sect. 2 presents a review on previous work. The proposed model is illustrated in Sect. 3. Section 4 demonstrates the experimental results. Finally, Sect. 5 concludes this paper.

2 Related Work

Originally classical machine learning algorithms such as Bayesian Networks [18], Gaussian Mixture Models [26], Hidden Markov Models [23], and Neural Networks [29] have been applied to detect expressed facial emotions. The quality of the training data, e.g., image resolution, face view angle and also the way emotions are labeled, strongly influences the results of the training algorithm and is the main obstacle for classical FER algorithms.

In contrast, promising results of neural network methods and deep learning (DL) based approaches in comparison with classical machine learning algorithms, caused to propose numerous DL based FER methods in the research community. Emergence of deep learning as a general end to end learning approach dispels handcraft feature detection problem too [7].

There are two approaches in FER, one which does not use the input’s temporal information so called frame-based, and the other, which uses the temporal information of images and is known as sequence-based. The input in frame-based approaches is an image without a reference frame, while the input in sequence-based approach is a sequence of one or more frames [13]. Since our proposed model categorizes as frame based, in this section we focus on the state-of-the-art algorithms of the frame-based methods.

Pitaloka et al. [21] used a Convolutional Neural Network (CNN) based method to recognize 6 basic emotions. The proposed method comprises of 5 layers including two sets of convolution layer, two max-pooling layers and a fully connected layer for classification. After pre-processing, the input image is fed to the first convolution layer to extract features like edges, corners and shapes. The output image then is passed to the first max-pooling layer to reduce the image size. The compact image then is sent to the second convolution layer to obtain higher order features and afterward is passed to the second max-pooling layer to reduce the final output size. The fully connected layer at the end, classifies the output image into one of the six basic emotions. However, the performance of the proposed algorithm decreases when the dimension of images is increased regarding to the complexity of the high dimensional images.

Liu et al. [14] proposed a sign-based deep neural network architecture called AU-aware Deep Networks (AUDN) in order to investigate the effect of AUs in emotion recognition. The proposed AUDN includes three sequential modules. In first module a convolution layer stacked by a max-pooling layer generates a complete representation of all expression-specific appearance variations. Then in the second module, an AU-aware receptive field layer searches the subsets of the over-complete representation to find the best simulating of the combination of the AUs. The third module consists of multilayer Restricted Boltzmann Machines (RBM) to learn hierarchical features. Once the features obtain, a linear SVM classifier is applied to recognize the six basic emotions. However, AU-aware layers, in second module, are not able to detect all FACS in images.

Although different state-of-the-art algorithms are proposed in the field of FER, emotion detection has remained a challenging problem in computer vision. In this study, we propose a new SAE-based model to cope with the challenge of the FER in two steps. In the first step, the proposed SAE aims at extracting the most pivotal AUs and in the next step these extracted AUs are applied to the categorical Softmax classifier to detect six basic emotions. Next section details the proposed model.

3 Proposed Model

According to the sign-based FER approaches, one way to recognize facial emotion expression is detecting the status of all individual AUs and then analyzing combinations of activated AUs. For example, if a face has been analyzed as having activated AU5, and AU26, a properly trained algorithm should classify it as expressing “surprise”. However, Du et al. [6] showed that encoding the activated AUs into a specific emotion is difficult, if the expressed emotion is a mixture of several emotions. For instance, when some one is surprised by a good news all AUs related to both happiness and surprise can be activated, however, if be shocked of an online scam, the AUs related to sadness, anger and surprise can be activated at the same time. This ambiguity of emotion expression makes the FER a challenging task.

The SAE is able to extract higher order features and detect relations between AUs, which is not possible by human experts or conventional machine learning techniques, therefore, we used a SAE deep network to extract the most effective combinations of AUs in each emotion and used them as the feature set to train our classifier. Figure 2 shows the overall scheme of our proposed model for emotion type detection task. Also the list of the applied AUs is shown in Table 1. The next subsections explain principles of the SAE and the architecture and methodology of the developed deep SAE for emotion type detection.

Table 1. The list of applied Action Units and related emotions. The pivotal AUs of each emotional state obtained by proposed model are indicated by sign \(\varvec{*}\), and the pivotal AUs obtained by [6] are indicated by +.
Fig. 2.
figure 2

OpenFace is able to read both images and videos and returns the activation value of different AUs. Passing AU values to SAE, abstracted features are obtained, which by feeding to a Softmax classifier the type of emotion, which AUs are showing is obtained.

3.1 Principals of the Stacked Auto Encoder

A SAE is a deep neural network consisting of several hidden layers in which the output of each layer is imposed as input to the next layer. By inner layers higher order features, i.e., those are not easily possible for humans to craft, are obtained. Equation 1 gives the encoding step for \(k^{th}\) layer.

$$\begin{aligned} a^{k+1}=F(\omega ^k a^k + b^k), \end{aligned}$$
(1)

where F is the activation function, e.g., sigmoid or Rectified Linear Unites (ReLU), \(\omega \) and b are corresponding weight vector and bias value to the units of \(k^{th}\) layer. The decoding step is given by running the decoding stack of each AE in reverse order as shown in Eq. 2.

$$\begin{aligned} a^{n+k+1}=F(\omega ^{n-k} a^{n+k}+b^{n-k}), \end{aligned}$$
(2)

where \(a^n\) contains the information of interest and is the activation of the deepest layer of hidden units. Applying the input values as output values, the SAE will learn the high-order, i.e., low dimension features of input values at the layer n. This vector gives a representation of the input in term of higher order features, which can be used for classification problems by feeding \(a^n\) to a Softmax classifier. After training the SAE, the encoder part of the network is saved and the activation values of the last layer are imposed to the classification layer, which uses a Softmax activation function to include more than two classes.

Table 2. The architecture of the applied SAE neural network for six class classification task.

3.2 SAE Architecture for Emotion Type Detection

Table 2 shows the architecture of the proposed SAE for emotion detection. We used OpenFace to extract the AU values of training data. The activation values of 15 different AUs, in both regression and binary scale, are presented by a \(30\times 1\) vector and imposed as an input to the developed SAE, where the regression values of AUs are normalized between 0 to 1. The extracted features from SAE are in the shape of a \(10\times 1\) vector, which are applied to the Softmax layer. Imposing the abstracted features vector into the Softmax layer, an original input data is classified into one of the six different basic emotion classes.

By applying 2D grid search, the hyper parameters of the SAE, e.g., learning rate and dropout are selected optimally in the learning process. The number of epochs and batch size are set as 200 and in the fully connected layer, “Adam” is used as optimizer and “Softmax” is used as supervised categorical classifier. Reconstructing the obtained features from SAE revealed the most pivotal AUs in each emotion as shown in Table 1.

Table 3. State-of-the-art algorithms in FER over CK+ and MMI datasets.
Table 4. Number of samples for each emotion class in two datasets.

4 Verification and Results

To verify the accuracy of the proposed model we applied it to three well-known datasets and compared obtained results with the results from two state-of-the-art methods, which showed convincing performance on these datasets. Table 3 summarizes two baselines. One of the baseline methods used a convolutional neural network, while the other used a SVM method. Since the training and testing approach and the used datasets are different for baselines, we defined different experiments to be in align with compared method. However, as both baselines used confusion matrix to show the accuracy of their model, we also showed the accuracy of our model by confusion matrix. In following we first review the applied datasets, then the comparison between proposed model and baselines are discussed through different experiments. For easiness of read in next subsections happiness, sadness, fear, anger, disgust, and surprise are indicated by H, Sa, F, A, D, and Su respectively.

4.1 Data Bases

The extended Cohn-Kanade database (CK+)[16], contains 593 frontal face poses images of 123 subjects ranging from 18 to 50 years old. However, only 327 sequences from 118 subjects have labels. MMI, contains 203 video sequences, including different head poses and subtle expressions of 19 participants with ages ranging from 19 to 62 years old [20]. Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [15], contains frontal face poses videos from 12 female and 12 male, all north American actor and actress, expressing six basic emotions, calm, and neutral. While the sequences of all datasets start with the neutral state frame and end at the apex of the target emotion [31] removed the beginning frames and [11] labeled them as neutral, thereby we also removed the beginning frames of both datasets. Table 4 shows the number of each expression class in CK+ and MMI datasets in our experiments.

4.2 Experiment A: Recognition Rate on CK+ Dataset for 6 Emotion Classes

The first experiment is conducted on the CK+ dataset. The best accuracy over CK+ for the six emotion recognition presented by Zhao et al. [31], which obtained by leave-one-out cross validation strategy. Hence, we also applied leave-one-out method to verify the accuracy of the proposed model. Table 5 shows the comparison between confusion matrices of proposed model and results reported in [31] over CK+ dataset.

Proposed model outperforms the baseline for 3 classes out of 6 classes, i.e., anger, sadness, and surprise, while baseline has higher accuracy rate for detecting disgust emotion. The model could detect all samples of happiness and surprise, i.e., 100% accuracy. The lowest accuracy is for recognising the fear class samples as 88% with misclassifying one sample as disgust and one sample as surprise. Overall the average accuracy of the proposed model is higher than baseline, i.e., 95.63% compared to 93.9%.

Table 5. Experiment A, confusion matrices for six emotion classification over the CK+ dataset, validated by leave-one-out technique.

4.3 Experiment B: Recognition Rate on MMI Dataset for 6 Emotion Classes

The second experiment performed on MMI dataset for which [11] obtained the best performance over it. We applied 10-fold cross validation to report our results, because Hasani et al. [11] used 5-fold cross validation and Zhao et al. [31] used 10-fold cross validation for verification. Table 6a shows that the proposed model outperforms both baselines significantly, i.e., 95.6% compared to 78.67% and 71.92%.

Since confusion matrices of baselines are not provided in main references, we compared the overall obtained accuracy. However, the confusion matrix of the proposed model is shown in Table 6b. Analysing Table 6b, the best accuracy is obtained for happiness and sadness with the accuracy of 100% and the lowest accuracy obtained for fear class with accuracy of 90%.

Table 6. Experiment A, six emotions classification over the MMI dataset.

4.4 Experiment C: Recognition Rate on RAVDESS Dataset for 6 Emotion Classes

For further validation, we tested the proposed model on RAVDESS dataset [15]. While RAVDESS contains both facial and speech data, it is mostly used for speech emotion recognition and, to our best knowledge, is not used for FER, hence to confirm our results, we used Weka [27] to obtain the accuracy of four well-known classical machine learning models including K-nearest neighbors, e.g., 1NN and 2NN, Multilayer perceptron (MLP) with learning rate of 0.3, and decision tree (M5P). The batch-size for all models set as 200.

We designed a subject-independent experiment, i.e., the dataset is partitioned into two subsets for train and validation such that 18 subjects (9 female and 9 male), i.e., 75% of the total dataset considered as training set and the other 6 subjects (3 female and 3 male), i.e., 25% of the total dataset considered as test set.

Table 7a shows the comparison between proposed model (SAE) with four other classical machine learning approaches. The proposed model outperforms baselines in three classes of anger, fear and sadness out of six classes. The best performance for happiness achieved by 1NN and for disgust and surprise by MLP. The overall average accuracy of the proposed SAE based model is 84.91% which outperforms all other baselines. The confusion matrix of the proposed model on the test dataset is shown in Table 7b. Also, the confusion matrices of provided baselines over RAVDESS are shown through Table 8.

Table 7. Experiment C, confusion matrices for six emotions over the RAVDESS dataset.
Table 8. Experiment C, confusion matrices for six emotion classification over the RAVDESS dataset for 4 different baselines.

5 Conclusion

Since one facial expression might have an ambiguity or similarity to some other basic emotions, precise Facial Emotion Recognition is a challenging task. To find the best features for recognizing different emotions we used Stacked Auto Encoder, which is able to find high order features, which are not possible to craft by humans. The provided raw input data for SAE is the activation value of AUs, the final output, i.e., feature set, is the combination of most pivotal AUs for each basic emotion. The obtained feature set then is imposed to a Softmax classifier layer to find 6 basic emotions.

The proposed method is compared with several key methods and the experiments’ results show that it is capable to outperform all rival methods. The proposed method achieves average accuracy of 95.63%, 95.55% and 84.91% for CK+, MMI and RAVDESS datasets respectively. Overall the best accuracy obtained for classifying happiness, while the worst result obtained for classifying fear.

In future work, we will apply the proposed method to classify more emotion classes. Also other features like head pose and gaze direction will be investigated to improve the accuracy of the proposed model. Furthermore, we will apply the proposed SAE to find the intensity of the detected emotion.