Abstract
Considering human’s emotion in different applications and systems has received substantial attention over the last three decades. The traditional approach for emotion detection is to first extract different features and then apply a classifier, like SVM, to find the true class. However, recently proposed Deep Learning based models outperform traditional machine learning approaches without requirement of a separate feature extraction phase.
This paper proposes a novel deep learning based facial emotion detection model, which uses facial muscles activities as raw input to recognize the type of the expressed emotion in the real time. To this end, we first use OpenFace to extract the activation values of the facial muscles, which are then presented to a Stacked Auto Encoder (SAE) as feature set. Afterward, the SAE returns the best combination of muscles in describing a particular emotion, these extracted features at the end are applied to a Softmax layer in order to fulfill multi classification task.
The proposed model has been applied to the CK+, MMI and RADVESS datasets and achieved respectively average accuracies of 95.63%, 95.58%, and 84.91% for emotion type detection in six classes, which outperforms state-of-the-art algorithms.
Similar content being viewed by others
Keywords
1 Introduction
For the last two decades we have been started to use different smart devices and applications in our daily life, robots are going to be used in our shops [5], schools and hospitals [22]. However, lots of them lose their favor by losing novelty effect. Haag et al. [10] argued the communication between humans and systems can improve by considering emotions as an additional interaction modality. Meanwhile researchers showed the systems which recognize and respond to human’s emotions are more caring, likable, supportive and trustworthy [2]. Hence, recognizing human’s emotion became an important topic to study.
Emotion detection is the ability to recognize another’s affective state, which typically involves the integration and analysis of expressions through different modalities, like facial expression, speech, body movements and gestures [3]. Since 55% of human emotions are conveyed by facial expression [17], Facial Emotion Recognition (FER) is the most investigated method for human emotion recognition task.
FER contains two main parts, facial expression analysis and facial behavior analysis, as shown in Fig. 1. Facial expression analysis carried out via two main approaches, feature extraction and Action Unit (AU) detection. The feature extraction approaches proceed on by detecting face region and facial components, e.g., eyebrows, eyes, nose and mouth from an input image. Then two different types of features are extracted: geometric and appearance features. Geometric features represent the positions of salient points of the face, e.g., ends of the eyes, end of the nose, mouth and the shape of the facial components, while appearance features represent the text variations of the face, e.g., color, edge density, crinkles, and wrinkles [28]. Finally, the pre-trained machine learning classifier attempts to classify the given face as portraying one emotion [12].
The AU detection methods, however, are independent of facial appearance and analyse facial muscles movements by tracking AUs. Each AU indicates fundamental movements of a single or a group of musclesFootnote 1. Through facial expression of different emotions, different combinations of AUs are activated. Ekman [9] defined the Facial Action Coding System (FACS), which encodes the movements of AUs to describe human facial movements and converts the detected AUs to the corresponding emotion. An important advantage of the AU detection methods is that they remove the need of analysing complex high-dimensional features [24].
Facial behavior analysis is the other way to perform FER. Cohn et al. [4] proposed two conceptual approaches for studying the facial behavior: “message-based” approach and “sign-based” approach. Message-based approaches categorize facial behaviors as the meaning of expressions and are widely used by psychologists. Message-based methods can be divided into discrete categorical and continuous dimensional methods. Discrete categorical methods assign an expression to one of pre-defined prototypical categories, including six basic emotions proposed by Ekman [8] like anger, disgust, fear, happiness, sadness, and surprise, while continuous emotional methods describe each facial expression by continuous axes, such as arousal and valence [30].
Sign-based approaches, however, describe facial actions regardless of their meaning, and different expressions are classified based on the activated AUs [19]. Indeed, sign-based approaches are similar to AU detection approaches.
Since sign-based algorithms are trained to detect activated AUs in a given image or video to recognize the emotion, the sign-based FER problems can be transformed into the problem of activated AU detection [25]. Hence, applying a proper toolkit, like OpenFace [1] the activation values of facial AUs can be obtained and used for model training for emotion detection. However, as Du et al. [6] showed, determining the exact combination of activated AUs in each emotion is difficult. Thereby, the main contribution of this study is finding the most pivotal activated AUs in each emotion. To this end, we developed a Stacked Auto Encoder (SAE) deep network on the statues of 15 facial AUs to extract the high-order features of the input data that is not possible to obtain by humans. Given automatic extracted features, we added a Softmax layer to full-fill the classification task.
The remain of this paper is structured as follows: Sect. 2 presents a review on previous work. The proposed model is illustrated in Sect. 3. Section 4 demonstrates the experimental results. Finally, Sect. 5 concludes this paper.
2 Related Work
Originally classical machine learning algorithms such as Bayesian Networks [18], Gaussian Mixture Models [26], Hidden Markov Models [23], and Neural Networks [29] have been applied to detect expressed facial emotions. The quality of the training data, e.g., image resolution, face view angle and also the way emotions are labeled, strongly influences the results of the training algorithm and is the main obstacle for classical FER algorithms.
In contrast, promising results of neural network methods and deep learning (DL) based approaches in comparison with classical machine learning algorithms, caused to propose numerous DL based FER methods in the research community. Emergence of deep learning as a general end to end learning approach dispels handcraft feature detection problem too [7].
There are two approaches in FER, one which does not use the input’s temporal information so called frame-based, and the other, which uses the temporal information of images and is known as sequence-based. The input in frame-based approaches is an image without a reference frame, while the input in sequence-based approach is a sequence of one or more frames [13]. Since our proposed model categorizes as frame based, in this section we focus on the state-of-the-art algorithms of the frame-based methods.
Pitaloka et al. [21] used a Convolutional Neural Network (CNN) based method to recognize 6 basic emotions. The proposed method comprises of 5 layers including two sets of convolution layer, two max-pooling layers and a fully connected layer for classification. After pre-processing, the input image is fed to the first convolution layer to extract features like edges, corners and shapes. The output image then is passed to the first max-pooling layer to reduce the image size. The compact image then is sent to the second convolution layer to obtain higher order features and afterward is passed to the second max-pooling layer to reduce the final output size. The fully connected layer at the end, classifies the output image into one of the six basic emotions. However, the performance of the proposed algorithm decreases when the dimension of images is increased regarding to the complexity of the high dimensional images.
Liu et al. [14] proposed a sign-based deep neural network architecture called AU-aware Deep Networks (AUDN) in order to investigate the effect of AUs in emotion recognition. The proposed AUDN includes three sequential modules. In first module a convolution layer stacked by a max-pooling layer generates a complete representation of all expression-specific appearance variations. Then in the second module, an AU-aware receptive field layer searches the subsets of the over-complete representation to find the best simulating of the combination of the AUs. The third module consists of multilayer Restricted Boltzmann Machines (RBM) to learn hierarchical features. Once the features obtain, a linear SVM classifier is applied to recognize the six basic emotions. However, AU-aware layers, in second module, are not able to detect all FACS in images.
Although different state-of-the-art algorithms are proposed in the field of FER, emotion detection has remained a challenging problem in computer vision. In this study, we propose a new SAE-based model to cope with the challenge of the FER in two steps. In the first step, the proposed SAE aims at extracting the most pivotal AUs and in the next step these extracted AUs are applied to the categorical Softmax classifier to detect six basic emotions. Next section details the proposed model.
3 Proposed Model
According to the sign-based FER approaches, one way to recognize facial emotion expression is detecting the status of all individual AUs and then analyzing combinations of activated AUs. For example, if a face has been analyzed as having activated AU5, and AU26, a properly trained algorithm should classify it as expressing “surprise”. However, Du et al. [6] showed that encoding the activated AUs into a specific emotion is difficult, if the expressed emotion is a mixture of several emotions. For instance, when some one is surprised by a good news all AUs related to both happiness and surprise can be activated, however, if be shocked of an online scam, the AUs related to sadness, anger and surprise can be activated at the same time. This ambiguity of emotion expression makes the FER a challenging task.
The SAE is able to extract higher order features and detect relations between AUs, which is not possible by human experts or conventional machine learning techniques, therefore, we used a SAE deep network to extract the most effective combinations of AUs in each emotion and used them as the feature set to train our classifier. Figure 2 shows the overall scheme of our proposed model for emotion type detection task. Also the list of the applied AUs is shown in Table 1. The next subsections explain principles of the SAE and the architecture and methodology of the developed deep SAE for emotion type detection.
3.1 Principals of the Stacked Auto Encoder
A SAE is a deep neural network consisting of several hidden layers in which the output of each layer is imposed as input to the next layer. By inner layers higher order features, i.e., those are not easily possible for humans to craft, are obtained. Equation 1 gives the encoding step for \(k^{th}\) layer.
where F is the activation function, e.g., sigmoid or Rectified Linear Unites (ReLU), \(\omega \) and b are corresponding weight vector and bias value to the units of \(k^{th}\) layer. The decoding step is given by running the decoding stack of each AE in reverse order as shown in Eq. 2.
where \(a^n\) contains the information of interest and is the activation of the deepest layer of hidden units. Applying the input values as output values, the SAE will learn the high-order, i.e., low dimension features of input values at the layer n. This vector gives a representation of the input in term of higher order features, which can be used for classification problems by feeding \(a^n\) to a Softmax classifier. After training the SAE, the encoder part of the network is saved and the activation values of the last layer are imposed to the classification layer, which uses a Softmax activation function to include more than two classes.
3.2 SAE Architecture for Emotion Type Detection
Table 2 shows the architecture of the proposed SAE for emotion detection. We used OpenFace to extract the AU values of training data. The activation values of 15 different AUs, in both regression and binary scale, are presented by a \(30\times 1\) vector and imposed as an input to the developed SAE, where the regression values of AUs are normalized between 0 to 1. The extracted features from SAE are in the shape of a \(10\times 1\) vector, which are applied to the Softmax layer. Imposing the abstracted features vector into the Softmax layer, an original input data is classified into one of the six different basic emotion classes.
By applying 2D grid search, the hyper parameters of the SAE, e.g., learning rate and dropout are selected optimally in the learning process. The number of epochs and batch size are set as 200 and in the fully connected layer, “Adam” is used as optimizer and “Softmax” is used as supervised categorical classifier. Reconstructing the obtained features from SAE revealed the most pivotal AUs in each emotion as shown in Table 1.
4 Verification and Results
To verify the accuracy of the proposed model we applied it to three well-known datasets and compared obtained results with the results from two state-of-the-art methods, which showed convincing performance on these datasets. Table 3 summarizes two baselines. One of the baseline methods used a convolutional neural network, while the other used a SVM method. Since the training and testing approach and the used datasets are different for baselines, we defined different experiments to be in align with compared method. However, as both baselines used confusion matrix to show the accuracy of their model, we also showed the accuracy of our model by confusion matrix. In following we first review the applied datasets, then the comparison between proposed model and baselines are discussed through different experiments. For easiness of read in next subsections happiness, sadness, fear, anger, disgust, and surprise are indicated by H, Sa, F, A, D, and Su respectively.
4.1 Data Bases
The extended Cohn-Kanade database (CK+)[16], contains 593 frontal face poses images of 123 subjects ranging from 18 to 50 years old. However, only 327 sequences from 118 subjects have labels. MMI, contains 203 video sequences, including different head poses and subtle expressions of 19 participants with ages ranging from 19 to 62 years old [20]. Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [15], contains frontal face poses videos from 12 female and 12 male, all north American actor and actress, expressing six basic emotions, calm, and neutral. While the sequences of all datasets start with the neutral state frame and end at the apex of the target emotion [31] removed the beginning frames and [11] labeled them as neutral, thereby we also removed the beginning frames of both datasets. Table 4 shows the number of each expression class in CK+ and MMI datasets in our experiments.
4.2 Experiment A: Recognition Rate on CK+ Dataset for 6 Emotion Classes
The first experiment is conducted on the CK+ dataset. The best accuracy over CK+ for the six emotion recognition presented by Zhao et al. [31], which obtained by leave-one-out cross validation strategy. Hence, we also applied leave-one-out method to verify the accuracy of the proposed model. Table 5 shows the comparison between confusion matrices of proposed model and results reported in [31] over CK+ dataset.
Proposed model outperforms the baseline for 3 classes out of 6 classes, i.e., anger, sadness, and surprise, while baseline has higher accuracy rate for detecting disgust emotion. The model could detect all samples of happiness and surprise, i.e., 100% accuracy. The lowest accuracy is for recognising the fear class samples as 88% with misclassifying one sample as disgust and one sample as surprise. Overall the average accuracy of the proposed model is higher than baseline, i.e., 95.63% compared to 93.9%.
4.3 Experiment B: Recognition Rate on MMI Dataset for 6 Emotion Classes
The second experiment performed on MMI dataset for which [11] obtained the best performance over it. We applied 10-fold cross validation to report our results, because Hasani et al. [11] used 5-fold cross validation and Zhao et al. [31] used 10-fold cross validation for verification. Table 6a shows that the proposed model outperforms both baselines significantly, i.e., 95.6% compared to 78.67% and 71.92%.
Since confusion matrices of baselines are not provided in main references, we compared the overall obtained accuracy. However, the confusion matrix of the proposed model is shown in Table 6b. Analysing Table 6b, the best accuracy is obtained for happiness and sadness with the accuracy of 100% and the lowest accuracy obtained for fear class with accuracy of 90%.
4.4 Experiment C: Recognition Rate on RAVDESS Dataset for 6 Emotion Classes
For further validation, we tested the proposed model on RAVDESS dataset [15]. While RAVDESS contains both facial and speech data, it is mostly used for speech emotion recognition and, to our best knowledge, is not used for FER, hence to confirm our results, we used Weka [27] to obtain the accuracy of four well-known classical machine learning models including K-nearest neighbors, e.g., 1NN and 2NN, Multilayer perceptron (MLP) with learning rate of 0.3, and decision tree (M5P). The batch-size for all models set as 200.
We designed a subject-independent experiment, i.e., the dataset is partitioned into two subsets for train and validation such that 18 subjects (9 female and 9 male), i.e., 75% of the total dataset considered as training set and the other 6 subjects (3 female and 3 male), i.e., 25% of the total dataset considered as test set.
Table 7a shows the comparison between proposed model (SAE) with four other classical machine learning approaches. The proposed model outperforms baselines in three classes of anger, fear and sadness out of six classes. The best performance for happiness achieved by 1NN and for disgust and surprise by MLP. The overall average accuracy of the proposed SAE based model is 84.91% which outperforms all other baselines. The confusion matrix of the proposed model on the test dataset is shown in Table 7b. Also, the confusion matrices of provided baselines over RAVDESS are shown through Table 8.
5 Conclusion
Since one facial expression might have an ambiguity or similarity to some other basic emotions, precise Facial Emotion Recognition is a challenging task. To find the best features for recognizing different emotions we used Stacked Auto Encoder, which is able to find high order features, which are not possible to craft by humans. The provided raw input data for SAE is the activation value of AUs, the final output, i.e., feature set, is the combination of most pivotal AUs for each basic emotion. The obtained feature set then is imposed to a Softmax classifier layer to find 6 basic emotions.
The proposed method is compared with several key methods and the experiments’ results show that it is capable to outperform all rival methods. The proposed method achieves average accuracy of 95.63%, 95.55% and 84.91% for CK+, MMI and RAVDESS datasets respectively. Overall the best accuracy obtained for classifying happiness, while the worst result obtained for classifying fear.
In future work, we will apply the proposed method to classify more emotion classes. Also other features like head pose and gaze direction will be investigated to improve the accuracy of the proposed model. Furthermore, we will apply the proposed SAE to find the intensity of the detected emotion.
References
Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.P.: OpenFace 2.0: facial behavior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, pp. 59–66. IEEE (2018)
Brave, S., Nass, C., Hutchinson, K.: Computers that care: investigating the effects of orientation of emotion exhibited by an embodied computer agent. Int. J. Hum.-Comput. Stud. 62(2), 161–178 (2005)
Caridakis, G., Castellano, G., Kessous, L., Raouzaiou, A., Malatesta, L., Asteriadis, S., Karpouzis, K.: Multimodal emotion recognition from expressive faces, body gestures and speech. In: IFIP International Conference on Artificial Intelligence Applications and Innovations, pp. 375–388. Springer (2007)
Cohn, J.F., Ambadar, Z., Ekman, P.: Observer-based measurement of facial expression with the facial action coding system. In: The Handbook of Emotion Elicitation and Assessment, pp. 203–221 (2007)
De Gauquier, L., Cao, H.L., Gomez Esteban, P., De Beir, A., van de Sanden, S., Willems, K., Brengman, M., Vanderborght, B.: Humanoid robot pepper at a Belgian chocolate shop. In: Companion of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, pp. 373–373. ACM (2018)
Du, S., Tao, Y., Martinez, A.M.: Compound facial expressions of emotion. In: Proceedings of the National Academy of Sciences, p. 201322355 (2014)
Ebrahimi Kahou, S., Michalski, V., Konda, K., Memisevic, R., Pal, C.: Recurrent neural networks for emotion recognition in video. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 467–474. ACM (2015)
Ekman, P.: Strong evidence for universals in facial expressions: a reply to Russell’s mistaken critique (1994)
Ekman, P.: Facial Action Coding System (FACS). A human face (2002)
Haag, A., Goronzy, S., Schaich, P., Williams, J.: Emotion recognition using bio-sensors: first steps towards an automatic system. In: Tutorial and Research Workshop on Affective Dialogue Systems, pp. 36–48. Springer (2004)
Hasani, B., Mahoor, M.H.: Spatio-temporal facial expression recognition using convolutional neural networks and conditional random fields. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2017, pp. 790–795. IEEE (2017)
Ko, B.: A brief review of facial emotion recognition based on visual information. Sensors 18(2), 401 (2018)
Liliana, D.Y., Basaruddin, T.: Review of automatic emotion recognition through facial expression analysis. In: 2018 International Conference on Electrical Engineering and Computer Science (ICECOS), pp. 231–236. IEEE (2018)
Liu, M., Li, S., Shan, S., Chen, X.: AU-aware deep networks for facial expression recognition. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–6. IEEE (2013)
Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PloS One 13(5), e0196391 (2018)
Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The extended Cohn-Kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 94–101. IEEE (2010)
Mehrabian, A.: Nonverbal Communication. Routledge, Abingdon (2017)
Miyakoshi, Y., Kato, S.: Facial emotion detection considering partial occlusion of face using Bayesian network. In: 2011 IEEE Symposium on Computers & Informatics (ISCI), pp. 96–101. IEEE (2011)
Mollahosseini, A., Chan, D., Mahoor, M.H.: Going deeper in facial expression recognition using deep neural networks. In: 2016 IEEE Winter conference on applications of computer vision (WACV), pp. 1–10. IEEE (2016)
Pantic, M., Valstar, M., Rademaker, R., Maat, L.: Web-based database for facial expression analysis. In: 2005 IEEE International Conference on Multimedia and Expo, pp. 5–pp. IEEE (2005)
Pitaloka, D.A., Wulandari, A., Basaruddin, T., Liliana, D.Y.: Enhancing cnn with preprocessing stage in automatic emotion recognition. Proc. Comput. Sci. 116, 523–529 (2017)
Pop, C.A., Simut, R., Pintea, S., Saldien, J., Rusu, A., David, D., Vanderfaeillie, J., Lefeber, D., Vanderborght, B.: Can the social robot probo help children with autism to identify situation-based emotions? A series of single case experiments. Int. J. Humanoid Rob. 10(03), 1350025 (2013)
Schuller, B., Rigoll, G., Lang, M.: Hidden Markov model-based speech emotion recognition. In: Proceedings of 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2003, vol. 2, pp. II–1. IEEE (2003)
Tong, Y., Liao, W., Ji, Q.: Facial action unit recognition by exploiting their dynamic and semantic relationships. IEEE Trans. Pattern Anal. Mach. Intell. 29(10), 1683–1699 (2007)
De la Torre, F., Cohn, J.F.: Facial expression analysis. In: Visual Analysis of Humans, pp. 377–409. Springer (2011)
Vydana, H.K., Kumar, P.P., Krishna, K.S.R., Vuppala, A.K.: Improved emotion recognition using GMM-UBMs. In: 2015 International Conference on Signal Processing and Communication Engineering Systems (SPACES), pp. 53–57. IEEE (2015)
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington (2016)
Xiaoxi, M., Weisi, L., Dongyan, H., Minghui, D., Li, H.: Facial emotion recognition. In: 2017 IEEE 2nd International Conference on Signal and Image Processing (ICSIP), pp. 77–81. IEEE (2017)
Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31(1), 39–58 (2009)
Zhang, L., Verma, B., Tjondronegoro, D., Chandran, V.: Facial expression analysis under partial occlusion: a survey. ACM Comput. Surv. (CSUR) 51(2), 25 (2018)
Zhao, L., Wang, Z., Zhang, G.: Facial expression recognition from video sequences based on spatial-temporal motion local binary pattern and Gabor multiorientation fusion histogram. Math. Probl. Eng. 2017, 12 (2017)
Acknowledgment
The work leading to these results has received funding from the European Commission 7th Framework Program as a part of the DREAM project, grant no. 611391 and the ICON project ROBO-CURE.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Bagheri, E., Bagheri, A., Esteban, P.G., Vanderborgth, B. (2020). A Novel Model for Emotion Detection from Facial Muscles Activity. In: Silva, M., Luís Lima, J., Reis, L., Sanfeliu, A., Tardioli, D. (eds) Robot 2019: Fourth Iberian Robotics Conference. ROBOT 2019. Advances in Intelligent Systems and Computing, vol 1093. Springer, Cham. https://doi.org/10.1007/978-3-030-36150-1_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-36150-1_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36149-5
Online ISBN: 978-3-030-36150-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)