Abstract
This paper introduces a multimodal emotion recognition system based on two different modalities, i.e., affective speech and facial expression. For affective speech, the common low-level descriptors including prosodic and spectral audio features (i.e., energy, zero crossing rate, MFCC, LPC, PLP and temporal derivatives) are extracted, whereas a novel visual feature extraction method is proposed in the case of facial expression. This method exploits the displacement of specific landmarks across consecutive frames of an utterance for feature extraction. To this end, the time series of temporal variations for each landmark is analyzed individually for extracting primary visual features, and then, the extracted features of all landmarks are concatenated for constructing the final feature vector. The analysis of displacement signal of landmarks is performed by the discrete wavelet transform which is a widely used mathematical transform in signal processing applications. In order to reduce the complexity of derived models and improve the efficiency, a variety of dimensionality-reduction schemes are applied. Furthermore, to exploit the advantages of multimodal emotion recognition systems, the feature-level fusion of the audio and the proposed visual features is examined. Results of experiments conducted on three SAVEE, RML and eNTERFACE05 databases show the efficiency of proposed visual feature extraction method in terms of performance criteria.
Similar content being viewed by others
References
Baltrušaitis T et al (2016) Openface: an open source facial behavior analysis toolkit. In: IEEE winter conference on applications of computer vision (WACV), 2016. IEEE
Barrett LF (1998) Discrete emotions or dimensions? The role of valence focus and arousal focus. Cogn Emot 12(4):579–599
Caridakis G. et al (2007) Multimodal emotion recognition from expressive faces, body gestures and speech. In: Artificial intelligence and innovations 2007: From theory to applications, pp 375–388
Cawley GC, Talbot NL (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11:2079–2107
Cevikalp H, Triggs B (2013) Hyperdisk based large margin classifier. Pattern Recognit 46(6):1523–1531
Chakraborty C, Talukdar P (2016) Issues and limitations of HMM in speech processing: a survey. Int J Comput Appl 141(7):13–17
Chao L et al (2016) Audio visual emotion recognition with temporal alignment and perception attention. arXiv preprint arXiv:1603.08321
Cid F et al (2015) A novel multimodal emotion recognition approach for affective human robot interaction’. In: Proceedings of FinE, pp 1–9
Colombetti G (2009) From affect programs to dynamical discrete emotions. Philos Psychol 22(4):407–425
Datcu D, Rothkrantz L (2009) Multimodal recognition of emotions in car environments. DCI&I 2009
Datcu D, Rothkrantz L (2014) Semantic audio-visual data fusion for automatic emotion recognition. In: Emotion recognition: a pattern analysis approach, pp 411–435
Degirmenci A (2014) Introduction to hidden Markov models. Harvard University. http://scholar.harvard.edu/files/adegirmenci/files/hmm_adegirmenci_2014.pdf. Accessed 10 Oct 2016
Dobrišek S et al (2013) Towards efficient multi-modal emotion recognition. Int J Adv Rob Syst 10(1):53
Ekman P et al (2013) Emotion in the human face: Guidelines for research and an integration of findings. Elsevier, New York
Fadil C et al (2015) Multimodal emotion recognition using deep networks. In: VI Latin American congress on biomedical engineering CLAIB 2014, Paraná, Argentina 29–31 October 2014. Springer, Berlin
Fugal DL (2009) Conceptual wavelets in digital signal processing: an in-depth, practical approach for the non-mathematician, Space & Signals Technical Pub
Gera A, Bhattacharya A (2014) Emotion recognition from audio and visual data using f-score based fusion. In: Proceedings of the 1st IKDD conference on data sciences, ACM
Ghahramani Z (2001) An introduction to hidden Markov models and Bayesian networks. Int J Pattern Recognit Artif Intell 15(01):9–42
Gharavian D et al (2017) Audio-visual emotion recognition using FCBF feature selection method and particle swarm optimization for fuzzy ARTMAP neural networks. Multimed Tools Appl 76(2):2331–2352
Goodfellow I et al (2016) Deep learning. MIT Press, Cambridge
Goyal A et al (2016) A multimodal mixture-of-experts model for dynamic emotion prediction in movies. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), 2016. IEEE
Guo J et al (2017) Multi-modality network with visual and geometrical information for micro emotion recognition. In: 12th IEEE international conference on automatic face & gesture recognition (FG 2017), 2017, IEEE
Gupta S et al (2013) Feature extraction using MFCC. Signal Image Process 4(4):101
Hall M et al (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Haq S, Jackson PJ (2010) Multimodal emotion recognition. In: Machine audition: principles, algorithms and systems, pp 398–423
Haq S et al (2008) Audio-visual feature selection and reduction for emotion classification. In: Proceedings of the international conference on auditory-visual speech processing (AVSP’08), Tangalooma, Australia
Haq S et al (2015) Bimodal human emotion classification in the speaker-dependent scenario. Pakistan Academy of Sciences, Islamabad, p 27
Haq S et al (2016) Audio-visual emotion classification using filter and wrapper feature selection approaches. Sindh Univ Res J-SURJ (Sci Ser) 47(1):67–72
Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am 87(4):1738–1752
Hossain MS et al (2016) Audio-visual emotion recognition using big data towards 5G. Mobile Netw Appl 21(5):753–763
Huang K-C et al (2013) Learning collaborative decision-making parameters for multimodal emotion recognition. IEEE international conference on multimedia and expo (ICME), 2013. IEEE
Jackson P, Haq S (2014) Surrey audio-visual expressed emotion(SAVEE) database. University of Surrey, Guildford
Jaimes A, Sebe N (2007) Multimodal human–computer interaction: a survey. Comput Vis Image Underst 108(1):116–134
Jiang D et al (2011) Audio visual emotion recognition based on triple-stream dynamic bayesian network models. In: International conference on affective computing and intelligent interaction. Springer, Berlin
Karamizadeh S et al (2014) Advantage and drawback of support vector machine functionality. In: International conference on computer, communications, and control technology (I4CT), 2014. IEEE
Kaya H, Salah AA (2016) Combining modality-specific extreme learning machines for emotion recognition in the wild. J Multimodal User Interfaces 10(2):139–149
Kaya H et al (2017) Video-based emotion recognition in the wild using deep transfer learning and score fusion. Image Vis Comput 65:66–75
Mansoorizadeh M, Charkari NM (2010) Multimodal information fusion application to human emotion recognition from face and speech. Multimed Tools Appl 49(2):277–297
Martin O et al (2006) The enterface’05 audio-visual emotion database. In: Proceedings of the 22nd international conference on data engineering workshops, 2006. IEEE
Mou W et al (2016) Automatic recognition of emotions and membership in group videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops
Noroozi F et al (2017) Audio-visual emotion recognition in video clips. IEEE Trans Affect Comput. https://doi.org/10.1109/TAFFC.2017.2713783
Oatley K, Johnson-Laird PN (1987) Towards a cognitive theory of emotions. Cogn Emot 1(1):29–50
Paleari M et al (2010) Towards multimodal emotion recognition: a new approach. In: Proceedings of the ACM international conference on image and video retrieval. ACM
Patwardhan A, Knapp G (2016) Multimodal Affect Recognition using Kinect. arXiv preprint arXiv:1607.02652
Plutchik R (1980) A general psychoevolutionary theory of emotion. Theor Emot 1(3–31):4
Poria S et al (2016) Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: IEEE 16th International conference on data mining (ICDM), 2016. IEEE
Poria S et al (2017) A review of affective computing: from unimodal analysis to multimodal fusion. Inf Fus 37:98–125
Reynolds D (2015) Gaussian mixture models. In: Encyclopedia of biometrics, pp 827–832
Rifkin R et al (2003) Advances in learning theory: methods, models and applications, eds. suykens, horvath, basu, micchelli, and vandewalle, ser. In: NATO science series III: computer and systems sciences, vol 190. IOS Press, Amsterdam
Sebe N et al (2005) Multimodal emotion recognition. Handb Pattern Recognit Comput Vis 4:387–419
Seng K et al (2016) A combined rule-based and machine learning audio-visual emotion recognition approach. IEEE Trans Affect Comput 9(1):3–13
Soleymani M et al (2017) A survey of multimodal sentiment analysis. Image Vis Comput 65:3–14
Štruc V, Mihelic F (2010) Multi-modal emotion recognition using canonical correlations and acoustic features. In: 20th International conference on pattern recognition (ICPR), 2010. IEEE
Subramaniam A et al (2016) Bi-modal first impressions recognition using temporally ordered deep audio and stochastic visual features. In: Computer vision–ECCV 2016 workshops. Springer, Berlin
Tao J, Tan T (2005) Affective computing: a review. In: International conference on affective computing and intelligent interaction. Springer, Berlin
Tzirakis P et al (2017) End-to-end multimodal emotion recognition using deep neural networks. IEEE J Sel Top Signal Process 11(8):1301–1309
Valstar MF et al (2015) Fera 2015-second facial expression recognition and analysis challenge. In: 11th IEEE international conference and workshops on automatic face and gesture recognition (FG), 2015. IEEE
Vaseghi SV (2008) Advanced digital signal processing and noise reduction. Wiley, New York
Walecki R et al (2015) Variable-state latent conditional random fields for facial expression recognition and action unit detection. In: 11th IEEE International conference and workshops on automatic face and gesture recognition (FG), 2015. IEEE
Wang Y, Guan L (2008) Recognizing human emotional state from audiovisual signals. IEEE Trans Multimed 10(5):936–946
Wang Y et al (2012) Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans Multimed 14(3):597–607
Xie Z et al (2015) A new audiovisual emotion recognition system using entropy-estimation-based multimodal information fusion. IEEE International symposium on circuits and systems (ISCAS), 2015. IEEE
You Q et al (2016) Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In: Proceedings of the 9th ACM international conference on web search and data mining. ACM
Yu D, Deng L (2016) Automatic speech recognition. Springer, Berlin
Zhalehpour S et al (2014) Multimodal emotion recognition with automatic peak frame selection. In: Proceedings of the IEEE international symposium on innovations in intelligent systems and applications (INISTA), 2014. IEEE
Acknowledgment
The authors gratefully acknowledge the financial support provided by Institute of Science and High Technology and Environmental Sciences, Graduate University of Advanced Technology, Kerman, Iran, under Contract Number 3165.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rahdari, F., Rashedi, E. & Eftekhari, M. A Multimodal Emotion Recognition System Using Facial Landmark Analysis. Iran J Sci Technol Trans Electr Eng 43 (Suppl 1), 171–189 (2019). https://doi.org/10.1007/s40998-018-0142-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40998-018-0142-9