Skip to main content
Log in

A Multimodal Emotion Recognition System Using Facial Landmark Analysis

  • Research paper
  • Published:
Iranian Journal of Science and Technology, Transactions of Electrical Engineering Aims and scope Submit manuscript

Abstract

This paper introduces a multimodal emotion recognition system based on two different modalities, i.e., affective speech and facial expression. For affective speech, the common low-level descriptors including prosodic and spectral audio features (i.e., energy, zero crossing rate, MFCC, LPC, PLP and temporal derivatives) are extracted, whereas a novel visual feature extraction method is proposed in the case of facial expression. This method exploits the displacement of specific landmarks across consecutive frames of an utterance for feature extraction. To this end, the time series of temporal variations for each landmark is analyzed individually for extracting primary visual features, and then, the extracted features of all landmarks are concatenated for constructing the final feature vector. The analysis of displacement signal of landmarks is performed by the discrete wavelet transform which is a widely used mathematical transform in signal processing applications. In order to reduce the complexity of derived models and improve the efficiency, a variety of dimensionality-reduction schemes are applied. Furthermore, to exploit the advantages of multimodal emotion recognition systems, the feature-level fusion of the audio and the proposed visual features is examined. Results of experiments conducted on three SAVEE, RML and eNTERFACE05 databases show the efficiency of proposed visual feature extraction method in terms of performance criteria.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  • Baltrušaitis T et al (2016) Openface: an open source facial behavior analysis toolkit. In: IEEE winter conference on applications of computer vision (WACV), 2016. IEEE

  • Barrett LF (1998) Discrete emotions or dimensions? The role of valence focus and arousal focus. Cogn Emot 12(4):579–599

    Article  Google Scholar 

  • Caridakis G. et al (2007) Multimodal emotion recognition from expressive faces, body gestures and speech. In: Artificial intelligence and innovations 2007: From theory to applications, pp 375–388

  • Cawley GC, Talbot NL (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11:2079–2107

    MathSciNet  MATH  Google Scholar 

  • Cevikalp H, Triggs B (2013) Hyperdisk based large margin classifier. Pattern Recognit 46(6):1523–1531

    Article  MATH  Google Scholar 

  • Chakraborty C, Talukdar P (2016) Issues and limitations of HMM in speech processing: a survey. Int J Comput Appl 141(7):13–17

    Google Scholar 

  • Chao L et al (2016) Audio visual emotion recognition with temporal alignment and perception attention. arXiv preprint arXiv:1603.08321

  • Cid F et al (2015) A novel multimodal emotion recognition approach for affective human robot interaction’. In: Proceedings of FinE, pp 1–9

  • Colombetti G (2009) From affect programs to dynamical discrete emotions. Philos Psychol 22(4):407–425

    Article  Google Scholar 

  • Datcu D, Rothkrantz L (2009) Multimodal recognition of emotions in car environments. DCI&I 2009

  • Datcu D, Rothkrantz L (2014) Semantic audio-visual data fusion for automatic emotion recognition. In: Emotion recognition: a pattern analysis approach, pp 411–435

  • Degirmenci A (2014) Introduction to hidden Markov models. Harvard University. http://scholar.harvard.edu/files/adegirmenci/files/hmm_adegirmenci_2014.pdf. Accessed 10 Oct 2016

  • Dobrišek S et al (2013) Towards efficient multi-modal emotion recognition. Int J Adv Rob Syst 10(1):53

    Article  Google Scholar 

  • Ekman P et al (2013) Emotion in the human face: Guidelines for research and an integration of findings. Elsevier, New York

    Google Scholar 

  • Fadil C et al (2015) Multimodal emotion recognition using deep networks. In: VI Latin American congress on biomedical engineering CLAIB 2014, Paraná, Argentina 29–31 October 2014. Springer, Berlin

  • Fugal DL (2009) Conceptual wavelets in digital signal processing: an in-depth, practical approach for the non-mathematician, Space & Signals Technical Pub

  • Gera A, Bhattacharya A (2014) Emotion recognition from audio and visual data using f-score based fusion. In: Proceedings of the 1st IKDD conference on data sciences, ACM

  • Ghahramani Z (2001) An introduction to hidden Markov models and Bayesian networks. Int J Pattern Recognit Artif Intell 15(01):9–42

    Article  Google Scholar 

  • Gharavian D et al (2017) Audio-visual emotion recognition using FCBF feature selection method and particle swarm optimization for fuzzy ARTMAP neural networks. Multimed Tools Appl 76(2):2331–2352

    Article  Google Scholar 

  • Goodfellow I et al (2016) Deep learning. MIT Press, Cambridge

    MATH  Google Scholar 

  • Goyal A et al (2016) A multimodal mixture-of-experts model for dynamic emotion prediction in movies. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), 2016. IEEE

  • Guo J et al (2017) Multi-modality network with visual and geometrical information for micro emotion recognition. In: 12th IEEE international conference on automatic face & gesture recognition (FG 2017), 2017, IEEE

  • Gupta S et al (2013) Feature extraction using MFCC. Signal Image Process 4(4):101

    MathSciNet  Google Scholar 

  • Hall M et al (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  • Haq S, Jackson PJ (2010) Multimodal emotion recognition. In: Machine audition: principles, algorithms and systems, pp 398–423

  • Haq S et al (2008) Audio-visual feature selection and reduction for emotion classification. In: Proceedings of the international conference on auditory-visual speech processing (AVSP’08), Tangalooma, Australia

  • Haq S et al (2015) Bimodal human emotion classification in the speaker-dependent scenario. Pakistan Academy of Sciences, Islamabad, p 27

    Google Scholar 

  • Haq S et al (2016) Audio-visual emotion classification using filter and wrapper feature selection approaches. Sindh Univ Res J-SURJ (Sci Ser) 47(1):67–72

    MathSciNet  Google Scholar 

  • Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am 87(4):1738–1752

    Article  Google Scholar 

  • Hossain MS et al (2016) Audio-visual emotion recognition using big data towards 5G. Mobile Netw Appl 21(5):753–763

    Article  Google Scholar 

  • Huang K-C et al (2013) Learning collaborative decision-making parameters for multimodal emotion recognition. IEEE international conference on multimedia and expo (ICME), 2013. IEEE

  • Jackson P, Haq S (2014) Surrey audio-visual expressed emotion(SAVEE) database. University of Surrey, Guildford

    Google Scholar 

  • Jaimes A, Sebe N (2007) Multimodal human–computer interaction: a survey. Comput Vis Image Underst 108(1):116–134

    Article  Google Scholar 

  • Jiang D et al (2011) Audio visual emotion recognition based on triple-stream dynamic bayesian network models. In: International conference on affective computing and intelligent interaction. Springer, Berlin

  • Karamizadeh S et al (2014) Advantage and drawback of support vector machine functionality. In: International conference on computer, communications, and control technology (I4CT), 2014. IEEE

  • Kaya H, Salah AA (2016) Combining modality-specific extreme learning machines for emotion recognition in the wild. J Multimodal User Interfaces 10(2):139–149

    Article  Google Scholar 

  • Kaya H et al (2017) Video-based emotion recognition in the wild using deep transfer learning and score fusion. Image Vis Comput 65:66–75

    Article  Google Scholar 

  • Mansoorizadeh M, Charkari NM (2010) Multimodal information fusion application to human emotion recognition from face and speech. Multimed Tools Appl 49(2):277–297

    Article  Google Scholar 

  • Martin O et al (2006) The enterface’05 audio-visual emotion database. In: Proceedings of the 22nd international conference on data engineering workshops, 2006. IEEE

  • Mou W et al (2016) Automatic recognition of emotions and membership in group videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops

  • Noroozi F et al (2017) Audio-visual emotion recognition in video clips. IEEE Trans Affect Comput. https://doi.org/10.1109/TAFFC.2017.2713783

    Google Scholar 

  • Oatley K, Johnson-Laird PN (1987) Towards a cognitive theory of emotions. Cogn Emot 1(1):29–50

    Article  Google Scholar 

  • Paleari M et al (2010) Towards multimodal emotion recognition: a new approach. In: Proceedings of the ACM international conference on image and video retrieval. ACM

  • Patwardhan A, Knapp G (2016) Multimodal Affect Recognition using Kinect. arXiv preprint arXiv:1607.02652

  • Plutchik R (1980) A general psychoevolutionary theory of emotion. Theor Emot 1(3–31):4

    Google Scholar 

  • Poria S et al (2016) Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: IEEE 16th International conference on data mining (ICDM), 2016. IEEE

  • Poria S et al (2017) A review of affective computing: from unimodal analysis to multimodal fusion. Inf Fus 37:98–125

    Article  Google Scholar 

  • Reynolds D (2015) Gaussian mixture models. In: Encyclopedia of biometrics, pp 827–832

  • Rifkin R et al (2003) Advances in learning theory: methods, models and applications, eds. suykens, horvath, basu, micchelli, and vandewalle, ser. In: NATO science series III: computer and systems sciences, vol 190. IOS Press, Amsterdam

  • Sebe N et al (2005) Multimodal emotion recognition. Handb Pattern Recognit Comput Vis 4:387–419

    Article  Google Scholar 

  • Seng K et al (2016) A combined rule-based and machine learning audio-visual emotion recognition approach. IEEE Trans Affect Comput 9(1):3–13

    Article  MathSciNet  Google Scholar 

  • Soleymani M et al (2017) A survey of multimodal sentiment analysis. Image Vis Comput 65:3–14

    Article  Google Scholar 

  • Štruc V, Mihelic F (2010) Multi-modal emotion recognition using canonical correlations and acoustic features. In: 20th International conference on pattern recognition (ICPR), 2010. IEEE

  • Subramaniam A et al (2016) Bi-modal first impressions recognition using temporally ordered deep audio and stochastic visual features. In: Computer vision–ECCV 2016 workshops. Springer, Berlin

  • Tao J, Tan T (2005) Affective computing: a review. In: International conference on affective computing and intelligent interaction. Springer, Berlin

  • Tzirakis P et al (2017) End-to-end multimodal emotion recognition using deep neural networks. IEEE J Sel Top Signal Process 11(8):1301–1309

    Article  MathSciNet  Google Scholar 

  • Valstar MF et al (2015) Fera 2015-second facial expression recognition and analysis challenge. In: 11th IEEE international conference and workshops on automatic face and gesture recognition (FG), 2015. IEEE

  • Vaseghi SV (2008) Advanced digital signal processing and noise reduction. Wiley, New York

    Book  Google Scholar 

  • Walecki R et al (2015) Variable-state latent conditional random fields for facial expression recognition and action unit detection. In: 11th IEEE International conference and workshops on automatic face and gesture recognition (FG), 2015. IEEE

  • Wang Y, Guan L (2008) Recognizing human emotional state from audiovisual signals. IEEE Trans Multimed 10(5):936–946

    Article  Google Scholar 

  • Wang Y et al (2012) Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans Multimed 14(3):597–607

    Article  MathSciNet  Google Scholar 

  • Xie Z et al (2015) A new audiovisual emotion recognition system using entropy-estimation-based multimodal information fusion. IEEE International symposium on circuits and systems (ISCAS), 2015. IEEE

  • You Q et al (2016) Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In: Proceedings of the 9th ACM international conference on web search and data mining. ACM

  • Yu D, Deng L (2016) Automatic speech recognition. Springer, Berlin

    MATH  Google Scholar 

  • Zhalehpour S et al (2014) Multimodal emotion recognition with automatic peak frame selection. In: Proceedings of the IEEE international symposium on innovations in intelligent systems and applications (INISTA), 2014. IEEE

Download references

Acknowledgment

The authors gratefully acknowledge the financial support provided by Institute of Science and High Technology and Environmental Sciences, Graduate University of Advanced Technology, Kerman, Iran, under Contract Number 3165.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Farhad Rahdari.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rahdari, F., Rashedi, E. & Eftekhari, M. A Multimodal Emotion Recognition System Using Facial Landmark Analysis. Iran J Sci Technol Trans Electr Eng 43 (Suppl 1), 171–189 (2019). https://doi.org/10.1007/s40998-018-0142-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40998-018-0142-9

Keywords

Navigation