Multimedia Systems

, Volume 25, Issue 5, pp 451–461 | Cite as

Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis

  • Jiamin Fu
  • Qirong MaoEmail author
  • Juanjuan Tu
  • Yongzhao Zhan
Special Issue Paper


Multimodal emotion recognition is a challenging research topic which has recently started to attract the attention of the research community. To better recognize the video users’ emotion, the research of multimodal emotion recognition based on audio and video is essential. Multimodal emotion recognition performance heavily depends on finding good shared feature representation. The good shared representation needs to consider two aspects: (1) it has the character of each modality and (2) it can balance the effect of different modalities to make the decision optimal. In the light of these, we propose a novel Enhanced Sparse Local Discriminative Canonical Correlation Analysis approach (En-SLDCCA) to learn the multimodal shared feature representation. The shared feature representation learning involves two stages. In the first stage, we pretrain the Sparse Auto-Encoder with unimodal video (or audio), so that we can obtain the hidden feature representation of video and audio separately. In the second stage, we obtain the correlation coefficients of video and audio using our En-SLDCCA approach, then we form the shared feature representation which fuses the features from video and audio using the correlation coefficients. We evaluate the performance of our method on the challenging multimodal Enterface’05 database. Experimental results reveal that our method is superior to the unimodal video (or audio) and improves significantly the performance for multimodal emotion recognition when compared with the current state of the art.


Multimodal emotion recognition Multimodal shared feature learning Multimodal information fusion Canonical correlation analysis 



This work is supported by the National Natural Science Foundation of China (Nos. 61272211 and 61672267) and the General Financial Grant from the China Postdoctoral Science Foundation 2015M570413.


  1. 1.
    An, L., Yang, S., Bhanu, B.: Person re-identification by robust canonical correlation analysis. IEEE Signal Process. Lett. 22(8), 1103–1107 (2015). doi: 10.1109/LSP.2015.2390222 CrossRefGoogle Scholar
  2. 2.
    Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S., Neumann, U., Narayanan, S.: Analysis of emotion recognition using facial expressions, speech and multimodal information. In: International Conference on Multimodal Interfaces, vol. 38, No. 4, pp. 205–211 (2004). doi: 10.1145/1027933.1027968
  3. 3.
    Chen, L.S., Huang, T.S., Miyasato, T., Nakatsu, R.: Multimodal human emotion/expression recognition. IEEE International Conference on Automatic Face and Gesture Recognition, pp. 366–371 (1998). doi: 10.1109/AFGR.1998.670976
  4. 4.
    Chen, Y., Wiesel, A., Eldar, Y.C., Hero, A.O.: Shrinkage algorithms for mmse covariance estimation. IEEE Trans. Signal Process. 58(10), 5016–5029 (2010). doi: 10.1109/TSP.2010.2053029 MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Datcu, D., Rothkrantz, L.J.M.: Multimodal recognition of emotions in car environments. In: DI & I Prague (2009)Google Scholar
  6. 6.
    Deng, J., Zhang, Z., Marchi, E., Schuller, B.: Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: Humaine Association Conference on AVTective Computing and Intelligent Interaction, vol. 7971, pp. 511–516 (2013). doi: 10.1109/ACII.2013.90
  7. 7.
    Dobrisek, S., Gajsek, R., Mihelic, F., Pavesic, N., Struc, V.: Towards efficient multi-modal emotion recognition. Int. J. Adv. Robot. Syst. 10(53), 53–53 (2013). doi: 10.5772/54002 CrossRefGoogle Scholar
  8. 8.
    Gajsek, R., STruc V, Mihelic F, : Multi-modal emotion recognition using canonical correlations and acoustic features. In: International Conference on Pattern Recognition, vol. 82, No. 6, pp. 4133–4136 (2010). doi: 10.1109/ICPR.2010.1005
  9. 9.
    Gunes, H., Piccardi, M., Pantic, M.: From the lab to the real world: affect recognition using multiple cues and modalities. InTech Education and Publishing, Croatia (2008)Google Scholar
  10. 10.
    Han, M.J., Hsu, I.H., Song, K.T., Chang, F.Y.: A new information fusion method for svm-based robotic audio-visual emotion recognition. In: IEEE International Conference on Systems, Man and Cybernetics, pp. 2656–2661 (2007). doi: 10.1109/ICSMC.2007.4413990
  11. 11.
    Hardoon, D.R., Shawe-Taylor, J.: Sparse canonical correlation analysis. Mach. Learn. 83(3), 331–353 (2011). doi: 10.1007/s10994-010-5222-7 MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004). doi: 10.1162/0899766042321814 CrossRefzbMATHGoogle Scholar
  13. 13.
    Huang, L., Xin, L., Zhao, L., Tao, J.: Combining audio and video by dominance in bimodal emotion recognition. In: Second International Conference on Affective Computing and Intelligent Interaction, vo.l 4738, pp. 729–730 (2007). doi: 10.1007/978-3-540-74889-2_71
  14. 14.
    Kapoor, A., Burleson, W., Picard, R.W.: Automatic prediction of frustration. Int. J. Hum. Comput. Stud. 65(8), 724–736 (2007). doi: 10.1016/j.ijhcs.2007.02.003 CrossRefGoogle Scholar
  15. 15.
    Kim, Y., Lee, H., Provost, E.M.: Deep learning for robust feature generation in audiovisual emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 32, No. 3, pp. 3687–3691 (2013). doi: 10.1109/ICASSP.2013.6638346
  16. 16.
    Le, D., Provost, E.M.: Emotion recognition from spontaneous speech using hidden markov models with deep belief networks. In: Automatic Speech Recognition and Understanding, pp. 216–221 (2013). doi: 10.1109/ASRU.2013.6707732
  17. 17.
    Ledoit, O., Wolf, M.: Nonlinear shrinkage estimation of large-dimensional covariance matrices. Ann. Stat. 40(2), 1024–1060 (2012). doi: 10.1214/12-AOS989 MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Li, Z., Tang, J.: Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans. Image Process. 24(12), 5343–5355 (2015). doi: 10.1109/TIP.2015.2479560 MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Li, Z., Tang, J.: Weakly-supervised deep matrix factorization for social image understanding. IEEE Trans. Image Process. 26(99), 276–288 (2017). doi: 10.1109/TIP.2016.2624140 MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Li, Z., Liu, J., Yang, Y., Zhou, X., Lu, H.: Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans. Knowl. Data Eng. 26(9), 2138–2150 (2014). doi: 10.1109/TKDE.2013.65 CrossRefGoogle Scholar
  21. 21.
    Li, Z., Liu, J., Tang, J., Lu, H.: Robust structured subspace learning for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 37(10), 2085–2098 (2015). doi: 10.1109/TPAMI.2015.2400461 CrossRefGoogle Scholar
  22. 22.
    Liu, M., Shan, S., Wang, R., Chen, X.: Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1749–1756 (2014). doi: 10.1109/CVPR.2014.226
  23. 23.
    Lu J, Hu J, Zhou X, Shang Y: Activity-based person identification using sparse coding and discriminative metric learning. In: ACM International Conference on Multimedia, pp. 1061–1064, (2012). doi: 10.1145/2393347.2396383
  24. 24.
    Mansoorizadeh, M., Moghaddam Charkari, N.: Multimodal information fusion application to human emotion recognition from face and speech. Multimed. Tools Appl. 49(2), 277–297 (2010). doi: 10.1007/s11042-009-0344-2 CrossRefGoogle Scholar
  25. 25.
    Martin O, Kotsia I, Macq B, Pitas I (2006) The enterface’05 audio-visual emotion database. In: International Conference on Data Engineering Workshops, pp. 8–8. doi: 10.1109/ICDEW.2006.145
  26. 26.
    Mroueh, Y., Marcheret, E., Goel, V.: Deep multimodal learning for audio-visual speech recognition. In: IEEE International Conference on Acoustics Speech and Signal Processing, vol. 22, pp. 1103–1107 (2015). doi: 10.1109/LSP.2015.2390222 CrossRefGoogle Scholar
  27. 27.
    Nie, L., Zhang, L., Yang, Y., Wang, M., Hong, R., Chua, T.S.: Beyond doctors: future health prediction from multimedia and multimodal observations. In: The 23rd ACM International Conference on Multimedia, pp. 591–600. (2015). doi: 10.1145/2733373.2806217
  28. 28.
    Nie, L., Song, X., Chua, T.S.: Learning from Multiple Social Networks, pp. 118–118. Morgan & Claypool, San Rafael (2016). doi: 10.2200/S00714ED1V01Y201603ICR048 CrossRefGoogle Scholar
  29. 29.
    Paleari, M., Huet, B.: Toward emotion indexing of multimedia excerpts. In: International Workshop on Content-Based Multimedia Indexing, pp. 425–432, (2008). doi: 10.1109/CBMI.2008.4564978
  30. 30.
    Paleari, M., Benmokhtar, R., Huet, B.: Evidence theory-based multimodal emotion recognition. In: International Multimedia Modeling Conference on Advances in Multimedia Modeling, vol. 5371, pp. 435–446 (2009). doi: 10.1007/978-3-540-92892-8_44 Google Scholar
  31. 31.
    Paleari, M., Chellali, R., Huet, B.: Bimodal emotion recognition. In: International Conference on Social Robotics, vol. 6414, pp. 305–314 (2010). doi: 10.1007/978-3-642-17248-9_32 CrossRefGoogle Scholar
  32. 32.
    Peng, Y., Zhang, D., Zhang, J.: A new canonical correlation analysis algorithm with local discrimination. Neural Process. Lett. 31(1), 1–15 (2010). doi: 10.1007/s11063-009-9123-3 CrossRefGoogle Scholar
  33. 33.
    Pun, T., Alecu, T.I., Chanel, G., Kronegg, J.: Brain-computer interaction research at the computer vision and multimedia laboratory, university of geneva. IEEE Trans. Neural Syst. Rehabil. Eng. 14(2), 210–213 (2006). doi: 10.1109/TNSRE.2006.875544 CrossRefGoogle Scholar
  34. 34.
    Schmidt, E.M.: Modeling and predicting emotion in music. Emotion 5, 6-6 (2012)Google Scholar
  35. 35.
    Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A (2009) Acoustic emotion recognition: a benchmark comparison of performances. In: IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 552–557, doi: 10.1109/ASRU.2009.5372886
  36. 36.
    Shan, C., Gong, S., Mcowan, P.W.: Beyond facial expressions: Learning human emotion from body gestures. In: Proceedings of the British Machine Vision Conference, pp. 43.1–43.10 (2007). doi: 10.5244/C.21.43
  37. 37.
    Stuhlsatz, A., Lippel, J., Zielke, T.: Feature extraction with deep neural networks by a generalized discriminant analysis. IEEE Trans. Neural Netw. Learn. Syst. 23(4), 596–608 (2012). doi: 10.1109/TNNLS.2012.2183645 CrossRefGoogle Scholar
  38. 38.
    Tang, J., Shu, X., Qi, G.J., Li, Z., Wang, M., Yan, S., Jain, R.: Tri-clustered tensor completion for social-aware image tag refinement. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1 (2016). doi: 10.1109/TPAMI.2016.2608882 CrossRefGoogle Scholar
  39. 39.
    Wang, H.: Local two-dimensional canonical correlation analysis. IEEE Signal Process. Lett. 17(11), 921–924 (2010). doi: 10.1109/LSP.2010.2071863 CrossRefGoogle Scholar
  40. 40.
    Wang, Y., Guan, L., Venetsanopoulos, A.N.: Audiovisual emotion recognition via cross-modal association in kernel space. In: IEEE International Conference on Multimedia and Expo, pp. 1–6 (2011). doi: 10.1109/ICME.2011.6011949
  41. 41.
    Zeng, Z., Tu, J., Liu, M., Huang, T.S., Pianfetti, B., Roth, D., Levinson, S.: Audio-visual affect recognition. IEEE Trans. Multimed. 9(2), 424–428 (2007). doi: 10.1109/TMM.2006.886310 CrossRefGoogle Scholar
  42. 42.
    Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 915–928 (2007). doi: 10.1109/TPAMI.2007.1110 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  • Jiamin Fu
    • 1
  • Qirong Mao
    • 1
    Email author
  • Juanjuan Tu
    • 2
  • Yongzhao Zhan
    • 1
  1. 1.School of Computer Science and Communication EngineeringJiangsu UniversityJiangsuChina
  2. 2.School of Computer Science and EngineeringJiangsu University of Science and TechnologyJiangsuChina

Personalised recommendations