Advertisement

An Effective Discriminative Learning Approach for Emotion-Specific Features Using Deep Neural Networks

  • Shuiyang MaoEmail author
  • Pak-Chung Ching
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11304)

Abstract

Speech contains rich yet entangled information ranging from phonetic to emotional components. These different components are always mixed together hindering certain tasks from achieving better performance. Therefore, automatically learning a good representation that disentangles these components is non-trivial. In this paper, we propose a hierarchical method to extract utterance-level features from frame-level acoustic features using deep neural networks (DNNs). Moreover, inspired by recent progress in face recognition, we introduce centre loss as a complementary supervision signal to the traditional softmax loss to facilitate the intra-class compactness of the learned features. With the joint supervision of these two loss functions, we can train the DNNs to obtain separable and discriminative emotion-specific features. Experiments on CASIA corpus, Emo-DB corpus and SAVEE database show comparable results with that of state-of-the-art approaches.

Keywords

Speech emotion recognition Deep neural networks Hierarchical method Centre loss 

References

  1. 1.
    Ververidis, D., Kotropoulos, C.: A state of the art review on emotional speech databases. In: 1st International Workshop on Interactive Rich Media Content Production (RichMedia 2003), Lausanne, Switzerland, pp. 109–119 (2003)Google Scholar
  2. 2.
    Rao, K.S., Koolagudi, S.G.: Emotion Recognition Using Speech Features. Springer, New York (2013).  https://doi.org/10.1007/978-1-4614-5143-3CrossRefzbMATHGoogle Scholar
  3. 3.
    Wang, K., An, N., Li, B.N., Zhang, Y., Li, L.: Speech emotion recognition using fourier parameters. IEEE Trans. Affect. Comput. 6(1), 69–75 (2015)CrossRefGoogle Scholar
  4. 4.
    Banse, R., Scherer, K.R.: Acoustic profiles in vocal emotion expression. J. Pers. Soc. Psychol. 70(3), 614–636 (1996)CrossRefGoogle Scholar
  5. 5.
    Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  6. 6.
    Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Interspeech 2014, Singapore (2014)Google Scholar
  7. 7.
    Lee, J., Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: Interspeech 2015, Dresden, Germany (2015)Google Scholar
  8. 8.
    Zhang, S., Zhang, S., Huang, T., Gao, W.: Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimed. 20(6), 1576–1590 (2018)CrossRefGoogle Scholar
  9. 9.
    Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_31CrossRefGoogle Scholar
  10. 10.
    Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR 2006, pp. 1735–1742. IEEE Press, New York (2006)Google Scholar
  11. 11.
    Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: CVPR 2015, pp. 815–823. IEEE Press, Boston (2015)Google Scholar
  12. 12.
    Chen, K., Salman, A.: Extracting speaker-specific information with a regularized siamese deep network. In: NIPS 2011, pp. 298–306, Granada (2011)Google Scholar
  13. 13.
    Zheng, X., Wu, Z., Meng, H., Cai, L.: Contrastive autoencoder for phoneme recognition. In: ICASSP 2014, pp. 2529–2533. IEEE Press, Florence (2014)Google Scholar
  14. 14.
    Bredin, H.: Tristounet: triplet loss for speaker turn embedding. In: ICASSP 2017, pp. 5430–5434. IEEE Press, New Orleans (2017)Google Scholar
  15. 15.
    Wu, Y., Liu, H., Li, J., Fu, Y.: Deep face recognition with center invariant loss. In: Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp. 408–414. ACM, Mountain View (2017)Google Scholar
  16. 16.
    Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. In: Interspeech 2005, Lisbon (2005)Google Scholar
  17. 17.
    Haq, S., Jackson, P.J.B., Edge, J.: Speaker-dependent audio-visual emotion recognition. In: AVSP 2009, pp. 53–58. Norfolk (2009)Google Scholar
  18. 18.
    Giannakopoulos, T.: pyaudioanalysis: an open-source python library for audio signal analysis. PLoS ONE 10(12), 1–17 (2015)CrossRefGoogle Scholar
  19. 19.
    Tsiakas, K., et al.: A multimodal adaptive dialogue manager for depressive and anxiety disorder screening: a wizard-of-oz experiment. In: Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive Environments, p. 82. ACM, Corfu (2015)Google Scholar
  20. 20.
    Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006)CrossRefGoogle Scholar
  21. 21.
    Smith, S.L., Kindermans, P.J., Le, Q.V.: Don’t Decay the Learning Rate, Increase the Batch Size (2017). arXiv preprint arXiv:1711.00489
  22. 22.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  23. 23.
    Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML 2015, pp. 448–456. Lille (2015)Google Scholar
  24. 24.
    Abadi, M., et al.: Tensorflow: A system for large-scale machine learning. In: OSDI 2016, pp. 265–283. Savannah (2016)Google Scholar
  25. 25.
    Sun, Y., Wen, G.: Emotion recognition using semi-supervised feature selection with speaker normalization. Int. J. Speech Technol. 18(3), 317–331 (2015)CrossRefGoogle Scholar
  26. 26.
    Yuan, J., Chen, L., Fan, T., Jia, J.: Dimension reduction of speech emotion feature based on weighted linear discriminate analysis. Image Process. Pattern Recognit. 8, 299–308 (2015)Google Scholar
  27. 27.
    Sun, Y., Wen, G., Wang, J.: Weighted spectral features based on local Hu moments for speech emotion recognition. Biomed. Signal Process. Control 18, 80–90 (2015)CrossRefGoogle Scholar
  28. 28.
    Li, C.Z., Liu, F.K., Wang, Y.T., et al.: Speech Emotion Recognition Based on PSO-optimized SVM. In: 2nd International Conference on Software, Multimedia and Communication Engineering (SMCE). Shanghai (2017)Google Scholar
  29. 29.
    Liu, Z.T., Wu, M., Cao, W.H., et al.: Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273, 271–280 (2018)CrossRefGoogle Scholar
  30. 30.
    Liu, Z.T., Xie, Q., Wu, M., Cao, W.H., Mei, Y., Mao, J.W.: Speech emotion recognition based on an improved brain emotion learning model. Neurocomputing 309, 145–156 (2018)CrossRefGoogle Scholar
  31. 31.
    Lim, W., Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. In: APSIPA ASC 2016, pp. 1–4. IEEE Press, Jeju (2016)Google Scholar
  32. 32.
    Sidorov, M., Brester, C., Minker, W., Semenkin, E.: Speech-based emotion recognition: feature selection by self-adaptive multi-criteria genetic algorithm. In: LREC 2014, pp. 3481–3485. Reykjavik (2014)Google Scholar
  33. 33.
    Yogesh, C.K., Hariharan, M., Ngadiran, R., Adom, A.H., Yaacob, S., Polat, K.: Hybrid BBO\(\_\)PSO and higher order spectral features for emotion and stress recognition from natural speech. Appl. Soft Comput. 56, 217–232 (2017)CrossRefGoogle Scholar
  34. 34.
    Sun, Y., Wen, G.: Ensemble softmax regression model for speech emotion recognition. Multimed. Tools Appl. 76(6), 8305–8328 (2017)CrossRefGoogle Scholar
  35. 35.
    Haq, S., Jackson, P.J.B.: Multimodal emotion recognition. In: Wang, W.W. (ed.) Machine Audition: Principles, Algorithms and Systems, pp. 398–423. IGI Global Press, Hershey (2010). Chapter 17Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Electronic EngineeringThe Chinese University of Hong KongHong Kong SARChina

Personalised recommendations