Skip to main content
Log in

Dimensionality reduction-based spoken emotion recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

To improve effectively the performance on spoken emotion recognition, it is needed to perform nonlinear dimensionality reduction for speech data lying on a nonlinear manifold embedded in a high-dimensional acoustic space. In this paper, a new supervised manifold learning algorithm for nonlinear dimensionality reduction, called modified supervised locally linear embedding algorithm (MSLLE) is proposed for spoken emotion recognition. MSLLE aims at enlarging the interclass distance while shrinking the intraclass distance in an effort to promote the discriminating power and generalization ability of low-dimensional embedded data representations. To compare the performance of MSLLE, not only three unsupervised dimensionality reduction methods, i.e., principal component analysis (PCA), locally linear embedding (LLE) and isometric mapping (Isomap), but also five supervised dimensionality reduction methods, i.e., linear discriminant analysis (LDA), supervised locally linear embedding (SLLE), local Fisher discriminant analysis (LFDA), neighborhood component analysis (NCA) and maximally collapsing metric learning (MCML), are used to perform dimensionality reduction on spoken emotion recognition tasks. Experimental results on two emotional speech databases, i.e. the spontaneous Chinese database and the acted Berlin database, confirm the validity and promising performance of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Ang J, Dhillon R, Krupski A, Shriberg E, Stolcke A (2002) Prosody-based automatic detection of annoyance and frustration in human-computer dialog. In: 7th International Conference on Spoken Language Processing (ICSLP’02), Denver, Colorado, pp. 2037–2040

  2. Banse R, Scherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70:614–636. doi:10.1037/0022-3514.70.3.614

    Article  Google Scholar 

  3. Batliner A, Buckow A, Niemann H, Noth E, Warnke V (2000) The prosody module. VERBMOBIL: foundations of speech-to-speech translations: 106–121

  4. Batliner A, Steidl S, Schuller B, Seppi D, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L, Aharonson V (2011) Whodunnit–searching for the most important feature types signalling emotion-related user states in speech. Comput Speech Lang 25(1):4–28. doi:10.1016/j.csl.2009.12.003

    Article  Google Scholar 

  5. Bengio Y, Paiement J, Vincent P, Delalleau O, Le Roux N, Ouimet M (2004) Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In: Advances in Neural Information Processing Systems, vol 16. MIT Press, Cambridge, MA, USA

  6. Boersma P, Weenink D (2009) Praat: doing phonetics by computer (version 5.1.05) [computer program]. Retrieved May 1, 2009, from http://www.praat.org/

  7. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Interspeech-2005, Lisbon, Portugal, pp. 1–4

  8. Carletta J (1996) Assessing agreement on classification tasks: the kappa statistic. Comput Ling 22(2):249–254

    Google Scholar 

  9. Chang Y, Hu C, Feris R, Turk M (2006) Manifold based analysis of facial expression. Image Vis Comput 24(6):605–614. doi:10.1016/j.imavis.2005.08.006

    Article  Google Scholar 

  10. Chang C, Lin C (2001) LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm

  11. Cowie R, Cornelius R (2003) Describing the emotional states that are expressed in speech. Speech Comm 40(1–2):5–32. doi:10.1016/S0167-6393(02)00071-7

    Article  MATH  Google Scholar 

  12. Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18(1):32–80. doi:10.1109/79.911197

    Article  Google Scholar 

  13. Daza-Santacoloma G, Acosta-Medina C, Castellanos-Domínguez G (2010) Regularization parameter choice in locally linear embedding. Neurocomputing 73(10–12):1595–1605. doi:10.1016/j.neucom.2009.11.038

    Article  Google Scholar 

  14. de Ridder D, Duin R (2002) Locally linear embedding for classification. Pattern Recognition Group, Dept of Imaging Science & Technology, Delft University of Technology, Delft, The Netherlands, Tech Rep PH-2002-01

  15. de Ridder D, Kouropteva O, Okun O, Pietikainen M, Duin R (2003) Supervised locally linear embedding. In: Artificial Neural Networks and Neural Information Processing-ICANN/ICONIP 2003, Lecture Notes in Computer Science 2714, vol 2714. Springer, pp 333–341

  16. Dellaert F, Polzin T, Waibel A (1996) Recognizing emotion in speech. In: 4th International Conference on Spoken Language Processing (ICSLP’96), Philadelphia, PA, USA, pp. 1970–1973

  17. Ekman P (1992) An argument for basic emotions. Cognit Emot 6(3):169–200. doi:10.1080/02699939208411068

    Article  Google Scholar 

  18. Errity A, McKenna J (2006) An investigation of manifold learning for speech analysis. In: Ninth International Conference on Spoken Language Processing (ICSLP’06), Pittsburgh, PA, USA, pp. 2506–2509

  19. Fernandez R, Picard R (2003) Modeling drivers’ speech under stress. Speech Comm 40(1–2):145–159. doi:10.1016/S0167-6393(02)00080-8

    Article  MATH  Google Scholar 

  20. Fisher R (1936) The use of multiple measures in taxonomic problems. Ann Eugenics 7:179–188

    Article  Google Scholar 

  21. Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic, Boston

    MATH  Google Scholar 

  22. Globerson A, Roweis S (2006) Metric learning by collapsing classes. In: Advances in neural information processing systems, vol 18. MIT Press, Cambridge, MA, pp 451–458

  23. Gobl C, Ni Chasaide A (2003) The role of voice quality in communicating emotion, mood and attitude. Speech Comm 40(1–2):189–212. doi:10.1016/S0167-6393(02)00082-1

    Article  MATH  Google Scholar 

  24. Goddard J, Schlotthauer G, Torres M, Rufiner H (2009) Dimensionality reduction for visualization of normal and pathological speech data. Biomed Signal Process Contr 4(3):194–201. doi:10.1016/j.bspc.2009.01.001

    Article  Google Scholar 

  25. Goldberger J, Roweis S, Hinton G, Salakhutdinov R (2005) Neighbourhood components analysis. In: Advances in Neural Information Processing Systems (NIPS), vol 17. MIT Press, Cambridge, MA, pp 513–520

  26. He X, Niyogi P (2003) Locality preserving projections. In: Advances in neural information processing systems (NIPS), vol 16. MIT Press, Cambridge, MA, pp 153–160

  27. Hozjan V, Kacic Z (2003) Improved emotion recognition with large set of statistical features. In: EUROSPEECH-2003, Geneva, pp. 133–136

  28. Hsu C, Chang C, Lin C (2003) A practical guide to support vector classification. Tech. Rep. Taipei

  29. Iliev A, Scordilis M, Papa J, Falcao A (2010) Spoken emotion recognition through optimum-path forest classification using glottal features. Comput Speech Lang 24(3):445–460. doi:10.1016/j.csl.2009.02.005

    Article  Google Scholar 

  30. Iliev A, Zhang Y, Scordilis M (2007) Spoken emotion classification using ToBI features and GMM. In: IEEE 6th EURASIP Conference Focused on Speech and Image Processing, Maribor, Slovenia, pp. 495–498

  31. Jain V, Saul L (2004) Exploratory analysis and visualization of speech and music by locally linear embedding. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’04), Montreal, Canada, pp. 984–987

  32. Jansen A, Niyogi P (2005) A geometric perspective on speech sounds. University of Chicago, Tech Rep

  33. Jansen A, Niyogi P (2006) Intrinsic fourier analysis on the manifold of speech sounds. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’06), Toulouse, France, pp. 241–244

  34. Johnstone T, Scherer K (1999) The effects of emotions on voice quality. In: XIVth International Congress of Phonetic Science, San Francisco, pp. 2029–2032

  35. Jolliffe IT (1986) Principal component analysis, 2nd edn. Springer, New York

    Book  Google Scholar 

  36. Kayo O, Design C, Ahonen R (2006) Locally linear embedding algorithm extensions and applications. Faculty of Technology, University of Oulu

  37. Kim J, Lee S, Narayanan S (2010) An exploratory study of manifolds of emotional speech In: 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’2010), Dallas, Texas, USA, pp. 5142–5145

  38. Kouropteva O, Okun O, Pietikainen M (2003) Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine. In: 11th European Symposium on Artificial Neural Networks Bruges, Belgium, pp. 229–234

  39. Kwon O, Chan K, Hao J, Lee T (2003) Emotion recognition by speech signals. In: EUROSPEECH-2003, Geneva, Switzerland, pp. 125–128

  40. Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Audio Speech Lang Process 13(2):293–303. doi:10.1109/TSA.2004.838534

    Article  Google Scholar 

  41. Lee CM, Narayanan SS, Pieraccini R (2001) Recognition of negative emotions from the speech signal. In: IEEE Workshop Automatic Speech Recognition and Understanding (ASRU), Trento, pp. 240–243

  42. Lee C, Narayanan S, Pieraccini R (2002) Combining acoustic and language information for emotion recognition. In: 7th International Conference on Spoken Language Processing (ICSLP’02), Denver, Colorado, USA, pp. 873–876

  43. Lee C, Yildirim S, Bulut M, Kazemzadeh A, Busso C, Deng Z, Lee S, Narayanan S (2004) Emotion recognition based on phoneme classes. In: International Conference on Spoken Language Processing (ICSLP’04), Jeju, Korea, pp. 889–892

  44. Li B, Zheng C-H, Huang D-S (2008) Locally linear discriminant embedding: an efficient method for face recognition. Pattern Recogn 42(12):3813–3821. doi:10.1016/j.patcog.2008.05.027

    Article  Google Scholar 

  45. Liang D, Yang J, Zheng Z, Chang Y (2005) A facial expression recognition system based on supervised locally linear embedding. Pattern Recogn Lett 26(15):2374–2389. doi:10.1016/j.patrec.2005.04.011

    Article  Google Scholar 

  46. Monzo C, Alías F, Iriondo I, Gonzalvo X, Planet S (2007) Discriminating expressive speech styles by voice quality parameterization. In: 16th International Congress of Phonetic Sciences, Saarbruken, Germany, pp. 2081–2084

  47. Morrison D, Wang R, De Silva LC (2007) Ensemble methods for spoken emotion recognition in call-centres. Speech Comm 49(2):98–112. doi:10.1016/j.specom.2006.11.004

    Article  Google Scholar 

  48. Nicholson J, Takahashi K, Nakatsu R (2000) Emotion recognition in speech using neural networks. Neural Comput Appl 9(4):290–296. doi:10.1007/s005210070006

    Article  MATH  Google Scholar 

  49. Nwe TL, Foo SW, De Silva LC (2003) Speech emotion recognition using hidden Markov models. Speech Comm 41(4):603–623. doi:10.1016/s01167-6393(03)00099-2

    Article  Google Scholar 

  50. Osgood C, May W, Miron M (1975) Cross-cultural universals of affective meaning. University of Illinois Press

  51. Pao T, Chen Y, Yeh J, Liao W (2005) Combining acoustic features for improved emotion recognition in Mandarin speech. In: Affective Computing and Intelligent Interaction. pp 279–285

  52. Pearson K (1901) On lines and planes of closest fit to systems of points in space. Phil Mag 2(6):559–572

    Google Scholar 

  53. Petrushin V (1999) Emotion in speech: recognition and application to call centers. In: Proc. 1999 Artificial Neural Networks in Engineering (ANNIE ’99), New York, pp. 7–10

  54. Petrushin V (2000) Emotion recognition in speech signal: experimental study, development, and application. In: 6th International Conference on Spoken Language Processing (ICSLP’00), Beijing, China, pp. 222–225

  55. Picard R (1997) Affective computing. MIT, Cambridge

    Google Scholar 

  56. Picard R (2001) Affective medicine: technology with emotional intelligence. Future of health technology. OIS, Cambridge, pp 69–85

    Google Scholar 

  57. Picard R, Klein J (2002) Computers that recognise and respond to user emotion: theoretical and practical implications. Interact Comput 14(2):141–169. doi:10.1016/S0953-5438(01)00055-8

    Article  Google Scholar 

  58. Platt J (1999) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods: support vector learning. MIT press, Cambridge, MA, USA, pp 185–208

  59. Rong J, Li G, Chen Y-PP (2009) Acoustic feature selection for automatic emotion recognition from speech. Inform Process Manag 45(3):315–328

    Article  Google Scholar 

  60. Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326. doi:10.1126/science.290.5500.2323

    Article  Google Scholar 

  61. Saul LK, Roweis ST (2003) Think globally, fit locally: unsupervised learning of low dimensional manifolds. J Mach Learn Res 4:119–155

    MathSciNet  Google Scholar 

  62. Scherer K (2003) Vocal communication of emotion: a review of research paradigms. Speech Comm 40(1–2):227–256. doi:10.1016/S0167-6393(02)00084-5

    Article  MATH  Google Scholar 

  63. Scherer S, Schwenker F, Palm G (2009) Classifier fusion for emotion recognition from speech. In: Advanced Intelligent Environments. Springer, pp 95–117

  64. Schuller B, Batliner A, Seppi D, Steidl S, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L (2007) The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: INTERSPEECH-2007, Antwerp, Belgium, pp. 2253–2256

  65. Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Quebec, Canada, pp. 577–580

  66. Schuller B, Seppi D, Batliner A, Maier A, Steidl S (2007) Towards more reality in the recognition of emotional speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’07), Honolulu, Hawai’i, USA, pp. 941–944

  67. Shami M, Verhelst W (2007) An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Comm 49(3):201–212. doi:10.1016/j.specom.2007.01.006

    Article  Google Scholar 

  68. Sugiyama M (2007) Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis. J Mach Learn Res 8:1027–1061

    MATH  Google Scholar 

  69. Tenenbaum JB, Silva VD, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323. doi:10.1126/science.290.5500.2319

    Article  Google Scholar 

  70. Valencia-Aguirre J, Álvarez-Mesa A, Daza-Santacoloma G, Castellanos-Domínguez G (2009) Automatic choice of the number of nearest neighbors in locally linear embedding. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 77–84

  71. van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605

    MATH  Google Scholar 

  72. Van der Maaten L, Postma E, Van den Herik H (2009) Dimensionality reduction: a comparative review. TiCC TR 2009–005

  73. Vapnik V (2000) The nature of statistical learning theory. Springer-Verlag, New York

    MATH  Google Scholar 

  74. Ververidis D, Kotropoulos C (2005) Emotional speech classification using Gaussian mixture models. In: IEEE International Conference on Multimedia and Expo (ICME’05), Amsterdam, The Netherlands, pp. 2871–2874

  75. Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Comm 48(9):1162–1181. doi:10.1016/j.specom.2006.04.003

    Article  Google Scholar 

  76. Ververidis D, Kotropoulos C (2008) Fast and accurate sequential floating forward feature selection with the Bayes classifier applied to speech emotion recognition. Signal Process 88(12):2956–2970. doi:10.1016/j.sigpro.2008.07.001

    Article  MATH  Google Scholar 

  77. Ververidis D, Kotropoulos C, Pitas I (2004) Automatic emotional speech classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’04), Montreal, Quebec, Canada, pp. 593–596

  78. Wang Y, Guan L (2004) An investigation of speech-based human emotion recognition. In: IEEE 6th Workshop on Multimedia Signal Processing, Siena, Italy pp. 15–18

  79. Wang M, Yang J, Xu Z, Chou K (2005) SLLE for predicting membrane protein types. J Theor Biol 232(1):7–15. doi:10.1016/j.jtbi.2004.07.023

    Article  MathSciNet  Google Scholar 

  80. Xiao Z, Dellandrea E, Dou W, Chen L (2010) Multi-stage classification of emotional speech motivated by a dimensional emotion model. Multimed Tool Appl 46(1):119–145. doi:10.1007/s11042-009-0319-3

    Article  Google Scholar 

  81. Yildirim S, Narayanan S, Potamianos A (2011) Detecting emotional state of a child in a conversational computer game. Comput Speech Lang 25(1):29–44. doi:10.1016/j.csl.2009.12.004

    Article  Google Scholar 

  82. You M, Chen C, Bu J, Liu J, Tao J (2006) Emotional speech analysis on nonlinear manifold. In: 18th International Conference on Pattern Recognition (ICPR 2006), Hong Kong, pp. 91–94

  83. You M, Chen C, Bu J, Liu J, Tao J (2007) Manifolds based emotion recognition in speech. Comput Ling Chin Lang Process 12(1):49–64

    Google Scholar 

  84. Zhang S (2008) Emotion recognition in Chinese natural speech by combining prosody and voice quality features. In: Advances in Neural Networks–ISNN 2008, Lecture Notes in Computer Science 5264, vol 5264. Springer, pp 457–464

  85. Zhao L, Zhang Z (2009) Supervised locally linear embedding with probability-based distance for classification. Comput Math Appl 57(6):919–926. doi:10.1016/j.camwa.2008.10.055

    Article  MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to thank all the anonymous reviewers and editors for their helpful comments and suggestions about the improvement of this paper. This work is supported by Zhejiang Provincial Natural Science Foundation of China under Grant No. Z1101048 and No. Y1111058.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shiqing Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, S., Zhao, X. Dimensionality reduction-based spoken emotion recognition. Multimed Tools Appl 63, 615–646 (2013). https://doi.org/10.1007/s11042-011-0887-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-011-0887-x

Keywords

Navigation