Advertisement

International Journal of Speech Technology

, Volume 16, Issue 2, pp 181–201 | Cite as

Characterization and recognition of emotions from speech using excitation source information

  • Sreenivasa Rao Krothapalli
  • Shashidhar G. Koolagudi
Article

Abstract

This paper explores the excitation source features of speech production mechanism for characterizing and recognizing the emotions from speech signal. The excitation source signal is obtained from speech signal using linear prediction (LP) analysis, and it is also known as LP residual. Glottal volume velocity (GVV) signal is also used to represent excitation source, and it is derived from LP residual signal. Speech signal has high signal to noise ratio around the instants of glottal closure (GC). These instants of glottal closure are also known as epochs. In this paper, the following excitation source features are proposed for characterizing and recognizing the emotions: sequence of LP residual samples and their phase information, parameters of epochs and their dynamics at syllable and utterance levels, samples of GVV signal and its parameters. Auto-associative neural networks (AANN) and support vector machines (SVM) are used for developing the emotion recognition models. Telugu and Berlin emotion speech corpora are used to evaluate the developed models. Anger, disgust, fear, happy, neutral and sadness are the six emotions considered in this study. About 42 % to 63 % of average emotion recognition performance is observed using different excitation source features. Further, the combination of excitation source and spectral features has shown to improve the emotion recognition performance up to 84 %.

Keywords

Auto-associative neural networks Epoch parameters Glottal volume velocity signal Linear prediction (LP) residual Source features Support vector machines 

Supplementary material

10772_2012_9175_MOESM1_ESM.pdf (147 kb)
(PDF 147 kB)

References

  1. Ananthakrishnan, S., & Narayanan, S. S. (2008). Automatic prosodic event detection using acoustic, lexical, and syntactic evidence. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 216–228. CrossRefGoogle Scholar
  2. Anjani, A. V. N. S. (2000). Autoassociate neural network models for processing degraded speech. MS thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India. Google Scholar
  3. Atal, B. S. (1972). Automatic speaker recognition based on pitch contours. The Journal of the Acoustical Society of America, 52(6), 1687–1697. CrossRefGoogle Scholar
  4. Bajpai, A., & Yegnanarayana, B. (2004). Exploring features for audio clip classification using LP residual and AANN models. In The international conference on intelligent sensing and information processing (ICISIP 2004), Chennai, India, January 2004 (pp. 305–310). CrossRefGoogle Scholar
  5. Bapineedu, G., Avinash, B., Gangashetty, S. V., & Yegnanarayana, B. (2009). Analysis of lombard speech using excitation source information. In INTERSPEECH-09, Brighton, UK, 6–10 September 2009 (pp. 1091–1094). Google Scholar
  6. Bitouk, D., Verma, R., & Nenkova, A. (2010). Class-level spectral features for emotion recognition. Speech Communication, 52(7), 613–625. CrossRefGoogle Scholar
  7. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., & Weiss, B. (2005). A database of German emotional speech. In Interspeech. Google Scholar
  8. Chauhan, A., Koolagudi, S. G., Kafley, S., & Rao, K. S. (2010). Emotion recognition using LP residual. In IEEE TechSym 2010, West Bengal, India, IIT Kharagpur, April 2010. New York: IEEE Press. Google Scholar
  9. Cichosz, K. S. J. (2007). Emotion recognition in speech signal using emotion-extracting binary decision trees. In Affective computing and intelligent interfaces ACII, Lisbon, Doctoral Consortium, September 2007. Google Scholar
  10. Cummings, K. E., & Clements, M. A. (1995). Analysis of the glottal excitation of emotionally styled and stressed speech. The Journal of the Acoustical Society of America, 98, 88–98. CrossRefGoogle Scholar
  11. Dellert, F., Polzin, T., & Waibel, A. (1996). Recognizing emotion in speech. In 4th international conference on spoken language processing, Philadelphia, PA, USA, October 1996 (pp. 1970–1973). Google Scholar
  12. Desai, S., Black, A. W., Yegnanarayana, B., & Prahallad, K. (2010). Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18, 954–964. CrossRefGoogle Scholar
  13. Diamantaras, K. I., & Kung, S. Y. (1996). Principal component neural networks: theory and applications. New York: Wiley. MATHGoogle Scholar
  14. Gobl, C., & Chasaide, A. (2003). The role of voice quality in communicating emotion, mood and attitude. Speech Communication, 40, 189–212. MATHCrossRefGoogle Scholar
  15. Gupta, C. S. (2003). Significance of source features for speaker recognition. MS thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India. Google Scholar
  16. Gupta, C. S., Prasanna, S. R. M., & Yegnanarayana, B. (2002). Autoassociative neural network models for online speaker verification using source features from vowels. In International joint conference on neural networks, Honululu, HI, USA, May 2002. Google Scholar
  17. Haag, A., Goronzy, S., Schaich, P., & Williams, J. (2004). Emotion recognition using bio-sensors: first steps towards an automatic system. In E.A. et al. (Eds.), LNAI: Vol. 3068. ADS-2004 (p. 3648). Berlin: Springer. Google Scholar
  18. Haykin, S. (1999). Neural networks: a comprehensive foundation. New Delhi: Pearson Education Aisa, Inc. MATHGoogle Scholar
  19. Hua, L. Z., Yu, H., & Hua, W. R. (2005). A novel source analysis method by matching spectral characters of LF model with STRAIGHT spectrum. Berlin: Springer. Google Scholar
  20. Iida, A., Campbell, N., Higuchi, F., & Yasumura, M. (2003). A corpus-based speech synthesis system with emotion. Speech Communication, 40, 161–187. MATHCrossRefGoogle Scholar
  21. Ikbal, M. S., Misra, H., & Yegnanarayana, B. (1999). Analysis of autoassociative mapping neural networks. In International joint conference on neural networks, USA (pp. 854–858). Google Scholar
  22. Iliev, A. I., Scordilis, M. S., Papa, J. P., & Falco, A. X. (2010). Spoken emotion recognition through optimum-path forest classification using glottal features. Computer Speech & Language, 24(3), 445–460. CrossRefGoogle Scholar
  23. Kamaruddin, N., & Wahab, A. (2009). Features extraction for speech emotion. Journal of Computational Methods in Science and Engineering, 9(9), 1–12. MATHGoogle Scholar
  24. Kim, J., & Andr, E. (2008). Emotion recognition based on physiological changes in music listening. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 2067–2083. CrossRefGoogle Scholar
  25. Kim, T., Shin, D., & Shin, D. (2009). Towards an emotion recognition system based on biometrics. In International joint conference on computational sciences and optimization. doi: 10.1l09/CSO.2009.497. Google Scholar
  26. Kishore, S. P., & Yegnanarayana, B. (2001). Online text-independent speaker verification system using autoassociative neural network models. In International joint conference on neural networks, Washington, USA, August 2001 (pp. 1548–1553). Google Scholar
  27. Koolagudi, S. G., & Rao, K. S. (2011). Two stage emotion recognition based on speaking rate. International Journal of Speech Technology, 14, 35–48. CrossRefGoogle Scholar
  28. Koolagudi, S. G., & Rao, K. S. (2012a). Emotion recognition from speech: a review. International Journal of Speech Technology, 15(2), 99–117. CrossRefGoogle Scholar
  29. Koolagudi, S. G., & Rao, K. S. (2012b). Emotion recognition from speech using source, system and prosodic features. International Journal of Speech Technology, 15(2), 265–289. CrossRefGoogle Scholar
  30. Koolagudi, S. G., & Rao, K. S. (2012c). Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features. International Journal of Speech Technology, 15(3), 495–511. doi: 10.1007/s10772-012-9150-8. CrossRefGoogle Scholar
  31. Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). IITKGP-SESC: speech database for emotion analysis. In LNCS. Communications in computer and information science. Berlin: Springer. Google Scholar
  32. Koolagudi, S., Reddy, R., & Rao, K. S. (2010). Emotion recognition from speech signal using epoch parameters. In International conference on signal processing and communications (SPCOM 2010), Indian Institute of Science, Bangalore, India, 18–21 July 2010. Google Scholar
  33. Kumar, K. S., Reddy, M. S. H., Murty, K. S. R., & Yegnanarayana, B. (2009). Analysis of laugh signals for detecting in continuous speech. In INTERSPEECH-09, Brighton, UK, 6–10 September 2009 (pp. 1591–1594). Google Scholar
  34. Kwon, O., Chan, K., Hao, J., & Lee, T. (2003). Emotion recognition by speech signals. In Eurospeech, Geneva (pp. 125–128). Google Scholar
  35. Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13, 293–303. CrossRefGoogle Scholar
  36. Makhoul, J. (1975). Linear prediction: a tutorial review. Proceedings of the IEEE, 63(4), 561–580. CrossRefGoogle Scholar
  37. Makhoul, J. (1975). Linear prediction: a tutorial review. Proceedings of the IEEE, 63, 561–580. CrossRefGoogle Scholar
  38. Mary, L. (2006). Multi level implicit features for language and speaker recognition. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology, Madras, Chennai, India, June 2006. Google Scholar
  39. Mary, L., & Yegnanarayana, B. (2004). Autoassociative neural network models for language identification. In International conference on intelligent sensing and information processing, 24 August 2004 (pp. 317–320). New York: IEEE Press. Google Scholar
  40. McGilloway, S., Cowie, R., Douglas-Cowie, E., Gielen, S., Westerdijk, M., & Stroeve, S. (2000). Approaching automatic recognition of emotion from voice: a rough benchmark. In ISCA workshop on speech and emotion, Belfast. Google Scholar
  41. Mohan, C. K., & Yegnanarayana, B. (2008). Classification of sport videos using edge-based features and autoassociative neural network models. Signal, Image and Video Processing, 4, 61–73. CrossRefGoogle Scholar
  42. Moore, E., Clements, M., Peifer, J., & Weisser, L. (2003). Investigating the role of glottal features in classifying clinical depression. In 25th annual international conference of the IEEE EMBAS, September 2003 (pp. 2849–2852). Google Scholar
  43. Murray, I. R., Arnott, J. L., & Rohwer, E. A. (1996). Emotional stress in synthetic speech: progress and future directions. Speech Communication, 20, 85–91. CrossRefGoogle Scholar
  44. Murty, K. S. R., & Yegnanarayana, B. (2006). Combining evidence from residual phase and mfcc features for speaker recognition. IEEE Signal Processing Letters, 13, 52–55. CrossRefGoogle Scholar
  45. Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613. CrossRefGoogle Scholar
  46. Nakatsu, R., Nicholson, J., & Tosa, N. (2000). Emotion recognition and its application to computer agents with spontaneous interactive capabilities. Knowledge-Based Systems, 13, 497–504. CrossRefGoogle Scholar
  47. Nicholson, J., Takahashi, K., & Nakatsu, R. (1999). Emotion recognition in speech using neural networks. In 6th international conference on neural information processing (ICONIP-99), Perth, WA, Australia, August 1999 (pp. 495–501). Google Scholar
  48. Prasanna, S. R. M., Reddy, B. V. S., & Krishnamoorthy, P. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Audio, Speech, and Language Processing, 17, 556–565. CrossRefGoogle Scholar
  49. Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Englewood Cliffs: Prentice-Hall. Google Scholar
  50. Rao, K. S. (2011a). Application of prosody models for developing speech systems in Indian languages. International Journal of Speech Technology, 14, 19–33. CrossRefGoogle Scholar
  51. Rao, K. S. (2011b). Role of neural network models for developing speech systems. Sâdhana, 36, 783–836. CrossRefGoogle Scholar
  52. Rao, K. S., & Yegnanarayana, B. (2009). Intonation modeling for Indian languages. Computer Speech & Language, 23, 240–256. CrossRefGoogle Scholar
  53. Rao, K. S., Prasanna, S. R. M., & Yegnanarayana, B. (2007). Determination of instants of significant excitation in speech using Hilbert envelope and group delay function. IEEE Signal Processing Letters, 14, 762–765. CrossRefGoogle Scholar
  54. Rao, K. S., Saroj, V. K., Maity, S., & Koolagudi, S. G. (2011). Recognition of emotions from video using neural network models. Expert Systems with Applications, 38, 13181–13185. CrossRefGoogle Scholar
  55. Rao, K. S., Koolagudi, S. G., & Vempada, R. R. (2012). Emotion recognition from speech using global and local prosodic features. International Journal of Speech Technology. doi: 10.1007/s10772-012-9172-2. Google Scholar
  56. Reddy, K. S. (2004). Source and system features for speaker recognition. MS thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India. Google Scholar
  57. Ringeval, F., & Chetouani, M. (2008). A vowel based approach for acted emotion recognition. In Verbal and nonverbal features of human-human and human-machine interaction: COST action 2102 international conference (pp. 243–254). Berlin: Springer. CrossRefGoogle Scholar
  58. Sagar, T. V. (2007). Characterisation and synthesis of emotions in speech using prosodic features. Master’s thesis, Dept. of Electronics and communications Engineering, Indian Institute of Technology Guwahati, May 2007. Google Scholar
  59. Scherer, S., Hofmann, H., Lampmann, M., Pfeil, M., Rhinow, S., Schwenker, F., & Palm, G. (2008). Emotion recognition from speech: stress experiment. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, & D. Tapias (Eds.), Proceedings of the sixth international language resources and evaluation (LREC’08), Marrakech, Morocco, May 2008. Paris : European Language Resources Association (ELRA). Google Scholar
  60. Schuller, B., Rigoll, G., & Lang, M. (2004). Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In Proceedings of the IEEE international conference on acoustics, speech and signal processing, May 2004 (pp. 577–580). New York: IEEE Press. Google Scholar
  61. Schuller, B., Reiter, S., & Rigoll, G. (2006). Evolutionary feature generation in speech emotion recognition. In IEEE international conference on multimedia and expo, Toronto, ON, July 2006 (pp. 5–8). New York: IEEE Press. Google Scholar
  62. Seshadri, G. P., & Yegnanarayana, B. (2009). Perceived loudness of speech based on the characteristics of glottal excitation source. The Journal of the Acoustical Society of America, 126, 2061–2071. CrossRefGoogle Scholar
  63. Sidorova, J. (2004). Speech emotion recognition. PhD thesis, Universitat Pompeu Fabra, July 2004. Google Scholar
  64. Ververidis, D., & Kotropoulos, C. (2006). A state of the art review on emotional speech databases. In Eleventh Australasian international conference on speech science and technology, Auckland, New Zealand, December 2006. Google Scholar
  65. Ververidis, D., Kotropoulos, C., & Pitas, I. (2004). Automatic emotional speech classification. In ICASSP (pp. I593–I596). New York: IEEE Press. Google Scholar
  66. Vogt, T., & Andre, E. (2006). Improving automatic emotion recognition from speech via gender differentiation. In Proceedings of language resources and evaluation conference (LREC), May 2006. Google Scholar
  67. Vuppala, A. K., Yadav, J., Chakrabarti, S., & Rao, K. S. (2012). Vowel onset point detection for low bit rate coded speech. IEEE Transactions on Audio, Speech, and Language Processing, 20, 1894–1903. CrossRefGoogle Scholar
  68. Wakita, H. (1976). Residual energy of linear prediction to vowel and speaker recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24, 270–271. CrossRefGoogle Scholar
  69. Wang, Y., & Guan, L. (2004). An investigation of speech-based human emotion recognition. In IEEE 6th workshop on multimedia signal processing, October 2004 (pp. 15–18). New York: IEEE Press. Google Scholar
  70. Williams, C. E., & Stevens, K. N. (1981). Vocal correlates of emotional states. In Speech evaluation in psychiatry (pp. 189–220). Google Scholar
  71. Yegnanarayana, B. (1999). Artificial neural networks. New Delhi: Prentice-Hall. Google Scholar
  72. Yegnanarayana, B., Reddy, K. S., & Kishore, S. P. (2001a). Source and system features for speaker recognition using aann models. In IEEE international conference on acoustics, speech and signal processing, Salt Lake City, UT, May 2001. Google Scholar
  73. Yegnanarayana, B., Reddy, K. S., & Kishore, S. P. (2001b). Source and system features for speaker recognition using AANN models. In Proceedings of the IEEE international conference on acoustics, speech and signal processing, Salt Lake City, UT, USA, May 2001 (pp. 409–412). Google Scholar
  74. Yegnanarayana, B., Swamy, R. K., & Murty, K. S. R. (2009). Determining mixing parameters from multispeaker data using speech-specific information. IEEE Transactions on Audio, Speech, and Language Processing, 17(6), 1196–1207. CrossRefGoogle Scholar
  75. Zhou, Y., Sun, Y., Yang, L., & Yan, Y. (2009). Applying articulatory features to speech emotion recognition. In International conference on research challenges in computer science (ICRCCS), 28–29 December 2009 (pp. 73–76). CrossRefGoogle Scholar
  76. Zhou, Y., Sun, Y., Zhang, J., & Yan, Y. (2009). Speech emotion recognition using both spectral and prosodic features. In International conference on information engineering and computer science (ICIECS), Wuhan, 19–20 December 2009 (pp. 1–4). New York: IEEE Press. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Sreenivasa Rao Krothapalli
    • 1
  • Shashidhar G. Koolagudi
    • 1
  1. 1.School of Information TechnologyIndian Institute of Technology KharagpurKharagpurIndia

Personalised recommendations