International Journal of Speech Technology

, Volume 21, Issue 3, pp 521–532 | Cite as

Prosody modification for speech recognition in emotionally mismatched conditions

  • Vishnu Vidyadhara Raju Vegesna
  • Krishna Gurugubelli
  • Anil kumar Vuppala


A degradation in the performance of automatic speech recognition systems (ASR) is observed in mismatched training and testing conditions. One of the reasons for this degradation is due to the presence of emotions in the speech. The main objective of this work is to improve the performance of ASR in the presence of emotional conditions using prosody modification. The influence of different emotions on the prosody parameters is exploited in this work. Emotion conversion methods are employed to generate the word level non-uniform prosody modified speech. Modification factors for prosodic components such as pitch, duration and energy are used. The prosody modification is done in two ways. Firstly, emotion conversion is done at the testing stage to generate the neutral speech from the emotional speech. Secondly, the ASR is trained with the generated emotional speech from the neutral speech. In this work, the presence of emotions in speech is studied for the Telugu ASR systems. A new database of IIIT-H Telugu speech corpus is collected to build the large vocabulary neutral Telugu speech ASR system. The emotional speech samples from IITKGP-SESC Telugu corpus are used for testing it. The emotions of anger, happiness and compassion are considered during the evaluation. An improvement in the performance of ASR systems is observed in the prosody modified speech.


Prosody Automatic speech recognition Pitch Duration Energy Emotion conversion Non-uniform prosody modification 


  1. Adiga, N., Govind, D., Prasanna, S. M. (2014). Significance of epoch identification accuracy for prosody modification. In proceedings of SPCOM, Bangalore, India. IEEE, (pp. 1–6).Google Scholar
  2. Batliner, A., Steidl, S., Seppi, D., & Schuller, B. (2010). Segmenting into adequate units for automatic recognition of emotion-related episodes: A speech-based approach. Advances in Human-Computer Interaction, 1, 3.Google Scholar
  3. Bulut, M., Narayanan, S. S., & Syrdal, A. K. (2002). Expressive speech synthesis using a concatenative synthesizer. In proceedings of INTERSPEECH, Denver, Colorado, USA.Google Scholar
  4. Cabral, J. P., & Oliveira, L. (2005). Pitch-synchronous time-scaling for prosodic and voice quality transformations. In proceedings of Ninth European Conference on Speech Communication and Technology.Google Scholar
  5. Cabral, J. P., & Oliveira, L. C. (2006). Emovoice: A system to generate emotions in speech. In proceedings of INTERSPEECH, Pittsburgh, Pennsylvania.Google Scholar
  6. Crumpton, J., & Bethel, C. L. (2015). Validation of vocal prosody modifications to communicate emotion in robot speech. In International Conference on Collaboration Technologies and Systems (CTS). IEEE, pp. 39–46.Google Scholar
  7. Dhananjaya, N., & Yegnanarayana, B. (2010). Voiced/nonvoiced detection based on robustness of voiced epochs. IEEE Signal Processing Letters, 17(3), 273–276.CrossRefGoogle Scholar
  8. Eide, E., Aaron, A., Bakis, R., Hamza, W., Picheny, M., & Pitrelli, J. (2004). A corpus-based approach to expressive speech synthesis. In Fifth ISCA Workshop on Speech Synthesis. Pittsburgh, PA, USA.Google Scholar
  9. Ellis, D. P., & Weiss, R. J. (2006). Model-based monaural source separation using a vector-quantized phase-vocoder representation. In proceedings of International conference on Acoustics, Speech and Signal Processing, Toulouse, France, Vol. 5. IEEE.Google Scholar
  10. Gangamohan, P., Mittal, V. K., & Yegnanarayana, B. (2012). A flexible analysis synthesis tool (fast) for studying the characteristic features of emotion in speech. In Proc. of Consumer Communications and Networking Conference (CCNC), Lasvegas, USA. IEEE, pp. 250–254.Google Scholar
  11. Gangamohan, P., Mittal, V., & Yegnanarayana, B. (2012). Relative importance of different components of speech contributing to perception of emotion. In proceedings of Sixth International Conference on Speech Prosody, China.Google Scholar
  12. Govind, D., & Prasanna, S. (2009). Expressive speech synthesis using prosodic modification and dynamic time warping. In proceedings of NCC, Guwahati, India.Google Scholar
  13. Govind, D., Prasanna, S. M., & Yegnanarayana, B. (2011). Neutral to target emotion conversion using source and suprasegmental information. In proceedings of INTERSPEECH, Florence, Italy, pp. 2969–2972.Google Scholar
  14. Kawahara, H. (2006). Straight, exploitation of the other aspect of vocoder: Perceptually isomorphic decomposition of speech sounds. Acoustical Science and Technology, 27(6), 349–353.CrossRefGoogle Scholar
  15. Kortekaas, R. W., & Kohlrausch, A. (1997). Psychoacoustical evaluation of the pitch-synchronous overlap-and-add speech-waveform manipulation technique using single-formant stimuli. The Journal of the Acoustical Society of America, 101(4), 2202–2213.CrossRefGoogle Scholar
  16. Laroche, J., & Dolson, M. (1999). Improved phase vocoder time-scale modification of audio. IEEE Transactions on Speech and Audio processing, 7(3), 323–332.CrossRefGoogle Scholar
  17. Lotfian, R., & Busso, C. (2015). Emotion recognition using synthetic speech as neutral reference. In proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, (pp. 4759–4763).Google Scholar
  18. Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5–6), 453–467.CrossRefGoogle Scholar
  19. Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1602–1613.CrossRefGoogle Scholar
  20. Murty, K. S. R., Yegnanarayana, B., & Joseph, M. A. (2009). Characterization of glottal activity from speech signals. IEEE Signal Processing Letters, 16(6), 469–472.CrossRefGoogle Scholar
  21. Nakayama, K., Oshima, C., Higashihara, R., Machishima, K. (2015). Mood induction through emotional prosody modification experiments of students reading a folk story scenario. In Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE). IEEE, (pp. 391–396).Google Scholar
  22. Pitrelli, J. F., Bakis, R., Eide, E. M., Fernandez, R., Hamza, W., & Picheny, M. A. (2006). The ibm expressive text-to-speech synthesis system for american english. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1099–1108.CrossRefGoogle Scholar
  23. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P. et al. (2011). The kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society.Google Scholar
  24. Prasanna, S. M., & Govind, D. (2010). Analysis of excitation source information in emotional speech. In proceedings of INTERSPEECH, Japan.Google Scholar
  25. Prasanna, S., Govind, D., Rao, K. S., & Yenanarayana, B. (2010). Fast prosody modification using instants of significant excitation. In proceedings of Speech Prosody, Chicago, USA.Google Scholar
  26. Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Audio, Speech, and Language Processing, 14(3), 972–980.CrossRefGoogle Scholar
  27. Reddy, G., & Rao, K. S. (2015). Neutral to happy emotion conversion by blending prosody and laughter. In proceedings of Eighth International Conference on Contemporary Computing (IC3), Noida, India. IEEE, (pp. 342–347).Google Scholar
  28. Sagha, H., Deng, J., & Schuller, B. (2017). The effect of personality trait, age, and gender on the performance of automatic speech valence recognition. In proceedings of Seventh International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, (pp. 86–91).Google Scholar
  29. Shahnawazuddin, S., Adiga, N., & Kathania, H. K. (2017). Effect of prosody modification on children’s asr. IEEE Signal Processing Letters, 24(11), 1749–1753.CrossRefGoogle Scholar
  30. Sharma, B., & Prasanna, S. M. (2015). Improvement of syllable based tts system in assamese using prosody modification. In proceedings of Annual India Conference (INDICON). IEEE, (pp. 1–6).Google Scholar
  31. Sorin, A., Shechtman, S., & Pollet, V. (2015). Coherent modification of pitch and energy for expressive prosody implantation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, (pp. 4914–4918).Google Scholar
  32. Tao, J., Kang, Y., & Li, A. (2006). Prosody conversion from neutral speech to emotional speech. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1145–1154.CrossRefGoogle Scholar
  33. Thomas, M. R., Gudnason, J., & Naylor, P. A. (2008). Application of dypsa algorithm to segmented time scale modification of speech. In proceedings of EUSIPCO, Switzerland. IEEE.Google Scholar
  34. Verhelst, W. (2000). Overlap-add methods for time-scaling of speech. Speech Communication, 30(4), 207–221.CrossRefGoogle Scholar
  35. Vidyadhara Raju, V., Vydana, V, H. K., Gangashetty, S. V., & Vuppala, A. K. (2017). Importance of non-uniform prosody modification for speech recognition in emotion conditions. In proceedings of Asia-Pacific Signal and information processing association annual summit and conference (APSIPA), Kuala Lumpur. IEEE.Google Scholar
  36. VidyadharaRaju, V., Gangamohan, P., Gangashetty, S. V., & Vuppala, A. K. (2016). Application of prosody modification for speech recognition in different emotion conditions. In proceedings of Region 10 Conference (TENCON), Singapore. IEEE, (pp. 951–954).Google Scholar
  37. Vydana, H. K., Vidyadhara Raju, V., Gangashetty, V, S. V., & Vuppala, A. K. (2015). Significance of emotionally significant regions of speech for emotive to neutral conversion. In proceedings of International Conference on Mining Intelligence and Knowledge Exploration, Hyderabad, India. Springer, New York, (pp. 287–296).Google Scholar
  38. Vydana, H. K., Kadiri, S. R., & Vuppala, A. K. (2016). Vowel-based non-uniform prosody modification for emotion conversion. Circuits, Systems, and Signal Processing, 35(5), 1643–1663.CrossRefGoogle Scholar
  39. Zölzer, U., & Smith Iii, J. O. (2003). Dafxdigital audio effects. The Journal of the Acoustical Society of America, 114(5), 2527–2528.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Vishnu Vidyadhara Raju Vegesna
    • 1
  • Krishna Gurugubelli
    • 1
  • Anil kumar Vuppala
    • 1
  1. 1.Speech Processing Lab , KCISInternational Institute of Information Technology, Hyderabad (IIIT-H)HyderabadIndia

Personalised recommendations