Application of Emotion Recognition and Modification for Emotional Telugu Speech Recognition
- 130 Downloads
Abstract
Majority of the automatic speech recognition systems (ASR) are trained with neutral speech and the performance of these systems are affected due to the presence of emotional content in the speech. The recognition of these emotions in human speech is considered to be the crucial aspect of human-machine interaction. The combined spectral and differenced prosody features are considered for the task of the emotion recognition in the first stage. The task of emotion recognition does not serve the sole purpose of improvement in the performance of an ASR system. Based on the recognized emotions from the input speech, the corresponding adapted emotive ASR model is selected for the evaluation in the second stage. This adapted emotive ASR model is built using the existing neutral and synthetically generated emotive speech using prosody modification method. In this work, the importance of emotion recognition block at the front-end along with the emotive speech adaptation to the ASR system models were studied. The speech samples from IIIT-H Telugu speech corpus were considered for building the large vocabulary ASR systems. The emotional speech samples from IITKGP-SESC Telugu corpus were used for the evaluation. The adapted emotive speech models have yielded better performance over the existing neutral speech models.
Keywords
ASR Emotion recognition Emotive speechReferences
- 1.Gangamohan P, Mittal V, Yegnanarayana B (2012) Relative importance of different components of speech contributing to perception of emotion. In: Proc of Sixth international conference on speech prosody, ChinaGoogle Scholar
- 2.YeonWoo Lee MK, Cheeyong K (2017) A study on colors and emotions of video contents-focusing on depression scale through analysis of commercials. Journal of Multimedia Information Systems 4(4):301–306Google Scholar
- 3.Dybkjaer L, Bernsen NO, Minker W (2004) Evaluation and usability of multimodal spoken language dialogue systems. Speech Comm 43(1-2):33–54CrossRefGoogle Scholar
- 4.Busso C, Bulut M, Narayanan S, Gratch J, Marsella S (2013) Toward effective automatic recognition systems of emotion in speech. In: Social emotions in nature and artifact: emotions in human and human-computer interaction, pp 110–127Google Scholar
- 5.Busso C, Lee S, Narayanan S (2009) Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE transactions on audio, speech, and language processing 17(4):582–596CrossRefGoogle Scholar
- 6.McKeown G, Valstar M, Cowie R, Pantic M, Schroder M (2012) The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans Affect Comput 3(1):5–17CrossRefGoogle Scholar
- 7.Mariooryad S, Lotfian R, Busso C (2014) Building a naturalistic emotional speech corpus by retrieving expressive behaviors from existing speech corpora. In: Proc of Fifteenth annual conference of the international speech communication associationGoogle Scholar
- 8.Schuller B, Vlasenko B, Eyben F, Wollmer M, Stuhlsatz A, Wendemuth A, Rigoll G (2010) Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans Affect Comput 1(2):119–131CrossRefGoogle Scholar
- 9.Sethu V, Ambikairajah E, Epps J (2007) Speaker normalisation for speech-based emotion detection. In: Proc of 15th international conference on digital signal processing, pp 611–614Google Scholar
- 10.Busso C, Metallinou A, Narayanan SS (2011) Iterative feature normalization for emotional speech detection. In: Proc of international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 5692–5695Google Scholar
- 11.Deng J, Zhang Z, Marchi E, Schuller B (2013) Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: Proc of humaine association conference on affective computing and intelligent interaction (ACII), IEEE, pp 511–516Google Scholar
- 12.Maeireizo B, Litman D, Hwa R (2004) Co-training for predicting emotions with spoken dialogue data. In: Proc of the interactive poster and demonstration sessions, ACL, p 28Google Scholar
- 13.Abdelwahab M, Busso C (2015) Supervised domain adaptation for emotion recognition from speech. In: Proc of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp 5058–5062Google Scholar
- 14.Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Comm 48(9):1162–1181CrossRefGoogle Scholar
- 15.Schuller B, Seppi D, Batliner A, Maier A, Steidl S (2007) Towards more reality in the recognition of emotional speech. In: Proc of international conference on acoustics, speech and signal processing, vol 4. IEEE, pp IV–941Google Scholar
- 16.Schuller B, Batliner A, Steidl S, Seppi D (2009) Emotion recognition from speech: putting asr in the loop. In: Proc of international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 4585–4588Google Scholar
- 17.Athanaselis T, Bakamidis S, Dologlou I, Cowie R, Douglas-Cowie E, Cox C (2005) Asr for emotional speech: clarifying the issues and enhancing performance. Neural Netw 18(4):437–444CrossRefGoogle Scholar
- 18.Steeneken HJ, Hansen JH (1999) Speech under stress conditions: overview of the effect on speech production and on system performance. In: Proc of international conference on acoustics, speech, and signal processing(ICASSP), vol 4. IEEE, pp 2079–2082Google Scholar
- 19.Benzeghiba M, De Mori R, Deroo O, Dupont S, Erbes T, Jouvet D, Fissore L, Laface P, Mertins A, Ris C et al (2007) Automatic speech recognition and speech variability: a review. Speech Comm 49(10-11):763–786CrossRefGoogle Scholar
- 20.Sheikhan M, Gharavian D, Ashoftedel F (2012) Using dtw neural-based mfcc warping to improve emotional speech recognition. Springer journal on Neural Computing and Applications 21(7):1765–1773CrossRefGoogle Scholar
- 21.Welling L, Ney H, Kanthak S (2002) Speaker adaptive modeling by vocal tract normalization. IEEE Transactions on Speech and Audio Processing 10(6):415–426CrossRefGoogle Scholar
- 22.Sinha R, Umesh S (2002) Non-uniform scaling based speaker normalization. In: Proc of international conference on acoustics, speech, and signal processing (ICASSP), vol 1. IEEE, pp I–589Google Scholar
- 23.Müller F, Mertins A (2011) Contextual invariant-integration features for improved speaker-independent speech recognition. Speech Comm 53(6):830–841CrossRefGoogle Scholar
- 24.Vydana HK, Vidyadhara Raju V, Gangashetty SV, Vuppala AK (2015) Significance of emotionally significant regions of speech for emotive to neutral conversion. In: Proc of international conference on mining intelligence and knowledge exploration, Springer, Hyderabad, pp 287–296Google Scholar
- 25.Vidyadhara Raju V, Vydana Vhk, Gangashetty SV, Vuppala AK (2017) Importance of non-uniform prosody modification for speech recognition in emotion conditions. In: Proc of Asia-Pacific Signal and information processing association annual summit and conference (APSIPA), IEEEGoogle Scholar
- 26.Adiga N, Govind D, Prasanna SM (2014) Significance of epoch identification accuracy for prosody modification. In: Proc of SPCOM, IEEE, Bangalore, pp 1–6Google Scholar
- 27.Rao KS, Yegnanarayana B (2006) Prosody modification using instants of significant excitation. IEEE Trans Audio Speech Lang Process 14(3):972–980CrossRefGoogle Scholar
- 28.Tao J, Kang Y, Li A (2006) Prosody conversion from neutral speech to emotional speech. IEEE Trans Audio Speech Lang Process 14(4):1145–1154CrossRefGoogle Scholar
- 29.Prasanna S, Govind D, Rao KS, Yenanarayana B (2010) Fast prosody modification using instants of significant excitation. In: Proc of speech prosody, ChicagoGoogle Scholar
- 30.Thomas MR, Gudnason J, Naylor PA (2008) Application of dypsa algorithm to segmented time scale modification of speech. In: Proc of EUSIPCO, IEEE, SwitzerlandGoogle Scholar
- 31.Vidyadhara Raju VV, Gurugubelli K, Vydana HK, Pulugandla B, Shrivastava M, Vuppala AK (2017) Dnn-hmm acoustic modeling for large vocabulary telugu speech recognition. In: Proc of international conference on mining intelligence and knowledge exploration, Springer, pp 189–197Google Scholar
- 32.Koolagudi SG, Maity S, Kumar VA, Chakrabarti S, Rao KS (2009) IITKGP-SESC: speech database for emotion analysis. In: Contemporary computing, Springer, pp 485–492Google Scholar
- 33.Saul LK, Rahim MG (2000) Maximum likelihood and minimum classification error factor analysis for automatic speech recognition. IEEE Transactions on Speech and Audio Processing 8(2):115–125CrossRefGoogle Scholar
- 34.Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D et al (2002) The htk book. Cambridge university engineering department 3:175Google Scholar
- 35.Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P et al (2011) The kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding, IEEE Signal Processing SocietyGoogle Scholar
- 36.Makhoul J (1975) Linear prediction: a tutorial review. Proc IEEE 63(4):561–580CrossRefGoogle Scholar
- 37.Koolagudi SG, Rao KS (2012) Emotion recognition from speech: a review. Int J Speech Technol 15(2):99–117CrossRefGoogle Scholar
- 38.El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn 44(3):572–587CrossRefzbMATHGoogle Scholar
- 39.Rabiner LR, Juang B-H (1993) Fundamentals of speech recognition. PTR Prentice Hall Englewood Cliffs, vol 14Google Scholar
- 40.Benesty J, Chen J, Huang Y (2008) Microphone array signal processing. Springer Science & Business Media, vol 1Google Scholar
- 41.Cowie R, Cornelius RR (2003) Describing the emotional states that are expressed in speech. Speech Comm 40(1-2):5–32CrossRefzbMATHGoogle Scholar
- 42.Bänziger T, Scherer KR (2005) The role of intonation in emotional expressions. Speech Comm 46 (3-4):252–267CrossRefGoogle Scholar
- 43.Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing 13(2):293–303CrossRefGoogle Scholar
- 44.Vidyadhara Raju VV, Gurugubelli K, Vuppala AK (2018) Differenced prosody features from normal and stressed regions foremotion recognition. In: 5th international conference on signal processing and integrated networks (SPIN), IEEEGoogle Scholar
- 45.Werner S, Keller E (1995) Prosodic aspects of speech. In: Fundamentals of speech synthesis and speech recognition, Wiley Ltd., pp 23–40Google Scholar
- 46.Govind D, Prasanna SM (2018) Prosody modification for speech recognition in emotionally mismatched conditions. Int J Speech TechnolGoogle Scholar
- 47.Murty KSR, Yegnanarayana B (2008) Epoch extraction from speech signals. IEEE Trans Audio Speech Lang Process 16(8):1602–1613CrossRefGoogle Scholar
- 48.Dhananjaya N, Yegnanarayana B (2010) Voiced/nonvoiced detection based on robustness of voiced epochs. IEEE Signal Process Lett 17(3):273–276CrossRefGoogle Scholar