Advertisement

GFCC based discriminatively trained noise robust continuous ASR system for Hindi language

  • Mohit DuaEmail author
  • Rajesh Kumar Aggarwal
  • Mantosh Biswas
Original Research

Abstract

A statistically designed Automatic Speech Recognition (ASR) system extracts features from speech signals using feature extraction methods, links the extracted features with the expected phonetics of the hypothesis using acoustic models, uses language model to add prior information about the structure of the target language. For many years, Mel-frequency Cepstral coefficients (MFCC), n-gram, and Hidden Markov Model (HMM) approaches have been used predominantly for feature extraction, language modeling and acoustic modeling, respectively. However, performance degradation of MFCC in noisy conditions and inaccuracy of HMMs while handling large vocabularies have made researchers to propose more efficient methods. The proposed work uses noise robust method Gammatone Frequency Cepstral Coefficients(GFCC) for feature extraction, trigram language modeling, and HMM-Gaussian mixture model (GMM) based acoustic modeling to implement a continuous Hindi language ASR system. Also, it applies Differential Evolution (DE) technique to refine the GFCC features and discriminative techniques to enhance performance of the acoustic model. The performance of the implemented system has been evaluated by using different feature extraction methods, variants of n-gram language modeling techniques and different discriminative techniques in clean as well as noisy conditions. Initially, the results reveal that DE optimized GFCC with HMM-Gaussian Mixture Model (GMM) acoustic modeling performs better than MFCC, PLP and MF-PLP feature extraction methods. Secondly, the experimental results show that the Minimum Phone Error (MPE) outperforms Maximum Mutual Information (MMI) and Maximum Likelihood Estimation (MLE) and trigram based language modeling gives more accurate results than unigram and bigram language modeling. Finally, it has been concluded that the continuous Hindi language ASR system implemented using DE refined GFFC feature extraction method with MPE discriminative training technique and trigram based language modeling gives better accuracy in clean as-well-as noisy environments.

Keywords

Automatic speech recognition MFCC GFCC Discriminative training MPE 

Notes

References

  1. Adiga A, Magimai M, Seelamantula CS (2013) Gammatone wavelet cepstral coefficients for robust speech recognition. In: TENCON 2013–2013 IEEE Region 10 Conference.  https://doi.org/10.1109/TENCON.2013.6718948
  2. Aggarwal RK, Dave M (2011) Discriminative techniques for Hindi speech recognition system. In: Information Systems for Indian Languages, pp 261–266.  https://doi.org/10.1007/978-3-642-19403-0_45
  3. Aggarwal RK, Dave M (2012a) Filterbank optimization for robust ASR using GA and PSO. Int J Speech Technol 15(2):191–201.  https://doi.org/10.1007/s10772-012-9133-9 CrossRefGoogle Scholar
  4. Aggarwal RK, Dave M (2012b) Integration of multiple acoustic and language models for improved Hindi speech recognition system. Int J Speech Technol 15(2):165–180.  https://doi.org/10.1007/s10772-012-9131-y CrossRefGoogle Scholar
  5. Aggarwal RK, Dave M (2013) Performance evaluation of sequentially combined heterogeneous feature streams for Hindi speech recognition system. Telecommun Syst 52(3):1457–1466.  https://doi.org/10.1007/s11235-011-9623-0 CrossRefGoogle Scholar
  6. Bahl L, Brown P, De Souza P, Mercer R (1986) Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’86, 11, pp 49–52.  https://doi.org/10.1109/ICASSP.1986.1169179
  7. Biswas A, Sahu PK, Chandra M (2014a) Admissible wavelet packet features based on human inner ear frequency response for Hindi consonant recognition. Comput Electr Eng 40(4):1111–1122.  https://doi.org/10.1016/j.compeleceng.2014.01.008 CrossRefGoogle Scholar
  8. Biswas A, Sahu PK, Bhowmick A, Chandra M (2014b) Acoustic feature extraction using ERB like wavelet sub-band perceptual Wiener filtering for noisy speech recognition. In: India Conference (INDICON), 2014 Annual IEEE, IEEE (pp 1–6). IEEE.  https://doi.org/10.1109/INDICON.2014.7030474
  9. Biswas A, Sahu PK, Bhowmick A, Chandra M (2015) Hindi phoneme classification using Wiener filtered wavelet packet decomposed periodic and aperiodic acoustic feature. Comput Electr Eng 42:12–22.  https://doi.org/10.1016/j.compeleceng.2014.12.017 CrossRefGoogle Scholar
  10. Biswas A, Sahu PK, Chandra M (2016) Admissible wavelet packet sub-band based harmonic energy features using ANOVA fusion techniques for Hindi phoneme recognition. IET Signal Proc 10(8):902–911.  https://doi.org/10.1049/iet-spr.2015.0488 CrossRefGoogle Scholar
  11. Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Sign Process 28(4):357–366.  https://doi.org/10.1109/TASSP.1980.1163420 CrossRefGoogle Scholar
  12. Dua M, Aggarwal RK, Biswas M (2017) Discriminative training using heterogeneous feature vector for hindi automatic speech recognition system. In: Computer and Applications (ICCA), 2017 (pp 158–162). IEEE  https://doi.org/10.1109/COMAPP.2017.8079777
  13. Dua M, Aggarwal RK, Biswas M (2018a) Discriminative training using noise robust integrated features and refined HMM modeling. J Intell Syst.  https://doi.org/10.1515/jisys-2017-0618 Google Scholar
  14. Dua M, Aggarwal RK, Biswas M (2018b) Performance evaluation of Hindi speech recognition system using optimized filterbanks. Eng Sci Technol Int J.  https://doi.org/10.1016/j.jestch.2018.04.005 Google Scholar
  15. Fan L, Lei X, Duong TQ, Elkashlan M, Karagiannidis GK (2014) Secure multiuser communications in multiple amplify-and-forward relay networks. IEEE Trans Commun 62(9):3299–3310.  https://doi.org/10.1109/TCOMM.2014.2345763 CrossRefGoogle Scholar
  16. Gillick D, Wegmann S, Gillick L (2012) Discriminative training for speech recognition is compensating for statistical dependence in the HMM framework. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 4745–4748).  https://doi.org/10.1109/ICASSP.2012.6288979
  17. Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Process 2(4):578–589.  https://doi.org/10.1109/89.326616 CrossRefGoogle Scholar
  18. Jeevan M, Dhingra A, Hanmandlu M, Panigrahi BK (2017) Robust speaker verification using GFCC based i-vectors. In: International Conference on Signal, Networks, Computing, and Systems.  https://doi.org/10.1007/978-81-322-3592-7_9
  19. Kadyan V, Mantri A, Aggarwal RK (2017a) A heterogeneous speech feature vectors generation approach with hybrid hmm classifiers. Int J Speech Technol 20(4):761–769.  https://doi.org/10.1007/s10772-017-9446-9 CrossRefGoogle Scholar
  20. Kadyan V, Mantri A, Aggarwal RK (2017b) Refinement of HMM model parameters for punjabi automatic speech recognition (PASR) system. IETE J Res.  https://doi.org/10.1080/03772063.2017.1369370 Google Scholar
  21. Kuan TW, Tsai AC, Sung PH, Wang JF, Kuo HS (2016) A robust BFCC fea-ture extraction for ASR system. Artif Intell Res 5(2):14.  https://doi.org/10.5430/air.v5n2p14 CrossRefGoogle Scholar
  22. Li Y, Wang G, Nie L, Wang Q, Tan W (2018) Distance metric optimization driven convolutional neural network for age invariant face recognition. Pattern Recogn 75:51–62.  https://doi.org/10.1016/j.patcog.2017.10.015 CrossRefGoogle Scholar
  23. Liu Z, Wu Z, Li T, Li J, Shen C (2018) GMM and CNN hybrid method for short utterance speaker recognition. IEEE Trans Industr Inf.  https://doi.org/10.1109/TII.2018.2799928 Google Scholar
  24. Lu L, Kong L, Dyer C, Smith NA, Renals S (2016) Segmental recurrent neural networks for end-to-end speech recognition. Proc Interspeech 2016:385–389CrossRefGoogle Scholar
  25. McDermott E, Hazen TJ, Le Roux J, Nakamura A, Katagiri S (2007) Discriminative training for large-vocabulary speech recognition using minimum classification error. IEEE Trans Audio Speech Lang Process 15(1):203–223.  https://doi.org/10.1109/TASL.2006.876778 CrossRefGoogle Scholar
  26. Mishra AN, Chandra M, Biswas A, Sharan SN (2011) Robust features for connected Hindi digits recognition. Int J Sign Process Image Process Pattern Recogn 4(2):79–90Google Scholar
  27. Povey D, Woodland PC (2002) Minimum phone error and I-smoothing for improved discriminative training. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1, pp 1–105.  https://doi.org/10.1109/ICASSP.2002.5743665
  28. Rabiner LR, Juang BH (1993) Fundamentals of speech recognition. Englewood Cliffs, PTR Prentice Hall, New JerseyGoogle Scholar
  29. Reynolds DA (1994) Experimental evaluation of features for robust speaker identification. IEEE Trans Speech Audio Process 2(4):639–643.  https://doi.org/10.1109/89.326623 CrossRefGoogle Scholar
  30. Samudravijaya K, Rio PVS, Agrawal SS (2000) Hindi speech database. In: International Conference on spoken Language Processing, (pp 456–464). Beijing, ChinaGoogle Scholar
  31. Shao Y, Jin Z, Wang D, Srinivasan S (2009) An auditory-based feature for robust speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP.  https://doi.org/10.1109/ICASSP.2009.4960661
  32. Shao Y, Srinivasan S, Jin Z, Wang D (2010) A computational auditory scene analysis system for speech segregation and robust speech recognition. Comput Speech Lang 24(1):77–93.  https://doi.org/10.1016/j.csl.2008.03.004 CrossRefGoogle Scholar
  33. Valero X, Alias F (2012) Gammatone cepstral coefficients: Biologically inspired features for non-speech audio classification. IEEE Trans Multimedia 14(6):1684–1689.  https://doi.org/10.1109/TMM.2012.2199972 CrossRefGoogle Scholar
  34. Varga A & (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251.  https://doi.org/10.1016/0167-6393(93)90095-3 CrossRefGoogle Scholar
  35. Vertanen K (2004) An overview of discriminative training for speech recognition. University of Cambridge, Cambridge, UKGoogle Scholar
  36. Woodland PC, Povey D (2000) Large scale discriminative training for speech recognition. ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW)Google Scholar
  37. Woodland PC, Povey D (2002) Large scale discriminative training of hidden Markov models for speech recognition. Comput Speech Lang 16(1):25–47.  https://doi.org/10.1006/csla.2001.0182 CrossRefGoogle Scholar
  38. Xiong W, Droppo J, Huang X, Seide F, Seltzer M, Stolcke A, Zweig G (2017) The Microsoft 2016 conversational speech recognition system. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)Google Scholar
  39. Young S, Evermann G, Hain T, Kershaw D, Moore G, Odell J, Woodland P (2002) The HTK book. Cambridge University Press, Cambridge, UKGoogle Scholar
  40. Yücesoy E, Nabiyev VV (2016) A new approach with score-level fusion for the classification of a speaker age and gender. Comput Electr Eng 53:29–39.  https://doi.org/10.1016/j.compeleceng.2016.06.002 CrossRefGoogle Scholar
  41. Zhao X, Wang D (2013) Analyzing noise robustness of MFCC and GFCC features in speaker identification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp 7204–7208).  https://doi.org/10.1109/ICASSP.2013.6639061

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer EngineeringNational Institute of TechnologyKurukshetraIndia

Personalised recommendations