Abstract
Automatic Speech Recognition (ASR) systems have become more popular recently for low resource languages. India has 22 official language and more than two thousands other regional languages, the majority have low resources. The standard resources are also limited for the Hindi language. In this paper, the implementation of continuous Hindi ASR system has been done using Time Delay Neural Network (TDNN) based acoustic modeling significantly improves the performance of baseline Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) based Hindi ASR system up to 11%. Further improvement of 3% and 2% have been recorded by applying i-vector adaptation, interpolated language modeling in this work.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Aggarwal, R. K., & Dave, M. (2012). Integration of multiple acoustic and language models for improved Hindi speech recognition system. International Journal of Speech Technology, 15(2), 165–180.
Deng, L., Wang, K., Acero, A., Hon, H. W., Droppo, J., Boulis, C., et al. (2002). Distributed speech processing in MiPad’s multimodal user interface. IEEE Transactions on Speech and Audio Processing, 10(8), 605–619.
Vegesna, V. V. R., Gurugubelli, K., Vydana, H. K., Pulugandla, B., Shrivastava, M., & Vuppala, A. K. (2017). Dnn-hmm acoustic modeling for large vocabulary Telugu speech recognition. In International Conference on Mining Intelligence and Knowledge Exploration (pp. 189-197). Springer, Cham.
Malioutov, D. M., Sanghavi, S. R., & Willsky, A. S. (2010). Sequential compressed sensing. IEEE Journal of Selected Topics in Signal Processing, 4(2), 435–444.
Ji, S., Xue, Y., & Carin, L. (2008). Bayesian compressive sensing. IEEE Transactions on Signal Processing, 56(6), 2346–2356.
Mohan, A., Rose, R., Ghalehjegh, S. H., & Umesh, S. (2014). Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain. Speech Communication, 56, 167–180.
Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., et al. (2016). Purely sequence-trained neural networks for ASR based on lattice-free MMI. In Interspeech (pp. 2751–2755).
Yoshioka, T., & Gales, M. J. (2015). Environmentally robust ASR front-end for deep neural network acoustic models. Computer Speech & Language, 31(1), 65–86.
Abraham, B., Umesh, S., & Joy, N. M. (2016). Overcoming data sparsity in acoustic modeling of low-resource language by borrowing data and model parameters from high-resource languages. In Interspeech (pp. 3037–3041).
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The Microsoft 2017 conversational speech recognition system. In2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5934–5938). IEEE.
Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545.
Saon, G., Kuo, H. K. J., Rennie, S., & Picheny, M. (2015). The IBM 2015 English conversational telephone speech recognition system. arXiv:1505.05899.
Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning (pp. 1764–1772).
Mohamed, A. R., Seide, F., Yu, D., Droppo, J., Stoicke, A., Zweig, G., et al. (2015). Deep bi-directional recurrent networks over spectral windows. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 78–83). IEEE.
Jozefowich, R., et al. (2016). Exploring the limits of language modeling. arXiv:1602.02410.
Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In Seventh International Conference on Spoken Language Processing.
Kuamr, A., Dua, M., & Choudhary, A. (2014). Implementation and performance evaluation of continuous Hindi speech recognition. In 2014 International Conference on Electronics and Communication Systems (ICECS) (pp. 1–5). IEEE.
Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association.
Peddinti, V., et al. (2015). Reverberation robust acoustic modeling using i-vectors with time delay neural networks. In Sixteenth Annual Conference of the International Speech Communication Association.
Peddinti, V., et al. (2015). Jhu aspire system: Robust lvcsr with tdnns, ivector adaptation and rnn-lms. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE.
Kuamr, A., Dua, M., & Choudhary, T. (2014). Continuous Hindi speech recognition using Gaussian mixture HMM. In IEEE Students’ Conference on Electrical (pp. 1–5). IEEE: Electronics and Computer Science.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et al. (2011). The Kaldi speech recognition toolkit. In IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 1–4).
Samudravijaya, K., Rao, P. V. S., & Agrawal, S. S. (2002). Hindi speech database. In International Conference on spoken Language Processing (pp. 456–464). China: Beijing.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kumar, A., Aggarwal, R.K. (2020). A Time Delay Neural Network Acoustic Modeling for Hindi Speech Recognition. In: Kolhe, M., Tiwari, S., Trivedi, M., Mishra, K. (eds) Advances in Data and Information Sciences. Lecture Notes in Networks and Systems, vol 94. Springer, Singapore. https://doi.org/10.1007/978-981-15-0694-9_40
Download citation
DOI: https://doi.org/10.1007/978-981-15-0694-9_40
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0693-2
Online ISBN: 978-981-15-0694-9
eBook Packages: EngineeringEngineering (R0)