Continuous Hindi Speech Recognition Using Kaldi ASR Based on Deep Neural Network

  • Prashant UpadhyayaEmail author
  • Sanjeev Kumar Mittal
  • Omar Farooq
  • Yash Vardhan Varshney
  • Musiur Raza Abidi
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 748)


Today, deep learning is one of the most reliable and technically equipped approaches for developing more accurate speech recognition model and natural language processing (NLP). In this paper, we propose Context-Dependent Deep Neural-network HMMs (CD-DNN-HMM) for large vocabulary Hindi speech using Kaldi automatic speech recognition toolkit. Experiments on AMUAV database demonstrate that CD-DNN-HMMs outperform the conventional CD-GMM-HMMs model and provide the improvement in word error rate of 3.1% over conventional triphone model.


Deep neural network (DNN) Hidden markov model (HMM) Speech recognition Kaldi Hindi language 



The authors would like to acknowledge Institution of Electronics and Telecommunication Engineers (IETE) for sponsoring the research fellowship during this period of research.


  1. 1.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)CrossRefGoogle Scholar
  2. 2.
    Ackley, D.H., Hinton, G.E., Sejnowski, T.J.: A learning algorithm for Boltzmann machines. Cogn. Sci. 9, 147–169 (1985)CrossRefGoogle Scholar
  3. 3.
    Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14, 1771–1800 (2002)CrossRefGoogle Scholar
  4. 4.
    Raina, R., Madhavan, A., Ng, A.Y.: Large-scale deep unsupervised learning using graphics processors. In: Proceedings of 26th International Conference on Machine Learning (ICML 09), pp. 873–880 (2009)Google Scholar
  5. 5.
    Mnih, V., Hinton, G.E.: Learning to detect roads in high-resolution aerial images. Lecture Notes in Computer Science (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) vol. 6316 LNCS, pp. 210–223 (2010)Google Scholar
  6. 6.
    Cireşan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Handwritten digit recognition with a committee of deep neural nets on GPUs. Technical Report No. IDSIA-03-11. 1–8 (2011)Google Scholar
  7. 7.
    Dahl, G.E., Yu, D., Deng, L., Acero, A.: Large vocabulary continuous speech recognition with context-dependent DBN-HMMS. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4688–4691 (2011)Google Scholar
  8. 8.
    Dahl, G.E., Sainath, T.N., Hinton, G.E.: Improving deep neural networks for LVCSR using rectified linear units and dropout. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 8609–8613. IEEE (2013)Google Scholar
  9. 9.
    Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 82–97 (2012)Google Scholar
  10. 10.
    Mohamed, A., Dahl, G.E., Hinton, G.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20, 14–22 (2012)CrossRefGoogle Scholar
  11. 11.
    Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 1–18. IEEE (2012)Google Scholar
  12. 12.
    Jaitly, N., Hinton, G.: Learning a better representation of speech soundwaves using restricted boltzmann machines. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 5884–5887 (2011)Google Scholar
  13. 13.
    Deng, L., Li, J., Huang, J.T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., Gong, Y., Acero, A.: Recent advances in deep learning for speech research at Microsoft. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8604–8608 (2013)Google Scholar
  14. 14.
    Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20, 30–42 (2012)Google Scholar
  15. 15.
    Zeiler, M.D., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q.V., Nguyen, P., Senior, A., Vanhoucke, V., Dean, J., Hinton, G.E.: On rectified linear units for speech processing New York University, USA Google Inc., USA University of Toronto, Canada. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 3517–3521. IEEE (2013)Google Scholar
  16. 16.
    Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of 2011 IEEE Workshop on Automatic Speech Recognition and Understandings (ASRU 2011), pp. 24–29 (2011)Google Scholar
  17. 17.
    Gehring, J., Miao, Y., Metze, F., Waibel, A.: Extracting deep bottleneck features using stacked auto-encoders. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3377–3381 (2013)Google Scholar
  18. 18.
    Deng, L., Seltzer, M.L., Yu, D., Acero, A., Mohamed, A.R., Hinton, G.: Binary coding of speech spectrograms using a deep auto-encoder. In: Eleventh Annual Conference of the International Speech Communication Association, pp. 1692–1695 (2010)Google Scholar
  19. 19.
    Dahl, G., Mohamed, A.R., Hinton, G.E.: Phone recognition with the mean-covariance restricted Boltzmann machine. In: Advances in Neural Information Processing Systems, pp. 469–477 (2010)Google Scholar
  20. 20.
    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: IEEE Workshop Automatic Speech Recognition and Understanding, pp. 1–4 (2011)Google Scholar
  21. 21.
    Young, S., Gales, M., Liu, X.A., Povey, D., Woodland, P.: The HTK Book (version 3.5a). English Department, Cambridge University (2015)Google Scholar
  22. 22.
    Kaldi Home Page.
  23. 23.
    Povey, D., Peddinti, V., Galvez, D., Ghahrmani, P., Manohar, V., Na, X., Wang, Y., Khudanpur, S.: Purely sequence-trained neural networks for {ASR} based on lattice-free {MMI}. In: Proceedings of Interspeech, pp. 2751–2755 (2016)Google Scholar
  24. 24.
    Cosi, P.: Phone recognition experiments on ArtiPhon with KALDI. In: Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian 1749, 0–5 (2016)Google Scholar
  25. 25.
    Canevari, C., Badino, L., Fadiga, L.: A new Italian dataset of parallel acoustic and articulatory data. In: Proceedings of Annual Conference on International Speech Communication Association Interspeech, Jan 2015, pp. 2152–2156 (2015)Google Scholar
  26. 26.
    Miao, Y.: Kaldi + PDNN: building DNN-based ASR systems with Kaldi and PDNN. arXiv CoRR. abs/1401.6, 1–4 (2014)Google Scholar
  27. 27.
    Chen, Z., Watanabe, S., Erdogan, H., Hershey, J.R.: Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. Interspeech 3274–3278 (2015)Google Scholar
  28. 28.
    Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 215–219 (2014)Google Scholar
  29. 29.
    Vu, N.T., Imseng, D., Povey, D., Motlicek, P., Schultz, T., Bourlard, H.: Multilingual deep neural network based acoustic modeling for rapid language adaptation. Icassp-2014, pp. 7639–7643 (2014)Google Scholar
  30. 30.
    Upadhyaya, P., Farooq, O., Abidi, M.R., Varshney, P.: Comparative study of visual feature for bimodal hindi speech recognition. Arch. Acoust. 40 (2015)Google Scholar
  31. 31.
    Ali, A., Zhang, Y., Cardinal, P., Dahak, N., Vogel, S., Glass, J.: A complete Kaldi recipe for building Arabic speech recognition systems. In: Proceedings of 2014 IEEE Workshop Spoken Language Technology (SLT 2014), pp. 525–529 (2014)Google Scholar
  32. 32.
    Cosi, P.: A KALDI-DNN-based ASR system for Italian. In: Proceedings of International Joint Conference on Neural Networks (2015)Google Scholar
  33. 33.
    Lopes, C., Perdigão, F.: Phone recognition on the TIMIT database. Speech Technol. 1, 285–302 (2011)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Prashant Upadhyaya
    • 1
    Email author
  • Sanjeev Kumar Mittal
    • 2
  • Omar Farooq
    • 1
  • Yash Vardhan Varshney
    • 1
  • Musiur Raza Abidi
    • 1
  1. 1.Department of ElectronicsAligarh Muslim UniversityAligarhIndia
  2. 2.Electrical EngineeringIndian Institute of Science BangaloreBengaluruIndia

Personalised recommendations