Abstract
In this paper, a method based on deep learning for extracting bottleneck features for Vietnamese large vocabulary speech recognition is presented. Deep bottleneck features (DBNFs) is able to achieve significant improvements over a number of base bottleneck features which was reported previously. The experiments are carried out on the dataset containing speeches on Voice of Vietnam channel (VOV). The results show that adding tonal feature as input feature of the network reached around 20% relative recognition performance. The DBNF extraction for Vietnamese recognition decrease the error rate by 51%, compared to the MFCC baseline.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Vu, T.T., Nguyen, D.T., Luong, M.C., Hosom, J.-P.: Vietnamese large vocabulary continuous speech recognition. In: INTERSPEECH (2005)
Quang, N.H., Nocera, P., Castelli, E., Van Loan, T.: A novel approach in continuous speech recognition for vietnamese. In: SLTU (2008)
Vu, N.T., Schultz, T.: Vietnamese large vocabulary continuous speech recognition. In: Proc. Automatic Speech Recognition and Understanding (ASRU), Merano, Italy. IEEE (December 2009)
Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 257–286 (1989)
Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach. Kluwer Academic Publishers, Norwell (1993)
Hermansky, H., Ellis, D.P.W., Sharma, S.: Tandem connectionist feature extraction for conventional hmm systems. In: Proc. ICASSP, pp. 1635–1638 (2000)
Grezl, F., Karafiat, M., Kontair, S., Cernocky, J.: Probabilistic and bottle-neck features for lvcsr of meetings. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. V–757–IV–760. IEEE (2007)
Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Proc. Interspeech 2011, pp. 437–440 (2011)
Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets 18, 1527–1554 (2006)
Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., Bengio, S.: Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res. 11, 625–660 (2010)
Gehring, J., Miao, Y., Metze, F., Waibel, A.: Extracting deep bottleneck features using stacked auto-encoders. In: ICASSP 2013, Vancouver, CA, pp. 3377–3381 (2013)
Metze, F., Sheikh, Z.A.W., Waibel, A., Gehring, J., Kilgour, K., Nguyen, Q.B., Nguyen, V.H.: Models of tone for tonal and non-tonal languages. In: ASRU, pp. 261–266. IEEE (2013)
Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., Khudanpur, S.: A pitch extraction algorithm tuned for automatic speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE Signal Processing Society (to appear, May 2014)
Talkin, D.: A robust algorithm for pitch tracking (RAPT). In: Klein, W.B., Palival, K.K. (eds.) Speech Coding and Synthesis. Elsevier (1995)
Plante, F., Meyer, G.F., Ainsworth, W.A.: A pitch extraction reference database. In: EUROSPEECH. ISCA (1995)
Yu, D., Seltzer, M.L.: Improved bottleneck features using pretrained deep neural networks. In: INTERSPEECH, pp. 237–240 (2011)
Tüske, Z., Schlüter, R., Ney, H.: Deep hierarchical bottleneck mrasta features for IVCSR. In: ICASSP, pp. 6970–6974 (2013)
Sainath, T.N., Kingsbury, B., Ramabhadran, B.: Auto-encoder bottleneck features using deep belief networks. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4153–4156 (2012)
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: ICML 2008, pp. 1096–1103 (2008)
Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: A deep learning approach. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 513–520 (2011)
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, IEEE Catalog No.: CFP11SRW-USB (December 2011)
Rath, S.P., Povey, D., Vesely, K., Cernocky, J.: Improved feature processing for deep neural networks. In: INTERSPEECH, pp. 109–113. ISCA (2013)
Nguyen, V.H., Luong, C.M., Vu, T.T.: Applying bottle neck feature for vietnamese speech recognition, pp. 379–388 (2013)
Nguyen, Q.B., Gehring, J., Kilgour, K.: A Waibel, “Optimizing deep bottleneck feature extraction.” in. In: 2013 IEEE RIVF International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), pp. 152–156 (November 2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Nguyen, Q.B., Vu, T.T., Luong, C.M. (2015). Improving Acoustic Model for Vietnamese Large Vocabulary Continuous Speech Recognition System Using Deep Bottleneck Features. In: Nguyen, VH., Le, AC., Huynh, VN. (eds) Knowledge and Systems Engineering. Advances in Intelligent Systems and Computing, vol 326. Springer, Cham. https://doi.org/10.1007/978-3-319-11680-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-11680-8_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11679-2
Online ISBN: 978-3-319-11680-8
eBook Packages: EngineeringEngineering (R0)