Abstract
In this paper we investigate the performance of Multitask learning (MTL) for the combined model of Convolutional, Long Short-Term Memory and Deep neural Networks (CLDNN) for low resource speech recognition tasks. We trained the multilingual CNN model followed by the MTL using the DNN layers. In the MTL framework the grapheme models are used along with the phone models in the shared hidden layers of deep neural network in order to calculate the state probability. We experimented with universal phone set (UPS) and universal grapheme set (UGS) in the DNN framework and a combination of both UPS and UGS for further accuracy of the overall system. The combined model is implemented on Prediction and Correction (PAC) model making it a multilingual PAC-MTL-CLDNN architecture. We evaluated the improvements on AP16-OLR task and using our proposed model we get 1.8% improvement on Vietnam and 2.5% improvement on Uyghur over the baseline PAC model and MDNN system. We also evaluated that extra grapheme modeling task is still efficient with one hour of training data to get 2.1% improvement on Uyghur over the baseline MDNN system making it highly beneficial for zero resource languages.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Burget, L., Schwarz, P., Agarwal, M., et al.: Multilingual acoustic modelling for speech recognition based on subspace Gaussian mixture models. In: Proceedings of ICASSP, pp. 4334–4337 (2010)
Mohan, A., Ghalehjegh, S.H., Rose, R.C.: Dealing with acoustic mismatch for training multilingual subspace Gaussian mixture models for speech recognition. In: Proceedings of ICASSP, pp. 4893–4896 (2012)
Lu, L., Ghoshal, A., Renals, S.: Regularized subspace Gaussian mixture models for cross-lingual speech recognition. In: Proceedings of ASRU, pp. 365–370 (2011)
Lu, L., Ghoshal, A., Renals, S.: Maximum a posteriori adaptation of subspace Gaussian mixture models for crosslingual speech recognition. In: Proceedings of ICASSP, pp. 4877–4880 (2012)
Huang, J.T., Li, J., Yu, D., Deng, L., Gong, Y.: Cross language knowledge transfer using multilingual deep neural network with shared hidden layers. In: Proceedings of ICASSP (2013)
Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ranzato, M., Devin, M., Dean, J.: Multilingual acoustic models using distributed deep neural networks. In: Proceedings of ICASSP (2013)
Devin, M., Dean, J.: Multilingual acoustic models using distributed deep neural networks. In: Proceedings of ICASSP (2013)
Ghoshal, A., Swietojanski, P., Renals, S.: Multilingual training of deep neural networks. In: Proceedings of ICASSP (2013)
Chen, D., Mak, B.K.-W.: Multitask learning of deep neural networks for low-resource speech recognition. ACM Trans. ASLP 23, 1172–1183 (2015)
Sercu, T., Puhrsch, C., Kingsbury, B., Lecun, Y.: Very deep multilingual convolutional neural networks for LVCSR. In: Proceedings of ICASSP (2016)
Zhang, Y., Yu, D., Seltzer, M., Droppo, J.: Speech recognition with prediction-adaptation-correction recurrent neural networks. In: Proceedings of ICASSP (2015)
Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proceedings of ICASSP (2015)
Deng, L., Platt, J.: Ensemble deep learning for speech recognition. In: Proceedings of Interspeech (2014)
Bell, P., Renals, S.: Regularization of context dependent deep neural networks with context independent multitask training. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia (2015)
Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of ICML 2008, pp. 160–167. ACM (2008)
Tur, G.: Multitask learning for spoken language understanding. In: Proceedings of ICASSP, pp. 585–588 (2006)
Huang, Y., Wang, W., Wang, L., Tan, T.: Multi-task deep neural network for multi-label learning. In: Proceedings of ICIP, pp. 2897–2900 (2013)
Wang, D., Li, L., Tang, D., Chen, Q.: AP16-OL7: a multilingual database for oriental languages and a language recognition baseline. In: Proceedings of APSIPA (2016)
Smith Finley, J., Zang, X. (eds.): Language. Education and Uyghur Identity in Urban Xinjiang. Routledge, Abingdon (2015). Social Science
Palaz, D., Collobert, R., Magimai.-Doss, M.: End-to-end phoneme sequence recognition using convolutional neural networks. In: Proceedings of IJCNN (2013)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of NIPS, pp. 1097–1105 (2012)
Saon, G., Kuo, H.K., Rennie, S., Picheny, M.: The IBM 2015 english conversational telephone speech recognition system. In: Proceedings of Interspeech (2015)
Bi, M., Qian, Y., Yu, K.: Very deep convolutional neural networks for LVCSR. In: Proceedings of Interspeech (2015)
Sak, H., Senior, A., Beaufays, F.: Long shortterm memory recurrent neural network architectures for large scale acoustic modeling. In: Proceedings of Interspeech (2014)
Rozi, A., Wang, D., Zhang, Z.: An open/free database and Benchmark for Uyghur speaker recognition. In: Proceedings of O-COCOSDA (2015)
Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Saimaiti, M., Feng, Z.: A syllabification algorithm and syllable statistics of written Uyghur. In: CL (2007)
Yu, Z., et al.: Prediction-adaptation-correction recurrent neural networks for low-resource language speech recognition. In: Proceedings of ICASSP (2016)
Acknowledgements
This work is supported by the National Natural Science Foundation of China (NSFC) (No. 61403386, No. 61273288, No. 61233009), and the Major Program for the National Social Science Fund of China (13&ZD189).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Bukhari, D., Yi, J., Wen, Z., Liu, B., Tao, J. (2018). Multi-task Learning in Prediction and Correction for Low Resource Speech Recognition. In: Tao, J., Zheng, T., Bao, C., Wang, D., Li, Y. (eds) Man-Machine Speech Communication. NCMMSC 2017. Communications in Computer and Information Science, vol 807. Springer, Singapore. https://doi.org/10.1007/978-981-10-8111-8_8
Download citation
DOI: https://doi.org/10.1007/978-981-10-8111-8_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8110-1
Online ISBN: 978-981-10-8111-8
eBook Packages: Computer ScienceComputer Science (R0)