Advertisement

Multi-task Learning in Prediction and Correction for Low Resource Speech Recognition

  • Danish Bukhari
  • Jiangyan Yi
  • Zhengqi Wen
  • Bin Liu
  • Jianhua Tao
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 807)

Abstract

In this paper we investigate the performance of Multitask learning (MTL) for the combined model of Convolutional, Long Short-Term Memory and Deep neural Networks (CLDNN) for low resource speech recognition tasks. We trained the multilingual CNN model followed by the MTL using the DNN layers. In the MTL framework the grapheme models are used along with the phone models in the shared hidden layers of deep neural network in order to calculate the state probability. We experimented with universal phone set (UPS) and universal grapheme set (UGS) in the DNN framework and a combination of both UPS and UGS for further accuracy of the overall system. The combined model is implemented on Prediction and Correction (PAC) model making it a multilingual PAC-MTL-CLDNN architecture. We evaluated the improvements on AP16-OLR task and using our proposed model we get 1.8% improvement on Vietnam and 2.5% improvement on Uyghur over the baseline PAC model and MDNN system. We also evaluated that extra grapheme modeling task is still efficient with one hour of training data to get 2.1% improvement on Uyghur over the baseline MDNN system making it highly beneficial for zero resource languages.

Keywords

MTL Multilingual speech recognition Human computer interaction Uyghur first section 

Notes

Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC) (No. 61403386, No. 61273288, No. 61233009), and the Major Program for the National Social Science Fund of China (13&ZD189).

References

  1. 1.
    Burget, L., Schwarz, P., Agarwal, M., et al.: Multilingual acoustic modelling for speech recognition based on subspace Gaussian mixture models. In: Proceedings of ICASSP, pp. 4334–4337 (2010)Google Scholar
  2. 2.
    Mohan, A., Ghalehjegh, S.H., Rose, R.C.: Dealing with acoustic mismatch for training multilingual subspace Gaussian mixture models for speech recognition. In: Proceedings of ICASSP, pp. 4893–4896 (2012)Google Scholar
  3. 3.
    Lu, L., Ghoshal, A., Renals, S.: Regularized subspace Gaussian mixture models for cross-lingual speech recognition. In: Proceedings of ASRU, pp. 365–370 (2011)Google Scholar
  4. 4.
    Lu, L., Ghoshal, A., Renals, S.: Maximum a posteriori adaptation of subspace Gaussian mixture models for crosslingual speech recognition. In: Proceedings of ICASSP, pp. 4877–4880 (2012)Google Scholar
  5. 5.
    Huang, J.T., Li, J., Yu, D., Deng, L., Gong, Y.: Cross language knowledge transfer using multilingual deep neural network with shared hidden layers. In: Proceedings of ICASSP (2013)Google Scholar
  6. 6.
    Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ranzato, M., Devin, M., Dean, J.: Multilingual acoustic models using distributed deep neural networks. In: Proceedings of ICASSP (2013)Google Scholar
  7. 7.
    Devin, M., Dean, J.: Multilingual acoustic models using distributed deep neural networks. In: Proceedings of ICASSP (2013)Google Scholar
  8. 8.
    Ghoshal, A., Swietojanski, P., Renals, S.: Multilingual training of deep neural networks. In: Proceedings of ICASSP (2013)Google Scholar
  9. 9.
    Chen, D., Mak, B.K.-W.: Multitask learning of deep neural networks for low-resource speech recognition. ACM Trans. ASLP 23, 1172–1183 (2015)Google Scholar
  10. 10.
    Sercu, T., Puhrsch, C., Kingsbury, B., Lecun, Y.: Very deep multilingual convolutional neural networks for LVCSR. In: Proceedings of ICASSP (2016)Google Scholar
  11. 11.
    Zhang, Y., Yu, D., Seltzer, M., Droppo, J.: Speech recognition with prediction-adaptation-correction recurrent neural networks. In: Proceedings of ICASSP (2015)Google Scholar
  12. 12.
    Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proceedings of ICASSP (2015)Google Scholar
  13. 13.
    Deng, L., Platt, J.: Ensemble deep learning for speech recognition. In: Proceedings of Interspeech (2014)Google Scholar
  14. 14.
    Bell, P., Renals, S.: Regularization of context dependent deep neural networks with context independent multitask training. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia (2015)Google Scholar
  15. 15.
    Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of ICML 2008, pp. 160–167. ACM (2008)Google Scholar
  16. 16.
    Tur, G.: Multitask learning for spoken language understanding. In: Proceedings of ICASSP, pp. 585–588 (2006)Google Scholar
  17. 17.
    Huang, Y., Wang, W., Wang, L., Tan, T.: Multi-task deep neural network for multi-label learning. In: Proceedings of ICIP, pp. 2897–2900 (2013)Google Scholar
  18. 18.
    Wang, D., Li, L., Tang, D., Chen, Q.: AP16-OL7: a multilingual database for oriental languages and a language recognition baseline. In: Proceedings of APSIPA (2016)Google Scholar
  19. 19.
    Smith Finley, J., Zang, X. (eds.): Language. Education and Uyghur Identity in Urban Xinjiang. Routledge, Abingdon (2015). Social ScienceGoogle Scholar
  20. 20.
    Palaz, D., Collobert, R., Magimai.-Doss, M.: End-to-end phoneme sequence recognition using convolutional neural networks. In: Proceedings of IJCNN (2013)Google Scholar
  21. 21.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR (2015)Google Scholar
  22. 22.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of NIPS, pp. 1097–1105 (2012)Google Scholar
  23. 23.
    Saon, G., Kuo, H.K., Rennie, S., Picheny, M.: The IBM 2015 english conversational telephone speech recognition system. In: Proceedings of Interspeech (2015)Google Scholar
  24. 24.
    Bi, M., Qian, Y., Yu, K.: Very deep convolutional neural networks for LVCSR. In: Proceedings of Interspeech (2015)Google Scholar
  25. 25.
    Sak, H., Senior, A., Beaufays, F.: Long shortterm memory recurrent neural network architectures for large scale acoustic modeling. In: Proceedings of Interspeech (2014)Google Scholar
  26. 26.
    Rozi, A., Wang, D., Zhang, Z.: An open/free database and Benchmark for Uyghur speaker recognition. In: Proceedings of O-COCOSDA (2015)Google Scholar
  27. 27.
    Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Saimaiti, M., Feng, Z.: A syllabification algorithm and syllable statistics of written Uyghur. In: CL (2007)Google Scholar
  29. 29.
    Yu, Z., et al.: Prediction-adaptation-correction recurrent neural networks for low-resource language speech recognition. In: Proceedings of ICASSP (2016)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • Danish Bukhari
    • 1
  • Jiangyan Yi
    • 1
  • Zhengqi Wen
    • 1
  • Bin Liu
    • 1
  • Jianhua Tao
    • 1
  1. 1.Institute of Automation, Chinese Academy of SciencesBeijingChina

Personalised recommendations