Skip to main content

Language Adaptive Multilingual CTC Speech Recognition

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Included in the following conference series:

Abstract

Recently, it has been demonstrated that speech recognition systems are able to achieve human parity. While much research is done for resource-rich languages like English, there exists a long tail of languages for which no speech recognition systems do yet exist. The major obstacle in building systems for new languages is the lack of available resources. In the past, several methods have been proposed to build systems in low-resource conditions by using data from additional source languages during training. While it has been shown that DNN/HMM hybrid setups trained in low-resource conditions benefit from additional data, we are proposing a similar technique using sequence based neural network acoustic models with Connectionist Temporal Classification (CTC) loss function. We demonstrate that setups with multilingual phone sets benefit from the addition of Language Feature Vectors (LFVs).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    see: https://github.com/baidu-research/warp-ctc, accessed 2017-04-13.

References

  1. PyTorch. http://pytorch.org. Accessed 13 Apr 2017

  2. warp-ctc. https://github.com/baidu-research/warp-ctc. Accessed 13 Apr 2017

  3. Woszczyna, M., et al.: JANUS 93: towards spontaneous speech translation. In: International Conference on Acoustics, Speech, and Signal Processing 1994, Adelaide, Australia (1994)

    Google Scholar 

  4. Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., et al.: Deep speech 2: end-to-end speech recognition in english and mandarin. arXiv preprint (2015). arXiv:1512.02595

  5. Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997)

    Article  MathSciNet  Google Scholar 

  6. Chen, D., Mak, B., Leung, C.C., Sivadas, S.: Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5592–5596. IEEE (2014)

    Google Scholar 

  7. Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a Matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011)

    Google Scholar 

  8. Ghoshal, A., Swietojanski, P., Renals, S.: Multilingual training of deep-neural networks. In: Proceedings of the ICASSP, Vancouver, Canada (2013)

    Google Scholar 

  9. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference On Machine Learning, pp. 369–376. ACM (2006)

    Google Scholar 

  10. Gretter, R.: Euronews: a multilingual benchmark for ASR and LID. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)

    Google Scholar 

  11. Grézl, F., Karafiát, M., Vesely, K.: Adaptation of multilingual stacked bottle-neck neural network structure for new language. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7654–7658. IEEE (2014)

    Google Scholar 

  12. Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ranzato, M., Devin, M., Dean, J.: Multilingual acoustic models using distributed deep neural networks. In: Proceedings of the ICASSP, Vancouver, Canada, May 2013

    Google Scholar 

  13. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  14. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  15. Huang, H., Sim, K.C.: An investigation of augmenting speaker representations to improve speaker normalisation for DNN-based speech recognition. In: ICASSP, pp. 4610–4613. IEEE (2015)

    Google Scholar 

  16. Kim, S., Hori, T., Watanabe, S.: Joint ctc-attention based end-to-end speech recognition using multi-task learning. arXiv preprint (2016). arXiv:1609.06773

  17. Laskowski, K., Heldner, M., Edlund, J.: The fundamental frequency variation spectrum. In: Proceedings of the 21st Swedish Phonetics Conference (Fonetik 2008), pp. 29–32, Gothenburg, Sweden, June 2008

    Google Scholar 

  18. Lu, L., Kong, L., Dyer, C., Smith, N.A.: Multi-task learning with ctc and segmental crf for speech recognition. arXiv preprint (2017). arXiv:1702.06378

  19. Metze, F., Sheikh, Z., Waibel, A., Gehring, J., Kilgour, K., Nguyen, Q.B., Nguyen, V.H., et al.: Models of Tone for tonal and non-tonal languages. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 261–266. IEEE (2013)

    Google Scholar 

  20. Miao, Y., Gowayyed, M., Metze, F.: EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 167–174. IEEE (2015)

    Google Scholar 

  21. Miao, Y., Zhang, H., Metze, F.: Towards speaker adaptive training of deep neural network acoustic models (2014)

    Google Scholar 

  22. Mohan, A., Rose, R.: Multi-lingual speech recognition with low-rank multi-task deep neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4994–4998. IEEE (2015)

    Google Scholar 

  23. Müller, M., Stüker, S., Waibel, A.: Language Adaptive DNNs for improved low resource speech recognition. In: Interspeech (2016)

    Google Scholar 

  24. Müller, M., Stüker, S., Waibel, A.: Language feature vectors for resource constraint speech recognition. In: ITG Symposium, Proceedings of Speech Communication, vol. 12. VDE (2016)

    Google Scholar 

  25. Müller, M., Waibel, A.: Using language adaptive deep neural networks for improved multilingual speech recognition. In: IWSLT (2015)

    Google Scholar 

  26. Sak, H., Rao, K.: Multi-accent speech recognition with hierarchical grapheme based models (2017)

    Google Scholar 

  27. Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., Cui, X., Ramabhadran, B., Picheny, M., Lim, L.L., et al.: English conversational telephone speech recognition by humans and machines. arXiv preprint (2017). arXiv:1703.02136

  28. Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-Vectors. In: ASRU, pp. 55–59. IEEE (2013)

    Google Scholar 

  29. Scanzio, S., Laface, P., Fissore, L., Gemello, R., Mana, F.: On the use of a multilingual neural network front-end. In: Proceedings of the Interspeech, pp. 2711–2714 (2008)

    Google Scholar 

  30. Schröder, M., Trouvain, J.: The German text-to-speech synthesis system MARY: a tool for research, development and teaching. Int. J. Speech Technol. 6(4), 365–377 (2003)

    Article  Google Scholar 

  31. Schubert, K.: Grundfrequenzverfolgung und deren Anwendung in der Spracherkennung. Master’s thesis, Universität Karlsruhe (TH), Germany (1999) (in German)

    Google Scholar 

  32. Schultz, T., Waibel, A.: Fast bootstrapping of lvcsr systems with multilingual phoneme sets. In: Eurospeech (1997)

    Google Scholar 

  33. Schultz, T., Waibel, A.: Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Commun. 35(1), 31–51 (2001)

    Article  MATH  Google Scholar 

  34. Soltau, H., Liao, H., Sak, H.: Neural speech recognizer: acoustic-to-word lstm model for large vocabulary speech recognition. arXiv preprint (2016). arXiv:1610.09975

  35. Soltau, H., Metze, F., Fugen, C., Waibel, A.: A One-pass decoder based on polymorphic linguistic context assignment. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2001, pp. 214–217. IEEE (2001)

    Google Scholar 

  36. Stüker, S.: Acoustic modelling for under-resourced languages. Ph.D. thesis, Karlsruhe University, Dissertation (2009)

    Google Scholar 

  37. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on Machine Learning (ICML-2013), pp. 1139–1147 (2013)

    Google Scholar 

  38. Swietojanski, P., Ghoshal, A., Renals, S.: Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR. In: SLT, pp. 246–251. IEEE (2012)

    Google Scholar 

  39. Vesely, K., Karafiat, M., Grezl, F., Janda, M., Egorova, E.: The language-independent bottleneck features. In: Proceedings of the Spoken Language Technology Workshop (SLT), pp. 336–341. IEEE (2012)

    Google Scholar 

  40. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K.: Phoneme recognition using time-delay neural networks. In: ATR Interpreting Telephony Research Laboratories, 30 October 1987

    Google Scholar 

  41. Wheatley, B., Kondo, K., Anderson, W., Muthusamy, Y.: An evaluation of cross-language adaptation for rapid hmm development in a new language. In: 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-1994, vol. 1, pp. I-237. IEEE (1994)

    Google Scholar 

  42. Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., Zweig, G.: Achieving human parity in conversational speech recognition. arXiv preprint (2016). arXiv:1610.05256

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Markus Müller .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Müller, M., Stüker, S., Waibel, A. (2017). Language Adaptive Multilingual CTC Speech Recognition. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_47

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-66429-3_47

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-66428-6

  • Online ISBN: 978-3-319-66429-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics