Skip to main content

A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11305))

Included in the following conference series:

Abstract

The choice of modeling units is critical to automatic speech recognition (ASR) tasks. Conventional ASR systems typically choose context-dependent states (CD-states) or context-dependent phonemes (CD-phonemes) as their modeling units. However, it has been challenged by sequence-to-sequence attention-based models. On English ASR tasks, previous attempts have already shown that the modeling unit of graphemes can outperform that of phonemes by sequence-to-sequence attention-based model. In this paper, we are concerned with modeling units on Mandarin Chinese ASR tasks using sequence-to-sequence attention-based models with the Transformer. Five modeling units are explored including context-independent phonemes (CI-phonemes), syllables, words, sub-words and characters. Experiments on HKUST datasets demonstrate that the lexicon free modeling units can outperform lexicon related modeling units in terms of character error rate (CER). Among five modeling units, character based model performs best and establishes a new state-of-the-art CER of \(26.64\%\) on HKUST datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/rsennrich/subword-nmt.

  2. 2.

    ‘@@’ is a special end-of-word symbol to connect sub-words.

  3. 3.

    We manually delete two tokens \(\cdot \) and \(+\), which are not Mandarin Chinese characters.

  4. 4.

    Experiment code: https://github.com/shiyuzh2007/ASR/tree/master/transformer.

References

  1. Chan, W., Lane, I.: On online attention-based speech recognition and joint Mandarin character-pinyin training. In: Interspeech, pp. 3404–3408 (2016)

    Google Scholar 

  2. Chiu, C.C., et al.: State-of-the-art speech recognition with sequence-to-sequence models. arXiv preprint arXiv:1712.01769 (2017)

  3. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)

    Article  Google Scholar 

  4. Hori, T., Watanabe, S., Zhang, Y., Chan, W.: Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. arXiv preprint arXiv:1706.02737 (2017)

  5. Kannan, A., Wu, Y., Nguyen, P., Sainath, T.N., Chen, Z., Prabhavalkar, R.: An analysis of incorporating an external language model into a sequence-to-sequence model. arXiv preprint arXiv:1712.01996 (2017)

  6. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  7. Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5884–5888. IEEE (2018)

    Google Scholar 

  8. Liu, Y., Fung, P., Yang, Y., Cieri, C., Huang, S., Graff, D.: HKUST/MTS: a very large scale Mandarin telephone speech corpus. In: Huo, Q., Ma, B., Chng, E.-S., Li, H. (eds.) ISCSLP 2006. LNCS, vol. 4274, pp. 724–735. Springer, Heidelberg (2006). https://doi.org/10.1007/11939993_73

    Chapter  Google Scholar 

  9. Prabhavalkar, R., Sainath, T.N., Li, B., Rao, K., Jaitly, N.: An analysis of attention in sequence-to-sequence models. In: Proceedings of Interspeech (2017)

    Google Scholar 

  10. Sainath, T.N., et al.: No need for a lexicon? Evaluating the value of the pronunciation lexica in end-to-end models. arXiv preprint arXiv:1712.01864 (2017)

  11. Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)

    Google Scholar 

  12. Sak, H., Senior, A., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947 (2015)

  13. Senior, A., Sak, H., Shafran, I.: Context dependent phone models for LSTM RNN acoustic modelling. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4585–4589. IEEE (2015)

    Google Scholar 

  14. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)

  15. Shan, C., Zhang, J., Wang, Y., Xie, L.: Attention-based end-to-end speech recognition on voice search. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4764–4768 (2018)

    Google Scholar 

  16. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

    Google Scholar 

  17. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010 (2017)

    Google Scholar 

  18. Zhao, Y., Xu, S., Xu, B.: Multidimensional residual learning based on recurrent neural networks for acoustic modeling. In: Interspeech, pp. 3419–3423 (2016)

    Google Scholar 

  19. Zhou, S., Dong, L., Xu, S., Xu, B.: Syllable-based sequence-to-sequence speech recognition with the transformer in Mandarin Chinese. ArXiv e-prints, April 2018

    Google Scholar 

  20. Zou, W., Jiang, D., Zhao, S., Li, X.: A comparable study of modeling units for end-to-end Mandarin speech recognition. arXiv preprint arXiv:1805.03832 (2018)

Download references

Acknowledgments

The research work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB1001404.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shiyu Zhou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, S., Dong, L., Xu, S., Xu, B. (2018). A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese. In: Cheng, L., Leung, A., Ozawa, S. (eds) Neural Information Processing. ICONIP 2018. Lecture Notes in Computer Science(), vol 11305. Springer, Cham. https://doi.org/10.1007/978-3-030-04221-9_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04221-9_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04220-2

  • Online ISBN: 978-3-030-04221-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics