Advertisement

Mandarin Chinese Mispronunciation Detection and Diagnosis Leveraging Deep Neural Network Based Acoustic Modeling and Training Techniques

  • Berlin ChenEmail author
  • Yao-Chi Hsu
Chapter
Part of the Chinese Language Learning Sciences book series (CLLS)

Abstract

Automatic mispronunciation detection and diagnosis are two critical and integral components of a computer-assisted pronunciation training (CAPT) system, collectively facilitating second-language (L2) learners to pinpoint erroneous pronunciations in a given utterance so as to improve their spoken proficiency. In this chapter, we will first briefly introduce the latest trends and developments in mispronunciation detection and diagnosis with state-of-the-art automatic speech recognition (ASR) methodologies, especially those using deep neural network based acoustic models. Afterward, we present an effective training approach that estimates the deep neural network based acoustic models involved in the mispronunciation detection process by optimizing an objective directly linked to the ultimate performance evaluation metric. We also investigate the extent to which the subsequent mispronunciation diagnosis process can benefit from the use of these specifically trained acoustic models. For this purpose, we recast mispronunciation diagnosis as a classification problem and a set of indicative features are derived. A series of experiments on a Mandarin Chinese mispronunciation detection and diagnosis task are conducted to evaluate the performance merits of such an approach.

Notes

Acknowledgements

This research is supported in part by the “Aim for the Top University Project” of National Taiwan Normal University (NTNU), sponsored by the Ministry of Education, Taiwan, and by the Ministry of Science and Technology, Taiwan, under Grants MOST 107-2634-F-008-004, MOST 105-2221-E-003-018-MY3, MOST 104-2221-E-003-018-MY3, and MOST 104-2911-I-003-301.

References

  1. Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer.Google Scholar
  2. Chen, L. Y., & Jang, J. S. R. (2015). Automatic pronunciation scoring with score combination by learning to rank and class-normalized DP-based quantization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(11), 787–797.Google Scholar
  3. Chen, N. F., & Li, H. (2016). Computer-assisted pronunciation training: From pronunciation scoring towards spoken language learning. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Jeju: Asia-Pacific Signal and Information Processing Association.Google Scholar
  4. Escudero-Mancebo, D., Gonzalez-Ferreras, C., Aguilar, L., & Estebas-Vilaplana, E. (2017). Automatic assessment of non-native prosody by measuring distances on prosodic label sequences. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 1442–1446). Stockholm: International Speech Communication Association.Google Scholar
  5. Gales, M., & Young, S. (2007). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 3(1), 195–304.CrossRefGoogle Scholar
  6. Gibson, M., & Hain, T. (2006). Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 2406–2409). Pittsburgh: International Speech Communication Association.Google Scholar
  7. Goel, V., & Byrne, W. J. (2000). Minimum Bayes-risk automatic speech recognition. Computer Speech & Language, 14(2), 115–135.CrossRefGoogle Scholar
  8. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge: The MIT Press.Google Scholar
  9. Harrison, A. M., Lo, W.-K., Qian, X.-J., & Meng, H. (2009). Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training. In Proceedings of the Symposium on Languages, Applications and Technologies (pp. 45–48).Google Scholar
  10. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRefGoogle Scholar
  11. Hsiung, Y., Chen, B., & Sung, Y. (2014). Development of Mandarin annotated spoken corpus (MAS Corpus) and the learner corpus analysis. In Proceedings of the Workshop on the Analysis of Linguistic Features. Taipei: National Taiwan Normal University.Google Scholar
  12. Hsu, Y. C., Yang, M. H., Hung, H. T., & Chen, B. (2016). Mispronunciation detection leveraging maximum performance criterion training of acoustic models and decision functions. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 2646–2650). San Francisco: International Speech Communication Association.Google Scholar
  13. Hu, W., Qian, Y., & Soong, F. K. (2013). A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL). In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 1886–1890). Lyon: International Speech Communication Association.Google Scholar
  14. Hu, W., Qian, Y., & Soong, F. K. (2015a). An improved DNN-based approach to mispronunciation detection and diagnosis of L2 learners’ speech. In Proceedings of the Symposium on Languages, Applications and Technologies (pp. 71–76). Madrid: International Speech Communication Association.Google Scholar
  15. Hu, W., Qian, Y., Soong, F. K., & Wang, Y. (2015b). Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Communication, 67, 154–160.CrossRefGoogle Scholar
  16. Huang, H., Wang, J., & Abudureyimu, H. (2012). Maximum F1-score discriminative training for automatic mispronunciation detection in computer-assisted language learning. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 815–818). Portland: International Speech Communication Association.Google Scholar
  17. Huang, H., Xu, H., Wang, X., & Silamu, W. (2015). Maximum F1-score discriminative training criterion for automatic mispronunciation detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(4), 787–797.CrossRefGoogle Scholar
  18. Kim, Y., Franco, H., & Neumeyer, L. (1997). Automatic pronunciation scoring of specific phone segments for language instruction. In Proceedings of the European Conference on Speech Communication and Technology (pp. 645–648). Rhodes: International Speech Communication Association.Google Scholar
  19. Kingsbury, B. (2009). Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 3761–3764). Taipei: Institute of Electrical and Electronics Engineers.Google Scholar
  20. LeCun, Y., Bengio, Y., Hinton G. (2015). Deep learning. Nature, 521, 436–444, London: Nature Publishing Group.Google Scholar
  21. Lee, A., Zhang, Y., & Glass, J. (2013). Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 8227–8231). Vancouver: Institute of Electrical and Electronics Engineers.Google Scholar
  22. Li, W., Siniscalchi, S. M., Chen, N. F., & Lee, C.-H. (2016). Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 6135–6139). Shanghai: Institute of Electrical and Electronics Engineers.Google Scholar
  23. Li, K., Qian, X., & Meng, H. (2017). Mispronunciation detection and diagnosis in L2 English speech using multidistribution deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(1), 193–207.CrossRefGoogle Scholar
  24. Lo, W., Zhang, S., & Meng, H. (2010). Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 765–768). Makuhari: International Speech Communication Association.Google Scholar
  25. Luo, D., Qiao, Y., Minematsu, N., Yamauchi, Y., & Hirose, K. (2009). Analysis and utilization of MLLR speaker adaptation technique for learners’ pronunciation evaluation. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 608–611). Brighton: International Speech Communication Association.Google Scholar
  26. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et al. (2011). The Kaldi speech recognition toolkit. In Proceedings of the IEEE workshop on Automatic Speech Recognition and Understanding. Waikoloa: Institute of Electrical and Electronics Engineers.Google Scholar
  27. Qian, X., Soong, F. K., & Meng, H. (2010). Discriminatively trained acoustic models for improving mispronunciation detection and diagnosis in computer aided pronunciation training (CAPT). In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 757–760). Makuhari: International Speech Communication Association.Google Scholar
  28. Qian, X., Meng, H., & Soong, F. K. (2012). The use of DBN-HMMs for mispronunciation detection and diagnosis in L2 English to support computer-aided pronunciation training (pp. 775–778). In Proceedings of the Annual Conference of the International Speech Communication Association. Portland: International Speech Communication Association.Google Scholar
  29. Qian, Y., Fan, Y., Hu, W., & Soong, F. K. (2014). On the training aspects of deep neural network (DNN) for parametric TTS synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Florence: Institute of Electrical and Electronics Engineers.Google Scholar
  30. Qian, X., Meng, H., & Soong, F. K. (2016). A two-pass framework of mispronunciation detection and diagnosis for computer-aided pronunciation training. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(6), 1020–1028.CrossRefGoogle Scholar
  31. Rand, W. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850.CrossRefGoogle Scholar
  32. Wang, Y. B., & Lee, L. S. (2015). Supervised detection and unsupervised discovery of pronunciation error patterns for computer-assisted language learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(3), 564–579.CrossRefGoogle Scholar
  33. Wei, S., Hu, G., Hu, Y., & Wang, R. (2009). A new method for mispronunciation detection using support vector machine based on pronunciation space models. Speech Communication, 51(10), 896–905.CrossRefGoogle Scholar
  34. Witt, S. M., & Young, S. J. (2000). Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication, 30(2–3), 95–108.CrossRefGoogle Scholar
  35. Yu, D., & Deng, L. (2014). Automatic speech recognition—A deep learning approach. New York: Springer.Google Scholar
  36. Zhang, F., Huang, C., Soong, F. K., Chu, M., & Wang, R. H. (2008). Automatic mispronunciation detection for Mandarin. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Las Vegas: Institute of Electrical and Electronics Engineers.Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.National Taiwan Normal UniversityTaipeiTaiwan
  2. 2.Delta Research CenterTaipeiTaiwan

Personalised recommendations