The Impact of Inaccurate Phonetic Annotations on Speech Recognition Performance
This paper focuses on impact of phonetic inaccuracies of acoustic training data on performance of automatic speech recognition system. This is especially important if the training data is created in automated way. In this case, the data often contains errors in a form of wrong phonetic transcriptions. A series of experiments simulating various common errors in phonetic transcriptions based on parts of GlobalPhone data set (for Croatian, Czech and Russian) is conducted. These experiments show the influence of various errors on different languages and acoustic models (Gaussian mixture models, deep neural networks). The impact of errors is also shown for real data obtained by our automated ASR creation process for Belarusian. The results show that the best performance is achieved by using the most accurate data; however, certain amount of errors (up to 5%) does have relatively small impact on speech recognition accuracy.
KeywordsSpeech recognition Gaussian mixture models Deep neural networks Phonetic annotations Phoneme corruption
This work was supported by the Technology Agency of the Czech Republic (Project No. TA04010199) and by the Student Grant Scheme 2017 of the Technical University in Liberec.
- 1.Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6 (2004)Google Scholar
- 3.Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Trans. Audio, Speech Lang. Proc. (2012)Google Scholar
- 5.Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing A Guide to Theory, Algorithm, and System Development, 1st edn. Prentice Hall, Upper Saddle River (2001)Google Scholar
- 6.Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: Proceedings of the IEEE International Conference on Acoustics. Speech and Signal Processing, Detroit, Michigan, vol. I, pp. 181–184, May 1995Google Scholar
- 7.Mateju, L., Cerva, P., Zdansky, J.: Investigation into the use of deep neural networks for LVCSR of Czech. In: 2015 IEEE International Workshop of Electronics, Control, Measurement, Signals and their application to Mechatronics (ECMSM), pp. 1–4 (2015)Google Scholar
- 8.Nouza, J., Zdansky, J., Cerva, P.: System for automatic collection, annotation and indexing of Czech broadcast speech with full-text search. In: 2010 15th IEEE Mediterranean Electrotechnical Conference, Melecon 2010, pp. 202–205, April 2010Google Scholar
- 9.Nouza, J., Safarik, R., Cerva, P.: Asr for South Slavic languages developed in almost automated way. In: INTERSPEECH, pp. 3868–3872 (2016)Google Scholar
- 10.Nouza, J.e.a.: Speech-to-text technology to transcribe and disclose 100,000+ hours of bilingual documents from historical Czech and Czechoslovak radio archive. In: INTERSPEECH, pp. 964–968. ISCA (2014)Google Scholar
- 11.Safarik, R., Mateju, L.: Impact of phonetic annotation precision on automatic speech recognition systems. In: 2016 39th International Conference on Telecommunications and Signal Processing (TSP), pp. 311–314, June 2016Google Scholar
- 12.Schultz, T.: Globalphone: A multilingual speech and text database developed at Karlsruhe university. In: Proceedings of the ICSLP, pp. 345–348 (2002)Google Scholar
- 13.Sundaram, R., Picone, J.: Effects on transcription errors on supervised learning in speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 Proceedings, ICASSP 2004, vol. 1, p. I-169. IEEE (2004)Google Scholar