The Impact of Inaccurate Phonetic Annotations on Speech Recognition Performance

  • Radek SafarikEmail author
  • Lukas Mateju
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10415)


This paper focuses on impact of phonetic inaccuracies of acoustic training data on performance of automatic speech recognition system. This is especially important if the training data is created in automated way. In this case, the data often contains errors in a form of wrong phonetic transcriptions. A series of experiments simulating various common errors in phonetic transcriptions based on parts of GlobalPhone data set (for Croatian, Czech and Russian) is conducted. These experiments show the influence of various errors on different languages and acoustic models (Gaussian mixture models, deep neural networks). The impact of errors is also shown for real data obtained by our automated ASR creation process for Belarusian. The results show that the best performance is achieved by using the most accurate data; however, certain amount of errors (up to 5%) does have relatively small impact on speech recognition accuracy.


Speech recognition Gaussian mixture models Deep neural networks Phonetic annotations Phoneme corruption 



This work was supported by the Technology Agency of the Czech Republic (Project No. TA04010199) and by the Student Grant Scheme 2017 of the Technical University in Liberec.


  1. 1.
    Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6 (2004)Google Scholar
  2. 2.
    Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)zbMATHGoogle Scholar
  3. 3.
    Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Trans. Audio, Speech Lang. Proc. (2012)Google Scholar
  4. 4.
    Hansen, M.S., Kozerke, S., Pruessmann, K.P., Boesiger, P., Pedersen, E.M., Tsao, J.: On the influence of training data quality in k-t BLAST reconstruction. Magn. Reson. Med. 52(5), 1175–1183 (2004)CrossRefGoogle Scholar
  5. 5.
    Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing A Guide to Theory, Algorithm, and System Development, 1st edn. Prentice Hall, Upper Saddle River (2001)Google Scholar
  6. 6.
    Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: Proceedings of the IEEE International Conference on Acoustics. Speech and Signal Processing, Detroit, Michigan, vol. I, pp. 181–184, May 1995Google Scholar
  7. 7.
    Mateju, L., Cerva, P., Zdansky, J.: Investigation into the use of deep neural networks for LVCSR of Czech. In: 2015 IEEE International Workshop of Electronics, Control, Measurement, Signals and their application to Mechatronics (ECMSM), pp. 1–4 (2015)Google Scholar
  8. 8.
    Nouza, J., Zdansky, J., Cerva, P.: System for automatic collection, annotation and indexing of Czech broadcast speech with full-text search. In: 2010 15th IEEE Mediterranean Electrotechnical Conference, Melecon 2010, pp. 202–205, April 2010Google Scholar
  9. 9.
    Nouza, J., Safarik, R., Cerva, P.: Asr for South Slavic languages developed in almost automated way. In: INTERSPEECH, pp. 3868–3872 (2016)Google Scholar
  10. 10.
    Nouza, J.e.a.: Speech-to-text technology to transcribe and disclose 100,000+ hours of bilingual documents from historical Czech and Czechoslovak radio archive. In: INTERSPEECH, pp. 964–968. ISCA (2014)Google Scholar
  11. 11.
    Safarik, R., Mateju, L.: Impact of phonetic annotation precision on automatic speech recognition systems. In: 2016 39th International Conference on Telecommunications and Signal Processing (TSP), pp. 311–314, June 2016Google Scholar
  12. 12.
    Schultz, T.: Globalphone: A multilingual speech and text database developed at Karlsruhe university. In: Proceedings of the ICSLP, pp. 345–348 (2002)Google Scholar
  13. 13.
    Sundaram, R., Picone, J.: Effects on transcription errors on supervised learning in speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 Proceedings, ICASSP 2004, vol. 1, p. I-169. IEEE (2004)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Institute of Information Technology and ElectronicsTechnical University of LiberecLiberecCzech Republic

Personalised recommendations