Using Auto-Encoder BiLSTM Neural Network for Czech Grapheme-to-Phoneme Conversion

Jůzová, Markéta; Vít, Jakub

doi:10.1007/978-3-030-27947-9_8

Markéta Jůzová⁹ &
Jakub Vít⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

876 Accesses

Abstract

The crucial part of almost all current TTS systems is a grapheme-to-phoneme (G2P) conversion, i.e. the transcription of any input grapheme sequence into the correct sequence of phonemes in the given language. Unfortunately, the preparation of transcription rules and pronunciation dictionaries is not an easy process for new languages in TTS systems. For that reason, in the presented paper, we focus on the creation of an automatic G2P model, based on neural networks (NN). But, contrary to the majority of related works in G2P field, using only separate words as an input, we consider a whole phrase the input of our proposed NN model. That approach should, in our opinion, lead to more precise phonetic transcription output because the pronunciation of a word can depend on the surrounding words. The results of the trained G2P model are presented on the Czech language where the cross-word-boundary phenomena occur quite often, and they are compared to the rule-based approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In SAMPA [25], the symbol J corresponds to the palatal nasal.
2.
Note: This accuracy was counted on phoneme level and includes the padding symbols ’-’ and the \(\texttt {<break>}\) words, too.
3.
This time, the accuracies were counted only for regular phonemes and words, without padding symbol “-” and without phrase-break words.

References

Bičan, A.: Distribution and combinations of Czech consonants. Zeitschrift für Slawistik 56, 153–171 (2011)
Article Google Scholar
Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)
Article Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) EMNLP, pp. 1724–1734. ACL (2014)
Google Scholar
Hanzlíček, Z., Vít, J., Tihelka, D.: LSTM-based speech segmentation for TTS synthesis. In: Ekštein, K. (ed.) TSD 2019. LNAI, vol. 11697, pp. 361–372. Springer, Heidelberg (2019)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Jiampojamarn, S., Cherry, C., Kondrak, G.: Joint processing and discriminative training for letter-to-phoneme conversion. In: Proceedings of ACL-08: HLT, pp. 905–913. Association for Computational Linguistics, Columbus (2008)
Google Scholar
Kučera, H.: The phonology of Czech, Slavistic printings and reprintings, vol. 30, ’s-Gravenhage, Mouton (1961)
Google Scholar
Machač, P., Skarnitzl, R.: Principles of phonetic segmentation. Edition erudica, Epocha (2009)
Google Scholar
Matoušek, J.: Building a New Czech text-to-speech system using triphone-based speech units. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2000. LNCS (LNAI), vol. 1902, pp. 223–228. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45323-7_38
Chapter Google Scholar
Matoušek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: Proceedings of INTERSPEECH 2013, Lyon, France, pp. 1511–1515 (2013)
Google Scholar
Matoušek, J., Tihelka, D., Šmídl, L.: On the impact of annotation errors on unit-selection speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 456–463. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32790-2_55
Chapter Google Scholar
Matoušek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for Czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39398-6_41
Chapter Google Scholar
Matoušek, J., Tihelka, D., Romportl, J., Psutka, J.: Slovak unit-selection speech synthesis: creating a new Slovak voice within a Czech TTS system ARTIC. IAENG Int. J. Comput. Sci. 39, 147–154 (2012)
Google Scholar
Matoušek, J., Kala, J.: On modelling glottal stop in Czech text-to-speech synthesis. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 257–264. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_33
Chapter Google Scholar
Matoušek, J., Psutka, J.: ARTIC: a new czech text-to-speech system using statistical approach to speech segment database construction. In: Interspeech 2000 - ICSLP, Beijing, China, vol. 4, pp. 612–615 (2000)
Google Scholar
Matoušek, J., Tihelka, D.: Slovak text-to-speech synthesis in ARTIC system. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 155–162. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30120-2_20
Chapter Google Scholar
Novak, J.R., Minamatsu, N., Hirose, K.: Phonetisaurus: exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Lang. Eng. 22(6), 907–938 (2016)
Article Google Scholar
Palková, Z.: Fonetika a fonologie češtiny [Phonetics and phonology of Czech], 1st edn. Univerzita Karlova, Nakladatelství Karolinum, Praha (1994)
Google Scholar
Psutka, J., Müller, L., Matoušek, J., Radová, V.: Mluvíme s počítačem česky [Talking with Computer in Czech]. Academia, Praha (2006)
Google Scholar
Rao, K., Peng, F., Sak, H., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4225–4229 (2015)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings NIPS, Montreal, Canada, pp. 3104–3112 (2014)
Google Scholar
Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40
Chapter Google Scholar
Wang, D., King, S.: Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Process. Lett. 18(2), 122–125 (2011)
Article Google Scholar
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis (2017). https://arxiv.org/abs/1703.10135
Wells, J.C.: SAMPA computer readable phonetic alphabet. In: Gibbon, D., Moore, R., Winski, R. (eds.) Handbook of Standards and Resources for Spoken Language Systems. Mouton de Gruyter, Berlin (1997)
Google Scholar
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016). http://arxiv.org/abs/1609.08144
Yao, K., Zweig, G.: Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. CoRR abs/1506.00196 (2015)
Google Scholar

Download references

Acknowledgement

This research was supported by the Czech Science Foundation (GA CR), project No. GA19-19324S, and by the grant of the University of West Bohemia, project No. SGS-2019-027.

Author information

Authors and Affiliations

Department of Cybernetics and New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic
Markéta Jůzová & Jakub Vít

Authors

Markéta Jůzová
View author publications
You can also search for this author in PubMed Google Scholar
Jakub Vít
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Markéta Jůzová .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jůzová, M., Vít, J. (2019). Using Auto-Encoder BiLSTM Neural Network for Czech Grapheme-to-Phoneme Conversion. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-27947-9_8
Published: 06 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics