Abstract
The crucial part of almost all current TTS systems is a grapheme-to-phoneme (G2P) conversion, i.e. the transcription of any input grapheme sequence into the correct sequence of phonemes in the given language. Unfortunately, the preparation of transcription rules and pronunciation dictionaries is not an easy process for new languages in TTS systems. For that reason, in the presented paper, we focus on the creation of an automatic G2P model, based on neural networks (NN). But, contrary to the majority of related works in G2P field, using only separate words as an input, we consider a whole phrase the input of our proposed NN model. That approach should, in our opinion, lead to more precise phonetic transcription output because the pronunciation of a word can depend on the surrounding words. The results of the trained G2P model are presented on the Czech language where the cross-word-boundary phenomena occur quite often, and they are compared to the rule-based approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In SAMPA [25], the symbol J corresponds to the palatal nasal.
- 2.
Note: This accuracy was counted on phoneme level and includes the padding symbols ’-’ and the \(\texttt {<break>}\) words, too.
- 3.
This time, the accuracies were counted only for regular phonemes and words, without padding symbol “-” and without phrase-break words.
References
Bičan, A.: Distribution and combinations of Czech consonants. Zeitschrift für Slawistik 56, 153–171 (2011)
Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) EMNLP, pp. 1724–1734. ACL (2014)
Hanzlíček, Z., Vít, J., Tihelka, D.: LSTM-based speech segmentation for TTS synthesis. In: Ekštein, K. (ed.) TSD 2019. LNAI, vol. 11697, pp. 361–372. Springer, Heidelberg (2019)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Jiampojamarn, S., Cherry, C., Kondrak, G.: Joint processing and discriminative training for letter-to-phoneme conversion. In: Proceedings of ACL-08: HLT, pp. 905–913. Association for Computational Linguistics, Columbus (2008)
Kučera, H.: The phonology of Czech, Slavistic printings and reprintings, vol. 30, ’s-Gravenhage, Mouton (1961)
Machač, P., Skarnitzl, R.: Principles of phonetic segmentation. Edition erudica, Epocha (2009)
Matoušek, J.: Building a New Czech text-to-speech system using triphone-based speech units. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2000. LNCS (LNAI), vol. 1902, pp. 223–228. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45323-7_38
Matoušek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: Proceedings of INTERSPEECH 2013, Lyon, France, pp. 1511–1515 (2013)
Matoušek, J., Tihelka, D., Šmídl, L.: On the impact of annotation errors on unit-selection speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 456–463. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32790-2_55
Matoušek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for Czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39398-6_41
Matoušek, J., Tihelka, D., Romportl, J., Psutka, J.: Slovak unit-selection speech synthesis: creating a new Slovak voice within a Czech TTS system ARTIC. IAENG Int. J. Comput. Sci. 39, 147–154 (2012)
Matoušek, J., Kala, J.: On modelling glottal stop in Czech text-to-speech synthesis. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 257–264. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_33
Matoušek, J., Psutka, J.: ARTIC: a new czech text-to-speech system using statistical approach to speech segment database construction. In: Interspeech 2000 - ICSLP, Beijing, China, vol. 4, pp. 612–615 (2000)
Matoušek, J., Tihelka, D.: Slovak text-to-speech synthesis in ARTIC system. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 155–162. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30120-2_20
Novak, J.R., Minamatsu, N., Hirose, K.: Phonetisaurus: exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Lang. Eng. 22(6), 907–938 (2016)
Palková, Z.: Fonetika a fonologie češtiny [Phonetics and phonology of Czech], 1st edn. Univerzita Karlova, Nakladatelství Karolinum, Praha (1994)
Psutka, J., Müller, L., Matoušek, J., Radová, V.: Mluvíme s počítačem česky [Talking with Computer in Czech]. Academia, Praha (2006)
Rao, K., Peng, F., Sak, H., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4225–4229 (2015)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings NIPS, Montreal, Canada, pp. 3104–3112 (2014)
Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40
Wang, D., King, S.: Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Process. Lett. 18(2), 122–125 (2011)
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis (2017). https://arxiv.org/abs/1703.10135
Wells, J.C.: SAMPA computer readable phonetic alphabet. In: Gibbon, D., Moore, R., Winski, R. (eds.) Handbook of Standards and Resources for Spoken Language Systems. Mouton de Gruyter, Berlin (1997)
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016). http://arxiv.org/abs/1609.08144
Yao, K., Zweig, G.: Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. CoRR abs/1506.00196 (2015)
Acknowledgement
This research was supported by the Czech Science Foundation (GA CR), project No. GA19-19324S, and by the grant of the University of West Bohemia, project No. SGS-2019-027.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Jůzová, M., Vít, J. (2019). Using Auto-Encoder BiLSTM Neural Network for Czech Grapheme-to-Phoneme Conversion. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-27947-9_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)