Abstract
Phrase break prediction is the first and most important component in increasing naturalness and intelligibility of text-to-speech (TTS) systems. Most works rely on language specific resources, large annotated corpus and feature engineering to perform well. However, phrase break prediction from text for Mongolian speech synthesis is still a great challenge because the data sparse problem due to the scarcity of resources. In this paper, we introduce a Bidirectional Long Short-Term Memory (BiLSTM) model with attention mechanism which uses the position-based enhanced phonological representations, word embeddings and character embeddings to achieve state of the art performance. The position-based enhanced phonological representations, derived from a separately BiLSTM model, are comprised of phoneme and syllable embeddings which take along position information. By using an attention mechanism, the model is able to dynamically decide how much information to use from a word or phonological component. To handle Out-of-Vocabulary (OOV) problem, we incorporated word, phonological and character embeddings together as inputs to the model. Experimental results show the proposed method significantly outperforms the systems which only used the word embeddings by successfully leveraging position-based phonologically information and attention mechanism.
This research was supports by the China national natural science foundation (No. 61563040, No. 61773224), Inner Mongolian nature science foundation (No. 2016ZD06) and the Enhancing Comprehensive Strength Foundation of Inner Mongolia University (No. 10000-16010109-23).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chen, Z., Hu, G., Jiang, W.: Improving prosodic phrase prediction by unsupervised adaptation and syntactic features extraction. In: 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, pp. 1421–1424 (2010)
Chu, M., Qian, Y.: Locating boundaries for prosodic constituents in unrestricted mandarin texts. Comput. Linguist. Chin. Lang. Process. 6, 61–82 (2001)
Nie, X., Wang, Z.: Automatic phrase break prediction in Chinese sentences. J. Chin. Inf. Process. 17(4), 39–44 (2003)
Li, J.F., Hu, G.P., Wang, R.: Chinese prosody phrase break prediction based on maximum entropy model. In: 8th Proceedings of INTERSPEECH, Jeju Island, Korea, pp. 729–732 (2004)
Qian, Y., Wu, Z., Ma, X., Soong, F.: Automatic prosody prediction and detection with conditional random field (CRF) models. In: 7th Proceedings of ISCSLP, Tainan, Taiwan, pp. 135–138 (2010)
Rosenberg, A., Fernandez, R., Ramabhadran, B.: Phrase boundary assignment from text in multiple domains. In: 13th Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, pp. 2558–2561 (2012)
Vadapalli, A., Bhaskararao, P., Prahallad, K.: Significance of word-terminal syllables for prediction of phrase breaks in text-to-speech systems for Indian languages. In: 8th ISCA Tutorial and Research Workshop on Speech Synthesis (2013)
Ananthakrishnan, S., Narayanan, S.: An automatic prosody recognizer using a coupled multi-stream acoustic model and a syntactic-prosodic language model. In: 30th International Conference on Acoustics. Speech, and Signal Processing, pp. 269–272. IEEE Press, Philadelphia (2005)
Hasegawa-Johnson, M., et al.: Simultaneous recognition of words and prosody in the Boston University radio speech corpus. Speech Commun. 46, 418–439 (2005)
Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE Trans. Audio Speech Lang. Process. 16, 797–811 (2008)
Busser, B., Daelemans, W., van den Bosch, A.: Predicting phrase breaks with memory-based learning. In: 4th ISCA Tutorial and Research Workshop on Speech Synthesis, Perthshire Scotland (2001)
Fernandez, R., Ramabhadran, B.: Driscriminative training and unsupervised adaptation for labeling prosodic events with limited training data. In: 11th Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, pp. 1429–1432 (2010)
Rosenberg, A., Fernandez, R., Ramabhadran, B.: Modeling phrasing and prominence using deep recurrent learning. In: 16th Conference of the International Speech Communication Association, Dresden, Germany, pp. 3066–3070 (2015)
Vadapalli, A., Prahallad, K.: Learning continuous-valued word representations for phrase break prediction. In: 15th Conference of the International Speech Communication Association, Singapore, pp. 41–45 (2014)
Watts, O., et al.: Neural net word representations for phrase-break prediction without a part of speech tagger. In: 34th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy, pp. 2599–2603 (2014)
Watts, O., Yamagishi, J., King, S.: Unsupervised continuous-valued word features for Phrase-break prediction without a part-of-speech tagger. In: 12th Conference of the International Speech Communication Association, Florence, Italy (2011)
Vadapalli, A., Gangashetty, S.V.: An investigation of recurrent neural network architectures using word embeddings for phrase break prediction. In: 17th Conference of the International Speech Communication Association, San Francisco, CA, USA, pp. 2308–2312 (2016)
Rendel, A., Fernandez, R., Hoory, R., Ramabhadran, B.: Using continuous lexical embeddings to improve symbolic-prosody prediction in a text-to-speech front-end. In: 36th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Shanghai, China, pp. 5655–5659 (2016)
Ding, C., Xie, L., Yan, J., Zhang, W., Liu, Y.: Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features. In: IEEE Automatic Speech Recognition and Understanding Workshop, Scottsdale, Arizona, USA, pp. 98–102 (2015)
Zheng, Y., Li, Y., Wen, Z., Ding, X., Tao, J.: Improving prosodic boundaries prediction for mandarin speech synthesis by using enhanced embedding feature and model fusion approach. In: 17th Conference of the International Speech Communication Association, San Francisco, CA, USA, pp. 3201–3205 (2016)
Klimkov, V., et al.: Phrase break prediction for long-form reading TTS: exploiting text structure information. In: 18th Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 1064–1068 (2017)
Liu, R., Bao, F., Gao, G., Wang, W.: Mongolian prosodic phrase prediction using suffix segmentation. In: International Conference on Asian Language Processing, pp. 250–253. IEEE (2017)
Gertai, Q.: Mongolian Syntax, pp. 77–133. Mongolia People Publishing House, Hohhot (1991)
Temusurvn and Otegen: Mongolian Orthography Dictionary, pp. 77–133. Inner Mongolia People Publishing House, Hohhot (1999)
Bao, F., Gao, G., Yan, X., Wang, W.: Segmentation-based Mongolian LVCSR approach. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 8136–8139 (2013)
Ling, W., et al.: Finding function in form: compositional character models for open vocabulary word representation. Computer Science, pp. 1899–1907 (2015)
Greff, K., Srivastava, R.K., Koutnik, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2016)
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (2002)
Mikolov, T., et al.: Efficient estimation of word representations in vector space. Computer Science (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.B.: Joint learning of character and word embeddings. In: International Conference on Artificial Intelligence, pp. 1236–1242, AAAI Press (2015)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. Computer Science (2014)
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. Computer Science (2015)
Liu, R., Bao, F., Gao, G., Wang, Y., et al.: Character-based joint segmentation and POS tagging for Chinese using bidirectional RNN-CRF. In: 8th International Joint Conference on Natural Language Processing (IJCNLP 2017), Taipei, Taiwan (2017)
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270 (2016)
Liu, R., Bao, F., Gao, G., Wang, Y.: Mongolian text-to-speech system based on deep neural network. In: Tao, J., Zheng, T.F., Bao, C., Wang, D., Li, Y. (eds.) NCMMSC 2017. CCIS, vol. 807, pp. 99–108. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-8111-8_10
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, R., Bao, F., Gao, G., Zhang, H., Wang, Y. (2018). Phonologically Aware BiLSTM Model for Mongolian Phrase Break Prediction with Attention Mechanism. In: Geng, X., Kang, BH. (eds) PRICAI 2018: Trends in Artificial Intelligence. PRICAI 2018. Lecture Notes in Computer Science(), vol 11012. Springer, Cham. https://doi.org/10.1007/978-3-319-97304-3_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-97304-3_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97303-6
Online ISBN: 978-3-319-97304-3
eBook Packages: Computer ScienceComputer Science (R0)