Phonologically Aware BiLSTM Model for Mongolian Phrase Break Prediction with Attention Mechanism

Liu, Rui; Bao, FeiLong; Gao, Guanglai; Zhang, Hui; Wang, Yonghe

doi:10.1007/978-3-319-97304-3_17

Rui Liu¹⁵,
FeiLong Bao¹⁵,
Guanglai Gao¹⁵,
Hui Zhang¹⁵ &
…
Yonghe Wang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11012))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

3273 Accesses
3 Citations

Abstract

Phrase break prediction is the first and most important component in increasing naturalness and intelligibility of text-to-speech (TTS) systems. Most works rely on language specific resources, large annotated corpus and feature engineering to perform well. However, phrase break prediction from text for Mongolian speech synthesis is still a great challenge because the data sparse problem due to the scarcity of resources. In this paper, we introduce a Bidirectional Long Short-Term Memory (BiLSTM) model with attention mechanism which uses the position-based enhanced phonological representations, word embeddings and character embeddings to achieve state of the art performance. The position-based enhanced phonological representations, derived from a separately BiLSTM model, are comprised of phoneme and syllable embeddings which take along position information. By using an attention mechanism, the model is able to dynamically decide how much information to use from a word or phonological component. To handle Out-of-Vocabulary (OOV) problem, we incorporated word, phonological and character embeddings together as inputs to the model. Experimental results show the proposed method significantly outperforms the systems which only used the word embeddings by successfully leveraging position-based phonologically information and attention mechanism.

This research was supports by the China national natural science foundation (No. 61563040, No. 61773224), Inner Mongolian nature science foundation (No. 2016ZD06) and the Enhancing Comprehensive Strength Foundation of Inner Mongolia University (No. 10000-16010109-23).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chen, Z., Hu, G., Jiang, W.: Improving prosodic phrase prediction by unsupervised adaptation and syntactic features extraction. In: 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, pp. 1421–1424 (2010)
Google Scholar
Chu, M., Qian, Y.: Locating boundaries for prosodic constituents in unrestricted mandarin texts. Comput. Linguist. Chin. Lang. Process. 6, 61–82 (2001)
Google Scholar
Nie, X., Wang, Z.: Automatic phrase break prediction in Chinese sentences. J. Chin. Inf. Process. 17(4), 39–44 (2003)
Google Scholar
Li, J.F., Hu, G.P., Wang, R.: Chinese prosody phrase break prediction based on maximum entropy model. In: 8th Proceedings of INTERSPEECH, Jeju Island, Korea, pp. 729–732 (2004)
Google Scholar
Qian, Y., Wu, Z., Ma, X., Soong, F.: Automatic prosody prediction and detection with conditional random field (CRF) models. In: 7th Proceedings of ISCSLP, Tainan, Taiwan, pp. 135–138 (2010)
Google Scholar
Rosenberg, A., Fernandez, R., Ramabhadran, B.: Phrase boundary assignment from text in multiple domains. In: 13th Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, pp. 2558–2561 (2012)
Google Scholar
Vadapalli, A., Bhaskararao, P., Prahallad, K.: Significance of word-terminal syllables for prediction of phrase breaks in text-to-speech systems for Indian languages. In: 8th ISCA Tutorial and Research Workshop on Speech Synthesis (2013)
Google Scholar
Ananthakrishnan, S., Narayanan, S.: An automatic prosody recognizer using a coupled multi-stream acoustic model and a syntactic-prosodic language model. In: 30th International Conference on Acoustics. Speech, and Signal Processing, pp. 269–272. IEEE Press, Philadelphia (2005)
Google Scholar
Hasegawa-Johnson, M., et al.: Simultaneous recognition of words and prosody in the Boston University radio speech corpus. Speech Commun. 46, 418–439 (2005)
Article Google Scholar
Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE Trans. Audio Speech Lang. Process. 16, 797–811 (2008)
Article Google Scholar
Busser, B., Daelemans, W., van den Bosch, A.: Predicting phrase breaks with memory-based learning. In: 4th ISCA Tutorial and Research Workshop on Speech Synthesis, Perthshire Scotland (2001)
Google Scholar
Fernandez, R., Ramabhadran, B.: Driscriminative training and unsupervised adaptation for labeling prosodic events with limited training data. In: 11th Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, pp. 1429–1432 (2010)
Google Scholar
Rosenberg, A., Fernandez, R., Ramabhadran, B.: Modeling phrasing and prominence using deep recurrent learning. In: 16th Conference of the International Speech Communication Association, Dresden, Germany, pp. 3066–3070 (2015)
Google Scholar
Vadapalli, A., Prahallad, K.: Learning continuous-valued word representations for phrase break prediction. In: 15th Conference of the International Speech Communication Association, Singapore, pp. 41–45 (2014)
Google Scholar
Watts, O., et al.: Neural net word representations for phrase-break prediction without a part of speech tagger. In: 34th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy, pp. 2599–2603 (2014)
Google Scholar
Watts, O., Yamagishi, J., King, S.: Unsupervised continuous-valued word features for Phrase-break prediction without a part-of-speech tagger. In: 12th Conference of the International Speech Communication Association, Florence, Italy (2011)
Google Scholar
Vadapalli, A., Gangashetty, S.V.: An investigation of recurrent neural network architectures using word embeddings for phrase break prediction. In: 17th Conference of the International Speech Communication Association, San Francisco, CA, USA, pp. 2308–2312 (2016)
Google Scholar
Rendel, A., Fernandez, R., Hoory, R., Ramabhadran, B.: Using continuous lexical embeddings to improve symbolic-prosody prediction in a text-to-speech front-end. In: 36th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Shanghai, China, pp. 5655–5659 (2016)
Google Scholar
Ding, C., Xie, L., Yan, J., Zhang, W., Liu, Y.: Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features. In: IEEE Automatic Speech Recognition and Understanding Workshop, Scottsdale, Arizona, USA, pp. 98–102 (2015)
Google Scholar
Zheng, Y., Li, Y., Wen, Z., Ding, X., Tao, J.: Improving prosodic boundaries prediction for mandarin speech synthesis by using enhanced embedding feature and model fusion approach. In: 17th Conference of the International Speech Communication Association, San Francisco, CA, USA, pp. 3201–3205 (2016)
Google Scholar
Klimkov, V., et al.: Phrase break prediction for long-form reading TTS: exploiting text structure information. In: 18th Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 1064–1068 (2017)
Google Scholar
Liu, R., Bao, F., Gao, G., Wang, W.: Mongolian prosodic phrase prediction using suffix segmentation. In: International Conference on Asian Language Processing, pp. 250–253. IEEE (2017)
Google Scholar
Gertai, Q.: Mongolian Syntax, pp. 77–133. Mongolia People Publishing House, Hohhot (1991)
Google Scholar
Temusurvn and Otegen: Mongolian Orthography Dictionary, pp. 77–133. Inner Mongolia People Publishing House, Hohhot (1999)
Google Scholar
Bao, F., Gao, G., Yan, X., Wang, W.: Segmentation-based Mongolian LVCSR approach. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 8136–8139 (2013)
Google Scholar
Ling, W., et al.: Finding function in form: compositional character models for open vocabulary word representation. Computer Science, pp. 1899–1907 (2015)
Google Scholar
Greff, K., Srivastava, R.K., Koutnik, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2016)
Article MathSciNet Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (2002)
Article Google Scholar
Mikolov, T., et al.: Efficient estimation of word representations in vector space. Computer Science (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.B.: Joint learning of character and word embeddings. In: International Conference on Artificial Intelligence, pp. 1236–1242, AAAI Press (2015)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. Computer Science (2014)
Google Scholar
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. Computer Science (2015)
Google Scholar
Liu, R., Bao, F., Gao, G., Wang, Y., et al.: Character-based joint segmentation and POS tagging for Chinese using bidirectional RNN-CRF. In: 8th International Joint Conference on Natural Language Processing (IJCNLP 2017), Taipei, Taiwan (2017)
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270 (2016)
Google Scholar
Liu, R., Bao, F., Gao, G., Wang, Y.: Mongolian text-to-speech system based on deep neural network. In: Tao, J., Zheng, T.F., Bao, C., Wang, D., Li, Y. (eds.) NCMMSC 2017. CCIS, vol. 807, pp. 99–108. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-8111-8_10
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Inner Mongolia Key Laboratory of Mongolian Information Processing Technology, College of Computer Science, Inner Mongolia University, Hohhot, 010021, China
Rui Liu, FeiLong Bao, Guanglai Gao, Hui Zhang & Yonghe Wang

Authors

Rui Liu
View author publications
You can also search for this author in PubMed Google Scholar
FeiLong Bao
View author publications
You can also search for this author in PubMed Google Scholar
Guanglai Gao
View author publications
You can also search for this author in PubMed Google Scholar
Hui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yonghe Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to FeiLong Bao .

Editor information

Editors and Affiliations

Southeast University, Nanjing, China
Xin Geng
University of Tasmania, Hobart, Tasmania, Australia
Byeong-Ho Kang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, R., Bao, F., Gao, G., Zhang, H., Wang, Y. (2018). Phonologically Aware BiLSTM Model for Mongolian Phrase Break Prediction with Attention Mechanism. In: Geng, X., Kang, BH. (eds) PRICAI 2018: Trends in Artificial Intelligence. PRICAI 2018. Lecture Notes in Computer Science(), vol 11012. Springer, Cham. https://doi.org/10.1007/978-3-319-97304-3_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-97304-3_17
Published: 27 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97303-6
Online ISBN: 978-3-319-97304-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics