Skip to main content

Phonologically Aware BiLSTM Model for Mongolian Phrase Break Prediction with Attention Mechanism

  • Conference paper
  • First Online:
PRICAI 2018: Trends in Artificial Intelligence (PRICAI 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11012))

Included in the following conference series:

Abstract

Phrase break prediction is the first and most important component in increasing naturalness and intelligibility of text-to-speech (TTS) systems. Most works rely on language specific resources, large annotated corpus and feature engineering to perform well. However, phrase break prediction from text for Mongolian speech synthesis is still a great challenge because the data sparse problem due to the scarcity of resources. In this paper, we introduce a Bidirectional Long Short-Term Memory (BiLSTM) model with attention mechanism which uses the position-based enhanced phonological representations, word embeddings and character embeddings to achieve state of the art performance. The position-based enhanced phonological representations, derived from a separately BiLSTM model, are comprised of phoneme and syllable embeddings which take along position information. By using an attention mechanism, the model is able to dynamically decide how much information to use from a word or phonological component. To handle Out-of-Vocabulary (OOV) problem, we incorporated word, phonological and character embeddings together as inputs to the model. Experimental results show the proposed method significantly outperforms the systems which only used the word embeddings by successfully leveraging position-based phonologically information and attention mechanism.

This research was supports by the China national natural science foundation (No. 61563040, No. 61773224), Inner Mongolian nature science foundation (No. 2016ZD06) and the Enhancing Comprehensive Strength Foundation of Inner Mongolia University (No. 10000-16010109-23).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Chen, Z., Hu, G., Jiang, W.: Improving prosodic phrase prediction by unsupervised adaptation and syntactic features extraction. In: 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, pp. 1421–1424 (2010)

    Google Scholar 

  2. Chu, M., Qian, Y.: Locating boundaries for prosodic constituents in unrestricted mandarin texts. Comput. Linguist. Chin. Lang. Process. 6, 61–82 (2001)

    Google Scholar 

  3. Nie, X., Wang, Z.: Automatic phrase break prediction in Chinese sentences. J. Chin. Inf. Process. 17(4), 39–44 (2003)

    Google Scholar 

  4. Li, J.F., Hu, G.P., Wang, R.: Chinese prosody phrase break prediction based on maximum entropy model. In: 8th Proceedings of INTERSPEECH, Jeju Island, Korea, pp. 729–732 (2004)

    Google Scholar 

  5. Qian, Y., Wu, Z., Ma, X., Soong, F.: Automatic prosody prediction and detection with conditional random field (CRF) models. In: 7th Proceedings of ISCSLP, Tainan, Taiwan, pp. 135–138 (2010)

    Google Scholar 

  6. Rosenberg, A., Fernandez, R., Ramabhadran, B.: Phrase boundary assignment from text in multiple domains. In: 13th Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, pp. 2558–2561 (2012)

    Google Scholar 

  7. Vadapalli, A., Bhaskararao, P., Prahallad, K.: Significance of word-terminal syllables for prediction of phrase breaks in text-to-speech systems for Indian languages. In: 8th ISCA Tutorial and Research Workshop on Speech Synthesis (2013)

    Google Scholar 

  8. Ananthakrishnan, S., Narayanan, S.: An automatic prosody recognizer using a coupled multi-stream acoustic model and a syntactic-prosodic language model. In: 30th International Conference on Acoustics. Speech, and Signal Processing, pp. 269–272. IEEE Press, Philadelphia (2005)

    Google Scholar 

  9. Hasegawa-Johnson, M., et al.: Simultaneous recognition of words and prosody in the Boston University radio speech corpus. Speech Commun. 46, 418–439 (2005)

    Article  Google Scholar 

  10. Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE Trans. Audio Speech Lang. Process. 16, 797–811 (2008)

    Article  Google Scholar 

  11. Busser, B., Daelemans, W., van den Bosch, A.: Predicting phrase breaks with memory-based learning. In: 4th ISCA Tutorial and Research Workshop on Speech Synthesis, Perthshire Scotland (2001)

    Google Scholar 

  12. Fernandez, R., Ramabhadran, B.: Driscriminative training and unsupervised adaptation for labeling prosodic events with limited training data. In: 11th Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, pp. 1429–1432 (2010)

    Google Scholar 

  13. Rosenberg, A., Fernandez, R., Ramabhadran, B.: Modeling phrasing and prominence using deep recurrent learning. In: 16th Conference of the International Speech Communication Association, Dresden, Germany, pp. 3066–3070 (2015)

    Google Scholar 

  14. Vadapalli, A., Prahallad, K.: Learning continuous-valued word representations for phrase break prediction. In: 15th Conference of the International Speech Communication Association, Singapore, pp. 41–45 (2014)

    Google Scholar 

  15. Watts, O., et al.: Neural net word representations for phrase-break prediction without a part of speech tagger. In: 34th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy, pp. 2599–2603 (2014)

    Google Scholar 

  16. Watts, O., Yamagishi, J., King, S.: Unsupervised continuous-valued word features for Phrase-break prediction without a part-of-speech tagger. In: 12th Conference of the International Speech Communication Association, Florence, Italy (2011)

    Google Scholar 

  17. Vadapalli, A., Gangashetty, S.V.: An investigation of recurrent neural network architectures using word embeddings for phrase break prediction. In: 17th Conference of the International Speech Communication Association, San Francisco, CA, USA, pp. 2308–2312 (2016)

    Google Scholar 

  18. Rendel, A., Fernandez, R., Hoory, R., Ramabhadran, B.: Using continuous lexical embeddings to improve symbolic-prosody prediction in a text-to-speech front-end. In: 36th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Shanghai, China, pp. 5655–5659 (2016)

    Google Scholar 

  19. Ding, C., Xie, L., Yan, J., Zhang, W., Liu, Y.: Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features. In: IEEE Automatic Speech Recognition and Understanding Workshop, Scottsdale, Arizona, USA, pp. 98–102 (2015)

    Google Scholar 

  20. Zheng, Y., Li, Y., Wen, Z., Ding, X., Tao, J.: Improving prosodic boundaries prediction for mandarin speech synthesis by using enhanced embedding feature and model fusion approach. In: 17th Conference of the International Speech Communication Association, San Francisco, CA, USA, pp. 3201–3205 (2016)

    Google Scholar 

  21. Klimkov, V., et al.: Phrase break prediction for long-form reading TTS: exploiting text structure information. In: 18th Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 1064–1068 (2017)

    Google Scholar 

  22. Liu, R., Bao, F., Gao, G., Wang, W.: Mongolian prosodic phrase prediction using suffix segmentation. In: International Conference on Asian Language Processing, pp. 250–253. IEEE (2017)

    Google Scholar 

  23. Gertai, Q.: Mongolian Syntax, pp. 77–133. Mongolia People Publishing House, Hohhot (1991)

    Google Scholar 

  24. Temusurvn and Otegen: Mongolian Orthography Dictionary, pp. 77–133. Inner Mongolia People Publishing House, Hohhot (1999)

    Google Scholar 

  25. Bao, F., Gao, G., Yan, X., Wang, W.: Segmentation-based Mongolian LVCSR approach. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 8136–8139 (2013)

    Google Scholar 

  26. Ling, W., et al.: Finding function in form: compositional character models for open vocabulary word representation. Computer Science, pp. 1899–1907 (2015)

    Google Scholar 

  27. Greff, K., Srivastava, R.K., Koutnik, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2016)

    Article  MathSciNet  Google Scholar 

  28. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (2002)

    Article  Google Scholar 

  29. Mikolov, T., et al.: Efficient estimation of word representations in vector space. Computer Science (2013)

    Google Scholar 

  30. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  31. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  32. Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.B.: Joint learning of character and word embeddings. In: International Conference on Artificial Intelligence, pp. 1236–1242, AAAI Press (2015)

    Google Scholar 

  33. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. Computer Science (2014)

    Google Scholar 

  34. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. Computer Science (2015)

    Google Scholar 

  35. Liu, R., Bao, F., Gao, G., Wang, Y., et al.: Character-based joint segmentation and POS tagging for Chinese using bidirectional RNN-CRF. In: 8th International Joint Conference on Natural Language Processing (IJCNLP 2017), Taipei, Taiwan (2017)

    Google Scholar 

  36. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270 (2016)

    Google Scholar 

  37. Liu, R., Bao, F., Gao, G., Wang, Y.: Mongolian text-to-speech system based on deep neural network. In: Tao, J., Zheng, T.F., Bao, C., Wang, D., Li, Y. (eds.) NCMMSC 2017. CCIS, vol. 807, pp. 99–108. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-8111-8_10

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to FeiLong Bao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, R., Bao, F., Gao, G., Zhang, H., Wang, Y. (2018). Phonologically Aware BiLSTM Model for Mongolian Phrase Break Prediction with Attention Mechanism. In: Geng, X., Kang, BH. (eds) PRICAI 2018: Trends in Artificial Intelligence. PRICAI 2018. Lecture Notes in Computer Science(), vol 11012. Springer, Cham. https://doi.org/10.1007/978-3-319-97304-3_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-97304-3_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-97303-6

  • Online ISBN: 978-3-319-97304-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics