Advertisement

AAT: An Efficient Model for Synthesizing Long Sequences on a Small Dataset

  • Quan Anh Minh Nguyen
  • Quang Trinh LeEmail author
  • Huu Quoc Van
  • Duc Dung Nguyen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11909)

Abstract

This work represents an alternative model for speech synthesis, which addresses some major disadvantages of current end-to-end models. Current state-of-the-art models still have some troubles while dealing with long sentences and the size of the dataset. Our proposed Adaptive Alignment Tacotron (AAT) model, however, has achieved impressive results for a very small dataset of the Vietnamese language. By leveraging the nature of diagonal alignment between phoneme and acoustic sequences, we address the issue with long sequences and the proposed model can also be trained efficiently on the small dataset (5 h of recording). The proposed model consists of the following components: a stacked convolutional encoder, a local diagonal attention module, a decoder with schedule teacher forcing to produce a coarse mel-spectrogram prediction, and a converter to transform the mel-spectrogram to linear-spectrogram. Experimental results show that the proposed model achieves faster convergence speed and higher stability than the baseline model and open a feasible approach for speech synthesis on languages with small dataset.

Keywords

Text-to-speech Attention Encoder-decoder Sequence-to-sequence Tacotron Vietnamese speech synthesis 

References

  1. 1.
    Arık, S.Ö., et al.: Deep voice: Real-time neural text-to-speech. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 195–204. PMLR, International Convention Centre, Sydney (2017). http://proceedings.mlr.press/v70/arik17a.html
  2. 2.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/ 1409.0473 (2014)Google Scholar
  3. 3.
    Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 1171–1179. Curran Associates, Inc. (2015). http://papers.nips.cc/paper/5956-scheduled-sampling-for-sequence-prediction-with-recurrent-neural-networks.pdf
  4. 4.
    Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 577–585. Curran Associates, Inc. (2015). http://papers.nips.cc/paper/5847-attention-based-models-for-speech-recognition.pdf
  5. 5.
    Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. CoRR abs/1612.08083 (2016). http://arxiv.org/abs/1612.08083
  6. 6.
    Do, T.V., Tran, D.D., Nguyen, T.T.T.: Non-uniform unit selection in Vietnamese speech synthesis. In: SoICT (2011)Google Scholar
  7. 7.
    Ghiasi, G., Lin, T., Le, Q.V.: DropBlock: a regularization method for convolutional networks. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, Montréal, Canada, 3–8 December 2018, pp. 10750–10760 (2018). http://papers.nips.cc/paper/8271-dropblock-a-regularization-method-for-convolutional-networks
  8. 8.
    Gibiansky, A., et al.: Deep voice 2: multi-speaker neural text-to-speech. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30, pp. 2962–2970. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/6889-deep-voice-2-multi-speaker-neural-text-to-speech.pdf
  9. 9.
    Graves, A.: Generating sequences with recurrent neural networks. CoRR abs/ 1308.0850 (2013)Google Scholar
  10. 10.
    Griffin, D.W., Lim, J.S.: Signal estimation from modified short-time Fourier transform. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1983, Boston, Massachusetts, USA, 14–16 April 1983, pp. 804–807. IEEE (1983).  https://doi.org/10.1109/ICASSP.1983.1172092
  11. 11.
    Ioffe, S.: Batch renormalization: towards reducing minibatch dependence in batch-normalized models. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017, pp. 1942–1950 (2017). http://papers.nips.cc/paper/6790-batch-renormalization-towards-reducing-minibatch-dependence-in-batch-normalized-models
  12. 12.
    Krueger, D., et al.: Zoneout: Regularizing RNNs by randomly preserving hidden activations. arXiv e-prints abs/ 1606.01305. https://arxiv.org/abs/1606.01305, June 2016
  13. 13.
    Mehri, S., et al.: SampleRNN: an unconditional end-to-end neural audio generation model. CoRR abs/ 1612.07837 (2016). http://arxiv.org/abs/1612.07837
  14. 14.
    Nallapati, R., Zhou, B., dos Santos, C.N., Gülçehre, Ç., Xiang, B.: Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: Goldberg, Y., Riezler, S. (eds.) Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, 11–12 August 2016, pp. 280–290. ACL (2016). http://aclweb.org/anthology/K/K16/K16-1028.pdf
  15. 15.
    Nguyen, T.T.T.: HMM-based vietnamese text-to-speech : prosodic phrasing modeling, corpus design system design, and evaluation. (Text-To-Speech à base de HMM (Hidden Markov Model) pour le vietnamien : modélisation de la segmentation prosodique, la conception du corpus, la conception du système, et l’évaluation perceptive). Ph.D. thesis, University of Paris-Sud, Orsay, France (2015). https://tel.archives-ouvertes.fr/tel-01260884
  16. 16.
    van den Oord, A., et al.: WaveNet: a generative model for raw audio. CoRR abs/ 1609.03499 (2016). http://arxiv.org/abs/1609.03499
  17. 17.
    Phan, T.S., Dinh, A.T., Vu, T.T., Luong, C.M.: An improvement of prosodic characteristics in vietnamese text to speech system. In: Huynh, V., Denoeux, T., Tran, D., Le, A., Pham, S. (eds.) Knowledge and Systems Engineering. Advances in Intelligent Systems and Computing, vol. 244, pp. 99–111. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-02741-8_10CrossRefGoogle Scholar
  18. 18.
    Ping, W., et al.: Deep voice 3: scaling text-to-speech with convolutional sequence learning. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=HJtEm4p6Z
  19. 19.
    Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018, Workshop Track Proceedings. OpenReview.net (2018). https://openreview.net/forum?id=Hkuq2EkPf
  20. 20.
    Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018, Conference Track Proceedings. OpenReview.net (2018). https://openreview.net/forum?id=ryQu7f-RZ
  21. 21.
    Rush, A.M., Chopra, S., Weston, J.: A neural attention model for abstractive sentence summarization. In: Màrquez, L., Callison-Burch, C., Su, J., Pighin, D., Marton, Y. (eds.) Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, 17–21 September 2015, pp. 379–389. The Association for Computational Linguistics (2015). http://aclweb.org/anthology/D/D15/D15-1044.pdf
  22. 22.
    Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. CoRR abs/ 1712.05884 (2017). http://arxiv.org/abs/1712.05884
  23. 23.
    Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017, Santa Rosa, CA, USA, 24–31 March 2017, pp. 464–472. IEEE Computer Society (2017).  https://doi.org/10.1109/WACV.2017.58
  24. 24.
    Sotelo, J., et al.: Char2wav: end-to-end speech synthesis. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017, Workshop Track Proceedings. OpenReview.net (2017). https://openreview.net/forum?id=B1VWyySKx
  25. 25.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014). http://jmlr.org/papers/v15/srivastava14a.htmlMathSciNetzbMATHGoogle Scholar
  26. 26.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014)Google Scholar
  27. 27.
    Tachibana, H., Uenoyama, K., Aihara, S.: Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, 15–20 April 2018, pp. 4784–4788. IEEE (2018).  https://doi.org/10.1109/ICASSP.2018.8461829
  28. 28.
    Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press (2009).  https://doi.org/10.1017/CBO9780511816338
  29. 29.
    Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech parameter generation algorithms for hmm-based speech synthesis. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP 2000, Hilton Hotel and Convention Center, Istanbul, Turkey, 5–9 June 2000, pp. 1315–1318. IEEE (2000).  https://doi.org/10.1109/ICASSP.2000.861820
  30. 30.
    Van Nguyen, T., Quoc Nguyen, B., Huy Phan, K., Van Do, H.: Development of Vietnamese speech synthesis system using deep neural networks. J. Comput. Sci. Cybern. 34, 349–363 (2019).  https://doi.org/10.15625/1813-9663/34/4/13172CrossRefGoogle Scholar
  31. 31.
    Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
  32. 32.
    Wang, W., Xu, S., Xu, B.: First step towards end-to-end parametric TTS synthesis: generating spectral parameters with neural attention. In: Interspeech 2016, pp. 2243–2247 (2016).  https://doi.org/10.21437/Interspeech.2016-134
  33. 33.
    Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Proceedings Interspeech 2017, pp. 4006–4010 (2017).  https://doi.org/10.21437/Interspeech.2017-1452
  34. 34.
    Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/ 1609.08144 (2016)Google Scholar
  35. 35.
    Zen, H., Senior, A.W., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26–31 2013, pp. 7962–7966. IEEE (2013).  https://doi.org/10.1109/ICASSP.2013.6639215

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Quan Anh Minh Nguyen
    • 1
  • Quang Trinh Le
    • 1
    Email author
  • Huu Quoc Van
    • 1
  • Duc Dung Nguyen
    • 1
  1. 1.HO CHI MINH City University of TechnologyHo Chi Minh CityVietnam

Personalised recommendations