Skip to main content

DNN-Based Duration Modeling for Synthesizing Short Sentences

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9811))

Included in the following conference series:

  • 2250 Accesses

Abstract

Statistical parametric speech synthesis conventionally utilizes decision tree clustered context-dependent hidden Markov models (HMMs) to model speech parameters. But decision trees are unable to capture complex context dependencies and fail to model the interaction between linguistic features. Recently deep neural networks (DNNs) have been applied in speech synthesis and they can address some of these limitations. This paper focuses on the prediction of phone durations in Text-to-Speech (TTS) systems using feedforward DNNs in case of short sentences (sentences containing one, two or three syllables only). To achieve better prediction accuracy hyperparameter optimization was carried out with manual grid search. Recordings from a male and a female speaker were used to train the systems, and the output of various configurations were compared against conventional HMM-based solutions and natural speech. Experimental results of objective evaluations show that DNNs can outperform previous state-of-the-art solutions in duration modeling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Prahallad, K., Vadapalli, A., Kesiraju, S., Murthy, H.A., Lata, S., Nagarajan, T., Presanna, M., Patil, H., Sao, A.K., King, S., Black, A.W., Tokuda, K.: The blizzard challenge 2014. In: Proceedings of the Blizzard Challenge Workshop (2014)

    Google Scholar 

  2. Hinterleitner, F., Möller, S., Norrenbrock, C., Heute, U.: Perceptual quality dimensions of text-to-speech systems. In: Proceedings of the Interspeech 2011, pp. 2177–2180 (2011)

    Google Scholar 

  3. Hinterleitner, F., Norrenbrock, C., Möller, S.: Is intelligibility still the main problem? A review of perceptual quality dimensions of synthetic speech. In: Proceedings of the 8th ISCA Speech Synthesis Workshop, pp. 147–151 (2013)

    Google Scholar 

  4. Mayo, C., Clark, R.A., and King, S.: Multidimensional scaling of listener responses to synthetic speech. In: Proceedings of the Interspeech 2005, pp. 1725–1728 (2005)

    Google Scholar 

  5. Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Hidden semi-markov model based speech synthesis system. Proc. IEICE – Trans. Inf. Syst. E90-D(5), 825–834 (2007)

    Article  Google Scholar 

  6. Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)

    Article  Google Scholar 

  7. Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A.W., Tokuda, K.: The HMM-based speech synthesis system (HTS) version 2.0. In: Proceedings of the SSW6, pp. 294–299 (2007)

    Google Scholar 

  8. Watts, O., Henter, G.E., Merritt, T., Wu, T., King, S.: From HMMs to DNNs: where do the improvements come from? In: Proceedings of the ICASSP, p. 5 (2016)

    Google Scholar 

  9. Zen, H., Senior, A., Schuster, M: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of the ICASSP, pp. 7962–7966 (2013)

    Google Scholar 

  10. Zen, H.: Acoustic modeling in statistical parametric speech synthesis – from HMM to LSTM-RNN. In: Proceedings of the MLSLP, Invited paper (2015)

    Google Scholar 

  11. Gao, B.-H., Qian, Y., Wi, Z.-Z., Soong, F.-K.: Duration refinement by jointly optimizing state and longer unit likelihood. In: Proceedings of the Interspeech, pp. 2266–2269 (2008)

    Google Scholar 

  12. Qian, Y., Fan, Y., Hu, W., Soong, F.K.: On the training aspects of deep neural network (DNN) for parametric TTS synthesis. In: Proceedings of the ICASSP, pp. 3857–3861 (2014)

    Google Scholar 

  13. Kang, S., Qian, X., Meng, H.: Multi-distribution deep belief network for speech synthesis. In: Proceedings of the ICASSP, pp. 8012–8016 (2013)

    Google Scholar 

  14. Zen, H., Sak, H.: Unidirectional long short-term memory recurrent neural networks with recurrent output layer for low-latency speech synthesis. In: Proceedings of the ICASSP, pp. 4470–4474 (2015)

    Google Scholar 

  15. Fan, Y., Qian, Y., Xie, F.-L., Soong, F.K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Proceedings of the Interspeech 2014, pp. 1964–1968 (2014)

    Google Scholar 

  16. Van Santen, J.P.: Contextual effects on vowel duration. Speech Commun. 11(6), 513–546 (1992)

    Article  Google Scholar 

  17. Olaszy, G.: Hangidőtartamok és időszerkezeti elemek a magyar beszédben [Sound durations and time structure elements in Hungarian speech] (in Hungarian). In: Nyelvtudományi Értekezések, Akadémiai Kiadó, p. 141 (2006)

    Google Scholar 

  18. Olaszy, G.: Precíziós, párhuzamos magyar beszédadatbázis fejlesztése és szolgáltatásai [Development and services of a Hungarian precisely labeled and segmented, parallel speech database] (in Hungarian). Beszédkutatás 2013 [Speech Research 2013], pp. 261–270 (2013)

    Google Scholar 

  19. Henter, G.E., Ronanki, S., Watts, O., Wester, M., Wu, Z., King, S.: Robust TTS duration modelling using DNNs. In: Proceedings of the ICASSP, p. 5 (2016)

    Google Scholar 

  20. Tóth, B., Németh, G.: Improvements of Hungarian hidden Markov model-based text-to-speech synthesis. Acta Cybern 19(4), 715–731 (2010)

    MATH  Google Scholar 

  21. Zeiler, M.D.: ADADELTA: An adaptive learning rate method. In arXiv preprint, arXiv:1212.5701 (2012)

  22. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In arXiv preprint, arXiv:1502.01852 (2015)

  23. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 249–256 (2010)

    Google Scholar 

Download references

Acknowledgments

This research is partially supported by the Swiss National Science Foundation via the joint research project (SCOPES scheme) SP2: SCOPES project on speech prosody (SNSF n° IZ73Z0_152495-1).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Péter Nagy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Nagy, P., Németh, G. (2016). DNN-Based Duration Modeling for Synthesizing Short Sentences. In: Ronzhin, A., Potapova, R., Németh, G. (eds) Speech and Computer. SPECOM 2016. Lecture Notes in Computer Science(), vol 9811. Springer, Cham. https://doi.org/10.1007/978-3-319-43958-7_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-43958-7_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-43957-0

  • Online ISBN: 978-3-319-43958-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics