Fine Vocoder Tuning for HMM-Based Speech Synthesis: Effect of the Analysis Window Length

  • Agustin Alonso
  • Daniel Erro
  • Eva Navas
  • Inma Hernaez
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8854)


This paper studies how the length of the window used during spectral envelope estimation influences the perceptual quality of HMM-based speech synthesis. We show that the acoustic differences due to variations in the window length are audible. The experiments reveal an overall preference towards short analysis windows, although longer windows seem to alleviate some artifacts related to training data scarcity.


Vocoder statistical parametric speech synthesis harmonic analysis window length 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Communication 51(11), 1039–1064 (2009)CrossRefGoogle Scholar
  2. 2.
    Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., Oura, K.: Speech synthesis based on hidden Markov Models. Proceedings IEEE 101(5), 1234–1252 (2013)CrossRefGoogle Scholar
  3. 3.
    Toda, T., Tokuda, K.: A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Transactions on Information and System E90-D(5), 816–824 (2007)CrossRefGoogle Scholar
  4. 4.
    HHM-based Speech Synthesis System (HTS),
  5. 5.
    Tokuda, K., Kobayashi, T., Masuko, T., Imai, S.: Mel-generalized cepstral analysis - a unified approach to speech spectral estimation. In: Proceedings ICSLP, vol. 3, pp. 1043–1046 (1994)Google Scholar
  6. 6.
    Imai, S.: Cepstral analysis synthesis on the mel frequency scale. In: Proceedigns ICASSP, pp. 93–96 (1983)Google Scholar
  7. 7.
    Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Mixed excitation for HMM-based speech synthesis. In: Proceedings Eurospeech, pp. 2263–2266 (2001)Google Scholar
  8. 8.
    Gonzalvo, X., Socorro, J.C., Iriondo, I., Monzo, C., Martinez, E.: Linguistic and mixed excitation improvements on a HMM-based speech synthesis for Castilian Spanish. In: Proceedings of the 6th ISCA Speech Synthesis Workshop, pp. 362–367 (2007)Google Scholar
  9. 9.
    Maia, R., Toda, T., Zen, H., Nankaku, Y., Tokuda, K.: An excitation model for HMM-based speech synthesis based on residual modeling. In: Proceedings 6th ISCA Speech Synthesis Workshop, pp. 131–136 (2007)Google Scholar
  10. 10.
    Drugman, T., Wilfart, G., Dutoit, T.: A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis. In: Proceedings Interspeech, pp. 1779–1782 (2009)Google Scholar
  11. 11.
    Zen, H., Toda, T., Nakamura, M., Tokuda, K.: Details of the Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE Transactions on Information and System E90-D(1), 325–333 (2007)CrossRefGoogle Scholar
  12. 12.
    Kawahara, H., Masuda-Kasuse, I., de Cheveigne, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Communication 27, 187–207 (1999)CrossRefGoogle Scholar
  13. 13.
    Cabral, J.P., Renals, S., Richmond, K., Yamagishi, J.: Glottal Spectra Separation for Parametric Speech Synthesis. In: Proceedings Interspeech, pp. 1829–1832 (2008)Google Scholar
  14. 14.
    Lanchantin, P., Degottex, G., Rodet, X.: A HMM-based speech synthesis system using a new glottal source and vocal-tract separation method. In: Proceedings ICASSP, pp. 4630–4633 (2010)Google Scholar
  15. 15.
    Raitio, T., Suni, A., Yamagishi, J., Pulakka, H., Nurminen, J., Vainio, M., Alku, P.: HMM-based Speech Synthesis Utilizing Glottal Inverse Filtering. IEEE Transactions on Audio Speech and Language Processing 19(1), 153–165 (2011)CrossRefGoogle Scholar
  16. 16.
    Banos, E., Derro, D., Bonafonte, A., Moreno, A.: Flexible harmonic/stochastic modeling for HMM-based speech synthesis. In: Proceedings V Jornadas en Tecnologías del Habla, pp. 145–148 (2008)Google Scholar
  17. 17.
    Shechtman, S., Sorin, A.: Sinusoidal model parameterization for HMM-based TTS system. In: Proceedings Interspeech, pp. 805–808 (2010)Google Scholar
  18. 18.
    Erro, D., Sainz, I., Navas, E., Hernaez, I.: Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE Journal of Selected Topics in Signal Processing (in press)Google Scholar
  19. 19.
    Toda, T., Tokuda, K.: Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM. In: Proceedings ICASSP, pp. 3925–3928 (2008)Google Scholar
  20. 20.
    Wu, Y.J., Tokuda, K.: Minimum generation error training by using original spectrum as reference for log spectral distortion measure. In: Proceedings ICASSP, pp. 4013–4016 (2009)Google Scholar
  21. 21.
    Ling, Z.H., Deng, L., Yu, D.: Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Transactions on Audio Speech and Language Processing 21(10), 2129–2139 (2013)CrossRefGoogle Scholar
  22. 22.
    Hojo, N., Yoshizato, K., Kameoka, H., Saito, D., Sagayama, S.: Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models. In: Proceedings of the 8th ISCA Speech Synthesis Workshop, pp. 129–134 (2013)Google Scholar
  23. 23.
    Stylianou, Y.: Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification. Ph.D. thesis, École Nationale Supèrieure de Télécommunications, Paris (1996)Google Scholar
  24. 24.
    Erro, D., Sainz, I., Navas, E., Hernaez, I.: Efficient spectral envelope estimation from harmonic speech signals. IET Electronics Letters 48(16), 1019–1021 (2012)CrossRefGoogle Scholar
  25. 25.
    Cappé, O., Laroche, J., Moulines, E.: Regularized estimation of cepstrum envelope from discrete frequency points. In: Proceedings WASPAA, pp. 213–219 (1995)Google Scholar
  26. 26.
    Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs. In: Proceedings ICASSP, pp. 749–752 (2001)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Agustin Alonso
    • 1
  • Daniel Erro
    • 1
    • 2
  • Eva Navas
    • 1
  • Inma Hernaez
    • 1
  1. 1.AHOLABUniversity of the Basque Country (UPV/EHU)BilbaoSpain
  2. 2.IKERBASQUE, Basque Foundation for ScienceBilbaoSpain

Personalised recommendations