Last Syllable Unit Penalization in Unit Selection TTS

  • Markéta Jůzová
  • Daniel TihelkaEmail author
  • Radek Skarnitzl
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10415)


While unit selection speech synthesis tries to avoid speech modifications, it strongly depends on the placement of units into the correct position. Usually, the position is tightly coupled with a distance from the beginning/end of some prosodic or rhythmic units like phrases or words. The present paper shows, however, that it is not necessary to follow position requirements, when the phonetic knowledge of the perception of prosodic patterns (mostly durational in our case) is considered. In particular, we focus on the effects of using word-final units in word-internal positions in synthesized speech, which are often perceived negatively by listeners, due to disruptions in local timing.


Speech synthesis Unit selection Target cost Word final lengthening 


  1. 1.
    Baddeley, A.: Human Memory: Theory and Practice. Psychology Press, East Sussex (1997). Revised ednGoogle Scholar
  2. 2.
    Beckman, M., Edwards, J.: Lengthenings and shortenings and the nature of prosodic constituency. In: Papers in Laboratory Phonology I: Between the Grammar and the Physics of Speech, pp. 152–178. Cambridge University Press, Cambridge (1990)Google Scholar
  3. 3.
    Buxton, H.: Temporal predictability in the perception of English speech. In: Cutler, A., Ladd, D.R. (eds.) Prosody: Models and Measurements, vol. 14, pp. 111–121. Springer, Heidelberg (1983)Google Scholar
  4. 4.
    Byrd, D., Saltzman, E.: The elastic phrase: modelling the dynamics of boundary-adjacent lengthening. J. Phonetics 31, 149–180 (2003)CrossRefGoogle Scholar
  5. 5.
    Crystal, T.H., House, A.S.: Segmental durations in connected-speech signals: current results. J. Acoust. Soc. Am. 83, 1553–1573 (1988)CrossRefGoogle Scholar
  6. 6.
    Cutler, A., Butterfield, S.: Syllabic lengthening as a word boundary cue. In: Proceedings of the 3rd Australian SST, pp. 324–328 (1990)Google Scholar
  7. 7.
    Dankovičová, J.: The domain of articulation rate variation in Czech. J. Phonetics 25, 287–312 (1997)Google Scholar
  8. 8.
    Fernandez, R., Rendel, A., Ramabhadran, B., Hoory, R.: Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks. In: Proceedings of Interspeech, pp. 2268–2272. ISCA (2014)Google Scholar
  9. 9.
    Fletcher, J.: The prosody of speech: timing and rhythm. In: The Handbook of Phonetic Sciences, pp. 521–602. Blackwell Publishing Ltd. (2010)Google Scholar
  10. 10.
    Gussenhoven, C.: The Phonology of Tone and Intonation. Cambridge University Press, Cambridge (2004)Google Scholar
  11. 11.
    Hanzlíček, Z.: Czech HMM-based speech synthesis: experiments with model adaptation. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 107–114. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-23538-2_14 CrossRefGoogle Scholar
  12. 12.
    Holm, B., Bailly, G.: Generating prosody by superposing multi-parametric overlapping contours. In: Proceedings of ICSLP, pp. 203–206 (2000)Google Scholar
  13. 13.
    Klatt, D.H.: Linguistic uses of segmental duration in English: acoustic and perceptual evidence. J. Acoust. Soc. Am. 59, 1208–1221 (1976)CrossRefGoogle Scholar
  14. 14.
    Ladd, D.R.: Intonational Phonology, 2nd edn. Cambridge University Press, Cambridge (2008)CrossRefGoogle Scholar
  15. 15.
    Matoušek, J., Hanzlíček, Z., Tihelka, D.: Hybrid syllable/triphone speech synthesis. In: Proceedings of 9th Interspeech (Eurospeech), Lisbon, Portugal, pp. 2529–2532 (2005)Google Scholar
  16. 16.
    Matoušek, J., Romportl, J., Tihelka, D., Tychtl, Z.: Recent improvements on ARTIC: czech text-to-speech system. In: Proceedings of Interspeech, Jeju Island, Korea, pp. 1933–1936 (2004)Google Scholar
  17. 17.
    NíChasaide, A., Yanushevskaya, I., Gobl, C.: Prosody of voice: declination, sentence mode and interaction with prominence. In: Proceedings of 18th ICPhS (2015). Paper 476Google Scholar
  18. 18.
    Quené, H., van Delft, L.E.: Non-native durational patterns decrease speech intelligibility. Speech Commun. 52(11–12), 911–918 (2010)Google Scholar
  19. 19.
    Quené, H., Port, R.: Effects of timing regularity and metrical expectancy on spoken-word perception. Phonetica 62(1), 1–13 (2005)Google Scholar
  20. 20.
    Romportl, J., Kala, J.: Prosody modelling in Czech text-to-speech synthesis. In: Proceedings of the 6th ISCA SSW, Bonn, pp. 200–205 (2007)Google Scholar
  21. 21.
    Romportl, J., Matoušek, J., Tihelka, D.: Advanced prosody modelling. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS, vol. 3206, pp. 441–447. Springer, Heidelberg (2004). doi: 10.1007/978-3-540-30120-2_56 CrossRefGoogle Scholar
  22. 22.
    van Santen, J.P.H.: Assignment of segmental duration in text-to-speech synthesis. Comput. Speech Lang. 8, 95–128 (1994)CrossRefGoogle Scholar
  23. 23.
    Skarnitzl, R., Eriksson, A.: The acoustics of word stress in Czech as a function of speaking style. In: Proceedings of Interspeech (2017)Google Scholar
  24. 24.
    Tihelka, D.: Symbolic prosody driven unit selection for highly natural synthetic speech. In: Proceedings of 9th Interspeech (Eurospeech), pp. 2525–2528. ISCA, Bonn (2005)Google Scholar
  25. 25.
    Tihelka, D., Grůber, M., Hanzlíček, Z.: Robust methodology for TTS enhancement evaluation. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS, vol. 8082, pp. 442–449. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-40585-3_56 Google Scholar
  26. 26.
    Tihelka, D., Matoušek, J.: Unit selection and its relation to symbolic prosody: a new approach. In: Proceedings of 9th ICSLP, vol. 1, pp. 2042–2045. ISCA, Bonn (2006)Google Scholar
  27. 27.
    Tihelka, D., Méner, M.: Generalized non-uniform time scaling distribution method for natural-sounding speech rate change. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 147–154. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-23538-2_19 CrossRefGoogle Scholar
  28. 28.
    Tihelka, D., Romportl, J.: Exploring automatic similarity measures for unit selection tuning. In: Proceedings of 10th Interspeech, pp. 736–739. ISCA, Brighton (2009)Google Scholar
  29. 29.
    Volín, J., Skarnitzl, R.: Temporal downtrends in Czech read speech. In: Proceedings of Interspeech, pp. 442–445 (2007)Google Scholar
  30. 30.
    Volín, J., Poesová, K., Skarnitzl, R.: The impact of rhythmic distortions in speech on personality assessment. Res. Lang. 12, 209–216 (2014)Google Scholar
  31. 31.
    White, L., Turk, A.E.: English words on the procrustean bed: polysyllabic shortening reconsidered. J. Phonetics 38(3), 459–471 (2010)CrossRefGoogle Scholar
  32. 32.
    Windmann, A., Šimko, J., Wagner, P.: Polysyllabic shortening and word-final lengthening in English. In: Interspeech 2015, pp. 23–40 (2015)Google Scholar
  33. 33.
    Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: Proceedings of 9th ISCA SSW, pp. 218–223, September 2016Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Markéta Jůzová
    • 1
  • Daniel Tihelka
    • 1
    Email author
  • Radek Skarnitzl
    • 2
  1. 1.New Technologies for the Information Society (NTIS) and Department of Cybernetics, Faculty of Applied SciencesUniversity of West BohemiaPilsenCzech Republic
  2. 2.Faculty of Arts, Institute of PhoneticsCharles UniversityPragueCzech Republic

Personalised recommendations