Abstract
Speech synthesis systems have to generate natural-sounding speech output from text. One of the key aspects of speech is prosody, which must be both natural (i.e., sounding like a human) and meaningful (i.e., sounding like a human who understands the contents of the text). The computation of prosody from text can be divided into the computation of prosodic tags from text and the computation of acoustic speech features from these tags. This chapter focuses on the latter. It provides an overview of prosody in human-human communication, including the communicative functions of prosody and the acoustic correlates. Discussed next is a historical overview of the various methods that have been used for prosody generation in speech synthesis, as well as of current methods. Special attention is paid to prosody generation in unit selection synthesis methods, in which large corpora are searched for fragments of speech that match the phonemes and prosodic tags computed from text and that optimize various cost functions, and in which prosody is not modeled and speech not modified. We conclude the chapter by advocating hybrid approaches in which search capabilities of unit selection methods are combined with the speech modification methods from more-traditional approaches.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Abbreviations
- CART:
-
classification and regression tree
- FC:
-
functional contour
- ML:
-
maximum-likelihood
- RMSE:
-
root-mean-square error
- TTS:
-
text-to-speech
- ToBI:
-
tone and break indices
References
J. van Santen: Contextual effects on vowel duration, Speech Commun. 11(6), 513-546 (1992)
J. van Santen: Exploring N-way tables with Sums-of-Product models, J. Mathemat. Psychol. 37(3), 327-371 (1993)
B. Möbius, J. van Santen: Modeling segmental duration in German text-to-speech synthesis, Proc. 1996 Int. Conf. Spoken Lang. Process. Philadelphia (1996) pp. 2395-2398
C. Shih, B. Ao: Duration study for the Bell Laboratories Mandarin text-to-speech system. In: Progress in Speech Synthesis, ed. by J. van Santen, R. Sproat, J. Olive, J. Hirschberg (Springer, New York 1996) pp. 383-397
J. van Santen: Assignment of segmental duration in text-to-speech synthesis, Computer Speech Language 8, 95-128 (1994)
H. Kato, M. Tsuzaki, Y. Sagisaka: Acceptability for temporal modification of single vowel segments in isolated words, J. Acoust. Soc. Am. 104(1), 540-549 (1998)
J. van Santen, C. Shih: Suprasegmental and segmental timing models in Mandarin Chinese and American English, J. Acoust. Soc. Am. 107(2), 1012-1026 (2000)
J. van Santen: Segmental duration and speech timing. In: Computing Prosody, ed. by Y. Sagisaka, W.N. Campbell, N. Higuchi (Springer, New York 1996)
J.B. Pierrehumbert: The phonetics and phonology of English intonation, Ph.D. Thesis (MIT, Cambridge 1980)
H. Fujisaki: Dynamic characteristics of voice fundamental frequency in speech and singing. In: The Production of Speech, ed. by P.F. MacNeilage (Springer, New York 1983) pp. 39-55
K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, J. Hirschberg: ToBI: A standard for labeling English prosody, Proc. 1992 Int. Conf. Spoken Language Processing Banff (1992) pp. 867-870
K.J. Kohler: Macro and micro F0 in the synthesis of intonation. In: Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech, ed. by J. Kingston, M.E. Beckman (Cambridge Univ. Press, New York 1990) pp. 115-138
M. dʼImperio, D. House: Perception of questions and statements in Neapolitan Italian, Proc. Fifth European Conference on Speech Communication and Technology Rhodes (1997)
D.J. Broad, F. Clermont: Linear scaling of vowel-formant ensembles (VFEs) in consonantal contexts, Speech Commun. 37, 175-195 (2002)
D.H. Klatt: Interaction between two factors that influence vowel duration, J. Acoust. Soc. Am. 54, 1102-1104 (1973)
D.H. Klatt: Linguistic uses of segmental duration in English: Acoustic and perceptual evidence, J. Acoust. Soc. Am. 59, 1209-1221 (1976)
J. Allen, S. Hunnicut, D. Klatt: Text-to-Speech: The MITalk System (Cambridge Univ. Press, Cambridge 1987)
J.B. Pierrehumbert: Synthesizing intonation, J. Acoust. Soc. Am. 70, 985-995 (1981)
R. Sproat (Ed.): Multilingual Text-to-Speech Synthesis: The Bell Labs Approach (Kluwer, Dordrecht 1997)
P. Taylor: Analysis and synthesis of intonation using the Tilt model, J. Acoust. Soc. Am. 107(3), 1697-1714 (2000)
K. Dusterhoff, A. Black: Generating F0 contours for speech synthesis using the Tilt intonation theory, Intonation: Theory, Models and Applications, Proc. ESCA Workshop, ed. by A. Botinis, G. Kouroupetroglou, G. Carayiannis (1997) pp. 107-110
A. Black, P. Taylor: CHATR: A generic speech synthesis system, Proc. COLINGʼ94 Kyoto (1994) pp. 983-986
A. Black, N. Campbell: Prosody and the selection of source units for concatenative synthesis. In: Progress in Speech Synthesis, ed. by J. van Santen, R. Sproat, J. Olive, J. Hirschberg (Springer, New York 1995) pp. 279-292
J. van Santen: Combinatorial issues in text-to-speech synthesis, Proc. Eurospeech 97, 2511-2514 (1997)
B. Möbius: Rare events and closed domains: Two delicate concepts in speech synthesis, Proc. 4rd ESCA Workshop on Speech Synthesis Pitlochry (2001)
M.D. Riley: Tree-based modeling for speech synthesis. In: Talking Machines: Theories, Models, and Designs, ed. by G. Bailly, C. Benoit, T. Sawallis (Elsevier, Amsterdam 1992) pp. 265-273
D.H. Klatt: Synthesis by rule of segmental durations in English sentences. In: Frontiers of Speech Communication Research, ed. by B. Lindblom, S. Öhman (Academic, New York 1979) pp. 287-300
J.P. Olive, M.Y. Liberman: Text to speech - An overview, J. Acoust. Soc. Am. 78(Suppl. 1), 6 (1985)
R. Carlson, B. Granström: A search for durational rules in a real-speech database, Phonetica 43, 140-154 (1986)
K.J. Kohler: Zeitstrukturierung in der Sprachsynthese, ITG-Fachbericht 105, 165-170 (1994), in German
K. Bartkova, C. Sorin: A model of segmental duration for speech synthesis in French, Speech Commun. 6, 245-260 (1987)
L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone: Classification and Regression Trees (Wadsworths Brooks, Monterey 1984)
H. Chung: Duration models and the perceptual evaluation of spoken Korean, Proc. Speech Prosody 2002 Aix-en-Provence (2002)
R. Batůšek: A duration model for Czech text-to-speech synthesis, Proc. Speech Prosody 2002 Aix-en-Provence (2002)
N.S. Krishna, H.A. Murthy: Duration modeling of Indian languages Hindi and Telugu, 5th ISCA Workshop of Speech Synthesis Pittsburgh (2005)
F. Tesser, P. Cosi, C. Drioli, G. Tisato: Prosodic data driven modelling of a narrative style in Festival TTS, Proc. 5th ISCA Workshop on Speech Synthesis Pittsburgh (2005)
W.N. Campbell: Syllable-based segmental durations. In: Talking Machines: Theories, Models, and Designs, ed. by G. Bailly, C. Benoit, T. Sawallis (Elsevier, Amsterdam 1992) pp. 43-60
E. Klabbers: Segmental and Prosodic Improvements to Speech Generation, Ph.D. Thesis (Eindhoven University of Technology, Eindhoven 2000)
A. Maghbouleh: An empirical comparison of automatic decision tree and linear regression models for vowel durations, Proc. second meeting of the ACL Special Interest Group in Computational Phonology Santa Cruz (1996)
D.R. Ladd: Intonational Phonology (Cambridge Univ. Press, Cambridge 1996)
J.A. Goldsmith: Autosegmental and Metrical Phonology (Blackwell, Oxford 1990)
N. Campbell, J. Venditti: J-ToBI: An intonation labelling system for Japanese, Proc. Autumn Meeting Acoust. Soc. Jpn. 1, 317-318 (1995)
C. Mayo, M. Aylett, D.R. Ladd: Prosodic transcription of Glasgow English: An evaluation study of Glatobi, Proc. ESCA Workshop: Intonation: Theory, Models and Applications, ed. by A. Botinis, G. Kouroupetroglou, G. Carayiannis (ESCA, 1997) pp. 231-234
M. Reyelt, M. Grice, R. Benzmuller, J. Mayer, A. Batliner: Prosodische Etikettierung des Deutschen mit ToBI. In: Natural Language and Speech Technology, ed. by D. Gibbon (Mouton de Gruyter, Berlin 1996) pp. 144-155, in German
M. Jilka, G. Mohler, G. Dogil: Rules for the generation of ToBI-based American English intonation, Speech Commun. 28, 83-108 (1999)
A. Black, A. Hunt: Generating F0 contours from the ToBI labels using linear regression, Proc. 4th Int. Conf. Spoken Language Process. 3, 1385-1388 (1996)
C. Traber: F0 generation with a database of natural F0 patterns and with a neural network. In: Talking Machines: Theories, Models, and Designs, ed. by G. Bailly, C. Benoit, T. Sawallis (Elsevier, Amsterdam 1992) pp. 287-304
C. Traber: Syntactic processing and prosody control in the SVOX TTS system for German, Proc. Eurospeech 93, 2099-2102 (1993)
C. Traber: SVOX: The Implementation of a Text-to-Speech System for German, Ph.D. Thesis (ETH Zurich, Zurich 1995)
A. Cohen, J. ʼt Hart: On the anatomy of intonation, Lingua 19, 177-192 (1967)
J. ʼt Hart, R. Collier, A. Cohen: A Perceptual Study of Intonation: An Experimental-Phonetic Approach to Speech Melody (Cambridge Univ. Press, Cambridge 1990)
J. Terken: Synthesizing natural-sounding intonation for Dutch: rules and perceptual evaluation, Computer Speech Language 7, 27-48 (1993)
N. Willems, R. Collier, J. ʼt Hart: A synthesis scheme for British English intonation, J. Acoust. Soc. Am. 84(4), 1250-1261 (1988)
J. van Hemert, U. Adriaens-Porzig, L. Adriaens: Speech synthesis in the SPICOS project. In: Analyse und Synthese gesprochener Sprache, ed. by H. Tillmann, G. Willee (Georg Olms, Hildesheim 1987) pp. 34-39
H. Fujisaki, K. Hirose: Modelling the dynamic characteristics of voice fundamental frequency with applications to analysis and synthesis of intonation, Preprints of the Working Group on Intonation, 13th Intl. Congress of Linguists Tokyo (1982) pp. 57-70
H. Fujisaki: Modelling in the study of tonal features of speech with application to multilingual speech synthesis, Joint Conference of SNLP and Oriental COCOSDA Hua Hin Prachuapkirikhan (2002)
H. Mixdorff: Quantitative tone and intonation modeling across languages, Proc. Int. Symp. Tonal Aspects of Languages: With Emphasis on Tone Languages Beijing (2004) pp. 137-142
J. van Santen, B. Möbius: A quantitative model of F 0 generation and alignment. In: Intonation: Analysis, Modeling and Technology, ed. by A. Botinis (Kluwer Academic, Dordrecht 1999) pp. 269-288
J. van Santen, B. Möbius, J. Venditti, C. Shih: Description of the Bell Labs Intonation System, Proc. 3rd ESCA Speech Synthesis Workshop Jenolan Caves (1998) pp. 293-298
V. Aubergé: Prosody modeling with a dynamic lexicon of intonative forms: Application for text-to-speech synthesis, Proc. ESCA Workshop on Prosody (1993) pp. 62-65
B. Holm, G. Bailly: Generating prosody by superposing multi-parametric overlapping contours, Proc. Int. Conf. Speech and Language Processing Beijing (2000) pp. 203-206
G. Bailly, B. Holm: SFC: A trainable prosodic model, Speech Commun. 46, 364-384 (2005)
K.J. Kohler: Studies in German intonation, Arbeitsberichte des Instituts für Phonetik und digitale Sprachverarbeitung, Universität Kiel 25, 295-360 (1991)
K.J. Kohler: Parametric control of prosodic variables by symbolic input in TTS synthesis. In: Progress in Speech Synthesis, ed. by J. van Santen, R. Sproat, J. Olive, J. Hirschberg (Springer, New York 1997) pp. 459-475
Kohler K.J.: The Kiel Intonation Model (KIM), its Implementation in TTS Synthesis and its Application to the Study of Spontaneous Speech (1995), retrieved on July 15th, 2006 from http://www.ipds.uni-kiel.de/kjk/forschung/kim.en.html
G.P. Kochanski, C. Shih: Stem-ML: Language independent prosody description, Proc. Int. Conf. Spoken Lang. Process. 3, 239-242 (2000)
G.P. Kochanski, C. Shih: Prosody modeling with soft templates, Speech Commun. 39(3-4), 311-352 (2003)
G.P. Kochanski, C. Shih: Automated modelling of Chinese intonation in continuous speech, Proc. Eurospeech 01, 911-914 (2001)
T. Lee, G. Kochanski, C. Shih, Y. Li: Modeling tones in continuous Cantonese speech, Proc. 2002 International Conference on Spoken Language Processing Denver (2002) pp. 2401-2404
C. Shih, G. Kochanski: Modeling intonation: Asking for confirmation in English, Proc. 15th Int. Congress of Phonetic Sciences Barcelona (2003)
S. Quazza, L. Donetti, L. Moisa, P.L. Salza: ACTOR: A multilingual unit-selection speech synthesis system, Proc. 4th ESCA Workshop on Speech Synthesis Pitlochry (2001)
F. Campillo-Díaz, E.R. Banga: Combined prosody and candidate unit selections for corpus-based text-to-speech systems, Proc. 7th Int. Conference on Spoken Language Processing (2002) pp. 141-144
A. Raux, A. Black: A unit selection approach to F0 modeling and its application to emphasis, ASRU 2003 St. Thomas (2003)
J. van Santen, A. Kain, E. Klabbers, T. Mishra: Synthesis of prosody using multi-level sequence units, Speech Commun. 46(3-4), 365-375 (2005)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Santen, J.v., Mishra, T., Klabbers, E. (2008). Prosodic Processing. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds) Springer Handbook of Speech Processing. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-49127-9_23
Download citation
DOI: https://doi.org/10.1007/978-3-540-49127-9_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49125-5
Online ISBN: 978-3-540-49127-9
eBook Packages: EngineeringEngineering (R0)