Predicting the Intonation of Discourse Segments from Examples in Dialogue Speech
In the area of speech synthesis it is already possible to generate understandable speech with discourse neutral prosody for simple written texts. However, at ATR-ITL we are researching speech synthesis techniques for use in a speech translation environment. Dialogues, in such conversations, involve much richer forms of prosodic variation than are required for the reading of texts. For our translations to sound natural it is necessary for our synthesis system to offer a wide range of prosodic variability, which can be described at an appropriate level of abstraction. This paper describes a multi-level intonation system which generates a fundamental frequency (F 0 ) contour based on input labelled with high level discourse information, including speech act type and focussing information, as well as part of speech and syntactic constituent structure. The system is rule driven but rules (and parameters) are derived from naturally spoken dialogues. Two experiments using this model are described, testing its accuracy. First results are given for a system to predict ToBI intonation labels from discourse information use a CART decision tree. Second a detailed investigation of the intonational variation of the word “okay” in different discourse contexts is presented.
KeywordsSpeech Synthesis Pitch Accent Prosodic Phrase Speech Synthesis System Accented Word
Unable to display preview. Download preview PDF.
- [Bec96b]M. Beckman. A typology of spontaneous speech. In Computing Prosody: Approaches to a Computational Analysis of the Prosody of Spontaneous Speech. New York: Springer-Verlag, 1997. This volume. Google Scholar
- [BT94a]A. W. Black and P. Taylor. Assigning intonation elements and prosodic phrasing for English speech synthesis from high level linguistic input. In Proceedings of the International Conference on Spoken Language Processing, Yokohama, Japan, Vol. 2, pp. 715–718, 1994. Google Scholar
- [BT94b]A. W. Black and P. Taylor. CHATR: A generic speech synthesis system. Proceedings of COLING-94, II:983–986, 1994. Google Scholar
- [OPSH95a]M. Ostendorf, P. Price, and S. Shattuck-Hufnagel. The Boston University Radio News Corpus. Technical Report ECS-95-001, Electrical, Computer and Systems Engineering Department, Boston University, Boston, MA, 1995. Google Scholar
- [SBSP92]K. E. A. Silverman, E. Blaauw, J. Spitz, and J. Pitrelli. Towards using prosody in speech recognition/understanding systems: Differences between read and spontaneous speech. Proceedings DARPA Speech and Natural Language Workshop, pp. 435–440, 1992. Google Scholar
- [SFT94]M. Seligman, L. Fais, and M. Tomokiyo. A bilingual set of communicative act labels for spontaneous dialogues. Technical Report Technical Report TR-IT-0081, ATR Interpreting Telecommunications Laboratories, Kyoto, Japan, 1994. Google Scholar
- [Ste94]A. Stenström. An Introduction to Spoken Interaction. London: Longman, 1994.Google Scholar
- [TB94]P. Taylor and A. W. Black. Synthesizing conversational intonation from a linguistically rich input. Proceedings of the ESC A/IEEE Workshop on Speech Synthesis, Mohonk, NY, pp. 175–178, 1994. Google Scholar
- [WC94]C. W. Wightman and W. N. Campbell. Automatic labelling of prosodic structure. Technical Report TR-IT-0061, ATR Interpreting Telecommunications Laboratories, Kyoto, Japan, 1994. Google Scholar