Abstract
Automatic speech understanding and speech synthesis, two major speech processing applications, impose strikingly different constraints and requirements on prosodic models. The prevalent models of prosody and intonation fail to offer a unified solution to these conflicting constraints. As a consequence, prosodic models have been applied only occasionally in end-to-end automatic speech understanding systems; in contrast, they have been applied extensively in speech synthesis systems. In this chapter we aim to make explicit the reasons for this state of affairs by reviewing the role of prosodic modelling in these two fields of speech technology. Subsequently, possible strategies to overcome the shortcomings of the use of prosodic modelling in automatic speech processing are discussed. In particular, the question is raised whether or not there is a common framework for prosodic modelling in automatic speech understanding and speech synthesis systems, and if so, whether any particular model or theory of prosody can serve as a common ground. Finally, a catalogue of tasks in prosody research is proposed that ought to be relevant to both automatic speech understanding and speech synthesis and that might stimulate joint research activities.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Batliner, A. Eine Frage ist eine Frage ist keine Frage. Perzeptionsexperimente zum Fragemodus im Deutschen. In: Altmann, H., Batliner, A., and Oppenrieder, W., (eds.), Zur Intonation von Modus und Fokus im Deutschen, Niemeyer, Tiibingen, 1989a: 87–109.
Batliner, A. Wieviel Halbtöne braucht die Frage? Merkmale, Dimensionen, Kategorien. In: Altmann, H., Batliner, A., and Oppenrieder, W., (eds.), Zur Intonation von Modus und Fokus im Deutschen, Niemeyer, Tiibingen, 1989b: 111–162.
Batliner, A., Buckow, A., Niemann, H., Nöth, E., and Warnke, V. The prosody module. In: (Wahlster, 2000), 2000: 106–121.
Batliner, A., Buckow, J., Huber, R., Warnke, V., Nöth, E., and Niemann, H. Boiling down prosody for the classification of boundaries and accents in German and English. In: Proceedings of the European Conference on Speech Communication and Technology (Aalborg, Denmark), 4, 2001a: 2781–2784.
Batliner, A., Kompe, R., Kießling, A., Mast, M., Niemann, H., and Nöth, E. M = Syntax + Prosody: A syntactic-prosodic labelling scheme for large spontaneous speech databases. Speech Communication, 25(4) (1998): 193–222.
Batliner, A., Möbius, B., Möhler, G., Schweitzer, A., and Nöth, E. Prosodic models, automatic speech understanding, and speech synthesis: towards the common ground. In: Proceedings of the European Conference on Speech Communication and Technology (Aalborg, Denmark), 4 (2001b): 2285–2288.
Batliner, A. and Nöth, E. The prediction of focus. In: Proceedings of the European Conference on Speech Communication and Technology (Paris), 1989: 210–213.
Batliner, A., Nöth, E., Buckow, J., Huber, R., Warnke, V., and Niemann, H. Whence and whither prosody in automatic speech understanding: A case study. In: Bacchiani, M., Hirschberg, J., Litman, D., and Ostendorf, M., (eds.), Proceedings of the Workshop on Prosody and Speech Recognition 2001 (Red Bank, NJ), 2001c: 3–12.
Batliner, A., Nutt, M., Warnke, V., Nöth, E., Buckow, J., Huber, R., and Niemann, H. Automatic annotation and classification of phrase accents in spontaneous speech. In: Proceedings of the European Conference on Speech Communication and Technology (Budapest), 1, 1999: 519–522.
Black, A.W. and Taylor, P. CHATR: a generic speech synthesis system. In: Proceedings of the International Conference on Computational Linguistics (Kyoto, Japan), 2, 1994: 983–986.
Black, A.W., Taylor, P., and Caley, R. The Festival speech synthesis system System documentation. CSTR Edinburgh. Edition 1.4, for Festival version 1.4.0.[http:/ /www.cstr.ed.ac.uk/projects/festival/manualfestival/_toc.html], 1999.
Botinis, A., Granström, B., and Möbius, B. Developments and paradigms in intonation research. Speech Communication, 33(4) (2001): 263–296.
Campbell, W.N. Syllable-based segmental duration. In: Bailly, G., Benoit, C., and Sawallis, T.R., (eds.), Talking Machines: Theories, Models, and Designs, Elsevier, Amsterdam, 1992: 211–224.
Campbell, W.N. and Isard, S.D. Segment durations in a syllable frame. Journal of Phonetics, 19 (1991): 37–47.
Dogil, G. and Möbius, B. Towards a model of target oriented production of prosody. In: Proceedings of the European Conference on Speech Communication and Technology (Aalborg, Denmark), 1 (2001): 665–668.
Fujisaki, H. A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour. In: Fujimura, O., editor, Vocal Physiology: Voice Production, Mechanisms and Functions, Raven, New York, 1988: 347–355.
Grice, M., D’Imperio, M., Savino, M., and Avesani, C. Strategies for intonation labelling across varieties of Italian. In: Jun, S. A., editor, Prosodic Typology: The Phonology of Intonation and Phrasing. Oxford University Press, Oxford, UK, 2004.
Hirschberg, J. Pitch accent in context: Predicting intonational prominence from text. Artificial Intelligence, 63(1–2) (1993): 305–340.
Hirschberg, J. and Pierrehumbert, J. The intonational structuring of discourse. In: Proceedings of the 24th Annual Meeting of the ACL (New York), 1986: 136–144.
Hirst, D. and Di Cristo, A., (eds.), Intonation Systems-A Survey of Twenty Languages. Cambridge University Press, Cambridge, UK, 1998a.
Hirst, D. and Di Cristo, A. A survey of intonation systems. In: (Hirst and Di Cristo:1998a), 1998b: 1–44.
House, D. Tonal Perception in Speech. Lund University Press, Lund, 1990.
House, D. Differential perception of tonal contours through the syllable. In: Proceedings of the International Conference on Spoken Language Processing (Philadelphia, PA), 1 (1996): 2048–2051.
Kießling, A. Extraktion und Klassifikation prosodischer Merkmale in der automatischen Sprachverarbeitung. Berichte aus der Informatik. Shaker, Aachen, 1997.
Klein, M. Standardization efforts on the level of dialogue act in the MATE project. In: Proceedings of the ACL Workshop “Towards Standards and Tools for Discourse Tagging” (University of Maryland), 1999: 35–41.
Ladd, D.R. Intonational Phonology. Cambridge University Press, Cambridge, UK, 1996.
Ladd, D.R. Introduction to part I. Naturalness and spontaneous speech. In: (Sagisaka et al., 1997), 1997: 3–6.
Lamel, L., Lefevre, F., Gauvain, J.-L., and Adda, G. Portability issues for speech recognition technologies. In: Proceedings of the Human Language Technology Conference HLT-2001 (San Diego, CA), 2001: 9–16.
Lisker, L. Rapid vs. rabid: A catalogue of acoustic features that may cue the distinction. Haskins Laboratories: Status Report on Speech Research SR-55/56 1978: 127–132.
Möbius, B. Components of a quantitative model of German intonation. In: Proceedings of the 13th International Congress of Phonetic Sciences (Stockholm), 2 (1995): 108–115.
Möhler, G. Describing intonation with a parametric model. In: Proceedings of the International Conference on Spoken Language Processing (Sydney), 7 (1998): 2851–2854.
Nöth, E., Batliner, A., Kießling, A., Kompe, R., and Niemann, H. Verbmobil: The use of prosody in the linguistic components of a speech understanding system. IEEE Transactions on Speech and Audio Processing, 8 (2000): 519–532.
Pierrehumbert, J. The phonology and phonetics of English intonation. PhD thesis, MIT, Cambridge, MA, 1980.
Pierrehumbert, J. Synthesizing intonation. Journal of the Acoustical Society of America, 70 (1981): 985–995.
Prevost, S. and Steedman, M. Specifying intonation from context for speech synthesis. Speech Communication, 15(12) (1994): 139–153.
Sagisaka, Y., Campbell, N., and Higuchi, N., (eds.), Computing prosody Computational models for processing spontaneous speech. Springer, New York, 1997.
Shriberg, E., Bates, R., Taylor, P., Stolcke, A., Jurafsky, D., Ries, K., Cocarro, N., Martin, R., Meteer, M., and Ess-Dykema, C. V. Can prosody aid the automatic classification of dialog acts in conversational speech? Language and Speech, 41 (1998): 439–487.
Siepmann, R. Phonetische Intonationsmodelle und die Parametrisierung von kontrastiven Satzakzenten im Deutschen. Forschungsberichte des Instituts für Phonetik und Sprachliche Kommunikation (Mzinchen), FIPKM, 38 (2001): 3–111.
Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., and Hirschberg, J. ToBI: A standard for labeling English prosody. In: Proceedings of the International Conference on Spoken Language Processing (Banff Alberta), 2 (1992): 867–870.
Sproat, R., (ed.), Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwer, Dordrecht, 1998.
Studdert-Kennedy, M. and Hadding, K. Auditory and linguistic processes in the perception of intonation contours. Language and Speech, 16 (1973): 293–313.
Syrdal, A., Möhler, G., Dusterhoff, K., Conkie, A., and Black, A. Three methods of intonation modeling. In: Proceedings of the Third International Workshop on Speech Synthesis (Jenolan Caves, Australia), 1998: 305–310.
Syrdal, A.K., Wightman, C.W., Conkie, A., Stylianou, Y., Beutnagel, M., Schroeter, J., Strom, V., Lee, K.-S., and Makashay, M.J. Corpus-based techniques in the AT&T NextGen synthesis system. In: Proceedings of the International Conference on Spoken Language Processing (Beijing), 3 (2000): 410–413.
’t Hart, J., Collier, R., and Cohen, A. A Perceptual Study of Intonation An Experimental-Phonetic Approach to Speech Melody. Cambridge University Press, Cambridge, UK, 1990.
Taylor, P. and Black, As. W. Speech synthesis by phonological structure matching. In: Proceedings of the European Conference on Speech Communication and Technology (Budapest), 2 (1999): 623–626.
Taylor, P.A. Analysis and synthesis of intonation using the Tilt model. Journal of the Acoustical Society of America, 107(3) (2000): 1697–1714.
van Santen, J.P.H. Exploring N-way tables with sums-of-products models. Journal of Mathematical Psychology, 37(3) (1993): 327–371.
van Santen, J.P.H. Assignment of segmental duration in text-to-speech synthesis. Computer Speech and Language, 8 (1994): 95–128.
van Santen, J.P.H. and Möbius, B. A quantitative model of F0 generation and alignment. In: Botinis, A., (ed.), Intonation-Analysis, Modelling and Technology, Kluwer, Dordrecht, 2000: 269–288.
van Santen, J.P.H., Möbius, B., Venditti, J., and Shih, C. Description of the Bell Labs intonation system. In: Proceedings of the Third International Workshop on Speech Synthesis (Jenolan Caves, Australia), 1998: 293–298.
Wahlster, W., (ed.), Verbmobil: Foundations of Speech-to-Speech Translations. Springer, Berlin, 2000.
Widera, C., Portele, T., and Wolters, M. Prediction of word prominence. In: Proceedings of the European Conference on Speech Communication and Technology (Rhodes, Greece), 2 (1997): 999–1002.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer
About this chapter
Cite this chapter
Batliner, A., Möbius, B. (2005). Prosodic Models, Automatic Speech Understanding, and Speech Synthesis: Towards the Common Ground?. In: Barry, W.J., van Dommelen, W.A. (eds) The Integration of Phonetic Knowledge in Speech Technology. Text, Speech and Language Technology, vol 25. Springer, Dordrecht. https://doi.org/10.1007/1-4020-2637-4_3
Download citation
DOI: https://doi.org/10.1007/1-4020-2637-4_3
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-2635-5
Online ISBN: 978-1-4020-2637-9
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)