Prosodic Models, Automatic Speech Understanding, and Speech Synthesis: Towards the Common Ground?

Batliner, Anton; Möbius, Bernd

doi:10.1007/1-4020-2637-4_3

Anton Batliner¹³ &
Bernd Möbius¹⁴

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 25))

456 Accesses
10 Citations

Abstract

Automatic speech understanding and speech synthesis, two major speech processing applications, impose strikingly different constraints and requirements on prosodic models. The prevalent models of prosody and intonation fail to offer a unified solution to these conflicting constraints. As a consequence, prosodic models have been applied only occasionally in end-to-end automatic speech understanding systems; in contrast, they have been applied extensively in speech synthesis systems. In this chapter we aim to make explicit the reasons for this state of affairs by reviewing the role of prosodic modelling in these two fields of speech technology. Subsequently, possible strategies to overcome the shortcomings of the use of prosodic modelling in automatic speech processing are discussed. In particular, the question is raised whether or not there is a common framework for prosodic modelling in automatic speech understanding and speech synthesis systems, and if so, whether any particular model or theory of prosody can serve as a common ground. Finally, a catalogue of tasks in prosody research is proposed that ought to be relevant to both automatic speech understanding and speech synthesis and that might stimulate joint research activities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Batliner, A. Eine Frage ist eine Frage ist keine Frage. Perzeptionsexperimente zum Fragemodus im Deutschen. In: Altmann, H., Batliner, A., and Oppenrieder, W., (eds.), Zur Intonation von Modus und Fokus im Deutschen, Niemeyer, Tiibingen, 1989a: 87–109.
Google Scholar
Batliner, A. Wieviel Halbtöne braucht die Frage? Merkmale, Dimensionen, Kategorien. In: Altmann, H., Batliner, A., and Oppenrieder, W., (eds.), Zur Intonation von Modus und Fokus im Deutschen, Niemeyer, Tiibingen, 1989b: 111–162.
Google Scholar
Batliner, A., Buckow, A., Niemann, H., Nöth, E., and Warnke, V. The prosody module. In: (Wahlster, 2000), 2000: 106–121.
Google Scholar
Batliner, A., Buckow, J., Huber, R., Warnke, V., Nöth, E., and Niemann, H. Boiling down prosody for the classification of boundaries and accents in German and English. In: Proceedings of the European Conference on Speech Communication and Technology (Aalborg, Denmark), 4, 2001a: 2781–2784.
Google Scholar
Batliner, A., Kompe, R., Kießling, A., Mast, M., Niemann, H., and Nöth, E. M = Syntax + Prosody: A syntactic-prosodic labelling scheme for large spontaneous speech databases. Speech Communication, 25(4) (1998): 193–222.
Article Google Scholar
Batliner, A., Möbius, B., Möhler, G., Schweitzer, A., and Nöth, E. Prosodic models, automatic speech understanding, and speech synthesis: towards the common ground. In: Proceedings of the European Conference on Speech Communication and Technology (Aalborg, Denmark), 4 (2001b): 2285–2288.
Google Scholar
Batliner, A. and Nöth, E. The prediction of focus. In: Proceedings of the European Conference on Speech Communication and Technology (Paris), 1989: 210–213.
Google Scholar
Batliner, A., Nöth, E., Buckow, J., Huber, R., Warnke, V., and Niemann, H. Whence and whither prosody in automatic speech understanding: A case study. In: Bacchiani, M., Hirschberg, J., Litman, D., and Ostendorf, M., (eds.), Proceedings of the Workshop on Prosody and Speech Recognition 2001 (Red Bank, NJ), 2001c: 3–12.
Google Scholar
Batliner, A., Nutt, M., Warnke, V., Nöth, E., Buckow, J., Huber, R., and Niemann, H. Automatic annotation and classification of phrase accents in spontaneous speech. In: Proceedings of the European Conference on Speech Communication and Technology (Budapest), 1, 1999: 519–522.
Google Scholar
Black, A.W. and Taylor, P. CHATR: a generic speech synthesis system. In: Proceedings of the International Conference on Computational Linguistics (Kyoto, Japan), 2, 1994: 983–986.
Google Scholar
Black, A.W., Taylor, P., and Caley, R. The Festival speech synthesis system System documentation. CSTR Edinburgh. Edition 1.4, for Festival version 1.4.0.[http:/ /www.cstr.ed.ac.uk/projects/festival/manualfestival/_toc.html], 1999.
Google Scholar
Botinis, A., Granström, B., and Möbius, B. Developments and paradigms in intonation research. Speech Communication, 33(4) (2001): 263–296.
Article Google Scholar
Campbell, W.N. Syllable-based segmental duration. In: Bailly, G., Benoit, C., and Sawallis, T.R., (eds.), Talking Machines: Theories, Models, and Designs, Elsevier, Amsterdam, 1992: 211–224.
Google Scholar
Campbell, W.N. and Isard, S.D. Segment durations in a syllable frame. Journal of Phonetics, 19 (1991): 37–47.
Google Scholar
Dogil, G. and Möbius, B. Towards a model of target oriented production of prosody. In: Proceedings of the European Conference on Speech Communication and Technology (Aalborg, Denmark), 1 (2001): 665–668.
Google Scholar
Fujisaki, H. A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour. In: Fujimura, O., editor, Vocal Physiology: Voice Production, Mechanisms and Functions, Raven, New York, 1988: 347–355.
Google Scholar
Grice, M., D’Imperio, M., Savino, M., and Avesani, C. Strategies for intonation labelling across varieties of Italian. In: Jun, S. A., editor, Prosodic Typology: The Phonology of Intonation and Phrasing. Oxford University Press, Oxford, UK, 2004.
Google Scholar
Hirschberg, J. Pitch accent in context: Predicting intonational prominence from text. Artificial Intelligence, 63(1–2) (1993): 305–340.
Article Google Scholar
Hirschberg, J. and Pierrehumbert, J. The intonational structuring of discourse. In: Proceedings of the 24th Annual Meeting of the ACL (New York), 1986: 136–144.
Google Scholar
Hirst, D. and Di Cristo, A., (eds.), Intonation Systems-A Survey of Twenty Languages. Cambridge University Press, Cambridge, UK, 1998a.
Google Scholar
Hirst, D. and Di Cristo, A. A survey of intonation systems. In: (Hirst and Di Cristo:1998a), 1998b: 1–44.
Google Scholar
House, D. Tonal Perception in Speech. Lund University Press, Lund, 1990.
Google Scholar
House, D. Differential perception of tonal contours through the syllable. In: Proceedings of the International Conference on Spoken Language Processing (Philadelphia, PA), 1 (1996): 2048–2051.
Article Google Scholar
Kießling, A. Extraktion und Klassifikation prosodischer Merkmale in der automatischen Sprachverarbeitung. Berichte aus der Informatik. Shaker, Aachen, 1997.
Google Scholar
Klein, M. Standardization efforts on the level of dialogue act in the MATE project. In: Proceedings of the ACL Workshop “Towards Standards and Tools for Discourse Tagging” (University of Maryland), 1999: 35–41.
Google Scholar
Ladd, D.R. Intonational Phonology. Cambridge University Press, Cambridge, UK, 1996.
Google Scholar
Ladd, D.R. Introduction to part I. Naturalness and spontaneous speech. In: (Sagisaka et al., 1997), 1997: 3–6.
Google Scholar
Lamel, L., Lefevre, F., Gauvain, J.-L., and Adda, G. Portability issues for speech recognition technologies. In: Proceedings of the Human Language Technology Conference HLT-2001 (San Diego, CA), 2001: 9–16.
Google Scholar
Lisker, L. Rapid vs. rabid: A catalogue of acoustic features that may cue the distinction. Haskins Laboratories: Status Report on Speech Research SR-55/56 1978: 127–132.
Google Scholar
Möbius, B. Components of a quantitative model of German intonation. In: Proceedings of the 13th International Congress of Phonetic Sciences (Stockholm), 2 (1995): 108–115.
Google Scholar
Möhler, G. Describing intonation with a parametric model. In: Proceedings of the International Conference on Spoken Language Processing (Sydney), 7 (1998): 2851–2854.
Google Scholar
Nöth, E., Batliner, A., Kießling, A., Kompe, R., and Niemann, H. Verbmobil: The use of prosody in the linguistic components of a speech understanding system. IEEE Transactions on Speech and Audio Processing, 8 (2000): 519–532.
Article Google Scholar
Pierrehumbert, J. The phonology and phonetics of English intonation. PhD thesis, MIT, Cambridge, MA, 1980.
Google Scholar
Pierrehumbert, J. Synthesizing intonation. Journal of the Acoustical Society of America, 70 (1981): 985–995.
Google Scholar
Prevost, S. and Steedman, M. Specifying intonation from context for speech synthesis. Speech Communication, 15(12) (1994): 139–153.
Article Google Scholar
Sagisaka, Y., Campbell, N., and Higuchi, N., (eds.), Computing prosody Computational models for processing spontaneous speech. Springer, New York, 1997.
Google Scholar
Shriberg, E., Bates, R., Taylor, P., Stolcke, A., Jurafsky, D., Ries, K., Cocarro, N., Martin, R., Meteer, M., and Ess-Dykema, C. V. Can prosody aid the automatic classification of dialog acts in conversational speech? Language and Speech, 41 (1998): 439–487.
Google Scholar
Siepmann, R. Phonetische Intonationsmodelle und die Parametrisierung von kontrastiven Satzakzenten im Deutschen. Forschungsberichte des Instituts für Phonetik und Sprachliche Kommunikation (Mzinchen), FIPKM, 38 (2001): 3–111.
Google Scholar
Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., and Hirschberg, J. ToBI: A standard for labeling English prosody. In: Proceedings of the International Conference on Spoken Language Processing (Banff Alberta), 2 (1992): 867–870.
Google Scholar
Sproat, R., (ed.), Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwer, Dordrecht, 1998.
Google Scholar
Studdert-Kennedy, M. and Hadding, K. Auditory and linguistic processes in the perception of intonation contours. Language and Speech, 16 (1973): 293–313.
PubMed Google Scholar
Syrdal, A., Möhler, G., Dusterhoff, K., Conkie, A., and Black, A. Three methods of intonation modeling. In: Proceedings of the Third International Workshop on Speech Synthesis (Jenolan Caves, Australia), 1998: 305–310.
Google Scholar
Syrdal, A.K., Wightman, C.W., Conkie, A., Stylianou, Y., Beutnagel, M., Schroeter, J., Strom, V., Lee, K.-S., and Makashay, M.J. Corpus-based techniques in the AT&T NextGen synthesis system. In: Proceedings of the International Conference on Spoken Language Processing (Beijing), 3 (2000): 410–413.
Google Scholar
’t Hart, J., Collier, R., and Cohen, A. A Perceptual Study of Intonation An Experimental-Phonetic Approach to Speech Melody. Cambridge University Press, Cambridge, UK, 1990.
Google Scholar
Taylor, P. and Black, As. W. Speech synthesis by phonological structure matching. In: Proceedings of the European Conference on Speech Communication and Technology (Budapest), 2 (1999): 623–626.
Google Scholar
Taylor, P.A. Analysis and synthesis of intonation using the Tilt model. Journal of the Acoustical Society of America, 107(3) (2000): 1697–1714.
Article PubMed Google Scholar
van Santen, J.P.H. Exploring N-way tables with sums-of-products models. Journal of Mathematical Psychology, 37(3) (1993): 327–371.
Article Google Scholar
van Santen, J.P.H. Assignment of segmental duration in text-to-speech synthesis. Computer Speech and Language, 8 (1994): 95–128.
Article Google Scholar
van Santen, J.P.H. and Möbius, B. A quantitative model of F0 generation and alignment. In: Botinis, A., (ed.), Intonation-Analysis, Modelling and Technology, Kluwer, Dordrecht, 2000: 269–288.
Google Scholar
van Santen, J.P.H., Möbius, B., Venditti, J., and Shih, C. Description of the Bell Labs intonation system. In: Proceedings of the Third International Workshop on Speech Synthesis (Jenolan Caves, Australia), 1998: 293–298.
Google Scholar
Wahlster, W., (ed.), Verbmobil: Foundations of Speech-to-Speech Translations. Springer, Berlin, 2000.
Google Scholar
Widera, C., Portele, T., and Wolters, M. Prediction of word prominence. In: Proceedings of the European Conference on Speech Communication and Technology (Rhodes, Greece), 2 (1997): 999–1002.
Google Scholar

Download references

Author information

Authors and Affiliations

Chair for Pattern Recognition, University of Erlangen-Nuremberg, Germany
Anton Batliner
Institute of Natural Language Processing, University of Stuttgart, Germany
Bernd Möbius

Authors

Anton Batliner
View author publications
You can also search for this author in PubMed Google Scholar
Bernd Möbius
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universität des Saarlandes, Saarbrücken, Germany
William J. Barry
Norwegian University of Science and Technology, Trondheim, Norway
Wim A. van Dommelen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Batliner, A., Möbius, B. (2005). Prosodic Models, Automatic Speech Understanding, and Speech Synthesis: Towards the Common Ground?. In: Barry, W.J., van Dommelen, W.A. (eds) The Integration of Phonetic Knowledge in Speech Technology. Text, Speech and Language Technology, vol 25. Springer, Dordrecht. https://doi.org/10.1007/1-4020-2637-4_3

Download citation

DOI: https://doi.org/10.1007/1-4020-2637-4_3
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-2635-5
Online ISBN: 978-1-4020-2637-9
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics