Skip to main content

Prosodic Models, Automatic Speech Understanding, and Speech Synthesis: Towards the Common Ground?

  • Chapter
The Integration of Phonetic Knowledge in Speech Technology

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 25))

Abstract

Automatic speech understanding and speech synthesis, two major speech processing applications, impose strikingly different constraints and requirements on prosodic models. The prevalent models of prosody and intonation fail to offer a unified solution to these conflicting constraints. As a consequence, prosodic models have been applied only occasionally in end-to-end automatic speech understanding systems; in contrast, they have been applied extensively in speech synthesis systems. In this chapter we aim to make explicit the reasons for this state of affairs by reviewing the role of prosodic modelling in these two fields of speech technology. Subsequently, possible strategies to overcome the shortcomings of the use of prosodic modelling in automatic speech processing are discussed. In particular, the question is raised whether or not there is a common framework for prosodic modelling in automatic speech understanding and speech synthesis systems, and if so, whether any particular model or theory of prosody can serve as a common ground. Finally, a catalogue of tasks in prosody research is proposed that ought to be relevant to both automatic speech understanding and speech synthesis and that might stimulate joint research activities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Batliner, A. Eine Frage ist eine Frage ist keine Frage. Perzeptionsexperimente zum Fragemodus im Deutschen. In: Altmann, H., Batliner, A., and Oppenrieder, W., (eds.), Zur Intonation von Modus und Fokus im Deutschen, Niemeyer, Tiibingen, 1989a: 87–109.

    Google Scholar 

  • Batliner, A. Wieviel Halbtöne braucht die Frage? Merkmale, Dimensionen, Kategorien. In: Altmann, H., Batliner, A., and Oppenrieder, W., (eds.), Zur Intonation von Modus und Fokus im Deutschen, Niemeyer, Tiibingen, 1989b: 111–162.

    Google Scholar 

  • Batliner, A., Buckow, A., Niemann, H., Nöth, E., and Warnke, V. The prosody module. In: (Wahlster, 2000), 2000: 106–121.

    Google Scholar 

  • Batliner, A., Buckow, J., Huber, R., Warnke, V., Nöth, E., and Niemann, H. Boiling down prosody for the classification of boundaries and accents in German and English. In: Proceedings of the European Conference on Speech Communication and Technology (Aalborg, Denmark), 4, 2001a: 2781–2784.

    Google Scholar 

  • Batliner, A., Kompe, R., Kießling, A., Mast, M., Niemann, H., and Nöth, E. M = Syntax + Prosody: A syntactic-prosodic labelling scheme for large spontaneous speech databases. Speech Communication, 25(4) (1998): 193–222.

    Article  Google Scholar 

  • Batliner, A., Möbius, B., Möhler, G., Schweitzer, A., and Nöth, E. Prosodic models, automatic speech understanding, and speech synthesis: towards the common ground. In: Proceedings of the European Conference on Speech Communication and Technology (Aalborg, Denmark), 4 (2001b): 2285–2288.

    Google Scholar 

  • Batliner, A. and Nöth, E. The prediction of focus. In: Proceedings of the European Conference on Speech Communication and Technology (Paris), 1989: 210–213.

    Google Scholar 

  • Batliner, A., Nöth, E., Buckow, J., Huber, R., Warnke, V., and Niemann, H. Whence and whither prosody in automatic speech understanding: A case study. In: Bacchiani, M., Hirschberg, J., Litman, D., and Ostendorf, M., (eds.), Proceedings of the Workshop on Prosody and Speech Recognition 2001 (Red Bank, NJ), 2001c: 3–12.

    Google Scholar 

  • Batliner, A., Nutt, M., Warnke, V., Nöth, E., Buckow, J., Huber, R., and Niemann, H. Automatic annotation and classification of phrase accents in spontaneous speech. In: Proceedings of the European Conference on Speech Communication and Technology (Budapest), 1, 1999: 519–522.

    Google Scholar 

  • Black, A.W. and Taylor, P. CHATR: a generic speech synthesis system. In: Proceedings of the International Conference on Computational Linguistics (Kyoto, Japan), 2, 1994: 983–986.

    Google Scholar 

  • Black, A.W., Taylor, P., and Caley, R. The Festival speech synthesis system System documentation. CSTR Edinburgh. Edition 1.4, for Festival version 1.4.0.[http:/ /www.cstr.ed.ac.uk/projects/festival/manualfestival/_toc.html], 1999.

    Google Scholar 

  • Botinis, A., Granström, B., and Möbius, B. Developments and paradigms in intonation research. Speech Communication, 33(4) (2001): 263–296.

    Article  Google Scholar 

  • Campbell, W.N. Syllable-based segmental duration. In: Bailly, G., Benoit, C., and Sawallis, T.R., (eds.), Talking Machines: Theories, Models, and Designs, Elsevier, Amsterdam, 1992: 211–224.

    Google Scholar 

  • Campbell, W.N. and Isard, S.D. Segment durations in a syllable frame. Journal of Phonetics, 19 (1991): 37–47.

    Google Scholar 

  • Dogil, G. and Möbius, B. Towards a model of target oriented production of prosody. In: Proceedings of the European Conference on Speech Communication and Technology (Aalborg, Denmark), 1 (2001): 665–668.

    Google Scholar 

  • Fujisaki, H. A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour. In: Fujimura, O., editor, Vocal Physiology: Voice Production, Mechanisms and Functions, Raven, New York, 1988: 347–355.

    Google Scholar 

  • Grice, M., D’Imperio, M., Savino, M., and Avesani, C. Strategies for intonation labelling across varieties of Italian. In: Jun, S. A., editor, Prosodic Typology: The Phonology of Intonation and Phrasing. Oxford University Press, Oxford, UK, 2004.

    Google Scholar 

  • Hirschberg, J. Pitch accent in context: Predicting intonational prominence from text. Artificial Intelligence, 63(1–2) (1993): 305–340.

    Article  Google Scholar 

  • Hirschberg, J. and Pierrehumbert, J. The intonational structuring of discourse. In: Proceedings of the 24th Annual Meeting of the ACL (New York), 1986: 136–144.

    Google Scholar 

  • Hirst, D. and Di Cristo, A., (eds.), Intonation Systems-A Survey of Twenty Languages. Cambridge University Press, Cambridge, UK, 1998a.

    Google Scholar 

  • Hirst, D. and Di Cristo, A. A survey of intonation systems. In: (Hirst and Di Cristo:1998a), 1998b: 1–44.

    Google Scholar 

  • House, D. Tonal Perception in Speech. Lund University Press, Lund, 1990.

    Google Scholar 

  • House, D. Differential perception of tonal contours through the syllable. In: Proceedings of the International Conference on Spoken Language Processing (Philadelphia, PA), 1 (1996): 2048–2051.

    Article  Google Scholar 

  • Kießling, A. Extraktion und Klassifikation prosodischer Merkmale in der automatischen Sprachverarbeitung. Berichte aus der Informatik. Shaker, Aachen, 1997.

    Google Scholar 

  • Klein, M. Standardization efforts on the level of dialogue act in the MATE project. In: Proceedings of the ACL Workshop “Towards Standards and Tools for Discourse Tagging” (University of Maryland), 1999: 35–41.

    Google Scholar 

  • Ladd, D.R. Intonational Phonology. Cambridge University Press, Cambridge, UK, 1996.

    Google Scholar 

  • Ladd, D.R. Introduction to part I. Naturalness and spontaneous speech. In: (Sagisaka et al., 1997), 1997: 3–6.

    Google Scholar 

  • Lamel, L., Lefevre, F., Gauvain, J.-L., and Adda, G. Portability issues for speech recognition technologies. In: Proceedings of the Human Language Technology Conference HLT-2001 (San Diego, CA), 2001: 9–16.

    Google Scholar 

  • Lisker, L. Rapid vs. rabid: A catalogue of acoustic features that may cue the distinction. Haskins Laboratories: Status Report on Speech Research SR-55/56 1978: 127–132.

    Google Scholar 

  • Möbius, B. Components of a quantitative model of German intonation. In: Proceedings of the 13th International Congress of Phonetic Sciences (Stockholm), 2 (1995): 108–115.

    Google Scholar 

  • Möhler, G. Describing intonation with a parametric model. In: Proceedings of the International Conference on Spoken Language Processing (Sydney), 7 (1998): 2851–2854.

    Google Scholar 

  • Nöth, E., Batliner, A., Kießling, A., Kompe, R., and Niemann, H. Verbmobil: The use of prosody in the linguistic components of a speech understanding system. IEEE Transactions on Speech and Audio Processing, 8 (2000): 519–532.

    Article  Google Scholar 

  • Pierrehumbert, J. The phonology and phonetics of English intonation. PhD thesis, MIT, Cambridge, MA, 1980.

    Google Scholar 

  • Pierrehumbert, J. Synthesizing intonation. Journal of the Acoustical Society of America, 70 (1981): 985–995.

    Google Scholar 

  • Prevost, S. and Steedman, M. Specifying intonation from context for speech synthesis. Speech Communication, 15(12) (1994): 139–153.

    Article  Google Scholar 

  • Sagisaka, Y., Campbell, N., and Higuchi, N., (eds.), Computing prosody Computational models for processing spontaneous speech. Springer, New York, 1997.

    Google Scholar 

  • Shriberg, E., Bates, R., Taylor, P., Stolcke, A., Jurafsky, D., Ries, K., Cocarro, N., Martin, R., Meteer, M., and Ess-Dykema, C. V. Can prosody aid the automatic classification of dialog acts in conversational speech? Language and Speech, 41 (1998): 439–487.

    Google Scholar 

  • Siepmann, R. Phonetische Intonationsmodelle und die Parametrisierung von kontrastiven Satzakzenten im Deutschen. Forschungsberichte des Instituts für Phonetik und Sprachliche Kommunikation (Mzinchen), FIPKM, 38 (2001): 3–111.

    Google Scholar 

  • Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., and Hirschberg, J. ToBI: A standard for labeling English prosody. In: Proceedings of the International Conference on Spoken Language Processing (Banff Alberta), 2 (1992): 867–870.

    Google Scholar 

  • Sproat, R., (ed.), Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwer, Dordrecht, 1998.

    Google Scholar 

  • Studdert-Kennedy, M. and Hadding, K. Auditory and linguistic processes in the perception of intonation contours. Language and Speech, 16 (1973): 293–313.

    PubMed  Google Scholar 

  • Syrdal, A., Möhler, G., Dusterhoff, K., Conkie, A., and Black, A. Three methods of intonation modeling. In: Proceedings of the Third International Workshop on Speech Synthesis (Jenolan Caves, Australia), 1998: 305–310.

    Google Scholar 

  • Syrdal, A.K., Wightman, C.W., Conkie, A., Stylianou, Y., Beutnagel, M., Schroeter, J., Strom, V., Lee, K.-S., and Makashay, M.J. Corpus-based techniques in the AT&T NextGen synthesis system. In: Proceedings of the International Conference on Spoken Language Processing (Beijing), 3 (2000): 410–413.

    Google Scholar 

  • ’t Hart, J., Collier, R., and Cohen, A. A Perceptual Study of Intonation An Experimental-Phonetic Approach to Speech Melody. Cambridge University Press, Cambridge, UK, 1990.

    Google Scholar 

  • Taylor, P. and Black, As. W. Speech synthesis by phonological structure matching. In: Proceedings of the European Conference on Speech Communication and Technology (Budapest), 2 (1999): 623–626.

    Google Scholar 

  • Taylor, P.A. Analysis and synthesis of intonation using the Tilt model. Journal of the Acoustical Society of America, 107(3) (2000): 1697–1714.

    Article  PubMed  Google Scholar 

  • van Santen, J.P.H. Exploring N-way tables with sums-of-products models. Journal of Mathematical Psychology, 37(3) (1993): 327–371.

    Article  Google Scholar 

  • van Santen, J.P.H. Assignment of segmental duration in text-to-speech synthesis. Computer Speech and Language, 8 (1994): 95–128.

    Article  Google Scholar 

  • van Santen, J.P.H. and Möbius, B. A quantitative model of F0 generation and alignment. In: Botinis, A., (ed.), Intonation-Analysis, Modelling and Technology, Kluwer, Dordrecht, 2000: 269–288.

    Google Scholar 

  • van Santen, J.P.H., Möbius, B., Venditti, J., and Shih, C. Description of the Bell Labs intonation system. In: Proceedings of the Third International Workshop on Speech Synthesis (Jenolan Caves, Australia), 1998: 293–298.

    Google Scholar 

  • Wahlster, W., (ed.), Verbmobil: Foundations of Speech-to-Speech Translations. Springer, Berlin, 2000.

    Google Scholar 

  • Widera, C., Portele, T., and Wolters, M. Prediction of word prominence. In: Proceedings of the European Conference on Speech Communication and Technology (Rhodes, Greece), 2 (1997): 999–1002.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer

About this chapter

Cite this chapter

Batliner, A., Möbius, B. (2005). Prosodic Models, Automatic Speech Understanding, and Speech Synthesis: Towards the Common Ground?. In: Barry, W.J., van Dommelen, W.A. (eds) The Integration of Phonetic Knowledge in Speech Technology. Text, Speech and Language Technology, vol 25. Springer, Dordrecht. https://doi.org/10.1007/1-4020-2637-4_3

Download citation

Publish with us

Policies and ethics