Abstract
This paper summarizes statistical modeling approaches for the use of prosody (the rhythm and melody of speech) in automatic recognition and understanding of speech. We outline effective prosodic feature extraction, model architectures, and techniques to combine prosodic with lexical (word-based) information. We then survey a number of applications of the framework, and give results for automatic sentence segmentation and disfluency detection, topic segmentation, dialog act labeling, and word recognition.
The research was supported by NSF Grants IRI-9314967, IRI-9618926, and IRI-9619921, by DARPA contract no. N66001-97-C-8544, and by NASA contract no. NCC 2-1256. Additional support came from the sponsors of the 1997 CLSP Workshop [7],[11] and from the DARPA Communicator project at UW and ICSI [8]. The views herein are those of the authors and should not be interpreted as representing the policies of the funding agencies.
We thank our many colleagues at SRI, ICSI, University of Washington (formerly at Boston University), and the 1997 Johns Hopkins CLSP Summer Workshop, who were instrumental in much of the work reported here.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
D. Baron, E. Shriberg, AND A. Stolcke, Automatic punctuation and disfluency detection in multi-party meetings using prosodic and lexical cues, in Proceedings of the International Conference on Spoken Language Processing, Denver, Sept. 2002.
A. Batliner, B. Möbius, G. Möhler, A. Schweitzer, AND E. Nöth, Prosodie models, automatic speech understanding, and speech synthesis: toward the common ground, in Proceedings of the 7th European Conference on Speech Communication and Technology, P. Dalsgaard, B. Lindberg, H. Benner, and Z. Tan, eds., Vol. 4, Aalborg, Denmark, Sept. 2001, pp. 2285–2288.
G. DOddington, The Topic Detection and Tracking Phase 2 (TDT2) evaluation plan, in Proceedings DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA, Feb. 1998, Morgan Kaufmann, pp. 223–229. Revised version available from http://www.nist.gov/speech/tests/tdt/tdt98/.
P. HEeman AND J. Allen, International boundaries, speech repairs, and discourse markers: Modeling spoken dialog, in Proceedings of the 35th Annual Meeting and 8th Conference of the European Chapter, Madrid, July 1997, Association for Computational Linguistics.
J. Hirschberg AND C. Nakatani, Acoustic indicators of topic segmentation, in Proceedings of the International Conference on Spoken Language Processing, R.H. Mannell and J. Robert-Ribes, eds., Sydney, Dec. 1998, Australian Speech Science and Technology Association, pp. 976–979.
M. Mast, R. Kompe, S. Harbeck, A. Kiessling, H. Niemann, E. Nöth, E.G. Schukat-talamazzini, AND V. Warnke, Dialog act classification with the help of prosody, in Proceedings of the International Conference on Spoken Language Processing, H.T. Bunnell and W. Idsardi, eds., Vol. 3, Philadelphia, Oct. 1996, pp. 1732–1735.
E. Shriberg, R. Bates, A. Stolcke, P. Taylor, D. Jurafsky, K. Ries, N. Coccaro, R. Martin, M. Meteer, AND C. Van Ess-dykema, Can prosody aid the automatic classification of dialog acts in conversational speech?, Language and Speech, 41 (1998), pp. 439–487.
E. Shriberg, A. Stolcke, AND D. Baron, Can prosody aid the automatic processing of multi-party meetings? Evidence from predicting punctuation, disfluencies, and overlapping speech, in Proceedings ISCA Tutorial and Research Workshop on Prosody in Speech Recognition and Understanding, M. Bacchiani, J. Hirschberg, D. Litman, and M. Ostendorf, eds., Red Bank, NJ, Oct. 2001, pp. 139–146.
E. Shriberg, A. Stolcke, D. Hakkani-tür, AND G. Tür, Prosody-based automatic segmentation of speech into sentences and topics, Speech Communication, 32 (2000), pp. 127–154. Special Issue on Accessing Information in Spoken Audio.
K. Sönmez, E. Shriberg, L. Heck, AND M. Weintraub, Modeling dynamic prosodie variation for speaker verification, in Proceedings of the International Conference on Spoken Language Processing, R.H. Mannell and J. Robert-Ribes, eds., Vol. 7, Sydney, Dec. 1998, Australian Speech Science and Technology Association, pp. 3189–3192.
A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, D. Jurafsky, P. Taylor, R. Martin, C. Vaness-dykema, AND M. Meteer, Dialogue act modeling for automatic tagging and recognition of conversational speech, Computational Linguistics, 26 (2000), pp. 339–373.
A. Stolcke, E. Shriberg, R. Bates, M. Ostendorf, D. Hakkani, M. Plauché, G. Tür, AND Y. Lu, Automatic detection of sentence boundaries and disfluencies based on recognized words, in Proceedings of the International Conference on Spoken Language Processing, R.H. Mannell and J. Robert-Ribes, eds., Vol. 5, Sydney, Dec. 1998, Australian Speech Science and Technology Association, pp. 2247–2250.
A. Stolcke, E. Shriberg, D. Hakkani-Tür, AND G. Tür, Modeling the prosody of hidden events for improved word recognition, in Proceedings of the 6th European Conference on Speech Communication and Technology, Vol. 1, Budapest, Sept. 1999, pp. 307–310.
P. Taylor, S. KIng, S. Isard, AND H. Wright, Intonation and dialog context as constraints for speech recognition, Language and Speech, 41 (1998), pp. 489–508.
G. Tür, D. Hakkani-Tür, A. Stolcke, AND E. Shriberg, Integrating prosodic and lexical cues for automatic topic segmentation, Computational Linguistics, 27 (2001), pp. 31–57.
N.M. Veilleux AND M. Ostendorf, Prosody/parse scoring and its applications in ATIS, in Proceedings of the ARPA Workshop on Human Language Technology, Plainsboro, NJ, Mar. 1993, pp. 335–340.
J. Yamron, I. Carp, L. Gillick, S. Lowe, AND P. Van Mulbregt, A hidden Markov model approach to text segmentation and event tracking, in Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, Vol. I, Seattle, WA, May 1998, pp. 333–336.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer Science+Business Media New York
About this paper
Cite this paper
Shriberg, E., Stolcke, A. (2004). Prosody Modeling for Automatic Speech Recognition and Understanding. In: Johnson, M., Khudanpur, S.P., Ostendorf, M., Rosenfeld, R. (eds) Mathematical Foundations of Speech and Language Processing. The IMA Volumes in Mathematics and its Applications, vol 138. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-9017-4_5
Download citation
DOI: https://doi.org/10.1007/978-1-4419-9017-4_5
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4612-6484-2
Online ISBN: 978-1-4419-9017-4
eBook Packages: Springer Book Archive