Utilizing gestures to improve sentence boundary detection

Chen, Lei; Harper, Mary P.

doi:10.1007/s11042-009-0436-z

Utilizing gestures to improve sentence boundary detection

Published: 23 January 2010

Volume 51, pages 1035–1067, (2011)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Lei Chen¹^nAff4 &
Mary P. Harper^2,3

190 Accesses
Explore all metrics

Abstract

An accurate estimation of sentence units (SUs) in spontaneous speech is important for (1) helping listeners to better understand speech content and (2) supporting other natural language processing tasks that require sentence information. There has been much research on automatic SU detection; however, most previous studies have only used lexical and prosodic cues, but have not used nonverbal cues, e.g., gesture. Gestures play an important role in human conversations, including providing semantic content, expressing emotional status, and regulating conversational structure. Given the close relationship between gestures and speech, gestures may provide additional contributions to automatic SU detection. In this paper, we have investigated the use of gesture cues for enhancing the SU detection. Particularly, we have focused on: (1) collecting multimodal data resources involving gestures and SU events in human conversations, (2) analyzing the collected data sets to enrich our knowledge about co-occurrence of gestures and SUs, and (3) building statistical models for detecting SUs using speech and gestural cues. Our data analyses suggest that some gesture patterns influence a word boundary’s probability of being an SU. On the basis of the data analyses, a set of novel gestural features were proposed for SU detection. A combination of speech and gestural features was found to provide more accurate SU predictions than using only speech features in discriminative models. Findings in this paper support the view that human conversations are processes involving multimodal cues, and so they are more effectively modeled using information from both verbal and nonverbal channels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

This paper supersedes the work done in [9, 10]. There are substantial differences in that much of the modeling work was redone, and measurement studies were added. The speech baseline, for example, was substantially improved in this paper. The two previous papers were preliminary results done for the first author’s Ph.D thesis proposal supervised by the second author. This paper reports on the work in the final thesis document.
The multimodal data used in this study was much more expensive to collect and annotate since it required processing both audio and video events.
If the interval from the current word boundary to the end of the following word is not more than 2.0 s, we use this interval; otherwise we use only the interval of 2.0 s after the current word boundary.
The window length was selected to be the one that provided the optimal IGR on the development set from window lengths from 0.1 to 0.5 s.
In our experiment, we used an HMM model trained from a large-sized audio corpus, which will be described in Section 7.1, as the speech-only SU model.

References

Argyle M (1988) Bodily communication, 2nd edn. Methuen, London
Google Scholar
Beeferman D, Berger A, Lafferty J (1998) Cyperpunc: a lightweight punctuation annotation system for speech. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP)
Berger A, Pietra S, Pietra V (1996) A maximum entropy approach to natural language processing. Comput Linguist 22:39–72
Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
MathSciNet MATH Google Scholar
Buntine W (1992) Learning classification trees. Stat Comput 2:63–73
Article Google Scholar
Cassell J, Stone M (1999) Living hand to mouth: psychological theories about speech and gesture in interactive dialogue systems. In: Proceedings of the AAAI conference on artificial intelligence
Chai J, Hong P, Zhou M (2004) A probabilistic approach to reference resolution in multimodal user interfaces. In: Proceedings of the conference on intelligent user interface (IUI). ACM Press, pp 70–77
Chen C (1999) Speech recognition with automatic punctuation. In: Proceedings of the European conference on speech processing (EuroSpeech)
Chen L, Harper M, Huang Z (2006) Using maximum entropy (ME) model to incorporate gesture cues for SU detection. In: Proceedings of the international conference on multimodal interface (ICMI), Banff, Canada
Chen L, Liu Y, Harper M, Shriberg E (2004) Multimodal model integration for sentence unit detection. In: Proceedings of the international conference on multimodal interface (ICMI), University Park, PA
Chen L, Rose T, Qiao Y, Kimbara I, Parrill F, Welji H, Xu T, Tu J, Huang Z, Harper M, Quek F, Xiong Y, McNeill D, Tuttle R, Huang TS (2005) VACE multimodal meeting corpus. In: Proceedings of the joint workshop on machine learning and multimodal interaction (MLMI)
Chen S, Rosenfeld R (1999) A gaussian prior for smoothing maximum entropy models. Tech. rep., Carnegie Mellon University
EARS (2002) DARPA EARS Program. http://projects.ldc.upenn.edu/EARS/
Eisenstein J, Davis R (2005) Gestural cues for sentence segmentation. MIT AI Memo
Eisenstein J, Davis R (2006) Gesture improves coreference resolution. In: Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL)
Eisenstein J, Davis R (2007) Conditional modality fusion for coreference resolution. In: Proceedings of the conference of annual meeting on association for computational linguistics linguistics (ACL)
Ekman P (1965) Communication through nonverbal behavior: a source of information about an interpersonal relationship. In: Tomkinds SS, Izard CE (eds) Affect, cognition and personality. Springer, New York, pp 390–442
Google Scholar
Esposito A, McCullough K, Quek F (2001) Disfluencies in gesture: gestural correlates to speech silent and filled pauses. In: Proceeding of IEEE workshop on cues in communication, Kauai,Hawaii
Fayyad U, Irani K (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8:87–102
MATH Google Scholar
Garofolo J, Laprum C, Michel M, Stanford V, Tabassi E (2004) The NIST meeting room pilot corpus. In: Proceedings of the conference on language resources and evaluations (LREC)
Gotoh Y, Renals S (2000) Sentence boundary detection in broadcast speech transcript. In: Proceedings of the international speech communication association (ISCA) workshop: automatic speech recognition: challenges for the new millennium ASR-2000
Huang Z, Harper M (2005) Speech and non-speech detection in meeting audio for transcription. In: Proceedings of NIST RT-05 workshop
Huang Z, Chen L, Harper M (2006) An open source prosodic feature extraction tool. In: Proceedings of the conference on language resources and evaluations (LREC)
Huang Z, Harper M, Wang W (2007) Mandarin part-of-speech tagging and discriminative reranking. In: Proceedings of the empirical methods in natural language processing (EMNLP), Prague, Czech
Kendon A (1974) Movement coordination in social interaction: some examples described. In: Weitz S (ed) Nonverbal communication. Oxford University Press, New York, pp 119–133
Google Scholar
Lafferty J, McCallum A, Pereira F (2001) Conditional random field: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the international conference on machine learning (ICML)
(LDC) LDC (2004) Meeting recording quick transcription guidelines, 1st edn. http://www.nist.gov/speech/test_beds/mr_proj/meeting_corpus_1/documents/pdf/MeetingDataQTRSpec-V1.3.pdf
(LDC) LDC (2004) Simple MetaData annotation specification version 6.2, 6th edn. http://projects.ldc.upenn.edu/MDE/Guidelines/SimpleMDE_V6.2.pdf
Lehmann EL (2005) Testing statistical hypotheses, 3rd edn. Springer, New York
MATH Google Scholar
Liu Y (2004) Structural event detection for rich transcription of speech. Ph.D. thesis, Purdue University
Liu Y, Chawla N, Shriberg E, Stolcke A, Harper M (2003) Resampling techniques for sentence boundary detection: a case study in machine learning from imbalanced data for spoken language processing. Tech. rep., International Computer Science Institute
Liu Y, Stolcke A, Shriberg E, Harper M (2004) Comparing and combining generative and posterior probability models: some advances in sentence boundary detection in speech. In: Proceedings of the empirical methods in natural language processing (EMNLP)
Liu Y, Shriberg E, Stockle A, Harper M (2005) Comparing HMM, maximum entropy, and conditional random fields for disfluency detection. In: Proceedings of the international conference on speech, Lisbon
Liu Y, Shriberg E, Stolcke A, Peskin B, Ang J, D H, Ostendorf M, Tomalin M, Woodland P, Harper M (2005) Structural metadata research in the EARS program. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP)
McCallum A (2005) Mallet: a machine learning toolkit for language. http://mallet.cs.umass.edu
McNeill D (1992) Hand and mind: what gestures reveal about thought. Univ. Chicago Press, Chicago
Google Scholar
Mehrabian A (1972) Nonverbal communication. Aidine-Atherton, Chicago
Google Scholar
Morency LP, Quattoni A, Darrell T (2007) Latent-dynamic discriminative models for continuous gesture recognition. In: Proceedings of the IEEE computer vision and pattern recognition (CVPR)
Morgan N, Baron D, Bhagat S, Carvey H, Dhillon R, Edwards J, Gelbart D, Janin A, Krupski A, Peskin B, Pfau T, Shriberg E, Stolcke A, Wooters C (2003) Meetings about meetings: research at ICSI on speech in multiparty conversations. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP), vol 4. Hong Kong, Hong Kong, pp 740–743
Qu S, Chai J (2006) Salience modeling based on non-verbal modalities for spoken language understanding. In: Proceedings of the international conference on multimodal interface (ICMI), Banff, Canada
Quek F, McNeill D, Bryll R, Duncan S, Ma X, Kirbas C, McCullough KE, Ansari R (2002) Multimodal human discourse: gesture and speech. ACM Trans Comput-Hum Interact 9(3):171–193
Article Google Scholar
Quek F et al (2002) KDI: cross-model analysis signal and sense- data and computational resources for gesture, speech and gaze research. http://vislab.cs.vt.edu/KDI
Quilan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco
Google Scholar
Rabiner LR, Juang BH (1986) An introduction to hidden Markov models. IEEE ASSP Mag 3(1):4–16
Article Google Scholar
Roark B, Liu Y, Harper M, Stewart R, Lease M, Snover M, Shafran I, Dorr B, Hale J, Krasnyanskaya A, Yung L (2006) Reranking for sentence boundary detection in conversational speech. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP)
Rose T, Quek F, Shi Y (2004) MacVissta: a system for multimodal analysis. In: Proceedings of the international conference on multimodal interface (ICMI)
Shriberg E, Stolcke A, Hakkani-Tur D, Tur G (2000) Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun 32(1–2):127–154
Article Google Scholar
Stevensonm M, Gaizauskasm R (2000) Experiments on sentence boundary detedction. In: Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL)
Stockle A (2002) SRILM—a extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing (ICSLP)
Strassel S (2003) Simple metadata annotation specification, 5th edn. Linguistic Data Consortium
Sundaram R, Ganapathiraju A, Hamaker J, Picone J (2001) ISIP 2000 conversational speech evaluation system. In: Proceedings of the speech transcription workshop, College Park, Maryland
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco
MATH Google Scholar
Xiong Y, Quek F (2005) Meeting room configuration and multiple camera calibration in meeting analysis. In: Proceedings of the international conference on multimodal interface (ICMI), Trento, Italy
Zhang L (2005) Maximum Entropy Modeling Toolkit for Python and C+ +. http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html

Download references

Author information

Lei Chen
Present address: Educational Testing Service (ETS), Princeton, NJ, 08541, USA

Authors and Affiliations

School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, 47905, USA
Lei Chen
Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
Mary P. Harper
Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD, 21211, USA
Mary P. Harper

Authors

Lei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Mary P. Harper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, L., Harper, M.P. Utilizing gestures to improve sentence boundary detection. Multimed Tools Appl 51, 1035–1067 (2011). https://doi.org/10.1007/s11042-009-0436-z

Download citation

Published: 23 January 2010
Issue Date: February 2011
DOI: https://doi.org/10.1007/s11042-009-0436-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Utilizing gestures to improve sentence boundary detection

Abstract

Access this article

Similar content being viewed by others

A Protocol for Comparing Gesture and Prosodic Boundaries in Multimodal Corpora

Sociolinguistic Factors in Text-Based Sentence Boundary Detection

Combining Lexical and Prosodic Features for Automatic Detection of Sentence Modality in French

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Utilizing gestures to improve sentence boundary detection

Abstract

Access this article

Similar content being viewed by others

A Protocol for Comparing Gesture and Prosodic Boundaries in Multimodal Corpora

Sociolinguistic Factors in Text-Based Sentence Boundary Detection

Combining Lexical and Prosodic Features for Automatic Detection of Sentence Modality in French

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation