Abstract
An accurate estimation of sentence units (SUs) in spontaneous speech is important for (1) helping listeners to better understand speech content and (2) supporting other natural language processing tasks that require sentence information. There has been much research on automatic SU detection; however, most previous studies have only used lexical and prosodic cues, but have not used nonverbal cues, e.g., gesture. Gestures play an important role in human conversations, including providing semantic content, expressing emotional status, and regulating conversational structure. Given the close relationship between gestures and speech, gestures may provide additional contributions to automatic SU detection. In this paper, we have investigated the use of gesture cues for enhancing the SU detection. Particularly, we have focused on: (1) collecting multimodal data resources involving gestures and SU events in human conversations, (2) analyzing the collected data sets to enrich our knowledge about co-occurrence of gestures and SUs, and (3) building statistical models for detecting SUs using speech and gestural cues. Our data analyses suggest that some gesture patterns influence a word boundary’s probability of being an SU. On the basis of the data analyses, a set of novel gestural features were proposed for SU detection. A combination of speech and gestural features was found to provide more accurate SU predictions than using only speech features in discriminative models. Findings in this paper support the view that human conversations are processes involving multimodal cues, and so they are more effectively modeled using information from both verbal and nonverbal channels.
Similar content being viewed by others
Notes
This paper supersedes the work done in [9, 10]. There are substantial differences in that much of the modeling work was redone, and measurement studies were added. The speech baseline, for example, was substantially improved in this paper. The two previous papers were preliminary results done for the first author’s Ph.D thesis proposal supervised by the second author. This paper reports on the work in the final thesis document.
The multimodal data used in this study was much more expensive to collect and annotate since it required processing both audio and video events.
If the interval from the current word boundary to the end of the following word is not more than 2.0 s, we use this interval; otherwise we use only the interval of 2.0 s after the current word boundary.
The window length was selected to be the one that provided the optimal IGR on the development set from window lengths from 0.1 to 0.5 s.
In our experiment, we used an HMM model trained from a large-sized audio corpus, which will be described in Section 7.1, as the speech-only SU model.
References
Argyle M (1988) Bodily communication, 2nd edn. Methuen, London
Beeferman D, Berger A, Lafferty J (1998) Cyperpunc: a lightweight punctuation annotation system for speech. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP)
Berger A, Pietra S, Pietra V (1996) A maximum entropy approach to natural language processing. Comput Linguist 22:39–72
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Buntine W (1992) Learning classification trees. Stat Comput 2:63–73
Cassell J, Stone M (1999) Living hand to mouth: psychological theories about speech and gesture in interactive dialogue systems. In: Proceedings of the AAAI conference on artificial intelligence
Chai J, Hong P, Zhou M (2004) A probabilistic approach to reference resolution in multimodal user interfaces. In: Proceedings of the conference on intelligent user interface (IUI). ACM Press, pp 70–77
Chen C (1999) Speech recognition with automatic punctuation. In: Proceedings of the European conference on speech processing (EuroSpeech)
Chen L, Harper M, Huang Z (2006) Using maximum entropy (ME) model to incorporate gesture cues for SU detection. In: Proceedings of the international conference on multimodal interface (ICMI), Banff, Canada
Chen L, Liu Y, Harper M, Shriberg E (2004) Multimodal model integration for sentence unit detection. In: Proceedings of the international conference on multimodal interface (ICMI), University Park, PA
Chen L, Rose T, Qiao Y, Kimbara I, Parrill F, Welji H, Xu T, Tu J, Huang Z, Harper M, Quek F, Xiong Y, McNeill D, Tuttle R, Huang TS (2005) VACE multimodal meeting corpus. In: Proceedings of the joint workshop on machine learning and multimodal interaction (MLMI)
Chen S, Rosenfeld R (1999) A gaussian prior for smoothing maximum entropy models. Tech. rep., Carnegie Mellon University
EARS (2002) DARPA EARS Program. http://projects.ldc.upenn.edu/EARS/
Eisenstein J, Davis R (2005) Gestural cues for sentence segmentation. MIT AI Memo
Eisenstein J, Davis R (2006) Gesture improves coreference resolution. In: Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL)
Eisenstein J, Davis R (2007) Conditional modality fusion for coreference resolution. In: Proceedings of the conference of annual meeting on association for computational linguistics linguistics (ACL)
Ekman P (1965) Communication through nonverbal behavior: a source of information about an interpersonal relationship. In: Tomkinds SS, Izard CE (eds) Affect, cognition and personality. Springer, New York, pp 390–442
Esposito A, McCullough K, Quek F (2001) Disfluencies in gesture: gestural correlates to speech silent and filled pauses. In: Proceeding of IEEE workshop on cues in communication, Kauai,Hawaii
Fayyad U, Irani K (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8:87–102
Garofolo J, Laprum C, Michel M, Stanford V, Tabassi E (2004) The NIST meeting room pilot corpus. In: Proceedings of the conference on language resources and evaluations (LREC)
Gotoh Y, Renals S (2000) Sentence boundary detection in broadcast speech transcript. In: Proceedings of the international speech communication association (ISCA) workshop: automatic speech recognition: challenges for the new millennium ASR-2000
Huang Z, Harper M (2005) Speech and non-speech detection in meeting audio for transcription. In: Proceedings of NIST RT-05 workshop
Huang Z, Chen L, Harper M (2006) An open source prosodic feature extraction tool. In: Proceedings of the conference on language resources and evaluations (LREC)
Huang Z, Harper M, Wang W (2007) Mandarin part-of-speech tagging and discriminative reranking. In: Proceedings of the empirical methods in natural language processing (EMNLP), Prague, Czech
Kendon A (1974) Movement coordination in social interaction: some examples described. In: Weitz S (ed) Nonverbal communication. Oxford University Press, New York, pp 119–133
Lafferty J, McCallum A, Pereira F (2001) Conditional random field: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the international conference on machine learning (ICML)
(LDC) LDC (2004) Meeting recording quick transcription guidelines, 1st edn. http://www.nist.gov/speech/test_beds/mr_proj/meeting_corpus_1/documents/pdf/MeetingDataQTRSpec-V1.3.pdf
(LDC) LDC (2004) Simple MetaData annotation specification version 6.2, 6th edn. http://projects.ldc.upenn.edu/MDE/Guidelines/SimpleMDE_V6.2.pdf
Lehmann EL (2005) Testing statistical hypotheses, 3rd edn. Springer, New York
Liu Y (2004) Structural event detection for rich transcription of speech. Ph.D. thesis, Purdue University
Liu Y, Chawla N, Shriberg E, Stolcke A, Harper M (2003) Resampling techniques for sentence boundary detection: a case study in machine learning from imbalanced data for spoken language processing. Tech. rep., International Computer Science Institute
Liu Y, Stolcke A, Shriberg E, Harper M (2004) Comparing and combining generative and posterior probability models: some advances in sentence boundary detection in speech. In: Proceedings of the empirical methods in natural language processing (EMNLP)
Liu Y, Shriberg E, Stockle A, Harper M (2005) Comparing HMM, maximum entropy, and conditional random fields for disfluency detection. In: Proceedings of the international conference on speech, Lisbon
Liu Y, Shriberg E, Stolcke A, Peskin B, Ang J, D H, Ostendorf M, Tomalin M, Woodland P, Harper M (2005) Structural metadata research in the EARS program. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP)
McCallum A (2005) Mallet: a machine learning toolkit for language. http://mallet.cs.umass.edu
McNeill D (1992) Hand and mind: what gestures reveal about thought. Univ. Chicago Press, Chicago
Mehrabian A (1972) Nonverbal communication. Aidine-Atherton, Chicago
Morency LP, Quattoni A, Darrell T (2007) Latent-dynamic discriminative models for continuous gesture recognition. In: Proceedings of the IEEE computer vision and pattern recognition (CVPR)
Morgan N, Baron D, Bhagat S, Carvey H, Dhillon R, Edwards J, Gelbart D, Janin A, Krupski A, Peskin B, Pfau T, Shriberg E, Stolcke A, Wooters C (2003) Meetings about meetings: research at ICSI on speech in multiparty conversations. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP), vol 4. Hong Kong, Hong Kong, pp 740–743
Qu S, Chai J (2006) Salience modeling based on non-verbal modalities for spoken language understanding. In: Proceedings of the international conference on multimodal interface (ICMI), Banff, Canada
Quek F, McNeill D, Bryll R, Duncan S, Ma X, Kirbas C, McCullough KE, Ansari R (2002) Multimodal human discourse: gesture and speech. ACM Trans Comput-Hum Interact 9(3):171–193
Quek F et al (2002) KDI: cross-model analysis signal and sense- data and computational resources for gesture, speech and gaze research. http://vislab.cs.vt.edu/KDI
Quilan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco
Rabiner LR, Juang BH (1986) An introduction to hidden Markov models. IEEE ASSP Mag 3(1):4–16
Roark B, Liu Y, Harper M, Stewart R, Lease M, Snover M, Shafran I, Dorr B, Hale J, Krasnyanskaya A, Yung L (2006) Reranking for sentence boundary detection in conversational speech. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP)
Rose T, Quek F, Shi Y (2004) MacVissta: a system for multimodal analysis. In: Proceedings of the international conference on multimodal interface (ICMI)
Shriberg E, Stolcke A, Hakkani-Tur D, Tur G (2000) Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun 32(1–2):127–154
Stevensonm M, Gaizauskasm R (2000) Experiments on sentence boundary detedction. In: Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL)
Stockle A (2002) SRILM—a extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing (ICSLP)
Strassel S (2003) Simple metadata annotation specification, 5th edn. Linguistic Data Consortium
Sundaram R, Ganapathiraju A, Hamaker J, Picone J (2001) ISIP 2000 conversational speech evaluation system. In: Proceedings of the speech transcription workshop, College Park, Maryland
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco
Xiong Y, Quek F (2005) Meeting room configuration and multiple camera calibration in meeting analysis. In: Proceedings of the international conference on multimodal interface (ICMI), Trento, Italy
Zhang L (2005) Maximum Entropy Modeling Toolkit for Python and C+ +. http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, L., Harper, M.P. Utilizing gestures to improve sentence boundary detection. Multimed Tools Appl 51, 1035–1067 (2011). https://doi.org/10.1007/s11042-009-0436-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-009-0436-z