Skip to main content
Log in

Utilizing gestures to improve sentence boundary detection

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

An accurate estimation of sentence units (SUs) in spontaneous speech is important for (1) helping listeners to better understand speech content and (2) supporting other natural language processing tasks that require sentence information. There has been much research on automatic SU detection; however, most previous studies have only used lexical and prosodic cues, but have not used nonverbal cues, e.g., gesture. Gestures play an important role in human conversations, including providing semantic content, expressing emotional status, and regulating conversational structure. Given the close relationship between gestures and speech, gestures may provide additional contributions to automatic SU detection. In this paper, we have investigated the use of gesture cues for enhancing the SU detection. Particularly, we have focused on: (1) collecting multimodal data resources involving gestures and SU events in human conversations, (2) analyzing the collected data sets to enrich our knowledge about co-occurrence of gestures and SUs, and (3) building statistical models for detecting SUs using speech and gestural cues. Our data analyses suggest that some gesture patterns influence a word boundary’s probability of being an SU. On the basis of the data analyses, a set of novel gestural features were proposed for SU detection. A combination of speech and gestural features was found to provide more accurate SU predictions than using only speech features in discriminative models. Findings in this paper support the view that human conversations are processes involving multimodal cues, and so they are more effectively modeled using information from both verbal and nonverbal channels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. This paper supersedes the work done in [9, 10]. There are substantial differences in that much of the modeling work was redone, and measurement studies were added. The speech baseline, for example, was substantially improved in this paper. The two previous papers were preliminary results done for the first author’s Ph.D thesis proposal supervised by the second author. This paper reports on the work in the final thesis document.

  2. The multimodal data used in this study was much more expensive to collect and annotate since it required processing both audio and video events.

  3. If the interval from the current word boundary to the end of the following word is not more than 2.0 s, we use this interval; otherwise we use only the interval of 2.0 s after the current word boundary.

  4. The window length was selected to be the one that provided the optimal IGR on the development set from window lengths from 0.1 to 0.5 s.

  5. In our experiment, we used an HMM model trained from a large-sized audio corpus, which will be described in Section 7.1, as the speech-only SU model.

References

  1. Argyle M (1988) Bodily communication, 2nd edn. Methuen, London

    Google Scholar 

  2. Beeferman D, Berger A, Lafferty J (1998) Cyperpunc: a lightweight punctuation annotation system for speech. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP)

  3. Berger A, Pietra S, Pietra V (1996) A maximum entropy approach to natural language processing. Comput Linguist 22:39–72

    Google Scholar 

  4. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MathSciNet  MATH  Google Scholar 

  5. Buntine W (1992) Learning classification trees. Stat Comput 2:63–73

    Article  Google Scholar 

  6. Cassell J, Stone M (1999) Living hand to mouth: psychological theories about speech and gesture in interactive dialogue systems. In: Proceedings of the AAAI conference on artificial intelligence

  7. Chai J, Hong P, Zhou M (2004) A probabilistic approach to reference resolution in multimodal user interfaces. In: Proceedings of the conference on intelligent user interface (IUI). ACM Press, pp 70–77

  8. Chen C (1999) Speech recognition with automatic punctuation. In: Proceedings of the European conference on speech processing (EuroSpeech)

  9. Chen L, Harper M, Huang Z (2006) Using maximum entropy (ME) model to incorporate gesture cues for SU detection. In: Proceedings of the international conference on multimodal interface (ICMI), Banff, Canada

  10. Chen L, Liu Y, Harper M, Shriberg E (2004) Multimodal model integration for sentence unit detection. In: Proceedings of the international conference on multimodal interface (ICMI), University Park, PA

  11. Chen L, Rose T, Qiao Y, Kimbara I, Parrill F, Welji H, Xu T, Tu J, Huang Z, Harper M, Quek F, Xiong Y, McNeill D, Tuttle R, Huang TS (2005) VACE multimodal meeting corpus. In: Proceedings of the joint workshop on machine learning and multimodal interaction (MLMI)

  12. Chen S, Rosenfeld R (1999) A gaussian prior for smoothing maximum entropy models. Tech. rep., Carnegie Mellon University

  13. EARS (2002) DARPA EARS Program. http://projects.ldc.upenn.edu/EARS/

  14. Eisenstein J, Davis R (2005) Gestural cues for sentence segmentation. MIT AI Memo

  15. Eisenstein J, Davis R (2006) Gesture improves coreference resolution. In: Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL)

  16. Eisenstein J, Davis R (2007) Conditional modality fusion for coreference resolution. In: Proceedings of the conference of annual meeting on association for computational linguistics linguistics (ACL)

  17. Ekman P (1965) Communication through nonverbal behavior: a source of information about an interpersonal relationship. In: Tomkinds SS, Izard CE (eds) Affect, cognition and personality. Springer, New York, pp 390–442

    Google Scholar 

  18. Esposito A, McCullough K, Quek F (2001) Disfluencies in gesture: gestural correlates to speech silent and filled pauses. In: Proceeding of IEEE workshop on cues in communication, Kauai,Hawaii

  19. Fayyad U, Irani K (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8:87–102

    MATH  Google Scholar 

  20. Garofolo J, Laprum C, Michel M, Stanford V, Tabassi E (2004) The NIST meeting room pilot corpus. In: Proceedings of the conference on language resources and evaluations (LREC)

  21. Gotoh Y, Renals S (2000) Sentence boundary detection in broadcast speech transcript. In: Proceedings of the international speech communication association (ISCA) workshop: automatic speech recognition: challenges for the new millennium ASR-2000

  22. Huang Z, Harper M (2005) Speech and non-speech detection in meeting audio for transcription. In: Proceedings of NIST RT-05 workshop

  23. Huang Z, Chen L, Harper M (2006) An open source prosodic feature extraction tool. In: Proceedings of the conference on language resources and evaluations (LREC)

  24. Huang Z, Harper M, Wang W (2007) Mandarin part-of-speech tagging and discriminative reranking. In: Proceedings of the empirical methods in natural language processing (EMNLP), Prague, Czech

  25. Kendon A (1974) Movement coordination in social interaction: some examples described. In: Weitz S (ed) Nonverbal communication. Oxford University Press, New York, pp 119–133

    Google Scholar 

  26. Lafferty J, McCallum A, Pereira F (2001) Conditional random field: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the international conference on machine learning (ICML)

  27. (LDC) LDC (2004) Meeting recording quick transcription guidelines, 1st edn. http://www.nist.gov/speech/test_beds/mr_proj/meeting_corpus_1/documents/pdf/MeetingDataQTRSpec-V1.3.pdf

  28. (LDC) LDC (2004) Simple MetaData annotation specification version 6.2, 6th edn. http://projects.ldc.upenn.edu/MDE/Guidelines/SimpleMDE_V6.2.pdf

  29. Lehmann EL (2005) Testing statistical hypotheses, 3rd edn. Springer, New York

    MATH  Google Scholar 

  30. Liu Y (2004) Structural event detection for rich transcription of speech. Ph.D. thesis, Purdue University

  31. Liu Y, Chawla N, Shriberg E, Stolcke A, Harper M (2003) Resampling techniques for sentence boundary detection: a case study in machine learning from imbalanced data for spoken language processing. Tech. rep., International Computer Science Institute

  32. Liu Y, Stolcke A, Shriberg E, Harper M (2004) Comparing and combining generative and posterior probability models: some advances in sentence boundary detection in speech. In: Proceedings of the empirical methods in natural language processing (EMNLP)

  33. Liu Y, Shriberg E, Stockle A, Harper M (2005) Comparing HMM, maximum entropy, and conditional random fields for disfluency detection. In: Proceedings of the international conference on speech, Lisbon

  34. Liu Y, Shriberg E, Stolcke A, Peskin B, Ang J, D H, Ostendorf M, Tomalin M, Woodland P, Harper M (2005) Structural metadata research in the EARS program. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP)

  35. McCallum A (2005) Mallet: a machine learning toolkit for language. http://mallet.cs.umass.edu

  36. McNeill D (1992) Hand and mind: what gestures reveal about thought. Univ. Chicago Press, Chicago

    Google Scholar 

  37. Mehrabian A (1972) Nonverbal communication. Aidine-Atherton, Chicago

    Google Scholar 

  38. Morency LP, Quattoni A, Darrell T (2007) Latent-dynamic discriminative models for continuous gesture recognition. In: Proceedings of the IEEE computer vision and pattern recognition (CVPR)

  39. Morgan N, Baron D, Bhagat S, Carvey H, Dhillon R, Edwards J, Gelbart D, Janin A, Krupski A, Peskin B, Pfau T, Shriberg E, Stolcke A, Wooters C (2003) Meetings about meetings: research at ICSI on speech in multiparty conversations. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP), vol 4. Hong Kong, Hong Kong, pp 740–743

  40. Qu S, Chai J (2006) Salience modeling based on non-verbal modalities for spoken language understanding. In: Proceedings of the international conference on multimodal interface (ICMI), Banff, Canada

  41. Quek F, McNeill D, Bryll R, Duncan S, Ma X, Kirbas C, McCullough KE, Ansari R (2002) Multimodal human discourse: gesture and speech. ACM Trans Comput-Hum Interact 9(3):171–193

    Article  Google Scholar 

  42. Quek F et al (2002) KDI: cross-model analysis signal and sense- data and computational resources for gesture, speech and gaze research. http://vislab.cs.vt.edu/KDI

  43. Quilan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco

    Google Scholar 

  44. Rabiner LR, Juang BH (1986) An introduction to hidden Markov models. IEEE ASSP Mag 3(1):4–16

    Article  Google Scholar 

  45. Roark B, Liu Y, Harper M, Stewart R, Lease M, Snover M, Shafran I, Dorr B, Hale J, Krasnyanskaya A, Yung L (2006) Reranking for sentence boundary detection in conversational speech. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP)

  46. Rose T, Quek F, Shi Y (2004) MacVissta: a system for multimodal analysis. In: Proceedings of the international conference on multimodal interface (ICMI)

  47. Shriberg E, Stolcke A, Hakkani-Tur D, Tur G (2000) Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun 32(1–2):127–154

    Article  Google Scholar 

  48. Stevensonm M, Gaizauskasm R (2000) Experiments on sentence boundary detedction. In: Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL)

  49. Stockle A (2002) SRILM—a extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing (ICSLP)

  50. Strassel S (2003) Simple metadata annotation specification, 5th edn. Linguistic Data Consortium

  51. Sundaram R, Ganapathiraju A, Hamaker J, Picone J (2001) ISIP 2000 conversational speech evaluation system. In: Proceedings of the speech transcription workshop, College Park, Maryland

  52. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco

    MATH  Google Scholar 

  53. Xiong Y, Quek F (2005) Meeting room configuration and multiple camera calibration in meeting analysis. In: Proceedings of the international conference on multimodal interface (ICMI), Trento, Italy

  54. Zhang L (2005) Maximum Entropy Modeling Toolkit for Python and C+ +. http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, L., Harper, M.P. Utilizing gestures to improve sentence boundary detection. Multimed Tools Appl 51, 1035–1067 (2011). https://doi.org/10.1007/s11042-009-0436-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-009-0436-z

Keywords

Navigation