Skip to main content

Speech Activity Detection on Multichannels of Meeting Recordings

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNISA,volume 3869)

Abstract

The Purdue SAD system was originally designed to identify speech regions in multichannel meeting recordings with the goal of focusing transcription effort on regions containing speech. In the NIST RT-05S evaluation, this system was evaluated in the ihm condition of the speech activity detection task. The goal for this task condition is to separate the voice of the speaker on each channel from silence and crosstalk. Our system consists of several steps and does not require a training set. It starts with a simple silence detection algorithm that utilizes pitch and energy to roughly separate silence from speech and crosstalk. A global Bayesian Information Criterion (BIC) is integrated with a Viterbi segmentation algorithm that divides the concatenated stream of local speech and crosstalk into homogeneous portions, which allows an energy based clustering process to then separate local speech and crosstalk. The second step makes use of the obtained segment information to iteratively train a Gaussian mixture model for each speech activity category and decode the whole sequence over an ergodic network to refine the segmentation. The final step first uses a cross-correlation analysis to eliminate crosstalk, and then applies a batch of post-processing operations to adjust the segments to the evaluation scenario. In this paper, we describe our system and discuss various issues related to its evaluation.

Keywords

  • False Alarm
  • Gaussian Mixture Model
  • Speech Activity
  • Speech Segment
  • Speech Frame

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/11677482_35
  • Chapter length: 13 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-540-32550-5
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Garofolo, J.S., Laprun, C.D., Michel, M., Stanford, V.M., Tabassi, E.: The NIST meeting room pilot corpus. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal (2004)

    Google Scholar 

  2. Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Phau, T., Shriberg, E., Stolcke, A., Wooters, C.: The ICSI meeting corpus. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 364–367 (2003)

    Google Scholar 

  3. Burger, S., MacLaren, V., Yu, H.: The ISL meeting corpus: The impact of meeting type on speech style. In: Proceedings of International Conference on Spoken Language Processing, pp. 302–304 (2002)

    Google Scholar 

  4. Chen, L., Rose, R.T., Parrill, F., Han, X., Tu, J., Huang, Z., Harper, M., Quek, F., McNeill, D., Tuttle, R., Huang, T.: Vace multimodal meeting corpus. In: Proceedings of MLMI 2005 Workshop (2005)

    Google Scholar 

  5. Chen, L., Maia, E., Liu, Y., Harper, M.P.: Evaluating factors impacting forced alignment in a multimodal corpus. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal (2004)

    Google Scholar 

  6. Liu, D., Kubala, F.: A cross-channel modeling approach for automatic segmentation of conversational telephone speech. In: Proceedings of IEEE ASRU Workshop, pp. 333–338 (2003)

    Google Scholar 

  7. Pfau, T., Ellis, D.P., Stolcke, A.: Multispeaker speech activity detection for the ICSI meeting recorder. In: Proceedings of IEEE ASRU Workshop, pp. 107–110 (2001)

    Google Scholar 

  8. Wrigley, S.N., Brown, G.J., Wan, V., Renals, S.: Speech and crosstalk detection in multichannel audio. IEEE Transactions on Speech and Audio Processing, 84–91 (2005)

    Google Scholar 

  9. Chen, S.S., Gopalakrishnan, P.: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. Technical report, IBM (1998)

    Google Scholar 

  10. Gauvain, J.L., Lamel, L., Adda, G.: The LIMSI broadcast news transcription system. Speech Communication 37, 89–108 (2002)

    CrossRef  MATH  Google Scholar 

  11. Fiscus, J.: Spring 2005 (RT-05S) Rich Transaction Meeting Recognition Evaluation Plan (2005), http://nist.gov/speech/tests/rt/rt2005/spring/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Huang, Z., Harper, M.P. (2006). Speech Activity Detection on Multichannels of Meeting Recordings. In: Renals, S., Bengio, S. (eds) Machine Learning for Multimodal Interaction. MLMI 2005. Lecture Notes in Computer Science, vol 3869. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11677482_35

Download citation

  • DOI: https://doi.org/10.1007/11677482_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-32549-9

  • Online ISBN: 978-3-540-32550-5

  • eBook Packages: Computer ScienceComputer Science (R0)