Skip to main content

The 2006 Athens Information Technology Speech Activity Detection and Speaker Diarization Systems

  • Conference paper
Machine Learning for Multimodal Interaction (MLMI 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4299))

Included in the following conference series:

Abstract

This paper describes the Speech Activity Detection (SAD) and Speaker Diarization (SPKR) systems that were developed by the Athens Information Technology in the scope of the NIST RT-06S evaluations. The SAD system performs classification of recorded frames into speech and non-speech, using Linear Discriminant Analysis (LDA), while the SPKR one initially segments recordings into speech intervals based on the Bayesian Information Criterion (BIC), and then applies a two-step clustering strategy to group segments from the same speaker together. Following a discussion of the intrinsics of the two systems, we report and comment on our results on the RT-06S corpus [20].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Weiser, M.: The Computer for the 21st Century. Scientific American 265(3), 66–75 (1991)

    Article  Google Scholar 

  2. Waibel, A., Steusloff, H., Stiefelhagen, R., et al.: CHIL: Computers in the Human Interaction Loop. In: 5th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), Lisbon, Portugal (April 2004)

    Google Scholar 

  3. Pnevmatikakis, A., Talantzis, F., Soldatos, J., Polymenakos, L.: Robust Multimodal Audio-Visual Processing for Advanced Context Awareness in Smart Spaces. In: Maglogiannis, I., Karpouzis, K., Bramer, M. (eds.) Artificial Intelligence Applications and Innovations (AIAI 2006), pp. 290–301. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  4. http://www.clear-evaluation.org/

  5. Katsarakis, N., Souretis, G., Talantzis, F., Pnevmatikakis, A., Polymenakos, L.: 3D Audiovisual Person Tracking Using Kalman Filtering and Information Theory. In: Stiefelhagen, R., Garofolo, J.S. (eds.) CLEAR 2006. LNCS, vol. 4122, pp. 45–54. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  6. Stergiou, A., Pnevmatikakis, A., Polymenakos, L.: A Decision Fusion System across Time and Classifiers for Audio-visual Person Identification. In: Stiefelhagen, R., Garofolo, J.S. (eds.) CLEAR 2006. LNCS, vol. 4122. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  7. Stergiou, A., Pnevmatikakis, A., Polymenakos, L.: Enhancing the Performance of a GMM-based Speaker Identification System in a Multi-Microphone Setup. In: INTERSPEECH 2006, Pittsburgh (accepted, September 2006)

    Google Scholar 

  8. Rabiner, L.R., Sambur, M.R.: An algorithm for determining the endpoints of isolated utterances. The Bell System Technical Journal 54, 297 (1975)

    Google Scholar 

  9. Li, K., Swamy, N.S., Ahmad, M.O.: An Improved Voice Activity Detection Using Higher Order Statistics. IEEE Transactions on Speech and Audio Processing 13(5) (September 2005)

    Google Scholar 

  10. Stegmann, J., Schroeder, G.: Robust Voice Activity Detection Based on the Wavelet Transform. In: Proc. IEEE Workshop on Speech Coding For Telecommunications, Pocono Manor, Pennsylvania, USA, pp. 99–100 (September 1997)

    Google Scholar 

  11. Reynolds, D.A., Rose, R.C., Smith, M.J.T.: PC-Based TMS320C30 Implementation of the Gaussian Mixture Model Text-Independent Speaker Recognition System. In: International Conference on Signal Processing Applications and Technology, Hyatt Regency, Cambridge, Massachusetts, pp. 967–973 (November 1992)

    Google Scholar 

  12. Martin, A., Charlet, C., Mauary, L.: Robust Speech/Non- Speech Detection Using LDA Applied to MFCC. IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City (2001)

    Google Scholar 

  13. Duda, R., Hart, R., Stork, D.: Pattern Classification. Wiley-Interscience, New York (2001)

    MATH  Google Scholar 

  14. Rabiner, L., Schafer, R.: Digital Processing of Speech Signals. Prentice Hall Series in Signal Processing (September 1978)

    Google Scholar 

  15. Wu, T.-Y., Lu, L., Chen, K., Zhang, H.-J.: Universal Background Models for Real-Time Speaker Change Detection. In: MMM 2003, pp. 135–149 (2003)

    Google Scholar 

  16. Moraru, D., Meignier, S., Fredouille, C., Besacier, L., Bonastre, J.-F.: The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation. In: Proceedings of International Conference on Acoustics Speech and Signal Processing (ICASSP 2004), Montreal, Canada (2004)

    Google Scholar 

  17. Gauvain, J.L., Lamel, L., Adda, G.: Partitioning and transcription of broadcast news data. In: International Conference on Speech and Language Processing, Sydney, Australia, vol. 4, pp. 1335–1338 (December 1998)

    Google Scholar 

  18. Tritschler, A., Gopinath, R.: Improved speaker segmentation and segments clustering using the Bayesian Information Criterion. In: Proc. of Eurospeech, pp. 679–682 (1999)

    Google Scholar 

  19. Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing 3(1), 72–83 (1995)

    Article  Google Scholar 

  20. Fiscus, J.: Spring 2006 (RT-06S) Rich Transcription Meeting Recognition Evaluation Plan (v2) (2006), http://www.nist.gov/speech/tests/rt/rt2006/spring/docs/rt06s-meeting-eval-plan-V2.pdf

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rentzeperis, E., Stergiou, A., Boukis, C., Pnevmatikakis, A., Polymenakos, L.C. (2006). The 2006 Athens Information Technology Speech Activity Detection and Speaker Diarization Systems. In: Renals, S., Bengio, S., Fiscus, J.G. (eds) Machine Learning for Multimodal Interaction. MLMI 2006. Lecture Notes in Computer Science, vol 4299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965152_34

Download citation

  • DOI: https://doi.org/10.1007/11965152_34

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-69267-6

  • Online ISBN: 978-3-540-69268-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics