Advertisement

The TNO Speaker Diarization System for NIST RT05s Meeting Data

  • David A. van Leeuwen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3869)

Abstract

The TNO speaker speaker diarization system is based on a standard BIC segmentation and clustering algorithm. Since for the NIST Rich Transcription speaker dizarization evaluation measure correct speech detection appears to be essential, we have developed a speech activity detector (SAD) as well. This is based on decoding the speech signal using two Gaussian Mixture Models trained on silence and speech. The SAD was trained on only AMI development test data, and performed quite well in the evaluation on all 5 meeting locations, with a SAD error rate of 5.0 %. For the speaker clustering algorithm we optimized the BIC penalty parameter λ to 14, which is quite high with respect to the theoretical value of 1. The final speaker diarization error rate was evaluated at 35.1 %.

Keywords

Bayesian Information Criterion Gaussian Mixture Model Speech Recognition System Universal Background Model Speaker Diarization 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Anguera, X., Wooters, C., Peskin, B., Aguiló, M.: Robust speaker segmentation for meetings: The ICSI-SRI spring 2005 diarization system. In: Proc. RT 2005 Meeting Recognition Evaluation Workshop, Edinburgh, July 2005, pp. 26–38 (2005)Google Scholar
  2. 2.
    Cassidy, S.: The macquire speaker diarisation system for RT04s (2004), http://www.nist.gov/speech/test_beds/mr_proj/documents/icassp/papers/P03.pdf
  3. 3.
    Chen, S.S., Gopalakrishnan, P.S.: Clustering via the Baysian Information Criterion with applications in speech recognition. In: Proc. ICASSP (1998)Google Scholar
  4. 4.
    Chen, S.S., Gopalakrishnan, P.S.: Speaker, environment and channel change detection and clustering via the Bayesian Information Criterion. In: Proceedings of the Darpa Broadcast News Transcription and Understanding Workshop (1998)Google Scholar
  5. 5.
    Delacourt, P., Wellekens, C.J.: Distbic: A speaker-based segmentation for audio indexing. Speech Communication 32, 111–126 (2000)CrossRefGoogle Scholar
  6. 6.
    Fiscus, J.: The rich transcription 2005 spring meeting recognition evaluation. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 369–389. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  7. 7.
    J. Fiscus. Spring 2005 (rt-05s) rich transcription meeting recognition evaluation plan (2005), http://www.nist.gov/speech/tests/rt/rt2005/spring/rt05smeeting-eval-plan-V1.pdf.
  8. 8.
    Fredouille, C., Moraru, D., Meigner, S., Besacier, L., Bonastre, J.-F.: The NIST 2004 spring rich transcription evaluations: two axis merging strategy in the context of multiple distant microphones based meeting speaker segmentation (2004), http://www.nist.gov/speech/test_beds/mr_proj/documents/icassp/papers/P02.pdf
  9. 9.
    Gish, H., Schmidt, N.: Text-independent speaker identication. IEEE Signal Processing Magazine, pp. 18–21 (1994)Google Scholar
  10. 10.
    Istrate, D., Fredouille, C., Meigner, S., Besacier, L., Bonastre, J.F.: NIST RT05S evaluation: Pre-processing techniques and speaker diarization on multiple microphones. In: Proc. RT 2005 Meeting Recognition Evaluation Workshop, Edinburgh, July 2005, pp. 14–25 (2005)Google Scholar
  11. 11.
    Jin, Q., Laskowski, K., Schultz, T., Waibel, A.: Speaker segmentation and clustering in meetings (2004), http://www.nist.gov/speech/test_beds/mr_proj/documents/icassp/papers/P04a.pdf
  12. 12.
    Pellom, B.: Sonic: The university of colorado continuous speech recognizer. Technical Report TR-CSLR-2001-01, University of Colorado, Boulder, Colorado (March 2001)Google Scholar
  13. 13.
    Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted gaussian mixture models. Digital Signal Processing 10, 19–41 (2000)CrossRefGoogle Scholar
  14. 14.
    Robinson, T., Hochberg, M., Renals, S.: The use of recurrent networks in continuous speech recognition, ch. 7, pp. 233–258. Kluwer Academic Publishers, Dordrecht (1996)Google Scholar
  15. 15.
    Tritschler, A., Gopinath, R.: Improved speaker segmentation and segments clustering using the Baysian Information Criterion. In: Proc. Eurospeech (1999)Google Scholar
  16. 16.
    Zhu, X., Barras, C., Meignier, S., Gauvain, J.-L.: Combining speaker identification and BIC for speaker diarization. In: Proc. Eurospeech, pp. 2441–2444 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • David A. van Leeuwen
    • 1
  1. 1.TNO Human FactorsSoesterbergThe Netherlands

Personalised recommendations