International Journal of Speech Technology

, Volume 16, Issue 4, pp 513–523 | Cite as

A new approach of speaker clustering based on the stereophonic differential energy

  • S. Ouamour
  • H. Sayoud


In this paper, we present a new approach of speech clustering with regards of the speaker identity. It consists in grouping the homogeneous speech segments that are obtained at the end of the segmentation process, by using the spatial information provided by the stereophonic speech signals. The proposed method uses the differential energy of the two stereophonic signals collected by two cardioid microphones, in order to cluster all the speech segments that belong to the same speaker. The total number of clusters obtained at the end should be equal to the real number of speakers present in the meeting room and each cluster should contain the global intervention of only one speaker. The proposed system is suitable for debates or multi-conferences for which the speakers are located at fixed positions.

Basically, our approach tries to make a speaker localization with regards to the position of the microphones, taken as a spatial reference. Based on this localization, the new proposed method can recognize the speaker identity of any speech segment during the meeting. So, the intervention of each speaker is automatically detected and assigned to him by estimating his relative position.

In a purpose of comparison, two types of clustering methods have been implemented and experimented: the new approach, which we called Energy Differential based Spatial Clustering (EDSC) and a classical statistical approach called “Mono-Gaussian based Sequential Clustering” (MGSC).

Experiments of speaker clustering are done on a stereophonic speech corpus called DB15, composed of 15 stereophonic scenarios of about 3.5 minutes each. Every scenario corresponds to a free discussion between two or three speakers seated at fixed positions in the meeting room.

Results show the outstanding performances of the new approach in terms of precision and speed, especially for short speech segments, where most of clustering techniques present a strong failure.


Speaker clustering Speaker diarization Spatial clustering Spatial speaker localization Speaker recognition Stereophonic speech 


  1. Ajmera, J., Bourlard, H., & Lapidot, I. (2002). Improved unknown-multiple speaker clustering using HMM (Technical report). IDIAP. Google Scholar
  2. Ajmera, J., Lathoud, G., & McCowan, I. (2004). Clustering and segmenting speakers and their locations in meetings. ICASSP Proceedings, 1, 605–608. Google Scholar
  3. Barras, C., Zhu, X., Meignier, S., & Gauvain, J.-L. (2004). Improving speaker diarization. In Proc. fall rich transcription workshop (RT-04), Palisades, NY, Nov. 2004. Google Scholar
  4. Ben, M., Betser, M., Bimbot, F., & Gravier, G. (2004). Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs. In Proceedings of the international conference on spoken language processing (ICSLP4), Jeju Islands, South Korea, October 2004. Google Scholar
  5. Bimbot, F., Magrin-Chagnolleau, I., & Mathan, L. (1995). Second-order statistical measures for text-independent broadcaster identification. Speech Communication, 17(1–2), 177–192. CrossRefGoogle Scholar
  6. Bjor, O. H., Enger, J., & Winsvold, B. (2001). Sound intensity for identification of aircraft noise. In Inter-noise proceedings, international congress and exhibition on noise control engineering. Google Scholar
  7. Bonastre, J. F., & Besacier, L. (1997). Traitement indépendant de sous-bandes fréquentielles par des méthodes statistiques du second ordre pour la reconnaissance du locuteur. In Actes du 4ème congrès Français d’Acoustique, Marseille, France, 14–18 April 1997 (pp. 357–360). Google Scholar
  8. Chen, S. S., & Gopalakrishnan, P. (1998). Clustering via the Bayesian information criterion with applications in speech recognition. In Proc. IEEE international conference on acoustics, speech and signal processing, ICASSP’1998, Seattle, USA (Vol. 2, pp. 645–648). Google Scholar
  9. Delacourt, P., & Wellekens, C. J. (2000). DISTBIC: a speaker-based segmentation for audio data indexing. Speech Communication, 32, 111–126. CrossRefGoogle Scholar
  10. Gish, H., Siu, M.-H., & Rohlicek, R. (1991). Segregation of speakers for speech recognition and speaker identification. In Proc. IEEE international conference on acoustics, speech and signal processing, ICASSP’1991, Toronto, Canada (Vol. 2, pp. 873–876). Google Scholar
  11. Istrate, F., Fredouille, C., Meignier, S., Besacier, L., & Bonastre, J.-F. (2005). Pre-processing techniques and speaker diarization on multiple microphone meetings. In NIST rich transcription 2005 spring meeting recognition evaluation (RT’05S), Lecture notes in computer sciences (LNCS), Edinburgh, Scotland. Berlin: Springer. Google Scholar
  12. Jacobsen, F. (2002). Sound intensity and its measurement and applications. Acoustic Technology, Technical, University of Denmark, Lyngby, Denmark. Note Number 2216. Google Scholar
  13. Jin, H., Kubala, F., & Schwartz, R. (1997). Automatic speaker clustering. In DARPA speech recognition workshop, Chantilly, USA (pp. 108–111). Google Scholar
  14. Jin, Q., Laskowski, K., Schultz, T., & Waibel, A. (2004). Speaker segmentation and clustering in meetings. In NIST 2004 spring rich transcription evaluation workshop, Montreal, Canada. Google Scholar
  15. Johnson, S., & Woodland, P. (1998). Speaker clustering using direct maximization of the MLLRadapted likelihood. In Proc. international conference on speech and language processing, Sydney, Australia, Dec. 1998 (Vol. 5, pp. 1775–1779). Google Scholar
  16. Koh, E. C., Sun, H., Nwe, T. L., Nguyen, T. H., Ma, B., Chng, E.-S., Li, H., & Rahardja, S. (2008). Speaker diarization using direction of arrival estimate and acoustic feature information, the I2R-NTU submission. In NIST RT 2007 evaluation in multimodal technologies for perception of humans (pp. 484–496). Berlin/Heidelberg: Springer. CrossRefGoogle Scholar
  17. Magrin-Chagnolleau, I., Bonastre, J. F., & Bimbot, F. (1995). Effect of utterance duration and phonetic content on speaker identification using second-order statistical methods. In Proceedings of EUROSPEECH 95, Madrid, Spain, September 1995 (Vol. 1, pp. 337–340). Google Scholar
  18. Meignier, S. (2002). Indexation en locuteurs de documents sonores: segmentation d’un document et appariement d’une collection. PhD Thesis, Laboratoire Informatique d’Avignon (LIA), Université d’Avignon et des Pays de Vaucluse, Avignon (France). Google Scholar
  19. Moh, Y., Nguyen, P., & Junqua, J.-C. (2003). Towards domain independent speaker clustering. In Proc. IEEE international conference on acoustics, speech and signal processing, ICASSP’2003, Hong Kong, China (pp. 85–88). Google Scholar
  20. Mori, K., & Nakagawa, S. (2001). Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition. In Proc. IEEE international conference on acoustics, speech and signal processing, ICASSP’2001, Salt Lake City, USA (pp. 413–416). Google Scholar
  21. Nakagawa, S., & Suzuki, H. (1993). A new speech recognition method based on VQ-distortion and HMM. In Proc. IEEE international conference on acoustics, speech and signal processing, ICASSP’1993, Minneapolis, USA (pp. 676–679). CrossRefGoogle Scholar
  22. Ouamour, S., & Sayoud, H. (2009). A new approach for speaker change detection using a fusion of different classifiers and a new relative characteristic. The Mediterranean Journal of Computers and Networks, 5(3), 104–113. ISSN:1744-2400 Google Scholar
  23. Ouamour, S., & Sayoud, H. (2012). A pertinent learning machine input feature for speaker discrimination by voice. International Journal of Speech Technology, 15, 181–190. CrossRefGoogle Scholar
  24. Reynolds, D. A., & Torres-Carrasquillo, P. (2004). The MIT Lincoln laboratories RT-04F diarization systems: applications to broadcast audio and telephone conversations. In Rich transcription workshop (RTW’04), Palisades, NY. Google Scholar
  25. Reynolds, D. A., Singer, E., Carlson, B. A., O’Leary, G. C., McLaughlin, J. J., & Zixxman, M. A. (1998). Blind clustering of speech utterances based on speaker and language characteristics. In Proc. international conference on speech and language processing, Sidney, Australia. Google Scholar
  26. Rougui, J., Rziza, M., Aboutajdine, D., Gelgon, M., & Martinez, J. (2006). Fast incremental clustering of Gaussian mixture speaker models for scaling up retrieval in on-line broadcast. In Proc. IEEE international conference on acoustics, speech and signal processing, ICASSP’2006, Toulouse, France (Vol. 5). Google Scholar
  27. Sayoud, H., Ouamour, S., & Boudraa, M. (2003). ‘ASTRA’ an automatic speaker tracking system based on SOSM measures and an interlaced indexation. Acta Acustica, 89(4), 702–710. Google Scholar
  28. Sayoud, H., Ouamour, S., & Khennouf, S. (2011). Automatic speaker tracking by camera using two-channel-based sound source localization. International Journal of Intelligent Computing and Cybernetics, 4(1), 40–60. MathSciNetCrossRefGoogle Scholar
  29. Schutte, K., & Glass, J. (2007) Features and classifiers for robust automatic speech recognition. In Research abstracts—2007, research project. MIT CSAIL publications and digital archives. Google Scholar
  30. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464. MathSciNetCrossRefMATHGoogle Scholar
  31. Singer, E., Torres-Carrasquillo, P., Reynolds, D., McCree, A., Richardson, F., Dehak, N., & Sturim, D. (2012). The MITLL NIST LRE 2011 language recognition system. In Odyssey workshop on speaker and language recognition, Singapore, 26 June 2012. Google Scholar
  32. Siu, M.-H., Yu, G., & Gish, H. (1992). An unsupervised, sequential learning algorithm for the segmentation of speech waveforms with multiple speakers. In Proc. IEEE international conference on acoustics, speech and signal processing, ICASSP’1992, San Francisco, USA (Vol. 2, pp. 189–192). Google Scholar
  33. Solomonov, A., Mielke, A., Schmidt, M., & Gish, H. (1998). Clustering speakers by their voices. In Proc. IEEE international conference on acoustics, speech and signal processing, ICASSP’1998, Seattle, USA (Vol. 2, pp. 757–760). Google Scholar
  34. Tranter, S., & Reynolds, D. (2004). Speaker diarisation for broadcast news. In Proc. ISCA Odyssey 2004 workshop on speaker and language recognition, Toledo, June 2004. Google Scholar
  35. Valente, F. (2005). Variational Bayesian methods for audio indexing. PhD Thesis, Université de Nice-Sophia Antipolis. Google Scholar
  36. Valente, F., & Wellekens, C. J. (2004). Variational Bayesian speaker clustering. In Odyssey’2004, the speaker and language recognition workshop, Toledo, Spain (pp. 207–214). Google Scholar
  37. Valente, F., Motlicek, P., & Vijayasenan, D. (2010). Variational Bayesian speaker diarization of meeting recordings. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP’2010, Dallas, Texas, USA (pp. 4954–4957). CrossRefGoogle Scholar
  38. Vijayasenan, D., Valente, F., & Bourlard, H. (2008). Combination of agglomerative and sequential clustering for speaker diarization. In IEEE int. conf. on acoustics, speech, and signal processing (ICASSP). Google Scholar
  39. Wang, W., Lv, P., Zhao, Q., & Yan, Y. (2007). A decision-tree-based online speaker clustering. In Proceedings of the 3rd Iberian conference on pattern recognition and image analysis, Girona, Spain (pp. 555–562). CrossRefGoogle Scholar
  40. Wang, H., Zhang, X., Suo, H., Zhao, Q., & Yan, Y. (2009). A novel fuzzy-based automatic speaker clustering algorithm. In Proceedings of the 6th international symposium on neural networks: advances in neural networks, China section: clustering and classification, Wuhan (pp. 639–646). Google Scholar
  41. Xavier, A. M. (2006). Robust speaker diarization for meetings. PhD Thesis, Speech Processing Group Department of Signal Theory and Communications Universitat Politecnica de Catalunya Barcelona (Espagnol), October 2006. Google Scholar
  42. Zhou, B., & Hansen, J. H. (2000). Unsupervised audio stream segmentation and clustering via the Bayesian information criterion. In Proc. international conference on speech and language processing, Beijing, China (Vol. 3, pp. 714–717). Google Scholar
  43. Žibert, J., & Mihelič, F. (2009). Fusion of acoustic and prosodic features for speaker clustering. In Proceedings of the 12th international conference on text, speech and dialogue, Pilsen, Czech Republic (pp. 210–217). Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Electronics and Computer Engineering InstituteUSTHB UniversityAlgerAlgeria

Personalised recommendations