Skip to main content

Abstract

Speaker diarization is the task of determining “Who spoke when?”, where the objective is to annotate a continuous audio recording with appropriate speaker labels corresponding to the time regions where they spoke. The labels are not necessarily the actual speaker identities, i.e. speaker identification, as long as the same labels are assigned to the regions uttered by the same speakers. These regions may overlap as multiple speakers could talk simultaneously. Speaker diarization is thus essentially the combination of two different processes: segmentation, in which the speaker turns are detected, and unsupervised clustering, in which segments of the same speakers are grouped. The clustering process is considered as unsupervised problem since there is no prior information about the number of speakers, their identities or acoustic conditions (Meignier et al., Comput Speech Lang 20(2–3):303–330, 2006; Zhou and Hansen, IEEE Trans Speech Audio Process 13(4):467–474, 2005). This chapter presents the fundamentals of speaker diarization and the most significant works over the recent years on this topic.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. The ISL Meeting Corpus (2004), https://catalog.ldc.upenn.edu/LDC2004S05. Accessed 25 Aug 2014

  2. The ICSI Meeting Corpus (2004), https://catalog.ldc.upenn.edu/LDC2004S02. Accessed 24 Aug 2014

  3. NIST Meeting Room Pilot Corpus (2004), https://catalog.ldc.upenn.edu/LDC2004S09. Accessed 24 Aug 2014

  4. The AMI corpus (2007), http://groups.inf.ed.ac.uk/ami/download/. Accessed 25 Aug 2014

  5. A.G. Adam, S.S. Kajarekar, H. Hermansky, A new speaker change detection method for two-speaker segmentation, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), vol. 4 (2002), pp. 3908–3911

    Google Scholar 

  6. A.G. Adami, L. Burget, S. Dupont, H. Garudadri, F. Grezl, H. Hermansky, P. Jain, S.S. Kajarekar, N. Morgan, S. Sivadas, Qualcomm-ICSI-OGI features for ASR, in Interspeech (2002)

    Google Scholar 

  7. J. Ajmera, H. Bourlard, I. Lapidot, I. McCowan, Unknown-multiple speaker clustering using HMM, in Interspeech (2002)

    Google Scholar 

  8. J. Ajmera, I. McCowan, H. Bourlard, Robust speaker change detection. IEEE Signal Process. Lett. 11(8), 649–651 (2004)

    Article  Google Scholar 

  9. J. Ajmera, C. Wooters, A robust speaker clustering algorithm, in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, 2003 (ASRU’03) (2003), pp. 411–416

    Google Scholar 

  10. J. Allen, How do humans process and recognize speech? IEEE Trans. Speech Audio Process. 2(4), 567–577 (1994)

    Article  Google Scholar 

  11. X. Anguera, BeamformIt acoustic beamformer (2009), http://www.xavieranguera.com/beamformit/. Accessed 24 Aug 2014

  12. X. Anguera, M. Aguilo, C. Wooters, C. Nadeu, J. Hernando, Hybrid speech/non-speech detector applied to speaker diarization of meetings, in IEEE Odyssey 2006: The Speaker and Language Recognition Workshop (2006), pp. 1–6

    Google Scholar 

  13. X. Anguera, J. Hernando, Evolutive speaker segmentation using a repository system, in Proceedings of International Conference on Speech and Language Processing, Jeju Island, 2004

    Google Scholar 

  14. X. Anguera, J. Hernando, Xbic: real-time cross probabilities measure for speaker segmentation. University of California Berkeley, ICSIBerkeley Technical Report (2005)

    Google Scholar 

  15. X. Anguera, C. Wooters, J. Hernando, Automatic cluster complexity and quantity selection: towards robust speaker diarization, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 248–256

    Google Scholar 

  16. X. Anguera, C. Wooters, J. Pardo, Robust speaker diarization for meetings: ICSI RT06s evaluation system, in Ninth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2006)

    Google Scholar 

  17. X. Anguera, C. Wooters, J. Pardo, J. Hernando, Automatic weighting for the combination of TDOA and acoustic features in speaker diarization for meetings, in Proceedings of ICASSP (2007), pp. 241–244

    Google Scholar 

  18. X. Anguera, C. Wooters, B. Peskin, M. Aguiló, Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 402–414

    Google Scholar 

  19. C. Barras, X. Zhu, S. Meignier, J.L. Gauvain, Improving speaker diarization, in RT-04F Workshop (2004)

    Google Scholar 

  20. M. Ben, M. Betser, F. Bimbot, G. Gravier, Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004)

    Google Scholar 

  21. F. Bimbot, L. Mathan, Text-free speaker recognition using an arithmetic-harmonic sphericity measure, in Third European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1993)

    Google Scholar 

  22. J.F. Bonastre, P. Delacourt, C. Fredouille, T. Merlin, C. Wellekens, A speaker tracking system based on speaker turn detection for NIST evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000 (ICASSP’00), vol. 2 (2000), pp. 1177–1180

    Google Scholar 

  23. S. Bozonnet, N. Evans, C. Fredouille, The lia-eurecom RT’09 speaker diarization system: enhancements in speaker modelling and cluster purification, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4958–4961. doi:10.1109/ICASSP.2010.5495088

  24. J. Campbell et al., Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997)

    Article  Google Scholar 

  25. W. Campbell, D. Sturim, D. Reynolds, Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5), 308–311 (2006). doi:10.1109/LSP.2006.870086

    Article  Google Scholar 

  26. G.C. Carter, A.H. Nuttall, P.G. Cable, The smoothed coherence transform. Proc. IEEE 61(10), 1497–1498 (1973)

    Article  Google Scholar 

  27. S. Cassidy, The Macquarie speaker diarization system for RT04s, in NIST 2004 Spring Rich Transcription Evaluation Workshop, Montreal, 2004

    Google Scholar 

  28. M. Cettolo, M. Vescovi, Efficient audio segmentation algorithms based on the BIC, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 6 (2003)

    Google Scholar 

  29. S. Chen, P. Gopalakrishnan, Speaker, environment and channel change detection and clustering via the Bayesian information criterion, in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop (1998), pp. 127–132

    Google Scholar 

  30. T. Cover, J. Thomas, Elements of Information Theory (Wiley-Interscience, London, 2006)

    MATH  Google Scholar 

  31. S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980) [see also IEEE Transactions on Signal Processing]

    Google Scholar 

  32. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). doi:10.1109/TASL.2010.2064307

    Article  Google Scholar 

  33. P. Delacourt, D. Kryze, C. Wellekens, Detection of speaker changes in an audio document, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999)

    Google Scholar 

  34. P. Delacourt, C. Wellekens, DISTBIC: a speaker-based segmentation for audio data indexing. Speech Commun. 32(1–2), 111–126 (2000)

    Article  Google Scholar 

  35. A. Dempster, N. Laird, D. Rubin et al., Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  36. R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification (Wiley, London, 2012)

    Google Scholar 

  37. C. Eckart, Optimal rectifier systems for the detection of steady signals, Scripps Institution of Oceanography, (UC San Diego 1952). Retrieved from: http://escholarship.org/uc/item/3676p6rt

  38. E. El-Khoury, C. Senac, R. Andre-Obrecht, Speaker diarization: towards a more robust and portable system, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2007 (ICASSP 2007), vol. 4 (2007), pp. 489–492. doi:10.1109/ICASSP.2007.366956

  39. D.P. Ellis, J.C. Liu, Speaker turn segmentation based on between-channel differences, in NIST ICASSP 2004 Meeting Recognition Workshop, Montreal, 2004, pp. 112–117

    Google Scholar 

  40. T. Ferguson, A Bayesian analysis of some nonparametric problems. Ann. Stat. 1(2) 209–230 (1973)

    Article  MathSciNet  MATH  Google Scholar 

  41. J.G. Fiscus, J. Ajot, J.S. Garofolo, The rich transcription 2007 meeting recognition evaluation, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 373–389

    Google Scholar 

  42. J.G. Fiscus, J. Ajot, M. Michel, J.S. Garofolo, The Rich Transcription 2006 Spring Meeting Recognition Evaluation (Springer, Berlin, 2006)

    Google Scholar 

  43. J.G. Fiscus, N. Radde, J.S. Garofolo, A. Le, J. Ajot, C. Laprun, The rich transcription 2005 spring meeting recognition evaluation, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 369–389

    Google Scholar 

  44. E. Fox, E. Sudderth, M. Jordan, A. Willsky, An HDP-HMM for systems with state persistence, in Proceedings of the 25th International Conference on Machine Learning (ACM, New York, 2008), pp. 312–319

    Google Scholar 

  45. E.B. Fox, E.B. Sudderth, M.I. Jordan, A.S. Willsky, A sticky HDP-HMM with application to speaker diarization. Ann. Appl. Stat. 5(2A), 1020–1056 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  46. A. Friedland, B. Vinyals, C. Huang, D. Muller, Fusing short term and long term features for improved speaker diarization, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4077–4080. doi:10.1109/ICASSP.2009.4960524

  47. G. Friedland, A. Janin, D. Imseng, X. Anguera Miro, L. Gottlieb, M. Huijbregts, M. Knox, O. Vinyals, The ICSI RT-09 speaker diarization system. IEEE Trans. Audio Speech Lang. Process. 20(2), 371–381 (2012). doi:10.1109/TASL.2011.2158419

    Article  Google Scholar 

  48. G. Friedland, O. Vinyals, Y. Huang, C. Muller, Prosodic and other long-term features for speaker diarization. IEEE Trans. Audio Speech Lang. Process. 17(5), 985–993 (2009). doi:10.1109/TASL.2009.2015089

    Article  Google Scholar 

  49. R. Gangadharaiah, B. Narayanaswamy, N. Balakrishnan, A novel method for two-speaker segmentation, in Interspeech (2004)

    Google Scholar 

  50. J. Gauvain, C. Lee, Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)

    Article  Google Scholar 

  51. J.L. Gauvain, L. Lamel, G. Adda, Partitioning and transcription of broadcast news data, in ICSLP, vol. 98 (1998), pp. 1335–1338

    Google Scholar 

  52. J.T. Geiger, F. Wallhoff, G. Rigoll, GMM-UBM based open-set online speaker diarization, in Interspeech (2010), pp. 2330–2333

    Google Scholar 

  53. H. Gish, M.H. Siu, R. Rohlicek, Segregation of speakers for speech recognition and speaker identification, in International Conference on Acoustics, Speech, and Signal Processing, 1991 (ICASSP-91) (1991), pp. 873–876

    Google Scholar 

  54. T. Hain, S. Johnson, A. Tuerk, P. Woodland, S. Young, Segment generation and clustering in the HTK broadcast news transcription system, in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, vol. 1998 (1998)

    Google Scholar 

  55. J. Hansen, B. Zhou, M. Akbacak, R. Sarikaya, B. Pellom, Audio stream phrase recognition for a national gallery of the spoken word:“ One Small Step”, in Sixth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2000)

    Google Scholar 

  56. H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)

    Article  Google Scholar 

  57. H. Hermansky, N. Morgan, A. Bayya, P. Kohn, RASTA-PLP speech analysis technique, in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992 (ICASSP-92), vol. 1 (1992), pp. 121–124

    Google Scholar 

  58. M. Huijbregts, R. Ordelman, F. de Jong, Annotation of heterogeneous multimedia content using automatic speech recognition. Lecture Notes in Computer Science Semantic Multimedia, vol. 4816, (Springer Berlin Heldeberg 2007), pp. 78–90

    Google Scholar 

  59. D. Imseng, G. Friedland, An adaptive initialization method for speaker diarization based on prosodic features, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4946–4949

    Google Scholar 

  60. D. Istrate, C. Fredouille, S. Meignier, L. Besacier, J.F. Bonastre, NIST RT’05S evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 428–439

    Google Scholar 

  61. H. Jin, F. Kubala, R. Schwartz, Automatic speaker clustering, in Proceedings of the DARPA Speech Recognition Workshop (1997), pp. 108–111

    Google Scholar 

  62. Q. Jin, T. Schultz, Speaker segmentation and clustering in meetings, in Interspeech, vol. 4 (2004), pp. 597–600

    Google Scholar 

  63. S. Johnson, Who spoke when?-automatic segmentation and clustering for determining speaker turns, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999)

    Google Scholar 

  64. S.E. Johnson, J. Woodland, Speaker clustering using direct maximisation of the MLLR-adapted likelihood, in Proceedings of ICSLP 98 (1998), pp. 1775–1779

    Google Scholar 

  65. T. Kemp, M. Schmidt, M. Westphal, A. Waibel, Strategies for automatic segmentation of audio data, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000 (ICASSP’00), vol. 3 (2000), pp. 1423–1426

    Google Scholar 

  66. P. Kenny, G. Boulianne, P. Dumouchel, Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process. 13(3), 345–354 (2005). doi:10.1109/TSA.2004.840940

    Article  Google Scholar 

  67. H. Kim, D. Ertelt, T. Sikora, Hybrid speaker-based segmentation system using model-level clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1 (2005), pp. 745–748

    Google Scholar 

  68. B.E. Kingsbury, N. Morgan, S. Greenberg, Robust speech recognition using the modulation spectrogram. Speech Commun. 25(1), 117–132 (1998)

    Article  Google Scholar 

  69. C. Knapp, G. Carter, The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976)

    Article  Google Scholar 

  70. T. Koshinaka, K. Nagatomo, K. Shinoda, Online speaker clustering using incremental learning of an ergodic hidden Markov model, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4093–4096. doi:10.1109/ICASSP.2009.4960528

  71. R. Kuhn, J.C. Junqua, P. Nguyen, N. Niedzielski, Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process. 8(6), 695–707 (2000)

    Article  Google Scholar 

  72. I. Lapidot, SOM as likelihood estimator for speaker clustering, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003)

    Google Scholar 

  73. K. Laskowski, C. Fugen, T. Schultz, Simultaneous multispeaker segmentation for automatic meeting recognition, in Proceedings of EUSIPCO, Poznan, 2007, pp. 1294–1298

    Google Scholar 

  74. K. Laskowski, Q. Jin, T. Schultz, Crosscorrelation-based multispeaker speech activity detection, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004)

    Google Scholar 

  75. K. Laskowski, G. Karlsruhe, T. Schultz, A geometric interpretation of non-target-normalized maximum cross-channel correlation for vocal activity detection in meetings, in Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pp. 89–92. Association for Computational Linguistics (2007)

    Google Scholar 

  76. K. Laskowski, T. Schultz, Unsupervised learning of overlapped speech model parameters for multichannel speech activity detection in meetings, in Proceedings of ICASSP (2006), pp. 993–996

    Google Scholar 

  77. V.B. Le, O. Mella, D. Fohr, et al., Speaker diarization using normalized cross likelihood ratio, in Interspeech, vol. 7 (2007), pp. 1869–1872

    Google Scholar 

  78. D.A. van Leeuwen, The TNO speaker diarization system for NIST RT05s meeting data, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 440–449

    Google Scholar 

  79. D.A. van Leeuwen, M. Konečný, Progress in the AMIDA speaker diarization system for meeting data, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 475–483

    Google Scholar 

  80. D. Lilt, F. Kubala, Online speaker clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004), pp. 333–336

    Google Scholar 

  81. D. Liu, F. Kubala, Fast speaker change detection for broadcast news transcription and indexing, in Sixth European Conference on Speech Communication and Technology (1999)

    Google Scholar 

  82. J. López, D. Ellis, Using acoustic condition clustering to improve acoustic change detection on broadcast news, in Sixth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2000)

    Google Scholar 

  83. L. Lu, H. Zhang, Real-time unsupervised speaker change detection, in International Conference on Pattern Recognition, vol. 16 (2002), pp. 358–361

    Google Scholar 

  84. J. Luque, C. Segura, J. Hernando, Clustering initialization based on spatial information for speaker diarization of meetings, in Interspeech (2008), pp. 383–386

    Google Scholar 

  85. J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)

    Article  Google Scholar 

  86. A. Malegaonkar, A. Ariyaeeinia, P. Sivakumaran, J. Fortuna, Unsupervised speaker change detection using probabilistic pattern matching. IEEE Signal Process. Lett. 13(8), 509–512 (2006)

    Article  Google Scholar 

  87. K. Markov, S. Nakamura, Never-ending learning system for on-line speaker diarization, in IEEE Workshop on Automatic Speech Recognition Understanding, 2007 (ASRU) (2007), pp. 699–704. doi:10.1109/ASRU.2007.4430197

  88. K. Markov, S. Nakamura, Improved novelty detection for online GMM based speaker diarization, in Interspeech (2008), pp. 363–366

    Google Scholar 

  89. S. Meignier, J. Bonastre, S. Igounet, E-HMM approach for learning and adapting sound models for speaker indexing, in 2001: A Speaker Odyssey-The Speaker Recognition Workshop (ISCA, Pittsburgh, 2001)

    Google Scholar 

  90. S. Meignier, D. Moraru, C. Fredouille, J.F. Bonastre, L. Besacier, Step-by-step and integrated approaches in broadcast news speaker diarization. Comput. Speech Lang. 20(2–3), 303–330 (2006). doi:http://dx.doi.org/10.1016/j.csl.2005.08.002. http://www.sciencedirect.com/science/article/pii/S0885230805000471

  91. X.A. Miró, Robust speaker diarization for meetings, Ph.D. thesis, Universitat Politècnica de Catalunya, Barcelona (2006)

    Google Scholar 

  92. D. Moraru, S. Meignier, L. Besacier, J.F. Bonastre, I. Magrin-Chagnolleau, The ELISA consortium approaches in speaker segmentation during the NIST 2002 speaker recognition evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 (ICASSP’03), vol. 2 (2003), p. II-89

    Google Scholar 

  93. D. Moraru, S. Meignier, C. Fredouille, L. Besacier, J.F. Bonastre, The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004), p. I-373

    Google Scholar 

  94. K. Mori, S. Nakagawa, Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001 (ICASSP’01), vol. 1 (2001)

    Google Scholar 

  95. R.M. Neal, G.E. Hinton, A view of the em algorithm that justifies incremental, sparse, and other variants, in Learning in Graphical Models (Springer, Berlin, 1998), pp. 355–368

    Google Scholar 

  96. A.Y. Ng, M.I. Jordan, Y. Weiss et al., On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2, 849–856 (2002)

    Google Scholar 

  97. P. Nguyen, L. Rigazio, Y. Moh, J. Junqua, Rich transcription 2002 site report, Panasonic Speech Technology Laboratory (PSTL), in Proceedings of the 2002 Rich Transcription Workshop (2002)

    Google Scholar 

  98. M. Nishida, T. Kawahara, Unsupervised speaker indexing using speaker model selection based on Bayesian information criterion, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 (ICASSP’03), vol. 1 (2003), pp. 172–175

    Google Scholar 

  99. J.M. Pardo, X. Anguera, C. Wooters, Speaker diarization for multi-microphone meetings using only between-channel differences, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 257–264

    Google Scholar 

  100. J.M. Pardo, X. Anguera, C. Wooters, Speaker diarization for multiple distant microphone meetings: mixing acoustic features and inter-channel time differences, in Interspeech (2006)

    Google Scholar 

  101. J.M. Pardo, R. Barra-Chicote, R. San-Segundo, R. de Córdoba, B. Martínez-González, Speaker diarization features: the UPM contribution to the RT09 evaluation. IEEE Trans. Audio Speech Lang. Process. 20(2), 426–435 (2012)

    Google Scholar 

  102. J. Pelecanos, S. Sridharan, Feature warping for robust speaker verification, in 2001: A Speaker Odyssey-The Speaker Recognition Workshop (2001)

    Google Scholar 

  103. L. Perez-Freire, C. Garcia-Mateo, A multimedia approach for audio segmentation in TV broadcast news, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004)

    Google Scholar 

  104. T. Pfau, D. Ellis, A. Stolcke, Multispeaker speech activity detection for the ICSI meeting recorder, in Proceedings of ASRU, vol. 1 (2001)

    Google Scholar 

  105. L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  106. W.M. Rand, Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)

    Article  Google Scholar 

  107. D. Reynolds, E. Singer, B. Carlson, G. O’Leary, J. McLaughlin, M. Zissman, Blind clustering of speech utterances based on speaker and language characteristics, in Fifth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 1998)

    Google Scholar 

  108. D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted Gaussian mixture models. Digit. Signal Process. 10(1), 19–41 (2000)

    Article  Google Scholar 

  109. D.A. Reynolds, R.C. Rose, Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)

    Article  Google Scholar 

  110. D.A. Reynolds, P. Torres-Carrasquillo, The MIT Lincoln laboratory RT-04F diarization systems: applications to broadcast audio and telephone conversations. Technical Report, DTIC Document (2004)

    Google Scholar 

  111. M. Roch, Y. Cheng, Speaker segmentation using the MAP-adapted Bayesian information criterion, in ODYSSEY04-The Speaker and Language Recognition Workshop (ISCA, Pittsburgh, 2004)

    Google Scholar 

  112. P.R. Roth, Effective measurements using digital signal analysis. IEEE Spectr. 8(4), 62–70 (1971)

    Article  Google Scholar 

  113. J. Rougui, M. Rziza, D. Aboutajdine, M. Gelgon, J. Martinez, F. Rabat, Fast incremental clustering of gaussian mixture speaker models for scaling up retrieval in on-line broadcast, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2006 (ICASSP 2006), vol. 5 (2006)

    Google Scholar 

  114. M. Rouvier, S. Meignier, A global optimization framework for speaker diarization, in Odyssey 2012-The Speaker and Language Recognition Workshop (2012)

    Google Scholar 

  115. M.A. Sato, S. Ishii, On-line EM algorithm for the normalized Gaussian network. Neural Comput. 12(2), 407–432 (2000)

    Article  Google Scholar 

  116. G. Schwarz, Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)

    Article  MATH  Google Scholar 

  117. E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, A. Stolcke, Modeling prosodic feature sequences for speaker recognition. Speech Commun. 46(3), 455–472 (2005)

    Article  Google Scholar 

  118. S. Shum, N. Dehak, E. Chuangsuwanich, D.A. Reynolds, J.R. Glass, Exploiting intra-conversation variability for speaker diarization, in Interspeech (2011), pp. 945–948

    Google Scholar 

  119. S. Shum, N. Dehak, R. Dehak, J. Glass, Unsupervised methods for speaker diarization: an integrated and iterative approach. IEEE Trans. Audio Speech Lang. Process. 21(10), 2015–2028 (2013). doi:10.1109/TASL.2013.2264673

    Article  Google Scholar 

  120. S. Shum, N. Dehak, J. Glass, On the use of spectral and iterative methods for speaker diarization. System 1(w2), 2 (2012)

    Google Scholar 

  121. M.A. Siegler, U. Jain, B. Raj, R.M. Stern, Automatic segmentation, classification and clustering of broadcast news audio, in Proceedings of DARPA Broadcast News Workshop (1997), p. 11

    Google Scholar 

  122. J. Silovsky, J. Prazak, Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4193–4196

    Google Scholar 

  123. R. Sinha, S.E. Tranter, M.J. Gales, P.C. Woodland, The Cambridge university March 2005 speaker diarisation system, in Interspeech (2005), pp. 2437–2440

    Google Scholar 

  124. P. Sivakumaran, J. Fortuna, A.M. Ariyaeeinia, On the use of the Bayesian information criterion in multiple speaker detection, in Interspeech (2001), pp. 795–798

    Google Scholar 

  125. A. Solomonoff, A. Mielke, M. Schmidt, H. Gish, Clustering speakers by their voices, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2 (1998), pp. 757–760

    Google Scholar 

  126. S. Stevens, J. Volkmann, The relation of pitch to frequency: a revised scale. Am. J. Psychol. 53(3), 329–353 (1940)

    Article  Google Scholar 

  127. H. Sun, B. Ma, S. Kalayar Khine, H. Li, Speaker diarization system for RT07 and RT09 meeting room audio, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4982–4985

    Google Scholar 

  128. H. Tang, S. Chu, M. Hasegawa-Johnson, T. Huang, Partially supervised speaker clustering. IEEE Trans. Pattern Anal. Mach. Intell. 34(5), 959–971 (2012). doi:10.1109/TPAMI.2011.174

    Article  Google Scholar 

  129. Y. Teh, M. Jordan, M. Beal, D. Blei, Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  130. S. Tranter, Two-way cluster voting to improve speaker diarisation performance, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005)

    Google Scholar 

  131. A. Tritschler, R. Gopinath, Improved speaker segmentation and segments clustering using the Bayesian information criterion, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999), pp. 679–682

    Google Scholar 

  132. W. Tsai, H. Wang, On maximizing the within-cluster homogeneity of speaker voice characteristics for speech utterance clustering, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Toulouse, 2006

    Google Scholar 

  133. W.H. Tsai, S.S. Cheng, Y.H. Chao, H.M. Wang, Clustering speech utterances by speaker using eigenvoice-motivated vector space models, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005), pp. 725–728

    Google Scholar 

  134. W.H. Tsai, S.S. Cheng, H.M. Wang, Speaker clustering of speech utterances using a voice characteristic reference space, in Eighth International Conference on Spoken Language Processing (2004)

    Google Scholar 

  135. F. Valente, Infinite models for speaker clustering, in Ninth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2006)

    Google Scholar 

  136. F. Valente, C. Wellekens, Variational Bayesian speaker clustering, in ODYSSEY04-The Speaker and Language Recognition Workshop (ISCA, Pittsburgh, 2004)

    Google Scholar 

  137. F. Valente, C. Wellekens, Variational Bayesian adaptation for speaker clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005)

    Google Scholar 

  138. D. Van Leeuwen, T. Factors, The TNO speaker diarization system for NIST RT05s meeting data. Lecture Notes in Computer Science, Machine Learning for Multimodal Interaction (Springer Berlin Heidelberg 2006) vol. 3869, pp. 440

    Google Scholar 

  139. A. Vandecatseye, J. Martens, A fast, accurate and stream-based speaker segmentation and clustering algorithm, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003)

    Google Scholar 

  140. D. Vijayasenan, F. Valente, Speaker diarization of meetings based on large TDOA feature vectors, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4173–4176. doi:10.1109/ICASSP.2012.6288838

  141. D. Vijayasenan, F. Valente, H. Bourlard, Agglomerative information bottleneck for speaker diarization of meetings data, in IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) (2007), pp. 250–449

    Google Scholar 

  142. D. Vijayasenan, F. Valente, H. Bourlard, Combination of agglomerative and sequential clustering for speaker diarization, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2008 (ICASSP 2008) (2008), pp. 4361–4364. doi:10.1109/ICASSP.2008.4518621

  143. D. Vijayasenan, F. Valente, H. Bourlard, Integration of TDOA features in information bottleneck framework for fast speaker diarization, in Interspeech (2008), pp. 40–43

    Google Scholar 

  144. D. Vijayasenan, F. Valente, H. Bourlard, Mutual information based channel selection for speaker diarization of meetings data, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4065–4068. doi:10.1109/ICASSP.2009.4960521

  145. D. Vijayasenan, F. Valente, H. Bourlard, An information theoretic combination of MFCC and TDOA features for speaker diarization. IEEE Trans. Audio Speech Lang. Process. 19(2), 431–438 (2011). doi:10.1109/TASL.2010.2048603

    Article  Google Scholar 

  146. D. Vijayasenan, F. Valente, H. Bourlard, Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features. Speech Commun. 54(1), 55–67 (2012)

    Article  Google Scholar 

  147. O. Vinyals, G. Friedland, Modulation spectrogram features for improved speaker diarization, in Interspeech (2008), pp. 630–633

    Google Scholar 

  148. A. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967)

    Article  MATH  Google Scholar 

  149. H. Wang, S. Cheng, METRIC-SEQDAC: a hybrid approach for audio segmentation, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004)

    Google Scholar 

  150. N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications, vol. 8 (MIT Press, Cambridge, 1964)

    Google Scholar 

  151. A. Willsky, H. Jones, A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems. IEEE Trans. Automat. Contr. 21(1), 108–112 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  152. C. Wooters, J. Fung, B. Peskin, X. Anguera, Towards robust speaker segmentation: the ICSI-SRI fall 2004 diarization system, in RT-04F Workshop, vol. 23 (2004)

    Google Scholar 

  153. C. Wooters, M. Huijbregts, The ICSI RT07s speaker diarization system, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 509–519

    Google Scholar 

  154. S. Wrigley, G. Brown, V. Wan, S. Renals, Feature selection for the classification of crosstalk in multi-channel audio, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003)

    Google Scholar 

  155. S. Wrigley, G. Brown, V. Wan, S. Renals, Speech and crosstalk detection in multichannel audio. IEEE Trans. Speech Audio Process. 13(1), 84–91 (2005)

    Article  Google Scholar 

  156. T. Wu, L. Lu, K. Chen, H. Zhang, UBM-based real-time speaker segmentation for broadcasting news, in ICME 2003, vol. 2 (2003), pp. 721–724

    Google Scholar 

  157. K. Yamanishi, J.I. Takeuchi, G. Williams, P. Milne, On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms, in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2000), pp. 320–324

    Google Scholar 

  158. M. Zamalloa, L.J. Rodríguez-Fuentes, G. Bordel, M. Penagarikano, J.P. Uribe, Low-latency online speaker tracking on the AMI corpus of meeting conversations, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4962–4965

    Google Scholar 

  159. B. Zhou, J. Hansen, Efficient audio stream segmentation via the combined T2 statistic and Bayesian information criterion. IEEE Trans. Speech Audio Process. 13(4), 467–474 (2005)

    Article  Google Scholar 

  160. B. Zhou, J.H. Hansen, Unsupervised audio stream segmentation and clustering via the Bayesian information criterion, in Interspeech (2000), pp. 714–717

    Google Scholar 

  161. X. Zhu, C. Barras, L. Lamel, J.L. Gauvain, Speaker diarization: from broadcast news to lectures, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 396–406

    Google Scholar 

  162. X. Zhu, C. Barras, S. Meignier, J.L. Gauvain, Combining speaker identification and BIC for speaker diarization, in Interspeech, vol. 5 (2005), pp. 2441–2444

    Google Scholar 

  163. P. Zochova, V. Radova, Modified DISTBIC algorithm for speaker change detection, in Ninth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2005)

    Google Scholar 

  164. E. Zwicker, E. Terhardt, Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J. Acoust. Soc. Am. 68, 1523 (1980)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Trung Hieu Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer Science+Business Media New York

About this chapter

Cite this chapter

Nguyen, T.H., Chng, E.S., Li, H. (2015). Speaker Diarization: An Emerging Research. In: Ogunfunmi, T., Togneri, R., Narasimha, M. (eds) Speech and Audio Processing for Coding, Enhancement and Recognition. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1456-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-1456-2_8

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4939-1455-5

  • Online ISBN: 978-1-4939-1456-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics