Speaker Diarization: An Emerging Research

Nguyen, Trung Hieu; Chng, Eng Siong; Li, Haizhou

doi:10.1007/978-1-4939-1456-2_8

Trung Hieu Nguyen⁴,
Eng Siong Chng⁵ &
Haizhou Li⁴

2006 Accesses
1 Citations
1 Altmetric

Abstract

Speaker diarization is the task of determining “Who spoke when?”, where the objective is to annotate a continuous audio recording with appropriate speaker labels corresponding to the time regions where they spoke. The labels are not necessarily the actual speaker identities, i.e. speaker identification, as long as the same labels are assigned to the regions uttered by the same speakers. These regions may overlap as multiple speakers could talk simultaneously. Speaker diarization is thus essentially the combination of two different processes: segmentation, in which the speaker turns are detected, and unsupervised clustering, in which segments of the same speakers are grouped. The clustering process is considered as unsupervised problem since there is no prior information about the number of speakers, their identities or acoustic conditions (Meignier et al., Comput Speech Lang 20(2–3):303–330, 2006; Zhou and Hansen, IEEE Trans Speech Audio Process 13(4):467–474, 2005). This chapter presents the fundamentals of speaker diarization and the most significant works over the recent years on this topic.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

The ISL Meeting Corpus (2004), https://catalog.ldc.upenn.edu/LDC2004S05. Accessed 25 Aug 2014
The ICSI Meeting Corpus (2004), https://catalog.ldc.upenn.edu/LDC2004S02. Accessed 24 Aug 2014
NIST Meeting Room Pilot Corpus (2004), https://catalog.ldc.upenn.edu/LDC2004S09. Accessed 24 Aug 2014
The AMI corpus (2007), http://groups.inf.ed.ac.uk/ami/download/. Accessed 25 Aug 2014
A.G. Adam, S.S. Kajarekar, H. Hermansky, A new speaker change detection method for two-speaker segmentation, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), vol. 4 (2002), pp. 3908–3911
Google Scholar
A.G. Adami, L. Burget, S. Dupont, H. Garudadri, F. Grezl, H. Hermansky, P. Jain, S.S. Kajarekar, N. Morgan, S. Sivadas, Qualcomm-ICSI-OGI features for ASR, in Interspeech (2002)
Google Scholar
J. Ajmera, H. Bourlard, I. Lapidot, I. McCowan, Unknown-multiple speaker clustering using HMM, in Interspeech (2002)
Google Scholar
J. Ajmera, I. McCowan, H. Bourlard, Robust speaker change detection. IEEE Signal Process. Lett. 11(8), 649–651 (2004)
Article Google Scholar
J. Ajmera, C. Wooters, A robust speaker clustering algorithm, in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, 2003 (ASRU’03) (2003), pp. 411–416
Google Scholar
J. Allen, How do humans process and recognize speech? IEEE Trans. Speech Audio Process. 2(4), 567–577 (1994)
Article Google Scholar
X. Anguera, BeamformIt acoustic beamformer (2009), http://www.xavieranguera.com/beamformit/. Accessed 24 Aug 2014
X. Anguera, M. Aguilo, C. Wooters, C. Nadeu, J. Hernando, Hybrid speech/non-speech detector applied to speaker diarization of meetings, in IEEE Odyssey 2006: The Speaker and Language Recognition Workshop (2006), pp. 1–6
Google Scholar
X. Anguera, J. Hernando, Evolutive speaker segmentation using a repository system, in Proceedings of International Conference on Speech and Language Processing, Jeju Island, 2004
Google Scholar
X. Anguera, J. Hernando, Xbic: real-time cross probabilities measure for speaker segmentation. University of California Berkeley, ICSIBerkeley Technical Report (2005)
Google Scholar
X. Anguera, C. Wooters, J. Hernando, Automatic cluster complexity and quantity selection: towards robust speaker diarization, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 248–256
Google Scholar
X. Anguera, C. Wooters, J. Pardo, Robust speaker diarization for meetings: ICSI RT06s evaluation system, in Ninth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2006)
Google Scholar
X. Anguera, C. Wooters, J. Pardo, J. Hernando, Automatic weighting for the combination of TDOA and acoustic features in speaker diarization for meetings, in Proceedings of ICASSP (2007), pp. 241–244
Google Scholar
X. Anguera, C. Wooters, B. Peskin, M. Aguiló, Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 402–414
Google Scholar
C. Barras, X. Zhu, S. Meignier, J.L. Gauvain, Improving speaker diarization, in RT-04F Workshop (2004)
Google Scholar
M. Ben, M. Betser, F. Bimbot, G. Gravier, Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004)
Google Scholar
F. Bimbot, L. Mathan, Text-free speaker recognition using an arithmetic-harmonic sphericity measure, in Third European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1993)
Google Scholar
J.F. Bonastre, P. Delacourt, C. Fredouille, T. Merlin, C. Wellekens, A speaker tracking system based on speaker turn detection for NIST evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000 (ICASSP’00), vol. 2 (2000), pp. 1177–1180
Google Scholar
S. Bozonnet, N. Evans, C. Fredouille, The lia-eurecom RT’09 speaker diarization system: enhancements in speaker modelling and cluster purification, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4958–4961. doi:10.1109/ICASSP.2010.5495088
J. Campbell et al., Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997)
Article Google Scholar
W. Campbell, D. Sturim, D. Reynolds, Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5), 308–311 (2006). doi:10.1109/LSP.2006.870086
Article Google Scholar
G.C. Carter, A.H. Nuttall, P.G. Cable, The smoothed coherence transform. Proc. IEEE 61(10), 1497–1498 (1973)
Article Google Scholar
S. Cassidy, The Macquarie speaker diarization system for RT04s, in NIST 2004 Spring Rich Transcription Evaluation Workshop, Montreal, 2004
Google Scholar
M. Cettolo, M. Vescovi, Efficient audio segmentation algorithms based on the BIC, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 6 (2003)
Google Scholar
S. Chen, P. Gopalakrishnan, Speaker, environment and channel change detection and clustering via the Bayesian information criterion, in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop (1998), pp. 127–132
Google Scholar
T. Cover, J. Thomas, Elements of Information Theory (Wiley-Interscience, London, 2006)
MATH Google Scholar
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980) [see also IEEE Transactions on Signal Processing]
Google Scholar
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). doi:10.1109/TASL.2010.2064307
Article Google Scholar
P. Delacourt, D. Kryze, C. Wellekens, Detection of speaker changes in an audio document, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999)
Google Scholar
P. Delacourt, C. Wellekens, DISTBIC: a speaker-based segmentation for audio data indexing. Speech Commun. 32(1–2), 111–126 (2000)
Article Google Scholar
A. Dempster, N. Laird, D. Rubin et al., Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification (Wiley, London, 2012)
Google Scholar
C. Eckart, Optimal rectifier systems for the detection of steady signals, Scripps Institution of Oceanography, (UC San Diego 1952). Retrieved from: http://escholarship.org/uc/item/3676p6rt
E. El-Khoury, C. Senac, R. Andre-Obrecht, Speaker diarization: towards a more robust and portable system, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2007 (ICASSP 2007), vol. 4 (2007), pp. 489–492. doi:10.1109/ICASSP.2007.366956
D.P. Ellis, J.C. Liu, Speaker turn segmentation based on between-channel differences, in NIST ICASSP 2004 Meeting Recognition Workshop, Montreal, 2004, pp. 112–117
Google Scholar
T. Ferguson, A Bayesian analysis of some nonparametric problems. Ann. Stat. 1(2) 209–230 (1973)
Article MathSciNet MATH Google Scholar
J.G. Fiscus, J. Ajot, J.S. Garofolo, The rich transcription 2007 meeting recognition evaluation, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 373–389
Google Scholar
J.G. Fiscus, J. Ajot, M. Michel, J.S. Garofolo, The Rich Transcription 2006 Spring Meeting Recognition Evaluation (Springer, Berlin, 2006)
Google Scholar
J.G. Fiscus, N. Radde, J.S. Garofolo, A. Le, J. Ajot, C. Laprun, The rich transcription 2005 spring meeting recognition evaluation, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 369–389
Google Scholar
E. Fox, E. Sudderth, M. Jordan, A. Willsky, An HDP-HMM for systems with state persistence, in Proceedings of the 25th International Conference on Machine Learning (ACM, New York, 2008), pp. 312–319
Google Scholar
E.B. Fox, E.B. Sudderth, M.I. Jordan, A.S. Willsky, A sticky HDP-HMM with application to speaker diarization. Ann. Appl. Stat. 5(2A), 1020–1056 (2011)
Article MathSciNet MATH Google Scholar
A. Friedland, B. Vinyals, C. Huang, D. Muller, Fusing short term and long term features for improved speaker diarization, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4077–4080. doi:10.1109/ICASSP.2009.4960524
G. Friedland, A. Janin, D. Imseng, X. Anguera Miro, L. Gottlieb, M. Huijbregts, M. Knox, O. Vinyals, The ICSI RT-09 speaker diarization system. IEEE Trans. Audio Speech Lang. Process. 20(2), 371–381 (2012). doi:10.1109/TASL.2011.2158419
Article Google Scholar
G. Friedland, O. Vinyals, Y. Huang, C. Muller, Prosodic and other long-term features for speaker diarization. IEEE Trans. Audio Speech Lang. Process. 17(5), 985–993 (2009). doi:10.1109/TASL.2009.2015089
Article Google Scholar
R. Gangadharaiah, B. Narayanaswamy, N. Balakrishnan, A novel method for two-speaker segmentation, in Interspeech (2004)
Google Scholar
J. Gauvain, C. Lee, Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)
Article Google Scholar
J.L. Gauvain, L. Lamel, G. Adda, Partitioning and transcription of broadcast news data, in ICSLP, vol. 98 (1998), pp. 1335–1338
Google Scholar
J.T. Geiger, F. Wallhoff, G. Rigoll, GMM-UBM based open-set online speaker diarization, in Interspeech (2010), pp. 2330–2333
Google Scholar
H. Gish, M.H. Siu, R. Rohlicek, Segregation of speakers for speech recognition and speaker identification, in International Conference on Acoustics, Speech, and Signal Processing, 1991 (ICASSP-91) (1991), pp. 873–876
Google Scholar
T. Hain, S. Johnson, A. Tuerk, P. Woodland, S. Young, Segment generation and clustering in the HTK broadcast news transcription system, in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, vol. 1998 (1998)
Google Scholar
J. Hansen, B. Zhou, M. Akbacak, R. Sarikaya, B. Pellom, Audio stream phrase recognition for a national gallery of the spoken word:“ One Small Step”, in Sixth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2000)
Google Scholar
H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)
Article Google Scholar
H. Hermansky, N. Morgan, A. Bayya, P. Kohn, RASTA-PLP speech analysis technique, in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992 (ICASSP-92), vol. 1 (1992), pp. 121–124
Google Scholar
M. Huijbregts, R. Ordelman, F. de Jong, Annotation of heterogeneous multimedia content using automatic speech recognition. Lecture Notes in Computer Science Semantic Multimedia, vol. 4816, (Springer Berlin Heldeberg 2007), pp. 78–90
Google Scholar
D. Imseng, G. Friedland, An adaptive initialization method for speaker diarization based on prosodic features, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4946–4949
Google Scholar
D. Istrate, C. Fredouille, S. Meignier, L. Besacier, J.F. Bonastre, NIST RT’05S evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 428–439
Google Scholar
H. Jin, F. Kubala, R. Schwartz, Automatic speaker clustering, in Proceedings of the DARPA Speech Recognition Workshop (1997), pp. 108–111
Google Scholar
Q. Jin, T. Schultz, Speaker segmentation and clustering in meetings, in Interspeech, vol. 4 (2004), pp. 597–600
Google Scholar
S. Johnson, Who spoke when?-automatic segmentation and clustering for determining speaker turns, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999)
Google Scholar
S.E. Johnson, J. Woodland, Speaker clustering using direct maximisation of the MLLR-adapted likelihood, in Proceedings of ICSLP 98 (1998), pp. 1775–1779
Google Scholar
T. Kemp, M. Schmidt, M. Westphal, A. Waibel, Strategies for automatic segmentation of audio data, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000 (ICASSP’00), vol. 3 (2000), pp. 1423–1426
Google Scholar
P. Kenny, G. Boulianne, P. Dumouchel, Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process. 13(3), 345–354 (2005). doi:10.1109/TSA.2004.840940
Article Google Scholar
H. Kim, D. Ertelt, T. Sikora, Hybrid speaker-based segmentation system using model-level clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1 (2005), pp. 745–748
Google Scholar
B.E. Kingsbury, N. Morgan, S. Greenberg, Robust speech recognition using the modulation spectrogram. Speech Commun. 25(1), 117–132 (1998)
Article Google Scholar
C. Knapp, G. Carter, The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976)
Article Google Scholar
T. Koshinaka, K. Nagatomo, K. Shinoda, Online speaker clustering using incremental learning of an ergodic hidden Markov model, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4093–4096. doi:10.1109/ICASSP.2009.4960528
R. Kuhn, J.C. Junqua, P. Nguyen, N. Niedzielski, Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process. 8(6), 695–707 (2000)
Article Google Scholar
I. Lapidot, SOM as likelihood estimator for speaker clustering, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003)
Google Scholar
K. Laskowski, C. Fugen, T. Schultz, Simultaneous multispeaker segmentation for automatic meeting recognition, in Proceedings of EUSIPCO, Poznan, 2007, pp. 1294–1298
Google Scholar
K. Laskowski, Q. Jin, T. Schultz, Crosscorrelation-based multispeaker speech activity detection, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004)
Google Scholar
K. Laskowski, G. Karlsruhe, T. Schultz, A geometric interpretation of non-target-normalized maximum cross-channel correlation for vocal activity detection in meetings, in Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pp. 89–92. Association for Computational Linguistics (2007)
Google Scholar
K. Laskowski, T. Schultz, Unsupervised learning of overlapped speech model parameters for multichannel speech activity detection in meetings, in Proceedings of ICASSP (2006), pp. 993–996
Google Scholar
V.B. Le, O. Mella, D. Fohr, et al., Speaker diarization using normalized cross likelihood ratio, in Interspeech, vol. 7 (2007), pp. 1869–1872
Google Scholar
D.A. van Leeuwen, The TNO speaker diarization system for NIST RT05s meeting data, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 440–449
Google Scholar
D.A. van Leeuwen, M. Konečný, Progress in the AMIDA speaker diarization system for meeting data, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 475–483
Google Scholar
D. Lilt, F. Kubala, Online speaker clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004), pp. 333–336
Google Scholar
D. Liu, F. Kubala, Fast speaker change detection for broadcast news transcription and indexing, in Sixth European Conference on Speech Communication and Technology (1999)
Google Scholar
J. López, D. Ellis, Using acoustic condition clustering to improve acoustic change detection on broadcast news, in Sixth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2000)
Google Scholar
L. Lu, H. Zhang, Real-time unsupervised speaker change detection, in International Conference on Pattern Recognition, vol. 16 (2002), pp. 358–361
Google Scholar
J. Luque, C. Segura, J. Hernando, Clustering initialization based on spatial information for speaker diarization of meetings, in Interspeech (2008), pp. 383–386
Google Scholar
J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)
Article Google Scholar
A. Malegaonkar, A. Ariyaeeinia, P. Sivakumaran, J. Fortuna, Unsupervised speaker change detection using probabilistic pattern matching. IEEE Signal Process. Lett. 13(8), 509–512 (2006)
Article Google Scholar
K. Markov, S. Nakamura, Never-ending learning system for on-line speaker diarization, in IEEE Workshop on Automatic Speech Recognition Understanding, 2007 (ASRU) (2007), pp. 699–704. doi:10.1109/ASRU.2007.4430197
K. Markov, S. Nakamura, Improved novelty detection for online GMM based speaker diarization, in Interspeech (2008), pp. 363–366
Google Scholar
S. Meignier, J. Bonastre, S. Igounet, E-HMM approach for learning and adapting sound models for speaker indexing, in 2001: A Speaker Odyssey-The Speaker Recognition Workshop (ISCA, Pittsburgh, 2001)
Google Scholar
S. Meignier, D. Moraru, C. Fredouille, J.F. Bonastre, L. Besacier, Step-by-step and integrated approaches in broadcast news speaker diarization. Comput. Speech Lang. 20(2–3), 303–330 (2006). doi:http://dx.doi.org/10.1016/j.csl.2005.08.002. http://www.sciencedirect.com/science/article/pii/S0885230805000471
X.A. Miró, Robust speaker diarization for meetings, Ph.D. thesis, Universitat Politècnica de Catalunya, Barcelona (2006)
Google Scholar
D. Moraru, S. Meignier, L. Besacier, J.F. Bonastre, I. Magrin-Chagnolleau, The ELISA consortium approaches in speaker segmentation during the NIST 2002 speaker recognition evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 (ICASSP’03), vol. 2 (2003), p. II-89
Google Scholar
D. Moraru, S. Meignier, C. Fredouille, L. Besacier, J.F. Bonastre, The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004), p. I-373
Google Scholar
K. Mori, S. Nakagawa, Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001 (ICASSP’01), vol. 1 (2001)
Google Scholar
R.M. Neal, G.E. Hinton, A view of the em algorithm that justifies incremental, sparse, and other variants, in Learning in Graphical Models (Springer, Berlin, 1998), pp. 355–368
Google Scholar
A.Y. Ng, M.I. Jordan, Y. Weiss et al., On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2, 849–856 (2002)
Google Scholar
P. Nguyen, L. Rigazio, Y. Moh, J. Junqua, Rich transcription 2002 site report, Panasonic Speech Technology Laboratory (PSTL), in Proceedings of the 2002 Rich Transcription Workshop (2002)
Google Scholar
M. Nishida, T. Kawahara, Unsupervised speaker indexing using speaker model selection based on Bayesian information criterion, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 (ICASSP’03), vol. 1 (2003), pp. 172–175
Google Scholar
J.M. Pardo, X. Anguera, C. Wooters, Speaker diarization for multi-microphone meetings using only between-channel differences, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 257–264
Google Scholar
J.M. Pardo, X. Anguera, C. Wooters, Speaker diarization for multiple distant microphone meetings: mixing acoustic features and inter-channel time differences, in Interspeech (2006)
Google Scholar
J.M. Pardo, R. Barra-Chicote, R. San-Segundo, R. de Córdoba, B. Martínez-González, Speaker diarization features: the UPM contribution to the RT09 evaluation. IEEE Trans. Audio Speech Lang. Process. 20(2), 426–435 (2012)
Google Scholar
J. Pelecanos, S. Sridharan, Feature warping for robust speaker verification, in 2001: A Speaker Odyssey-The Speaker Recognition Workshop (2001)
Google Scholar
L. Perez-Freire, C. Garcia-Mateo, A multimedia approach for audio segmentation in TV broadcast news, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004)
Google Scholar
T. Pfau, D. Ellis, A. Stolcke, Multispeaker speech activity detection for the ICSI meeting recorder, in Proceedings of ASRU, vol. 1 (2001)
Google Scholar
L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Article Google Scholar
W.M. Rand, Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)
Article Google Scholar
D. Reynolds, E. Singer, B. Carlson, G. O’Leary, J. McLaughlin, M. Zissman, Blind clustering of speech utterances based on speaker and language characteristics, in Fifth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 1998)
Google Scholar
D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted Gaussian mixture models. Digit. Signal Process. 10(1), 19–41 (2000)
Article Google Scholar
D.A. Reynolds, R.C. Rose, Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)
Article Google Scholar
D.A. Reynolds, P. Torres-Carrasquillo, The MIT Lincoln laboratory RT-04F diarization systems: applications to broadcast audio and telephone conversations. Technical Report, DTIC Document (2004)
Google Scholar
M. Roch, Y. Cheng, Speaker segmentation using the MAP-adapted Bayesian information criterion, in ODYSSEY04-The Speaker and Language Recognition Workshop (ISCA, Pittsburgh, 2004)
Google Scholar
P.R. Roth, Effective measurements using digital signal analysis. IEEE Spectr. 8(4), 62–70 (1971)
Article Google Scholar
J. Rougui, M. Rziza, D. Aboutajdine, M. Gelgon, J. Martinez, F. Rabat, Fast incremental clustering of gaussian mixture speaker models for scaling up retrieval in on-line broadcast, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2006 (ICASSP 2006), vol. 5 (2006)
Google Scholar
M. Rouvier, S. Meignier, A global optimization framework for speaker diarization, in Odyssey 2012-The Speaker and Language Recognition Workshop (2012)
Google Scholar
M.A. Sato, S. Ishii, On-line EM algorithm for the normalized Gaussian network. Neural Comput. 12(2), 407–432 (2000)
Article Google Scholar
G. Schwarz, Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Article MATH Google Scholar
E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, A. Stolcke, Modeling prosodic feature sequences for speaker recognition. Speech Commun. 46(3), 455–472 (2005)
Article Google Scholar
S. Shum, N. Dehak, E. Chuangsuwanich, D.A. Reynolds, J.R. Glass, Exploiting intra-conversation variability for speaker diarization, in Interspeech (2011), pp. 945–948
Google Scholar
S. Shum, N. Dehak, R. Dehak, J. Glass, Unsupervised methods for speaker diarization: an integrated and iterative approach. IEEE Trans. Audio Speech Lang. Process. 21(10), 2015–2028 (2013). doi:10.1109/TASL.2013.2264673
Article Google Scholar
S. Shum, N. Dehak, J. Glass, On the use of spectral and iterative methods for speaker diarization. System 1(w2), 2 (2012)
Google Scholar
M.A. Siegler, U. Jain, B. Raj, R.M. Stern, Automatic segmentation, classification and clustering of broadcast news audio, in Proceedings of DARPA Broadcast News Workshop (1997), p. 11
Google Scholar
J. Silovsky, J. Prazak, Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4193–4196
Google Scholar
R. Sinha, S.E. Tranter, M.J. Gales, P.C. Woodland, The Cambridge university March 2005 speaker diarisation system, in Interspeech (2005), pp. 2437–2440
Google Scholar
P. Sivakumaran, J. Fortuna, A.M. Ariyaeeinia, On the use of the Bayesian information criterion in multiple speaker detection, in Interspeech (2001), pp. 795–798
Google Scholar
A. Solomonoff, A. Mielke, M. Schmidt, H. Gish, Clustering speakers by their voices, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2 (1998), pp. 757–760
Google Scholar
S. Stevens, J. Volkmann, The relation of pitch to frequency: a revised scale. Am. J. Psychol. 53(3), 329–353 (1940)
Article Google Scholar
H. Sun, B. Ma, S. Kalayar Khine, H. Li, Speaker diarization system for RT07 and RT09 meeting room audio, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4982–4985
Google Scholar
H. Tang, S. Chu, M. Hasegawa-Johnson, T. Huang, Partially supervised speaker clustering. IEEE Trans. Pattern Anal. Mach. Intell. 34(5), 959–971 (2012). doi:10.1109/TPAMI.2011.174
Article Google Scholar
Y. Teh, M. Jordan, M. Beal, D. Blei, Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)
Article MathSciNet MATH Google Scholar
S. Tranter, Two-way cluster voting to improve speaker diarisation performance, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005)
Google Scholar
A. Tritschler, R. Gopinath, Improved speaker segmentation and segments clustering using the Bayesian information criterion, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999), pp. 679–682
Google Scholar
W. Tsai, H. Wang, On maximizing the within-cluster homogeneity of speaker voice characteristics for speech utterance clustering, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Toulouse, 2006
Google Scholar
W.H. Tsai, S.S. Cheng, Y.H. Chao, H.M. Wang, Clustering speech utterances by speaker using eigenvoice-motivated vector space models, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005), pp. 725–728
Google Scholar
W.H. Tsai, S.S. Cheng, H.M. Wang, Speaker clustering of speech utterances using a voice characteristic reference space, in Eighth International Conference on Spoken Language Processing (2004)
Google Scholar
F. Valente, Infinite models for speaker clustering, in Ninth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2006)
Google Scholar
F. Valente, C. Wellekens, Variational Bayesian speaker clustering, in ODYSSEY04-The Speaker and Language Recognition Workshop (ISCA, Pittsburgh, 2004)
Google Scholar
F. Valente, C. Wellekens, Variational Bayesian adaptation for speaker clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005)
Google Scholar
D. Van Leeuwen, T. Factors, The TNO speaker diarization system for NIST RT05s meeting data. Lecture Notes in Computer Science, Machine Learning for Multimodal Interaction (Springer Berlin Heidelberg 2006) vol. 3869, pp. 440
Google Scholar
A. Vandecatseye, J. Martens, A fast, accurate and stream-based speaker segmentation and clustering algorithm, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003)
Google Scholar
D. Vijayasenan, F. Valente, Speaker diarization of meetings based on large TDOA feature vectors, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4173–4176. doi:10.1109/ICASSP.2012.6288838
D. Vijayasenan, F. Valente, H. Bourlard, Agglomerative information bottleneck for speaker diarization of meetings data, in IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) (2007), pp. 250–449
Google Scholar
D. Vijayasenan, F. Valente, H. Bourlard, Combination of agglomerative and sequential clustering for speaker diarization, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2008 (ICASSP 2008) (2008), pp. 4361–4364. doi:10.1109/ICASSP.2008.4518621
D. Vijayasenan, F. Valente, H. Bourlard, Integration of TDOA features in information bottleneck framework for fast speaker diarization, in Interspeech (2008), pp. 40–43
Google Scholar
D. Vijayasenan, F. Valente, H. Bourlard, Mutual information based channel selection for speaker diarization of meetings data, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4065–4068. doi:10.1109/ICASSP.2009.4960521
D. Vijayasenan, F. Valente, H. Bourlard, An information theoretic combination of MFCC and TDOA features for speaker diarization. IEEE Trans. Audio Speech Lang. Process. 19(2), 431–438 (2011). doi:10.1109/TASL.2010.2048603
Article Google Scholar
D. Vijayasenan, F. Valente, H. Bourlard, Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features. Speech Commun. 54(1), 55–67 (2012)
Article Google Scholar
O. Vinyals, G. Friedland, Modulation spectrogram features for improved speaker diarization, in Interspeech (2008), pp. 630–633
Google Scholar
A. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967)
Article MATH Google Scholar
H. Wang, S. Cheng, METRIC-SEQDAC: a hybrid approach for audio segmentation, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004)
Google Scholar
N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications, vol. 8 (MIT Press, Cambridge, 1964)
Google Scholar
A. Willsky, H. Jones, A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems. IEEE Trans. Automat. Contr. 21(1), 108–112 (1976)
Article MathSciNet MATH Google Scholar
C. Wooters, J. Fung, B. Peskin, X. Anguera, Towards robust speaker segmentation: the ICSI-SRI fall 2004 diarization system, in RT-04F Workshop, vol. 23 (2004)
Google Scholar
C. Wooters, M. Huijbregts, The ICSI RT07s speaker diarization system, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 509–519
Google Scholar
S. Wrigley, G. Brown, V. Wan, S. Renals, Feature selection for the classification of crosstalk in multi-channel audio, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003)
Google Scholar
S. Wrigley, G. Brown, V. Wan, S. Renals, Speech and crosstalk detection in multichannel audio. IEEE Trans. Speech Audio Process. 13(1), 84–91 (2005)
Article Google Scholar
T. Wu, L. Lu, K. Chen, H. Zhang, UBM-based real-time speaker segmentation for broadcasting news, in ICME 2003, vol. 2 (2003), pp. 721–724
Google Scholar
K. Yamanishi, J.I. Takeuchi, G. Williams, P. Milne, On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms, in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2000), pp. 320–324
Google Scholar
M. Zamalloa, L.J. Rodríguez-Fuentes, G. Bordel, M. Penagarikano, J.P. Uribe, Low-latency online speaker tracking on the AMI corpus of meeting conversations, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4962–4965
Google Scholar
B. Zhou, J. Hansen, Efficient audio stream segmentation via the combined T² statistic and Bayesian information criterion. IEEE Trans. Speech Audio Process. 13(4), 467–474 (2005)
Article Google Scholar
B. Zhou, J.H. Hansen, Unsupervised audio stream segmentation and clustering via the Bayesian information criterion, in Interspeech (2000), pp. 714–717
Google Scholar
X. Zhu, C. Barras, L. Lamel, J.L. Gauvain, Speaker diarization: from broadcast news to lectures, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 396–406
Google Scholar
X. Zhu, C. Barras, S. Meignier, J.L. Gauvain, Combining speaker identification and BIC for speaker diarization, in Interspeech, vol. 5 (2005), pp. 2441–2444
Google Scholar
P. Zochova, V. Radova, Modified DISTBIC algorithm for speaker change detection, in Ninth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2005)
Google Scholar
E. Zwicker, E. Terhardt, Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J. Acoust. Soc. Am. 68, 1523 (1980)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Infocomm Research, Singapore, Singapore
Trung Hieu Nguyen & Haizhou Li
Nanyang Technological University, Singapore, Singapore
Eng Siong Chng

Authors

Trung Hieu Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Eng Siong Chng
View author publications
You can also search for this author in PubMed Google Scholar
Haizhou Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Trung Hieu Nguyen .

Editor information

Editors and Affiliations

Dept. of Electrical Engineering, Santa Clara University, Santa Clara, California, USA
Tokunbo Ogunfunmi
School of EE&C Engineering, The University of Western Australia, Crawley, West Australia, Australia
Roberto Togneri
Qualcomm Inc., Santa Clara, California, USA
Madihally (Sim) Narasimha

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nguyen, T.H., Chng, E.S., Li, H. (2015). Speaker Diarization: An Emerging Research. In: Ogunfunmi, T., Togneri, R., Narasimha, M. (eds) Speech and Audio Processing for Coding, Enhancement and Recognition. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1456-2_8

Download citation

DOI: https://doi.org/10.1007/978-1-4939-1456-2_8
Published: 18 September 2014
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-1455-5
Online ISBN: 978-1-4939-1456-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics