Abstract
Speaker diarization is the task of determining “Who spoke when?”, where the objective is to annotate a continuous audio recording with appropriate speaker labels corresponding to the time regions where they spoke. The labels are not necessarily the actual speaker identities, i.e. speaker identification, as long as the same labels are assigned to the regions uttered by the same speakers. These regions may overlap as multiple speakers could talk simultaneously. Speaker diarization is thus essentially the combination of two different processes: segmentation, in which the speaker turns are detected, and unsupervised clustering, in which segments of the same speakers are grouped. The clustering process is considered as unsupervised problem since there is no prior information about the number of speakers, their identities or acoustic conditions (Meignier et al., Comput Speech Lang 20(2–3):303–330, 2006; Zhou and Hansen, IEEE Trans Speech Audio Process 13(4):467–474, 2005). This chapter presents the fundamentals of speaker diarization and the most significant works over the recent years on this topic.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
The ISL Meeting Corpus (2004), https://catalog.ldc.upenn.edu/LDC2004S05. Accessed 25 Aug 2014
The ICSI Meeting Corpus (2004), https://catalog.ldc.upenn.edu/LDC2004S02. Accessed 24 Aug 2014
NIST Meeting Room Pilot Corpus (2004), https://catalog.ldc.upenn.edu/LDC2004S09. Accessed 24 Aug 2014
The AMI corpus (2007), http://groups.inf.ed.ac.uk/ami/download/. Accessed 25 Aug 2014
A.G. Adam, S.S. Kajarekar, H. Hermansky, A new speaker change detection method for two-speaker segmentation, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), vol. 4 (2002), pp. 3908–3911
A.G. Adami, L. Burget, S. Dupont, H. Garudadri, F. Grezl, H. Hermansky, P. Jain, S.S. Kajarekar, N. Morgan, S. Sivadas, Qualcomm-ICSI-OGI features for ASR, in Interspeech (2002)
J. Ajmera, H. Bourlard, I. Lapidot, I. McCowan, Unknown-multiple speaker clustering using HMM, in Interspeech (2002)
J. Ajmera, I. McCowan, H. Bourlard, Robust speaker change detection. IEEE Signal Process. Lett. 11(8), 649–651 (2004)
J. Ajmera, C. Wooters, A robust speaker clustering algorithm, in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, 2003 (ASRU’03) (2003), pp. 411–416
J. Allen, How do humans process and recognize speech? IEEE Trans. Speech Audio Process. 2(4), 567–577 (1994)
X. Anguera, BeamformIt acoustic beamformer (2009), http://www.xavieranguera.com/beamformit/. Accessed 24 Aug 2014
X. Anguera, M. Aguilo, C. Wooters, C. Nadeu, J. Hernando, Hybrid speech/non-speech detector applied to speaker diarization of meetings, in IEEE Odyssey 2006: The Speaker and Language Recognition Workshop (2006), pp. 1–6
X. Anguera, J. Hernando, Evolutive speaker segmentation using a repository system, in Proceedings of International Conference on Speech and Language Processing, Jeju Island, 2004
X. Anguera, J. Hernando, Xbic: real-time cross probabilities measure for speaker segmentation. University of California Berkeley, ICSIBerkeley Technical Report (2005)
X. Anguera, C. Wooters, J. Hernando, Automatic cluster complexity and quantity selection: towards robust speaker diarization, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 248–256
X. Anguera, C. Wooters, J. Pardo, Robust speaker diarization for meetings: ICSI RT06s evaluation system, in Ninth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2006)
X. Anguera, C. Wooters, J. Pardo, J. Hernando, Automatic weighting for the combination of TDOA and acoustic features in speaker diarization for meetings, in Proceedings of ICASSP (2007), pp. 241–244
X. Anguera, C. Wooters, B. Peskin, M. Aguiló, Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 402–414
C. Barras, X. Zhu, S. Meignier, J.L. Gauvain, Improving speaker diarization, in RT-04F Workshop (2004)
M. Ben, M. Betser, F. Bimbot, G. Gravier, Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004)
F. Bimbot, L. Mathan, Text-free speaker recognition using an arithmetic-harmonic sphericity measure, in Third European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1993)
J.F. Bonastre, P. Delacourt, C. Fredouille, T. Merlin, C. Wellekens, A speaker tracking system based on speaker turn detection for NIST evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000 (ICASSP’00), vol. 2 (2000), pp. 1177–1180
S. Bozonnet, N. Evans, C. Fredouille, The lia-eurecom RT’09 speaker diarization system: enhancements in speaker modelling and cluster purification, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4958–4961. doi:10.1109/ICASSP.2010.5495088
J. Campbell et al., Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997)
W. Campbell, D. Sturim, D. Reynolds, Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5), 308–311 (2006). doi:10.1109/LSP.2006.870086
G.C. Carter, A.H. Nuttall, P.G. Cable, The smoothed coherence transform. Proc. IEEE 61(10), 1497–1498 (1973)
S. Cassidy, The Macquarie speaker diarization system for RT04s, in NIST 2004 Spring Rich Transcription Evaluation Workshop, Montreal, 2004
M. Cettolo, M. Vescovi, Efficient audio segmentation algorithms based on the BIC, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 6 (2003)
S. Chen, P. Gopalakrishnan, Speaker, environment and channel change detection and clustering via the Bayesian information criterion, in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop (1998), pp. 127–132
T. Cover, J. Thomas, Elements of Information Theory (Wiley-Interscience, London, 2006)
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980) [see also IEEE Transactions on Signal Processing]
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). doi:10.1109/TASL.2010.2064307
P. Delacourt, D. Kryze, C. Wellekens, Detection of speaker changes in an audio document, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999)
P. Delacourt, C. Wellekens, DISTBIC: a speaker-based segmentation for audio data indexing. Speech Commun. 32(1–2), 111–126 (2000)
A. Dempster, N. Laird, D. Rubin et al., Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39(1), 1–38 (1977)
R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification (Wiley, London, 2012)
C. Eckart, Optimal rectifier systems for the detection of steady signals, Scripps Institution of Oceanography, (UC San Diego 1952). Retrieved from: http://escholarship.org/uc/item/3676p6rt
E. El-Khoury, C. Senac, R. Andre-Obrecht, Speaker diarization: towards a more robust and portable system, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2007 (ICASSP 2007), vol. 4 (2007), pp. 489–492. doi:10.1109/ICASSP.2007.366956
D.P. Ellis, J.C. Liu, Speaker turn segmentation based on between-channel differences, in NIST ICASSP 2004 Meeting Recognition Workshop, Montreal, 2004, pp. 112–117
T. Ferguson, A Bayesian analysis of some nonparametric problems. Ann. Stat. 1(2) 209–230 (1973)
J.G. Fiscus, J. Ajot, J.S. Garofolo, The rich transcription 2007 meeting recognition evaluation, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 373–389
J.G. Fiscus, J. Ajot, M. Michel, J.S. Garofolo, The Rich Transcription 2006 Spring Meeting Recognition Evaluation (Springer, Berlin, 2006)
J.G. Fiscus, N. Radde, J.S. Garofolo, A. Le, J. Ajot, C. Laprun, The rich transcription 2005 spring meeting recognition evaluation, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 369–389
E. Fox, E. Sudderth, M. Jordan, A. Willsky, An HDP-HMM for systems with state persistence, in Proceedings of the 25th International Conference on Machine Learning (ACM, New York, 2008), pp. 312–319
E.B. Fox, E.B. Sudderth, M.I. Jordan, A.S. Willsky, A sticky HDP-HMM with application to speaker diarization. Ann. Appl. Stat. 5(2A), 1020–1056 (2011)
A. Friedland, B. Vinyals, C. Huang, D. Muller, Fusing short term and long term features for improved speaker diarization, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4077–4080. doi:10.1109/ICASSP.2009.4960524
G. Friedland, A. Janin, D. Imseng, X. Anguera Miro, L. Gottlieb, M. Huijbregts, M. Knox, O. Vinyals, The ICSI RT-09 speaker diarization system. IEEE Trans. Audio Speech Lang. Process. 20(2), 371–381 (2012). doi:10.1109/TASL.2011.2158419
G. Friedland, O. Vinyals, Y. Huang, C. Muller, Prosodic and other long-term features for speaker diarization. IEEE Trans. Audio Speech Lang. Process. 17(5), 985–993 (2009). doi:10.1109/TASL.2009.2015089
R. Gangadharaiah, B. Narayanaswamy, N. Balakrishnan, A novel method for two-speaker segmentation, in Interspeech (2004)
J. Gauvain, C. Lee, Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)
J.L. Gauvain, L. Lamel, G. Adda, Partitioning and transcription of broadcast news data, in ICSLP, vol. 98 (1998), pp. 1335–1338
J.T. Geiger, F. Wallhoff, G. Rigoll, GMM-UBM based open-set online speaker diarization, in Interspeech (2010), pp. 2330–2333
H. Gish, M.H. Siu, R. Rohlicek, Segregation of speakers for speech recognition and speaker identification, in International Conference on Acoustics, Speech, and Signal Processing, 1991 (ICASSP-91) (1991), pp. 873–876
T. Hain, S. Johnson, A. Tuerk, P. Woodland, S. Young, Segment generation and clustering in the HTK broadcast news transcription system, in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, vol. 1998 (1998)
J. Hansen, B. Zhou, M. Akbacak, R. Sarikaya, B. Pellom, Audio stream phrase recognition for a national gallery of the spoken word:“ One Small Step”, in Sixth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2000)
H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)
H. Hermansky, N. Morgan, A. Bayya, P. Kohn, RASTA-PLP speech analysis technique, in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992 (ICASSP-92), vol. 1 (1992), pp. 121–124
M. Huijbregts, R. Ordelman, F. de Jong, Annotation of heterogeneous multimedia content using automatic speech recognition. Lecture Notes in Computer Science Semantic Multimedia, vol. 4816, (Springer Berlin Heldeberg 2007), pp. 78–90
D. Imseng, G. Friedland, An adaptive initialization method for speaker diarization based on prosodic features, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4946–4949
D. Istrate, C. Fredouille, S. Meignier, L. Besacier, J.F. Bonastre, NIST RT’05S evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 428–439
H. Jin, F. Kubala, R. Schwartz, Automatic speaker clustering, in Proceedings of the DARPA Speech Recognition Workshop (1997), pp. 108–111
Q. Jin, T. Schultz, Speaker segmentation and clustering in meetings, in Interspeech, vol. 4 (2004), pp. 597–600
S. Johnson, Who spoke when?-automatic segmentation and clustering for determining speaker turns, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999)
S.E. Johnson, J. Woodland, Speaker clustering using direct maximisation of the MLLR-adapted likelihood, in Proceedings of ICSLP 98 (1998), pp. 1775–1779
T. Kemp, M. Schmidt, M. Westphal, A. Waibel, Strategies for automatic segmentation of audio data, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000 (ICASSP’00), vol. 3 (2000), pp. 1423–1426
P. Kenny, G. Boulianne, P. Dumouchel, Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process. 13(3), 345–354 (2005). doi:10.1109/TSA.2004.840940
H. Kim, D. Ertelt, T. Sikora, Hybrid speaker-based segmentation system using model-level clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1 (2005), pp. 745–748
B.E. Kingsbury, N. Morgan, S. Greenberg, Robust speech recognition using the modulation spectrogram. Speech Commun. 25(1), 117–132 (1998)
C. Knapp, G. Carter, The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976)
T. Koshinaka, K. Nagatomo, K. Shinoda, Online speaker clustering using incremental learning of an ergodic hidden Markov model, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4093–4096. doi:10.1109/ICASSP.2009.4960528
R. Kuhn, J.C. Junqua, P. Nguyen, N. Niedzielski, Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process. 8(6), 695–707 (2000)
I. Lapidot, SOM as likelihood estimator for speaker clustering, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003)
K. Laskowski, C. Fugen, T. Schultz, Simultaneous multispeaker segmentation for automatic meeting recognition, in Proceedings of EUSIPCO, Poznan, 2007, pp. 1294–1298
K. Laskowski, Q. Jin, T. Schultz, Crosscorrelation-based multispeaker speech activity detection, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004)
K. Laskowski, G. Karlsruhe, T. Schultz, A geometric interpretation of non-target-normalized maximum cross-channel correlation for vocal activity detection in meetings, in Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pp. 89–92. Association for Computational Linguistics (2007)
K. Laskowski, T. Schultz, Unsupervised learning of overlapped speech model parameters for multichannel speech activity detection in meetings, in Proceedings of ICASSP (2006), pp. 993–996
V.B. Le, O. Mella, D. Fohr, et al., Speaker diarization using normalized cross likelihood ratio, in Interspeech, vol. 7 (2007), pp. 1869–1872
D.A. van Leeuwen, The TNO speaker diarization system for NIST RT05s meeting data, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 440–449
D.A. van Leeuwen, M. Konečný, Progress in the AMIDA speaker diarization system for meeting data, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 475–483
D. Lilt, F. Kubala, Online speaker clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004), pp. 333–336
D. Liu, F. Kubala, Fast speaker change detection for broadcast news transcription and indexing, in Sixth European Conference on Speech Communication and Technology (1999)
J. López, D. Ellis, Using acoustic condition clustering to improve acoustic change detection on broadcast news, in Sixth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2000)
L. Lu, H. Zhang, Real-time unsupervised speaker change detection, in International Conference on Pattern Recognition, vol. 16 (2002), pp. 358–361
J. Luque, C. Segura, J. Hernando, Clustering initialization based on spatial information for speaker diarization of meetings, in Interspeech (2008), pp. 383–386
J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)
A. Malegaonkar, A. Ariyaeeinia, P. Sivakumaran, J. Fortuna, Unsupervised speaker change detection using probabilistic pattern matching. IEEE Signal Process. Lett. 13(8), 509–512 (2006)
K. Markov, S. Nakamura, Never-ending learning system for on-line speaker diarization, in IEEE Workshop on Automatic Speech Recognition Understanding, 2007 (ASRU) (2007), pp. 699–704. doi:10.1109/ASRU.2007.4430197
K. Markov, S. Nakamura, Improved novelty detection for online GMM based speaker diarization, in Interspeech (2008), pp. 363–366
S. Meignier, J. Bonastre, S. Igounet, E-HMM approach for learning and adapting sound models for speaker indexing, in 2001: A Speaker Odyssey-The Speaker Recognition Workshop (ISCA, Pittsburgh, 2001)
S. Meignier, D. Moraru, C. Fredouille, J.F. Bonastre, L. Besacier, Step-by-step and integrated approaches in broadcast news speaker diarization. Comput. Speech Lang. 20(2–3), 303–330 (2006). doi:http://dx.doi.org/10.1016/j.csl.2005.08.002. http://www.sciencedirect.com/science/article/pii/S0885230805000471
X.A. Miró, Robust speaker diarization for meetings, Ph.D. thesis, Universitat Politècnica de Catalunya, Barcelona (2006)
D. Moraru, S. Meignier, L. Besacier, J.F. Bonastre, I. Magrin-Chagnolleau, The ELISA consortium approaches in speaker segmentation during the NIST 2002 speaker recognition evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 (ICASSP’03), vol. 2 (2003), p. II-89
D. Moraru, S. Meignier, C. Fredouille, L. Besacier, J.F. Bonastre, The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004), p. I-373
K. Mori, S. Nakagawa, Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001 (ICASSP’01), vol. 1 (2001)
R.M. Neal, G.E. Hinton, A view of the em algorithm that justifies incremental, sparse, and other variants, in Learning in Graphical Models (Springer, Berlin, 1998), pp. 355–368
A.Y. Ng, M.I. Jordan, Y. Weiss et al., On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2, 849–856 (2002)
P. Nguyen, L. Rigazio, Y. Moh, J. Junqua, Rich transcription 2002 site report, Panasonic Speech Technology Laboratory (PSTL), in Proceedings of the 2002 Rich Transcription Workshop (2002)
M. Nishida, T. Kawahara, Unsupervised speaker indexing using speaker model selection based on Bayesian information criterion, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 (ICASSP’03), vol. 1 (2003), pp. 172–175
J.M. Pardo, X. Anguera, C. Wooters, Speaker diarization for multi-microphone meetings using only between-channel differences, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 257–264
J.M. Pardo, X. Anguera, C. Wooters, Speaker diarization for multiple distant microphone meetings: mixing acoustic features and inter-channel time differences, in Interspeech (2006)
J.M. Pardo, R. Barra-Chicote, R. San-Segundo, R. de Córdoba, B. Martínez-González, Speaker diarization features: the UPM contribution to the RT09 evaluation. IEEE Trans. Audio Speech Lang. Process. 20(2), 426–435 (2012)
J. Pelecanos, S. Sridharan, Feature warping for robust speaker verification, in 2001: A Speaker Odyssey-The Speaker Recognition Workshop (2001)
L. Perez-Freire, C. Garcia-Mateo, A multimedia approach for audio segmentation in TV broadcast news, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004)
T. Pfau, D. Ellis, A. Stolcke, Multispeaker speech activity detection for the ICSI meeting recorder, in Proceedings of ASRU, vol. 1 (2001)
L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
W.M. Rand, Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)
D. Reynolds, E. Singer, B. Carlson, G. O’Leary, J. McLaughlin, M. Zissman, Blind clustering of speech utterances based on speaker and language characteristics, in Fifth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 1998)
D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted Gaussian mixture models. Digit. Signal Process. 10(1), 19–41 (2000)
D.A. Reynolds, R.C. Rose, Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)
D.A. Reynolds, P. Torres-Carrasquillo, The MIT Lincoln laboratory RT-04F diarization systems: applications to broadcast audio and telephone conversations. Technical Report, DTIC Document (2004)
M. Roch, Y. Cheng, Speaker segmentation using the MAP-adapted Bayesian information criterion, in ODYSSEY04-The Speaker and Language Recognition Workshop (ISCA, Pittsburgh, 2004)
P.R. Roth, Effective measurements using digital signal analysis. IEEE Spectr. 8(4), 62–70 (1971)
J. Rougui, M. Rziza, D. Aboutajdine, M. Gelgon, J. Martinez, F. Rabat, Fast incremental clustering of gaussian mixture speaker models for scaling up retrieval in on-line broadcast, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2006 (ICASSP 2006), vol. 5 (2006)
M. Rouvier, S. Meignier, A global optimization framework for speaker diarization, in Odyssey 2012-The Speaker and Language Recognition Workshop (2012)
M.A. Sato, S. Ishii, On-line EM algorithm for the normalized Gaussian network. Neural Comput. 12(2), 407–432 (2000)
G. Schwarz, Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, A. Stolcke, Modeling prosodic feature sequences for speaker recognition. Speech Commun. 46(3), 455–472 (2005)
S. Shum, N. Dehak, E. Chuangsuwanich, D.A. Reynolds, J.R. Glass, Exploiting intra-conversation variability for speaker diarization, in Interspeech (2011), pp. 945–948
S. Shum, N. Dehak, R. Dehak, J. Glass, Unsupervised methods for speaker diarization: an integrated and iterative approach. IEEE Trans. Audio Speech Lang. Process. 21(10), 2015–2028 (2013). doi:10.1109/TASL.2013.2264673
S. Shum, N. Dehak, J. Glass, On the use of spectral and iterative methods for speaker diarization. System 1(w2), 2 (2012)
M.A. Siegler, U. Jain, B. Raj, R.M. Stern, Automatic segmentation, classification and clustering of broadcast news audio, in Proceedings of DARPA Broadcast News Workshop (1997), p. 11
J. Silovsky, J. Prazak, Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4193–4196
R. Sinha, S.E. Tranter, M.J. Gales, P.C. Woodland, The Cambridge university March 2005 speaker diarisation system, in Interspeech (2005), pp. 2437–2440
P. Sivakumaran, J. Fortuna, A.M. Ariyaeeinia, On the use of the Bayesian information criterion in multiple speaker detection, in Interspeech (2001), pp. 795–798
A. Solomonoff, A. Mielke, M. Schmidt, H. Gish, Clustering speakers by their voices, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2 (1998), pp. 757–760
S. Stevens, J. Volkmann, The relation of pitch to frequency: a revised scale. Am. J. Psychol. 53(3), 329–353 (1940)
H. Sun, B. Ma, S. Kalayar Khine, H. Li, Speaker diarization system for RT07 and RT09 meeting room audio, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4982–4985
H. Tang, S. Chu, M. Hasegawa-Johnson, T. Huang, Partially supervised speaker clustering. IEEE Trans. Pattern Anal. Mach. Intell. 34(5), 959–971 (2012). doi:10.1109/TPAMI.2011.174
Y. Teh, M. Jordan, M. Beal, D. Blei, Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)
S. Tranter, Two-way cluster voting to improve speaker diarisation performance, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005)
A. Tritschler, R. Gopinath, Improved speaker segmentation and segments clustering using the Bayesian information criterion, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999), pp. 679–682
W. Tsai, H. Wang, On maximizing the within-cluster homogeneity of speaker voice characteristics for speech utterance clustering, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Toulouse, 2006
W.H. Tsai, S.S. Cheng, Y.H. Chao, H.M. Wang, Clustering speech utterances by speaker using eigenvoice-motivated vector space models, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005), pp. 725–728
W.H. Tsai, S.S. Cheng, H.M. Wang, Speaker clustering of speech utterances using a voice characteristic reference space, in Eighth International Conference on Spoken Language Processing (2004)
F. Valente, Infinite models for speaker clustering, in Ninth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2006)
F. Valente, C. Wellekens, Variational Bayesian speaker clustering, in ODYSSEY04-The Speaker and Language Recognition Workshop (ISCA, Pittsburgh, 2004)
F. Valente, C. Wellekens, Variational Bayesian adaptation for speaker clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005)
D. Van Leeuwen, T. Factors, The TNO speaker diarization system for NIST RT05s meeting data. Lecture Notes in Computer Science, Machine Learning for Multimodal Interaction (Springer Berlin Heidelberg 2006) vol. 3869, pp. 440
A. Vandecatseye, J. Martens, A fast, accurate and stream-based speaker segmentation and clustering algorithm, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003)
D. Vijayasenan, F. Valente, Speaker diarization of meetings based on large TDOA feature vectors, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4173–4176. doi:10.1109/ICASSP.2012.6288838
D. Vijayasenan, F. Valente, H. Bourlard, Agglomerative information bottleneck for speaker diarization of meetings data, in IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) (2007), pp. 250–449
D. Vijayasenan, F. Valente, H. Bourlard, Combination of agglomerative and sequential clustering for speaker diarization, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2008 (ICASSP 2008) (2008), pp. 4361–4364. doi:10.1109/ICASSP.2008.4518621
D. Vijayasenan, F. Valente, H. Bourlard, Integration of TDOA features in information bottleneck framework for fast speaker diarization, in Interspeech (2008), pp. 40–43
D. Vijayasenan, F. Valente, H. Bourlard, Mutual information based channel selection for speaker diarization of meetings data, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4065–4068. doi:10.1109/ICASSP.2009.4960521
D. Vijayasenan, F. Valente, H. Bourlard, An information theoretic combination of MFCC and TDOA features for speaker diarization. IEEE Trans. Audio Speech Lang. Process. 19(2), 431–438 (2011). doi:10.1109/TASL.2010.2048603
D. Vijayasenan, F. Valente, H. Bourlard, Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features. Speech Commun. 54(1), 55–67 (2012)
O. Vinyals, G. Friedland, Modulation spectrogram features for improved speaker diarization, in Interspeech (2008), pp. 630–633
A. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967)
H. Wang, S. Cheng, METRIC-SEQDAC: a hybrid approach for audio segmentation, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004)
N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications, vol. 8 (MIT Press, Cambridge, 1964)
A. Willsky, H. Jones, A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems. IEEE Trans. Automat. Contr. 21(1), 108–112 (1976)
C. Wooters, J. Fung, B. Peskin, X. Anguera, Towards robust speaker segmentation: the ICSI-SRI fall 2004 diarization system, in RT-04F Workshop, vol. 23 (2004)
C. Wooters, M. Huijbregts, The ICSI RT07s speaker diarization system, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 509–519
S. Wrigley, G. Brown, V. Wan, S. Renals, Feature selection for the classification of crosstalk in multi-channel audio, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003)
S. Wrigley, G. Brown, V. Wan, S. Renals, Speech and crosstalk detection in multichannel audio. IEEE Trans. Speech Audio Process. 13(1), 84–91 (2005)
T. Wu, L. Lu, K. Chen, H. Zhang, UBM-based real-time speaker segmentation for broadcasting news, in ICME 2003, vol. 2 (2003), pp. 721–724
K. Yamanishi, J.I. Takeuchi, G. Williams, P. Milne, On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms, in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2000), pp. 320–324
M. Zamalloa, L.J. Rodríguez-Fuentes, G. Bordel, M. Penagarikano, J.P. Uribe, Low-latency online speaker tracking on the AMI corpus of meeting conversations, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4962–4965
B. Zhou, J. Hansen, Efficient audio stream segmentation via the combined T2 statistic and Bayesian information criterion. IEEE Trans. Speech Audio Process. 13(4), 467–474 (2005)
B. Zhou, J.H. Hansen, Unsupervised audio stream segmentation and clustering via the Bayesian information criterion, in Interspeech (2000), pp. 714–717
X. Zhu, C. Barras, L. Lamel, J.L. Gauvain, Speaker diarization: from broadcast news to lectures, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 396–406
X. Zhu, C. Barras, S. Meignier, J.L. Gauvain, Combining speaker identification and BIC for speaker diarization, in Interspeech, vol. 5 (2005), pp. 2441–2444
P. Zochova, V. Radova, Modified DISTBIC algorithm for speaker change detection, in Ninth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2005)
E. Zwicker, E. Terhardt, Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J. Acoust. Soc. Am. 68, 1523 (1980)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer Science+Business Media New York
About this chapter
Cite this chapter
Nguyen, T.H., Chng, E.S., Li, H. (2015). Speaker Diarization: An Emerging Research. In: Ogunfunmi, T., Togneri, R., Narasimha, M. (eds) Speech and Audio Processing for Coding, Enhancement and Recognition. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1456-2_8
Download citation
DOI: https://doi.org/10.1007/978-1-4939-1456-2_8
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-1455-5
Online ISBN: 978-1-4939-1456-2
eBook Packages: EngineeringEngineering (R0)