Abstract
It is possible to use lexical information extracted from speech transcripts for speaker identification (SID), either on its own or to improve the performance of standard cepstral-based SID systems upon fusion. This was established before typically using isolated speech from single speakers (NIST SRE corpora, parliamentary speeches). On the contrary, this work applies lexical approaches for SID on a different type of data. It uses the REPERE corpus consisting of unsegmented multiparty conversations, mostly debates, discussions and Q&A sessions from TV shows. It is hypothesized that people give out clues to their identity when speaking in such settings which this work aims to exploit. The impact on SID performance of the diarization front-end required to pre-process the unsegmented data is also measured. Four lexical SID approaches are studied in this work, including TFIDF, BM25 and LDA-based topic modeling. Results are analysed in terms of TV shows and speaker roles. Lexical approaches achieve low error rates for certain speaker roles such as anchors and journalists, sometimes lower than a standard cepstral-based Gaussian Supervector - Support Vector Machine (GSV-SVM) system. Also, in certain cases, the lexical system shows modest improvement over the cepstral-based system performance using score-level sum fusion. To highlight the potential of using lexical information not just to improve upon cepstral-based SID systems but as an independent approach in its own right, initial studies on crossmedia SID is briefly reported. Instead of using speech data as all cepstral systems require, this approach uses Wikipedia texts to train lexical speaker models which are then tested on speech transcripts to identify speakers.
Similar content being viewed by others
Notes
Example: www.opensubtitles.org.
www.defi-repere.fr (in French)
Role annotations were provided by ELDA with the corpus.
Note that all 359 target speaker models are retained while scoring and not just the 63 in common so that the level of difficulty (equivalently the random chance performance) is at the same level as the open set case.
It is assumed that manual transcription takes more effort than manual diarization/segmentation.
Only REPERE training corpus was used to train topics so as to have perfectly matching training data. The option of using other corpora for training topics will be studied later.
In this study, the set of 63 speakers in common between training and test corpora were used both for training target speaker models and as the reference set for test (ref. Section 4.1). The time limits of 120 s and 420 s were chosen as round numbers which divide this set nearly equally into 20 speakers for each of the 3 conditions (precisely, 20, 20 and 23 speakers respectively).
The scores for the GSV-SVM system were the distances of the test points to the decision hyperplane.
Only biographical articles were considered in this work, i.e. one article per speaker, e.g. http://fr.wikipedia.org/wiki/Luc_Besson.
Only manual transcripts were used in this study.
References
Alam MJ, Kinnunen T, Kenny P, Ouellet P, O’Shaughnessy D (2013) Multitaper MFCC and PLP features for speaker verification using i-vectors. Speech Commun 55(2):237–251
Baker B, Vogt R, Mason M, Sridharan S (2004) Improved phonetic and lexical speaker recognition through MAP adaptation. In: Proceedings of Odyssey
Barras C, Zhu X, Meignier S, Gauvain JL (2006) Multi-stage speaker diarization of broadcast news. IEEE Trans Audio Speech Lang Process 14(5):1505–1512
Baum D (2012) Recognising speakers from the topics they talk about. Speech Comm 54:1132–1142
Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Campbell W, Campbell J, Reynolds D, Jones D, Leek T (2003) Phonetic speaker recognition with support vector machines. In: Proceedings of neural information processing systems conference, pp 1377–1384
Campbell W, Sturim D, Reynolds D (2006) Support vector machines using GMM supervectors for speaker verification. IEEE Signal Proc Lett 13(5):308–311
Canseco L, Lamel L, Gauvain JL (2005) A comparative study using manual and automatic transcriptions for diarization. In: Proceedings of IEEE workshop on Automatic Speech Recognition and Understanding (ASRU)
Dehak N, Dehak R, Kenny P, Brummer N, Ouellet P, Dumouchel P (2009) Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. In: Proceedings of interspeech, pp 1559–1562
Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis For speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Doddington G (2001) Some experiments on ideolectal differences among speakers. Tech. rep. http://www.nist.gov/speech/tests/spk/2001/doc/
Doddington G (2001) Speaker recognition based on idiolectal differences between speakers. In: Proceedings of interspeech
Galibert O, Kahn J (2013) The first official REPERE evaluation. In: Proceedings of workshop on Speech, Language and Audio in Multimedia (SLAM)
Gauvain JL, Lamel L, Adda G (1998) Partitioning and transcription of broadcast news data. In: Proceedings of International Conference on Spoken Language Processing (ICSLP), pp 5:1335–1338
Gauvain JL, Lamel L, Barras C, Adda G, de Kercadio Y (2000) The LIMSI SDR system for TREC-9. In: Proceedings of TREC-9
Giraudel A, Carre M, Mapelli V, Kahn J, Galibert O, Quintard L (2012) The REPERE corpus : a multimodal corpus for person recognition. In: Proceedings of Language Resources and Evaluation Conference (LREC)
Jones KS, Walker S, Robertson S (2000) A probabilistic model of information retrieval: development and comparative experiments. Inf Process Manag 36(6):779–840
Kajarekar SS, Ferrer L, Shriberg E, Sonmez K, Stolcke A, Venkataraman A, Zheng J (2005) SRI’s 2004 NIST speaker recognition evaluation system. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Khan A, Yegnanarayana B (2004) Latent semantic analysis for speaker recognition. In: Proceedings of International Conference on Spoken Language Processing (ICSLP)
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Comm 52(1):12–40
Lamel L, Courcinous S, Despres J, Gauvain JL, Josse Y, Kilgour K, Kraft F, Le VB, Nussbaum-Thom HNM, Oparin I, Schlippe T, Schluter R, Schultz T, da Silva TF, Stuker S, Sundermeyer M, Vieru B, Vu NT, Waibel A, Woehrling C (2011) Speech recognition for machine translation in quaero. In: Proceedings of IWSLT
Le V, Barras C, Ferras M (2010) On the use of GSV-SVM for speaker diarization and tracking. In: Proceedings of Odyssey, pp 146–150
Manning C, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press
Mauclair J, Meignier S, Esteve Y (2006) Speaker diarization: about whom the speaker is talking?. In: Proceedings of Odyssey
McCallum A (2002) Mallet: a machine learning for language toolkit. Tech. rep. http://mallet.cs.umass.edu
Plchot O, Matsoukas S, Matejka P, Dehak N, Ma J, Cumani S, Glembek O, Hermansky H, Mallidi S, Mesgarani N, Schwartz R, Soufifar M, Tan Z, Thomas S, Zhang B, Zhou X (2013) Developing a speaker identification system for the DARPA RATS project. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Shriberg E (2007) Higher-level features in speaker recognition. Speaker Classification I. LNAI 4343: 241–259
Tran VA, Le V, Barras C, Lamel L (2011) Comparing multi-stage approaches for cross-show speaker diarization. In: Proceedings of interspeech, pp 1053–1056
Tranter S (2006) Who really spoke when? Finding speaker turns and identities in broadcast news audio. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Tur G, Shriberg E, Stolcke A, Kajarekar S (2007) Duration and pronunciation conditioned lexical modeling for speaker verification. In: Proceedings of interspeech
Acknowledgments
The authors would like to thank Lori Lamel for providing the alignments for the mmAM configuration, and François Yvon, Sophie Rosset, Sylvain Meignier and the anonymous reviewers for their helpful comments and advice. This work was partly realized as part of the Quaero Program and the QCompere project, respectively funded by OSEO (French State agency for innovation) and ANR (French national research agency).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Roy, A., Bredin, H., Hartmann, W. et al. Lexical speaker identification in TV shows. Multimed Tools Appl 74, 1377–1396 (2015). https://doi.org/10.1007/s11042-014-1940-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-1940-3