Skip to main content
Log in

Lexical speaker identification in TV shows

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

It is possible to use lexical information extracted from speech transcripts for speaker identification (SID), either on its own or to improve the performance of standard cepstral-based SID systems upon fusion. This was established before typically using isolated speech from single speakers (NIST SRE corpora, parliamentary speeches). On the contrary, this work applies lexical approaches for SID on a different type of data. It uses the REPERE corpus consisting of unsegmented multiparty conversations, mostly debates, discussions and Q&A sessions from TV shows. It is hypothesized that people give out clues to their identity when speaking in such settings which this work aims to exploit. The impact on SID performance of the diarization front-end required to pre-process the unsegmented data is also measured. Four lexical SID approaches are studied in this work, including TFIDF, BM25 and LDA-based topic modeling. Results are analysed in terms of TV shows and speaker roles. Lexical approaches achieve low error rates for certain speaker roles such as anchors and journalists, sometimes lower than a standard cepstral-based Gaussian Supervector - Support Vector Machine (GSV-SVM) system. Also, in certain cases, the lexical system shows modest improvement over the cepstral-based system performance using score-level sum fusion. To highlight the potential of using lexical information not just to improve upon cepstral-based SID systems but as an independent approach in its own right, initial studies on crossmedia SID is briefly reported. Instead of using speech data as all cepstral systems require, this approach uses Wikipedia texts to train lexical speaker models which are then tested on speech transcripts to identify speakers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Example: www.opensubtitles.org.

  2. www.cis.uni-muenchen.de/~schmid/tools/TreeTagger

  3. www.defi-repere.fr (in French)

  4. www.elda.org

  5. Role annotations were provided by ELDA with the corpus.

  6. Note that all 359 target speaker models are retained while scoring and not just the 63 in common so that the level of difficulty (equivalently the random chance performance) is at the same level as the open set case.

  7. It is assumed that manual transcription takes more effort than manual diarization/segmentation.

  8. Only REPERE training corpus was used to train topics so as to have perfectly matching training data. The option of using other corpora for training topics will be studied later.

  9. In this study, the set of 63 speakers in common between training and test corpora were used both for training target speaker models and as the reference set for test (ref. Section 4.1). The time limits of 120 s and 420 s were chosen as round numbers which divide this set nearly equally into 20 speakers for each of the 3 conditions (precisely, 20, 20 and 23 speakers respectively).

  10. The scores for the GSV-SVM system were the distances of the test points to the decision hyperplane.

  11. Only biographical articles were considered in this work, i.e. one article per speaker, e.g. http://fr.wikipedia.org/wiki/Luc_Besson.

  12. Only manual transcripts were used in this study.

References

  1. Alam MJ, Kinnunen T, Kenny P, Ouellet P, O’Shaughnessy D (2013) Multitaper MFCC and PLP features for speaker verification using i-vectors. Speech Commun 55(2):237–251

    Article  Google Scholar 

  2. Baker B, Vogt R, Mason M, Sridharan S (2004) Improved phonetic and lexical speaker recognition through MAP adaptation. In: Proceedings of Odyssey

  3. Barras C, Zhu X, Meignier S, Gauvain JL (2006) Multi-stage speaker diarization of broadcast news. IEEE Trans Audio Speech Lang Process 14(5):1505–1512

    Article  Google Scholar 

  4. Baum D (2012) Recognising speakers from the topics they talk about. Speech Comm 54:1132–1142

    Article  Google Scholar 

  5. Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  6. Campbell W, Campbell J, Reynolds D, Jones D, Leek T (2003) Phonetic speaker recognition with support vector machines. In: Proceedings of neural information processing systems conference, pp 1377–1384

  7. Campbell W, Sturim D, Reynolds D (2006) Support vector machines using GMM supervectors for speaker verification. IEEE Signal Proc Lett 13(5):308–311

    Article  Google Scholar 

  8. Canseco L, Lamel L, Gauvain JL (2005) A comparative study using manual and automatic transcriptions for diarization. In: Proceedings of IEEE workshop on Automatic Speech Recognition and Understanding (ASRU)

  9. Dehak N, Dehak R, Kenny P, Brummer N, Ouellet P, Dumouchel P (2009) Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. In: Proceedings of interspeech, pp 1559–1562

  10. Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis For speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798

    Article  Google Scholar 

  11. Doddington G (2001) Some experiments on ideolectal differences among speakers. Tech. rep. http://www.nist.gov/speech/tests/spk/2001/doc/

  12. Doddington G (2001) Speaker recognition based on idiolectal differences between speakers. In: Proceedings of interspeech

  13. Galibert O, Kahn J (2013) The first official REPERE evaluation. In: Proceedings of workshop on Speech, Language and Audio in Multimedia (SLAM)

  14. Gauvain JL, Lamel L, Adda G (1998) Partitioning and transcription of broadcast news data. In: Proceedings of International Conference on Spoken Language Processing (ICSLP), pp 5:1335–1338

  15. Gauvain JL, Lamel L, Barras C, Adda G, de Kercadio Y (2000) The LIMSI SDR system for TREC-9. In: Proceedings of TREC-9

  16. Giraudel A, Carre M, Mapelli V, Kahn J, Galibert O, Quintard L (2012) The REPERE corpus : a multimodal corpus for person recognition. In: Proceedings of Language Resources and Evaluation Conference (LREC)

  17. Jones KS, Walker S, Robertson S (2000) A probabilistic model of information retrieval: development and comparative experiments. Inf Process Manag 36(6):779–840

    Article  Google Scholar 

  18. Kajarekar SS, Ferrer L, Shriberg E, Sonmez K, Stolcke A, Venkataraman A, Zheng J (2005) SRI’s 2004 NIST speaker recognition evaluation system. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  19. Khan A, Yegnanarayana B (2004) Latent semantic analysis for speaker recognition. In: Proceedings of International Conference on Spoken Language Processing (ICSLP)

  20. Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Comm 52(1):12–40

    Article  Google Scholar 

  21. Lamel L, Courcinous S, Despres J, Gauvain JL, Josse Y, Kilgour K, Kraft F, Le VB, Nussbaum-Thom HNM, Oparin I, Schlippe T, Schluter R, Schultz T, da Silva TF, Stuker S, Sundermeyer M, Vieru B, Vu NT, Waibel A, Woehrling C (2011) Speech recognition for machine translation in quaero. In: Proceedings of IWSLT

  22. Le V, Barras C, Ferras M (2010) On the use of GSV-SVM for speaker diarization and tracking. In: Proceedings of Odyssey, pp 146–150

  23. Manning C, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press

  24. Mauclair J, Meignier S, Esteve Y (2006) Speaker diarization: about whom the speaker is talking?. In: Proceedings of Odyssey

  25. McCallum A (2002) Mallet: a machine learning for language toolkit. Tech. rep. http://mallet.cs.umass.edu

  26. Plchot O, Matsoukas S, Matejka P, Dehak N, Ma J, Cumani S, Glembek O, Hermansky H, Mallidi S, Mesgarani N, Schwartz R, Soufifar M, Tan Z, Thomas S, Zhang B, Zhou X (2013) Developing a speaker identification system for the DARPA RATS project. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  27. Shriberg E (2007) Higher-level features in speaker recognition. Speaker Classification I. LNAI 4343: 241–259

    Article  Google Scholar 

  28. Tran VA, Le V, Barras C, Lamel L (2011) Comparing multi-stage approaches for cross-show speaker diarization. In: Proceedings of interspeech, pp 1053–1056

  29. Tranter S (2006) Who really spoke when? Finding speaker turns and identities in broadcast news audio. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  30. Tur G, Shriberg E, Stolcke A, Kajarekar S (2007) Duration and pronunciation conditioned lexical modeling for speaker verification. In: Proceedings of interspeech

Download references

Acknowledgments

The authors would like to thank Lori Lamel for providing the alignments for the mmAM configuration, and François Yvon, Sophie Rosset, Sylvain Meignier and the anonymous reviewers for their helpful comments and advice. This work was partly realized as part of the Quaero Program and the QCompere project, respectively funded by OSEO (French State agency for innovation) and ANR (French national research agency).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anindya Roy.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Roy, A., Bredin, H., Hartmann, W. et al. Lexical speaker identification in TV shows. Multimed Tools Appl 74, 1377–1396 (2015). https://doi.org/10.1007/s11042-014-1940-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-1940-3

Keywords

Navigation