Lexical speaker identification in TV shows

Roy, Anindya; Bredin, Hervé; Hartmann, William; Le, Viet Bac; Barras, Claude; Gauvain, Jean-Luc

doi:10.1007/s11042-014-1940-3

Lexical speaker identification in TV shows

Published: 17 April 2014

Volume 74, pages 1377–1396, (2015)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Anindya Roy¹,
Hervé Bredin¹,
William Hartmann¹,
Viet Bac Le²,
Claude Barras¹ &
…
Jean-Luc Gauvain¹

254 Accesses
3 Citations
Explore all metrics

Abstract

It is possible to use lexical information extracted from speech transcripts for speaker identification (SID), either on its own or to improve the performance of standard cepstral-based SID systems upon fusion. This was established before typically using isolated speech from single speakers (NIST SRE corpora, parliamentary speeches). On the contrary, this work applies lexical approaches for SID on a different type of data. It uses the REPERE corpus consisting of unsegmented multiparty conversations, mostly debates, discussions and Q&A sessions from TV shows. It is hypothesized that people give out clues to their identity when speaking in such settings which this work aims to exploit. The impact on SID performance of the diarization front-end required to pre-process the unsegmented data is also measured. Four lexical SID approaches are studied in this work, including TFIDF, BM25 and LDA-based topic modeling. Results are analysed in terms of TV shows and speaker roles. Lexical approaches achieve low error rates for certain speaker roles such as anchors and journalists, sometimes lower than a standard cepstral-based Gaussian Supervector - Support Vector Machine (GSV-SVM) system. Also, in certain cases, the lexical system shows modest improvement over the cepstral-based system performance using score-level sum fusion. To highlight the potential of using lexical information not just to improve upon cepstral-based SID systems but as an independent approach in its own right, initial studies on crossmedia SID is briefly reported. Instead of using speech data as all cepstral systems require, this approach uses Wikipedia texts to train lexical speaker models which are then tested on speech transcripts to identify speakers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing

Article 31 October 2016

Comparison of Retrieval Approaches and Blind Relevance Feedback Methods Within the Czech Speech Information Retrieval

Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks

Article Open access 09 August 2022

Notes

Example: www.opensubtitles.org.
www.cis.uni-muenchen.de/~schmid/tools/TreeTagger
www.defi-repere.fr (in French)
www.elda.org
Role annotations were provided by ELDA with the corpus.
Note that all 359 target speaker models are retained while scoring and not just the 63 in common so that the level of difficulty (equivalently the random chance performance) is at the same level as the open set case.
It is assumed that manual transcription takes more effort than manual diarization/segmentation.
Only REPERE training corpus was used to train topics so as to have perfectly matching training data. The option of using other corpora for training topics will be studied later.
In this study, the set of 63 speakers in common between training and test corpora were used both for training target speaker models and as the reference set for test (ref. Section 4.1). The time limits of 120 s and 420 s were chosen as round numbers which divide this set nearly equally into 20 speakers for each of the 3 conditions (precisely, 20, 20 and 23 speakers respectively).
The scores for the GSV-SVM system were the distances of the test points to the decision hyperplane.
Only biographical articles were considered in this work, i.e. one article per speaker, e.g. http://fr.wikipedia.org/wiki/Luc_Besson.
Only manual transcripts were used in this study.

References

Alam MJ, Kinnunen T, Kenny P, Ouellet P, O’Shaughnessy D (2013) Multitaper MFCC and PLP features for speaker verification using i-vectors. Speech Commun 55(2):237–251
Article Google Scholar
Baker B, Vogt R, Mason M, Sridharan S (2004) Improved phonetic and lexical speaker recognition through MAP adaptation. In: Proceedings of Odyssey
Barras C, Zhu X, Meignier S, Gauvain JL (2006) Multi-stage speaker diarization of broadcast news. IEEE Trans Audio Speech Lang Process 14(5):1505–1512
Article Google Scholar
Baum D (2012) Recognising speakers from the topics they talk about. Speech Comm 54:1132–1142
Article Google Scholar
Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Campbell W, Campbell J, Reynolds D, Jones D, Leek T (2003) Phonetic speaker recognition with support vector machines. In: Proceedings of neural information processing systems conference, pp 1377–1384
Campbell W, Sturim D, Reynolds D (2006) Support vector machines using GMM supervectors for speaker verification. IEEE Signal Proc Lett 13(5):308–311
Article Google Scholar
Canseco L, Lamel L, Gauvain JL (2005) A comparative study using manual and automatic transcriptions for diarization. In: Proceedings of IEEE workshop on Automatic Speech Recognition and Understanding (ASRU)
Dehak N, Dehak R, Kenny P, Brummer N, Ouellet P, Dumouchel P (2009) Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. In: Proceedings of interspeech, pp 1559–1562
Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis For speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Article Google Scholar
Doddington G (2001) Some experiments on ideolectal differences among speakers. Tech. rep. http://www.nist.gov/speech/tests/spk/2001/doc/
Doddington G (2001) Speaker recognition based on idiolectal differences between speakers. In: Proceedings of interspeech
Galibert O, Kahn J (2013) The first official REPERE evaluation. In: Proceedings of workshop on Speech, Language and Audio in Multimedia (SLAM)
Gauvain JL, Lamel L, Adda G (1998) Partitioning and transcription of broadcast news data. In: Proceedings of International Conference on Spoken Language Processing (ICSLP), pp 5:1335–1338
Gauvain JL, Lamel L, Barras C, Adda G, de Kercadio Y (2000) The LIMSI SDR system for TREC-9. In: Proceedings of TREC-9
Giraudel A, Carre M, Mapelli V, Kahn J, Galibert O, Quintard L (2012) The REPERE corpus : a multimodal corpus for person recognition. In: Proceedings of Language Resources and Evaluation Conference (LREC)
Jones KS, Walker S, Robertson S (2000) A probabilistic model of information retrieval: development and comparative experiments. Inf Process Manag 36(6):779–840
Article Google Scholar
Kajarekar SS, Ferrer L, Shriberg E, Sonmez K, Stolcke A, Venkataraman A, Zheng J (2005) SRI’s 2004 NIST speaker recognition evaluation system. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Khan A, Yegnanarayana B (2004) Latent semantic analysis for speaker recognition. In: Proceedings of International Conference on Spoken Language Processing (ICSLP)
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Comm 52(1):12–40
Article Google Scholar
Lamel L, Courcinous S, Despres J, Gauvain JL, Josse Y, Kilgour K, Kraft F, Le VB, Nussbaum-Thom HNM, Oparin I, Schlippe T, Schluter R, Schultz T, da Silva TF, Stuker S, Sundermeyer M, Vieru B, Vu NT, Waibel A, Woehrling C (2011) Speech recognition for machine translation in quaero. In: Proceedings of IWSLT
Le V, Barras C, Ferras M (2010) On the use of GSV-SVM for speaker diarization and tracking. In: Proceedings of Odyssey, pp 146–150
Manning C, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press
Mauclair J, Meignier S, Esteve Y (2006) Speaker diarization: about whom the speaker is talking?. In: Proceedings of Odyssey
McCallum A (2002) Mallet: a machine learning for language toolkit. Tech. rep. http://mallet.cs.umass.edu
Plchot O, Matsoukas S, Matejka P, Dehak N, Ma J, Cumani S, Glembek O, Hermansky H, Mallidi S, Mesgarani N, Schwartz R, Soufifar M, Tan Z, Thomas S, Zhang B, Zhou X (2013) Developing a speaker identification system for the DARPA RATS project. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Shriberg E (2007) Higher-level features in speaker recognition. Speaker Classification I. LNAI 4343: 241–259
Article Google Scholar
Tran VA, Le V, Barras C, Lamel L (2011) Comparing multi-stage approaches for cross-show speaker diarization. In: Proceedings of interspeech, pp 1053–1056
Tranter S (2006) Who really spoke when? Finding speaker turns and identities in broadcast news audio. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Tur G, Shriberg E, Stolcke A, Kajarekar S (2007) Duration and pronunciation conditioned lexical modeling for speaker verification. In: Proceedings of interspeech

Download references

Acknowledgments

The authors would like to thank Lori Lamel for providing the alignments for the mmAM configuration, and François Yvon, Sophie Rosset, Sylvain Meignier and the anonymous reviewers for their helpful comments and advice. This work was partly realized as part of the Quaero Program and the QCompere project, respectively funded by OSEO (French State agency for innovation) and ANR (French national research agency).

Author information

Authors and Affiliations

LIMSI-CNRS, Orsay, France
Anindya Roy, Hervé Bredin, William Hartmann, Claude Barras & Jean-Luc Gauvain
Vocapia Research, Orsay, France
Viet Bac Le

Authors

Anindya Roy
View author publications
You can also search for this author in PubMed Google Scholar
Hervé Bredin
View author publications
You can also search for this author in PubMed Google Scholar
William Hartmann
View author publications
You can also search for this author in PubMed Google Scholar
Viet Bac Le
View author publications
You can also search for this author in PubMed Google Scholar
Claude Barras
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Luc Gauvain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anindya Roy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Roy, A., Bredin, H., Hartmann, W. et al. Lexical speaker identification in TV shows. Multimed Tools Appl 74, 1377–1396 (2015). https://doi.org/10.1007/s11042-014-1940-3

Download citation

Published: 17 April 2014
Issue Date: February 2015
DOI: https://doi.org/10.1007/s11042-014-1940-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Lexical speaker identification in TV shows

Abstract

Access this article

Similar content being viewed by others

Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing

Comparison of Retrieval Approaches and Blind Relevance Feedback Methods Within the Czech Speech Information Retrieval

Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Lexical speaker identification in TV shows

Abstract

Access this article

Similar content being viewed by others

Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing

Comparison of Retrieval Approaches and Blind Relevance Feedback Methods Within the Czech Speech Information Retrieval

Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation