Advertisement

Speaker Change Detection Using Binary Key Modelling with Contextual Information

  • Jose PatinoEmail author
  • Héctor Delgado
  • Nicholas Evans
Conference paper
  • 558 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10583)

Abstract

Speaker change detection can be of benefit to a number of different speech processing tasks such as speaker diarization, recognition and detection. Current solutions rely either on highly localized data or on training with large quantities of background data. While efficient, the former tend to over-segment. While more stable, the latter are less efficient and need adaptation to mis-matching data. Building on previous work in speaker recognition and diarization, this paper reports a new binary key (BK) modelling approach to speaker change detection which aims to strike a balance between efficiency and segmentation accuracy. The BK approach benefits from training using a controllable degree of contextual data, rather than relying on external background data, and is efficient in terms of computation and speaker discrimination. Experiments on a subset of the standard ETAPE database show that the new approach outperforms the current state-of-the-art methods for speaker change detection and gives an average relative improvement in segment coverage and purity of 18.71% and 4.51% respectively.

Keywords

Speaker identification and verification Speaker change detection Binary keys Speaker diarization 

Notes

Acknowledgements

This work was supported through funding from the Agence Nationale de la Recherche (French research funding agency) in the context of the ODESSA project (ANR-15-CE39-0010). The authors acknowledge Hervé Bredin’s help in the evaluation of speaker change detection.

References

  1. 1.
    Anguera, X., Bonastre, J.F.: A novel speaker binary key derived from anchor models. In: Proceedings of the INTERSPEECH, pp. 2118–2121 (2010)Google Scholar
  2. 2.
    Anguera, X., Bonastre, J.F.: Fast speaker diarization based on binary keys. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4428–4431. IEEE (2011)Google Scholar
  3. 3.
    Anguera, X., Movellan, E., Ferrarons, M.: Emotions recognition using binary fingerprints. In: Proceedings of the IberSPEECH (2012)Google Scholar
  4. 4.
    Barras, C., Zhu, X., Meignier, S., Gauvain, J.L.: Multistage speaker diarization of broadcast news. IEEE Trans. Audio Speech Lang. Process. 14(5), 1505–1512 (2006)CrossRefGoogle Scholar
  5. 5.
    Bonastre, J.F., Miró, X.A., Sierra, G.H., Bousquet, P.M.: Speaker modeling using local binary decisions. In: Proceedings of the INTERSPEECH, pp. 13–16 (2011)Google Scholar
  6. 6.
    Bredin, H.: Tristounet: triplet loss for speaker turn embedding. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5430–5434. IEEE (2017)Google Scholar
  7. 7.
    Cettolo, M., Vescovi, M.: Efficient audio segmentation algorithms based on the BIC. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 6, pp. VI–537. IEEE (2003)Google Scholar
  8. 8.
    Chen, S., Gopalakrishnan, P.: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, vol. 8, pp. 127–132 (1998)Google Scholar
  9. 9.
    Cheng, S.S., Wang, H.M., Fu, H.C.: BIC-based speaker segmentation using divide-and-conquer strategies with application to speaker diarization. IEEE Trans. Audio Speech Lang. Process. 18(1), 141–157 (2010)CrossRefGoogle Scholar
  10. 10.
    Delacourt, P., Wellekens, C.J.: DISTBIC: a speaker-based segmentation for audio data indexing. Speech Commun. 32(1), 111–126 (2000)CrossRefGoogle Scholar
  11. 11.
    Delgado, H., Anguera, X., Fredouille, C., Serrano, J.: Improved binary key speaker diarization system. In: Proceedings of the 23rd European Signal Processing Conference (EUSIPCO), pp. 2087–2091 (2015)Google Scholar
  12. 12.
    Delgado, H., Anguera, X., Fredouille, C., Serrano, J.: Global speaker clustering towards optimal stopping criterion in binary key speaker diarization. In: Navarro Mesa, J.L., Ortega, A., Teixeira, A., Hernández Pérez, E., Quintana Morales, P., Ravelo García, A., Guerra Moreno, I., Toledano, D.T. (eds.) IberSPEECH 2014. LNCS, vol. 8854, pp. 59–68. Springer, Cham (2014). doi: 10.1007/978-3-319-13623-3_7 Google Scholar
  13. 13.
    Delgado, H., Anguera, X., Fredouille, C., Serrano, J.: Fast single-and cross-show speaker diarization using binary key speaker modeling. IEEE Trans. Audio Speech Lang. Process. 23(12), 2286–2297 (2015)CrossRefGoogle Scholar
  14. 14.
    Delgado, H., Anguera, X., Fredouille, C., Serrano, J.: Novel clustering selection criterion for fast binary key speaker diarization. In: Proceedings of the INTERSPEECH, pp. 3091–3095, Dresden, Germany (2015)Google Scholar
  15. 15.
    Delgado, H., Fredouille, C., Serrano, J.: Towards a complete binary key system for the speaker diarization task. In: Proceedings of the INTERSPEECH, pp. 572–576 (2014)Google Scholar
  16. 16.
    Gravier, G., Adda, G., Paulson, N., Carré, M., Giraudel, A., Galibert, O.: The ETAPE corpus for the evaluation of speech-based TV content processing in the French language. In: LREC-Eighth International Conference on Language Resources and Evaluation, p. na (2012)Google Scholar
  17. 17.
    Gupta, V.: Speaker change point detection using deep neural nets. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4420–4424. IEEE (2015)Google Scholar
  18. 18.
    Hrúz, M., Zajíc, Z.: Convolutional neural network for speaker change detection in telephone speaker diarization system. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949. IEEE (2017)Google Scholar
  19. 19.
    Luque, J., Anguera, X.: On the modeling of natural vocal emotion expressions through binary key. In: Proceedings of the 22nd European Signal Processing Conference (EUSIPCO), pp. 1562–1566 (2014)Google Scholar
  20. 20.
    Malegaonkar, A.S., Ariyaeeinia, A.M., Sivakumaran, P.: Efficient speaker change detection using adapted Gaussian mixture models. IEEE Trans. Audio Speech Lang. Process. 15(6), 1859–1869 (2007)CrossRefGoogle Scholar
  21. 21.
    Neri, L.V., Pinheiro, H.N., Ren, T.I., Cavalcanti, G.D.D.C., Adami, A.G.: Speaker segmentation using i-vector in meetings domain. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5455–5459. IEEE (2017)Google Scholar
  22. 22.
    Patino, J., Delgado, H., Evans, N., Anguera, X.: EURECOM submission to the Albayzin 2016 speaker diarization evaluation. In: Proceedings of the IberSPEECH (2016)Google Scholar
  23. 23.
    Wang, R., Gu, M., Li, L., Xu, M., Zheng, T.F.: Speaker segmentation using deep speaker vectors for fast speaker change scenarios. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5420–5424. IEEE (2017)Google Scholar
  24. 24.
    Wu, T.Y., Lu, L., Chen, K., Zhang, H.: Universal background models for real-time speaker change detection. In: MMM, pp. 135–149 (2003)Google Scholar
  25. 25.
    Zajíc, Z., Kunešová, M., Radová, V.: Investigation of segmentation in i-vector based speaker diarization of telephone speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 411–418. Springer, Cham (2016). doi: 10.1007/978-3-319-43958-7_49 CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of Digital SecurityEURECOMSophia AntipolisFrance

Personalised recommendations