Advertisement

Recurrent Neural Network Based Speaker Change Detection from Text Transcription Applied in Telephone Speaker Diarization System

  • Zbyněk Zajíc
  • Daniel Soutner
  • Marek Hrúz
  • Luděk Müller
  • Vlasta Radová
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11107)

Abstract

In this paper, we propose a speaker change detection system based on lexical information from the transcribed speech. For this purpose, we applied a recurrent neural network to decide if there is an end of an utterance at the end of a spoken word. Our motivation is to use the transcription of the conversation as an additional feature for a speaker diarization system to refine the segmentation step to achieve better accuracy of the whole diarization system. We compare the proposed speaker change detection system based on transcription (text) with our previous system based on information from spectrogram (audio) and combine these two modalities to improve the results of diarization. We cut the conversation into segments according to the detected changes and represent them by an i-vector. We conducted experiments on the English part of the CallHome corpus. The results indicate improvement in speaker change detection (by 0.5% relatively) and also in speaker diarization (by 1% relatively) when both modalities are used.

Keywords

Recurrent neural network Convolutional Neural Network Speaker change detection Speaker diarization I-vector 

References

  1. 1.
    Rouvier, M., Dupuy, G., Gay, P., Khoury, E., Merlin, T., Meignier, S.: An open-source state-of-the-art toolbox for broadcast news diarization. In: Interspeech, Lyon, pp. 1477–1481 (2013)Google Scholar
  2. 2.
    Sell, G., Garcia-Romero, D.: Speaker diarization with PLDA I-vector scoring and unsupervised calibration. In: IEEE Spoken Language Technology Workshop, South Lake Tahoe, pp. 413–417 (2014)Google Scholar
  3. 3.
    Hrúz, M., Zajíc, Z.: Convolutional neural network for speaker change detection in telephone speaker diarization system. In: ICASSP, New Orleans, pp. 4945–4949 (2017)Google Scholar
  4. 4.
    Zajíc, Z., Hrúz, M., Müller, L.: Speaker diarization using convolutional neural network for statistics accumulation refinement. In: Interpeech, Stockholm, pp. 3562–3566 (2017)Google Scholar
  5. 5.
    Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)CrossRefGoogle Scholar
  6. 6.
    Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D., Glass, J.: Exploiting intra-conversation variability for speaker diarization. In: Interspeech, Florence, pp. 945–948 (2011)Google Scholar
  7. 7.
    Valente, F., Vijayasenan, D., Motlicek, P.: Speaker diarization of meetings based on speaker role n-gram models. In: ICASSP, pp. 4416–4419. IEEE, Prague (2011)Google Scholar
  8. 8.
    Tranter, S.E., Yu, K., Evermann, G., Woodland, P.C.: Generating and evaluating segmentations for automatic speech recognition of conversational telephone speech. In: ICASSP, pp. 753–756. IEEE, Montreal (2004)Google Scholar
  9. 9.
    Kunešová, M., Zajíc, Z., Radová, V.: Experiments with segmentation in an online speaker diarization system. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 429–437. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-64206-2_48CrossRefGoogle Scholar
  10. 10.
    Hrúz, M., Kunešová, M.: Convolutional neural network in the task of speaker change detection. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 191–198. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-43958-7_22CrossRefGoogle Scholar
  11. 11.
    Soutner, D., Müller, L.: Application of LSTM neural networks in language modelling. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 105–112. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-40585-3_14CrossRefGoogle Scholar
  12. 12.
    Hochreiter, S., Urgen Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  13. 13.
    Zajíc, Z., Machlica, L., Müller, L.: Robust adaptation techniques dealing with small amount of data. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 480–487. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-32790-2_58CrossRefGoogle Scholar
  14. 14.
    Kenny, P., Dumouchel, P.: Experiments in speaker verification using factor analysis likelihood ratios. In: Odyssey, Toledo, pp. 219–226 (2004)Google Scholar
  15. 15.
    Canavan, A., Graff, D., Zipperlen, G.: CALLHOME American English speech, LDC97S42. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1997)Google Scholar
  16. 16.
    Godfrey, J.J., Holliman, E.: Switchboard-1 release 2. In: LDC Catalog. Linguistics Data Consortium, Philadelphia (1997)Google Scholar
  17. 17.
    Daniel, P., et al.: Modelos animales de dolor neuropático. In: Workshop on Automatic Speech Recognition and Understanding, IEEE Catalog No.: CFP11SRW-USB (2011)Google Scholar
  18. 18.
    Harris, M., Aubert, X., Haeb-Umbach, R., Beyerlein, P.: A study of broadcast news audio stream segmentation and segment clustering. In: EUROSPEECH, Budapest, pp. 1027–1030 (1999)Google Scholar
  19. 19.
    Bredin, H.: TristouNet: triplet loss for speaker turn embedding. In: ICASSP, New Orleans, pp. 5430–5434 (2017)Google Scholar
  20. 20.
    Bredin, H.: pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. In: Interspeech, Stockholm, pp. 3587–3591 (2017)Google Scholar
  21. 21.
    Sell, G., Garcia-Romero, D., Mccree, A.: Speaker diarization with I-vectors from DNN senone posteriors. In: Interspeech, Dresden, pp. 3096–3099 (2015)Google Scholar
  22. 22.
    Fiscus, J.G., Radde, N., Garofolo, J.S., Le, A., Ajot, J., Laprun, C.: The rich transcription 2006 spring meeting recognition evaluation. Mach. Learn. Multimodal Interact. 4299, 309–322 (2006)CrossRefGoogle Scholar
  23. 23.
    India, M., Fonollosa, J., Hernando, J.: LSTM neural network-based speaker segmentation using acoustic and language modelling. In: Interspeech, Stockholm, pp. 2834–2838 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Faculty of Applied Sciences, NTIS - New Technologies for the Information Society and Department of CyberneticsUniversity of West BohemiaPlzeňCzech Republic

Personalised recommendations