Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis

Cabañas-Molero, P.; Lucena, M.; Fuertes, J. M.; Vera-Candeas, P.; Ruiz-Reyes, N.

doi:10.1007/s11042-018-5944-2

Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis

Published: 11 April 2018

Volume 77, pages 27685–27707, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

P. Cabañas-Molero ORCID: orcid.org/0000-0002-2452-6037¹,
M. Lucena²,
J. M. Fuertes²,
P. Vera-Candeas¹ &
…
N. Ruiz-Reyes¹

450 Accesses
6 Citations
Explore all metrics

Abstract

Speaker diarization is traditionally defined as the problem of determining “who speaks when” given an audio or video stream. This is an important task in many applications for meeting rooms, including automatic transcription of conversations, camera steering or content summarization. When the room is equipped with microphone arrays and cameras, speakers can be distinguished according to their location and the problem can be addressed through localization techniques. This article proposes a multimodal speaker diarization system for meeting environments based on a modified SRP-PHAT function evaluated on space volumes rather than discrete points. In our system, this function is used in combination with a circular array, enabling audio-based localization based on the selection of local maxima. Voicing detection is used to detect speech frames, whereas video analysis is introduced to aid in the decision when users move or simultaneously speak. The approach is evaluated on the well-known AMI dataset with approximately 100 hours of realistic meeting recordings and shows an average diarization error rate of 21% – 25%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improvement of Speaker Number Estimation by Applying an Overlapped Speech Detector

The use of long-term features for GMM- and i-vector-based speaker diarization systems

Article Open access 26 September 2018

Audio source separation by activity probability detection with maximum correlation and simplex geometry

Article Open access 28 January 2021

References

Ajmera J, Lathoud G, McCowan L (2004) Clustering and segmenting speakers and their locations in meetings. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol 1, pp 605–608
Anguera X, Bozonnet S, Evans N, Fredouille C, Friedland G, Vinyals O (2012) Speaker diarization: a review of recent research. IEEE Trans Audio Speech Lang Process 20(2):356–370
Article Google Scholar
Araki S, Hori T, Fujimoto M, Watanabe S, Yoshioka T, Nakatani T, Nakamura A (2010) Online meeting recognizer with multichannel speaker diarization. In: 44th ASILOMAR conference on signals, systems and computers, pp 1697–1701
Araki S, Okada M, Higuchi T, Ogawa A, Nakatani T (2016) Spatial correlation model based observation vector clustering and MVDR beamforming for meeting recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 385–389
Aubrey A, Rivet B, Hicks Y, Girin L, Chambers J, Jutten C (2007) Two novel visual voice activity detectors based on appearance models and retinal filtering. In: 15th european signal processing conference (EUSIPCO), pp 2409–2413
Bergh TF, Hafizovic I, Holm S (2016) Multi-speaker voice activity detection using a camera-assisted microphone array. In: 23rd international conference on systems, signals and image processing (IWSSIP), pp 1–4
Biagetti G, Crippa P, Falaschetti L, Orcioni S, Turchetti C (2016) Robust speaker identification in a meeting with short audio segments, pp 465–477. Springer International Publishing, Cham
Chapter Google Scholar
Blauth DA, Minotto VP, Jung CR, Lee B, Kalker T (2012) Voice activity detection and speaker localization using audiovisual cues. Pattern Recogn Lett 33(4):373–380
Article Google Scholar
Carletta J, Ashby S, Bourban S, Flynn M, Guillemot M, Hain T, Kadlec J, Karaiskos V, Kraaij W, Kronenthal M, Lathoud G, Lincoln M, Lisowska A, McCowan I, Post W, Reidsma D, Wellner P (2005) The AMI meeting corpus: a pre-announcement. In: International workshop on machine learning for multimodal interaction. Springer, pp 28–39
Cobos M, Marti A, Lopez JJ (2011) A modified SRP-PHAT functional for robust real-time sound source localization with scalable spatial sampling. IEEE Signal Processing Letters 18(1):71–74
Article Google Scholar
DiBiase JH (2000) A high-accuracy, low-latency technique for talker localization in reverberant environments. Ph.D. thesis, Brown University, Providence, RI
Do H, Silverman HF, Yu Y (2007) A real-time SRP-PHAT source location implementation using stochastic region contraction (SRC) on a large-aperture microphone array. In: IEEE International conference on acoustics, speech and signal processing (ICASSP), vol 1, pp 121–124
Fredouille C, Bozonnet S, Evans N (2009) The LIA-EURECOM RT’09 speaker diarization system. In: RT’09 NIST Rich transcription workshop, vol 15, pp 17–23
Friedland G, Hung H, Yeo C (2009) Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In: IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 4069–4072
Friedland G, Janin A, Imseng D, Anguera X, Gottlieb L, Huijbregts M, Knox MT, Vinyals O (2012) The ICSI RT-09 speaker diarization system. IEEE Trans Audio Speech Lang Process 20(2):371–381
Article Google Scholar
Fujimoto M, Ishizuka K, Nakatani T (2009) A study of mutual front-end processing method based on statistical model for noise robust speech recognition. In: 10Th annual conference of the international speech communication association (INTERSPEECH), pp 1235–1238
Gebru I, Ba S, Li X, Horaud R (2017) Audio-visual speaker diarization based on spatiotemporal bayesian fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2017.2648793
Article Google Scholar
Ghaemmaghami H, Baker BJ, Vogt RJ, Sridharan S (2010) Noise robust voice activity detection using features extracted from the time-domain autocorrelation function. In: 11th annual conference of the international speech communication association (INTERSPEECH), pp 3118–3121
Gonzalez S, Brookes M (2014) PEFAC - a pitch estimation algorithm robust to high levels of noise. IEEE/ACM Trans Audio Speech Lang Process 22(2):518–530
Article Google Scholar
Hori T, Araki S, Yoshioka T, Fujimoto M, Watanabe S, Oba T, Ogawa A, Otsuka K, Mikami D, Kinoshita K, Nakatani T, Nakamura A, Yamato J (2012) Low-latency real-time meeting recognition and understanding using distant microphones and omni-directional camera. IEEE Trans Audio Speech Lang Process 20(2):499–513
Article Google Scholar
Hung H, Friedland G (2008) Towards audio-visual on-line diarization of participants in group meetings. In: Workshop on multi-camera and multi-modal sensor fusion algorithms and applications
Liu Q, Wang W, Jackson P (2011) A visual voice activity detection method with adaboosting. In: Sensor signal processing for defence (SSPD), pp 1–5
Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2activity: Recognizing complex activities from sensor data. In: International joint conference on artificial intelligence (IJCAI), pp 1617–1623
Liu Y, Zhang L, Nie L, Yan Y, Rosenblum DS (2016) Fortune teller: Predicting your career path. In: Proceedings of the AAAI conference on artificial intelligence, pp 201–207
Liu Y, Zheng Y, Liang Y, Liu S, Rosenblum DS (2016) Urban water quality prediction based on multi-task multi-view learning. In: International joint conference on artificial intelligence (IJCAI)
Marti A, Cobos M, Lopez JJ (2011) Real time speaker localization and detection system for camera steering in multiparticipant videoconferencing environments. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2592–2595
McCowan I, Carletta J, Kraaij W, Ashby S, Bourban S, Flynn M, Guillemot M, Hain T, Kadlec J, Karaiskos V, Kronenthal M, Lathoud G, Lincoln M, Lisowska A, Post W, Reidsma D, Wellner P (2005) The AMI meeting corpus. In: 5th international conference on methods and techniques in behavioral research, pp 137–140
Minotto VP, Lopes CBO, Scharcanski J, Jung CR, Lee B (2013) Audiovisual voice activity detection based on microphone arrays and color information. IEEE Journal of Selected Topics in Signal Processing 7(1):147–156
Article Google Scholar
Minotto VP, Jung CR, Lee B (2014) Simultaneous-speaker voice activity detection and localization using mid-fusion of svm and hmms. IEEE Trans Multimedia 16(4):1032–1044
Article Google Scholar
Minotto VP, Jung CR, Lee B (2015) Multimodal multi-channel on-line speaker diarization using sensor fusion through SVM. IEEE Trans Multimedia 17(10):1694–1705
Article Google Scholar
Noulas A, Englebienne G, Krose BJ (2012) Multimodal speaker diarization. IEEE Trans Pattern Anal Mach Intell 34(1):79–93
Article Google Scholar
Rozgic V, Han KJ, Georgiou PG, Narayanan S (2010) Multimodal speaker segmentation and identification in presence of overlapped speech segments. Journal of Multimedia 5(4):322–331
Article Google Scholar
Sarafianos N, Giannakopoulos T, Petridis S (2016) Audio-visual speaker diarization using fisher linear semi-discriminant analysis. Multimed Tools Appl 75(1):115–130
Article Google Scholar
Schmalenstroeer J, Kelling M, Leutnant V, Haeb-Umbach R (2009) Fusing audio and video information for online speaker diarization. In: 10th annual conference of the international speech communication association (INTERSPEECH), pp 1163–1166
Scott D, Jung CR, Bins J, Said A, Kalker A (2009) Video based VAD using adaptive color information. In: 11Th IEEE international symposium on multimedia, pp 80–87
Soldi G, Beaugeant C, Evans N (2015) Adaptive and online speaker diarization for meeting data. In: 23Rd european signal processing conference (EUSIPCO), pp 2112–2116
Tiawongsombat P, Jeong MH, Yun JS, You BJ, Oh SR (2012) Robust visual speakingness detection using bi-level HMM. Pattern Recogn 45(2):783–793
Article Google Scholar
Vaquero C, Vinyals O, Friedland G (2010) A hybrid approach to online speaker diarization. In: 11Th annual conference of the international speech communication association (INTERSPEECH), pp 2638–2641
Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Computer vision and pattern recognition (CVPR), vol 1, pp 511–518
Wellner P, Flynn M, Guillemot M (2004) Browsing recorded meetings with Ferret. In: International workshop on machine learning for multimodal interaction. Springer, pp 12–21
Wooters C, Huijbregts M (2008) The ICSI RT07s speaker diarization system. In: Multimodal technologies for perception of humans: International evaluation workshops CLEAR 2007 and RT 2007. Springer, pp 509–519
Zhang C, Yin P, Rui Y, Cutler R, Viola P (2006) Boosting-based multimodal speaker detection for distributed meetings. In: IEEE 8Th workshop on multimedia signal processing (MMSP), pp 86–91
Zhang C, Zhang Z, Florencio D (2007) Maximum likelihood sound source localization for multiple directional microphones. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), vol 1, pp 125–128
Zhang C, Florencio D, Ba DE, Zhang Z (2008) Maximum likelihood sound source localization and beamforming for directional microphone arrays in distributed meetings. IEEE Trans Multimedia 10(3):538–548
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Andalusian Economy and Knowledge Council under project 2010-TIC6762, and the Spanish Ministry of Economy and Competitiveness under project TEC2015-67387-C4-2-R.

Author information

Authors and Affiliations

Department of Telecommunication Engineering, University of Jaén, Linares, Jaén, Spain
P. Cabañas-Molero, P. Vera-Candeas & N. Ruiz-Reyes
Department of Computer Science, University of Jaén, Jaén, Spain
M. Lucena & J. M. Fuertes

Authors

P. Cabañas-Molero
View author publications
You can also search for this author in PubMed Google Scholar
M. Lucena
View author publications
You can also search for this author in PubMed Google Scholar
J. M. Fuertes
View author publications
You can also search for this author in PubMed Google Scholar
P. Vera-Candeas
View author publications
You can also search for this author in PubMed Google Scholar
N. Ruiz-Reyes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to P. Cabañas-Molero.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cabañas-Molero, P., Lucena, M., Fuertes, J.M. et al. Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis. Multimed Tools Appl 77, 27685–27707 (2018). https://doi.org/10.1007/s11042-018-5944-2

Download citation

Received: 24 July 2017
Revised: 26 January 2018
Accepted: 26 March 2018
Published: 11 April 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s11042-018-5944-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis

Abstract

Access this article

Similar content being viewed by others

Improvement of Speaker Number Estimation by Applying an Overlapped Speech Detector

The use of long-term features for GMM- and i-vector-based speaker diarization systems

Audio source separation by activity probability detection with maximum correlation and simplex geometry

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis

Abstract

Access this article

Similar content being viewed by others

Improvement of Speaker Number Estimation by Applying an Overlapped Speech Detector

The use of long-term features for GMM- and i-vector-based speaker diarization systems

Audio source separation by activity probability detection with maximum correlation and simplex geometry

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation