Real-Time Implementation of Speaker Diarization System on Raspberry PI3 Using TLBO Clustering Algorithm

Abstract

In the recent years, extensive researches have been performed on various possible implementations of speaker diarization systems. These systems require efficient clustering algorithms in order to improve their performances in real-time processing. Teaching–learning-based optimization (TLBO) is such clustering algorithm which can be used to resolve the problem to the optimum clustering in a reasonable time. In this paper, a real-time implementation of speaker diarization (SD) system on raspberry pi 3 (RPi 3) using TLBO technique as classifier has been performed. This system has been evaluated on broadcasting radio dataset (NDTV), and the experimental tests have shown that this technique has succeeded to achieve acceptable performances in terms of diarization error rate (DER = 21.90% and 35% in single- and cross-show diarization, respectively), accuracy (87.30%), and real-time factor (RTF = 2.40). Also, we have tested TLBO technique on a 2.4 GHz Intel Core i5 processor using REPERE corpus. Thus, ameliorated results have been obtained in terms of execution time (xRT) and DER in both tasks of single- and cross-show speaker diarization (0.08 and 0.095, and 18.50% and 26.30%, respectively).

This is a preview of subscription content, log in to check access.

Fig. 1

References

  1. 1.

    C. Anandaraman, An improved sheep flock heredity algorithm for job shop scheduling and flow shop scheduling problems. Int. J. Ind. Eng. Comput. 2(4), 749–764 (2011)

    Google Scholar 

  2. 2.

    X. Anguera et al., Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)

    Article  Google Scholar 

  3. 3.

    X. Anguera, C. Wooters, B. Peskin, M. Aguilo, Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system, in International Workshop on Machine Learning for Multimodal Interaction, (Springer, Heidelberg, 2005), pp. 402–414

  4. 4.

    K. Asanovic, R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson, W.L. Plishker, J. Shalf, S.W. Williams, K.A. Yelick, The landscape of parallel computing research: a view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (2006)

  5. 5.

    C. Barras, X. Zhu, S. Meignier, J. Gauvain, Multistage speaker diarization of broadcast news. IEEE Trans. Audio Speech Lang. Process. 14(5), 1505–1512 (2006)

    Article  Google Scholar 

  6. 6.

    A. Baykasoğlu, A. Hamzadayi, S.Y. Köse, Testing the performance of teaching–learning based optimization (TLBO) algorithm on combinatorial problems: flow shop and job shop scheduling cases. Inf. Sci. 276, 204–218 (2014)

    MathSciNet  Article  Google Scholar 

  7. 7.

    D. Charlet, C. Barras, J.-S. Lienard, Impact of overlapping speech detection on speaker diarization for broadcast news and debates, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 2013), pp. 7707–7711

  8. 8.

    S.S. Chen, P. S. Gopalakrishnan, Speaker, environment and channel change detection and clustering via the bayesian information criterion, in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop (1998), pp. 127–132

  9. 9.

    S. Cheng, H. Min Wang, H. Fu, BIC-based speaker segmentation using divide-and-conquer strategies with application to speaker diarization. IEEE Trans. Audio Speech Lang. Process. 18(1), 141–157 (2009)

    Article  Google Scholar 

  10. 10.

    J. Chong, E. Gonina, Y. Yi, K. Keutzer, A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit, in Tenth Annual Conference of the International Speech Communication Association (2009)

  11. 11.

    J. Chong, Y. Yi, N.R.S.A. Faria, K. Keutzer, Data-parallel large vocabulary continuous speech recognition on graphics processors, in Proceedings of the 1st Annual Workshop on Emerging Applications and Many Core Architecture (2008), pp. 23–35

  12. 12.

    K. Church, W. Zhu, J. Vopicka, J. Pelecanos, D. Dimitriadis, P. Fousek, Speaker diarization: a perspective on challenges and opportunities from theory to practice, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (IEEE, 2017), pp. 4950–4954

  13. 13.

    K. Dabbabi, S. Hajji, A. Cherif, Integration of evolutionary computation algorithms and new AUTO-TLBO technique in the speaker clustering stage for speaker diarization of broadcast news. EURASIP J. Audio Speech Music Process. 2017(1), 21 (2017)

    Article  Google Scholar 

  14. 14.

    G. Dahl, Yu. Dong, D. Li, A. Alex, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2011)

    Article  Google Scholar 

  15. 15.

    H. Delgado, X. Anguera, C. Fredouille, J. Serrano, Fast single-and cross-show speaker diarization using binary key speaker modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2286–2297 (2015)

    Article  Google Scholar 

  16. 16.

    D. Dimitriadis, P. Fousek, Y. Heights, Developing on-line speaker diarization system, in INTERSPEECH (2017), pp. 2739–2743

  17. 17.

    P.R. Dixon, T. Oonishi, S. Furui, Fast acoustic computations using graphics processors, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan (2009)

  18. 18.

    H. Do, H. Silverman, SRP-PHAT methods of locating simultaneous multiple talkers using a frame of microphone array data, in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, (IEEE, 2010), pp. 125–128

  19. 19.

    G. Dupuy, S. Meignier, P. Deléglise, Y. Estève, Recent improvements on ILP-based clustering for broadcast news speaker diarization (2014)

  20. 20.

    R.J. Edd, M. Rziza, D. Aboutajdine, M. Gelgon, J. Martinez, Fast incremental clustering of Gaussian mixture speaker models for scaling up retrieval in on-line broadcast, in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, (IEEE, 2006), p. V

  21. 21.

    A. Firoozabadi, H. Abutalebi, Combination of nested microphone array and subband processing for multiple simultaneous speaker localization, in 6th International Symposium on Telecommunications (IST), (IEEE, 2012), pp. 907–912

  22. 22.

    A. Firoozabadi, H. Abutalebi, Localization of multiple simultaneous speakers by combining the information from different subbands. in 2013 21st Iranian Conference on Electrical Engineering (ICEE), (IEEE, 2013), pp. 1–6

  23. 23.

    O. Galibert, J. Kahn. The first official repere evaluation, in First Workshop on Speech, Language and Audio in Multimedia (2013)

  24. 24.

    T. Giannakopoulos, pyaudioanalysis: an open-source python library for audio signal analysis. PLoS ONE 10(12), e0144610 (2015)

    Article  Google Scholar 

  25. 25.

    A. Giraudel, M. Carré, V. Mapelli, J. Kahn, O. Galibert, L. Quintard, The REPERE Corpus: a multimodal corpus for person recognition, in LREC (2012), pp. 1102–1107

  26. 26.

    E. Gonina, G.Friedland, H. Cook, K. Keutzer, Fast speaker diarization using a high-level scripting language, in 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, (IEEE, 2011), pp. 553–558

  27. 27.

    H. Gyulyustan, S. Enkov, Experimental speech recognition system based on Raspberry Pi 3. IOSR J. Comput. Eng. (IOSR-JCE) 19(3), 107–112 (2017)

    Article  Google Scholar 

  28. 28.

    T. Herbig, F. Gerl, W. Minker, Self-learning speaker identification for enhanced speech recognition. Comput. Speech Lang. 26(3), 210–227 (2012)

    Article  Google Scholar 

  29. 29.

    S. Ishikawa, K. Yamabana, R. Isotani, A. Okumura, Parallel LVCSR algorithm for cellphone-oriented multicore processors, in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, (IEEE, 2006), p. l

  30. 30.

    K.R. Krishnamachari, R.E. Yantorno, D.S. Benincasa, S.J. Wenndt, Spectral autocorrelation ratio as a usability measure of speech segments under co-channel conditions, in IEEE International Symposium on Intelligent Signal Processing and Communication Systems (2000), pp. 710–713

  31. 31.

    N. Kumar, S. Satoor, I. Buck, Fast parallel expectation maximization for Gaussian mixture models on GPUs using CUDA, in 2009 11th IEEE International Conference on High Performance Computing and Communications, (IEEE, 2009), pp. 103–109

  32. 32.

    S. Kwon, S. Narayanan, A study of generic models for unsupervised on-line speaker indexing, in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721), (IEEE, 2003), pp. 423–428

  33. 33.

    S. Kwon, S. Narayanan, Unsupervised speaker indexing using generic models. IEEE Trans. Speech Audio Process. 13(5), 1004–1013 (2005)

    Article  Google Scholar 

  34. 34.

    J.P. LeBlanc, P.L. De Leon, Speech separation by kurtosis maximization, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), (IEEE, 1998), pp. 1029–1032

  35. 35.

    M. Li, K.J. Han, S. Narayanan, Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput. Speech Lang. 27(1), 151–167 (2013)

    Article  Google Scholar 

  36. 36.

    L. Linna, W. Weng, Sh. Fujimura, An improved teaching-learning-based optimization algorithm to solve job shop scheduling problems, in 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS), (IEEE, 2017), pp. 797–801

  37. 37.

    H.K. Maganti, P. Motlicek, D. Gatica-Perez, Unsupervised speech/non-speech detection for automatic speech recognition in meeting rooms, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, (IEEE, 2007), pp. IV-1037–IV-1040

  38. 38.

    K. Markov, S. Nakamura, Never-ending learning system for on-line speaker diarization, in 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), (IEEE, 2007), pp. 699–704

  39. 39.

    M. Moattar, M. Homayounpour, A review on speaker diarization systems and approaches. Speech Commun. 54(10), 1065–1103 (2012)

    Article  Google Scholar 

  40. 40.

    A. Noulas, B.J.A. Krose, On-line multi-modal speaker diarization, in Proceedings of 9th International Conference on Multimodal Interfaces (2007), pp. 350–357

  41. 41.

    G. Onwubolu, D. Davendra, Scheduling flow shops using differential evolution algorithm. Eur. J. Oper. Res. 171(2), 674–692 (2006)

    Article  Google Scholar 

  42. 42.

    D. Pelleg, A. Moore, Extending k-means with efficient estimation of the number of clusters, in ICML, (2000), pp. 727–734

  43. 43.

    T. Pfau, D. Ellis, A. Stolcke, Multispeaker speech activity detection for the ICSI meeting recorder, in IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU’01, (IEEE, 2001), pp. 107–110

  44. 44.

    S.A. Rahat, A. Imteaj, T. Rahman, An IoT based interactive speech recognizable robot with distance control using Raspberry Pi, in 2018 International Conference on Innovations in Science, Engineering and Technology (ICISET), (IEEE, 2018), pp. 480–485

  45. 45.

    R. Ravipudi, V. Vimal, J. Savsani, D.P. Vakharia, Teaching–learning-based optimization: a novel method for constrained mechanical design optimization problems. Comput. Aided Des. 43(3), 303–315 (2011)

    Article  Google Scholar 

  46. 46.

    D. Reynolds, P. Torres-Carrasquillo, Approaches and applications of audio diarization, in Proceedings. (ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. (IEEE, 2005), pp. v/953–v/956 Vol. 5

  47. 47.

    M. Rouvier, G. Dupuy, P. Gay, E. Khoury, T. Merlin, S. Meignier, An open-source state-of-the-art toolbox for broadcast news diarization (2013)

  48. 48.

    J. Schmalenstroeer, M. Kelling, V. Leutnant, R. Haeb-Umbach, Fusing audio and video information for online speaker diarization, in Tenth Annual Conference of the International Speech Communication Association (2009)

  49. 49.

    M. Taghizadeh, P. Garner, H. Bourlard, H. Abutalebi, A. Asaei, An integrated framework for multi-channel multi-source localization and voice activity detection, in 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays, (IEEE, 2011), pp. 92–97

  50. 50.

    S. Thiyagarajan, G. Saravana Kumar, E. Praveen Kumar, G. Sakana, Implementation of optical character recognition using Raspberry Pi for visually challenged person. Int. J. Eng. Technol. 7(3.34), 65–67 (2018)

    Article  Google Scholar 

  51. 51.

    P. Tiawongsombat, M.-H. Jeong, J.-S. Yun, B.-J. You, S.-R. Oh, Robust visual speakingness detection using bi-level HMM. Pattern Recogn. 45(2), 783–793 (2012)

    Article  Google Scholar 

  52. 52.

    C. Vaquero, O. Vinyals, G. Friedland, A hybrid approach to online speaker diarization, in Eleventh Annual Conference of the International Speech Communication Association (2010)

  53. 53.

    D. Vijayasenan, F. Valente, H. Bourlard, An information theoretic approach to speaker diarization of meeting data. IEEE Trans. Audio Speech Lang. Process. 17(7), 1382–1393 (2009)

    Article  Google Scholar 

  54. 54.

    J. Walsh, Y. Kim, T. Doll, Joint iterative multi-speaker identification and source separation using expectation propagation, in 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (IEEE, 2007), pp. 283–286

  55. 55.

    Q. Wang, C. Downey, Li. Wan, Ph. Andrew, M. Ignacio, L. Moreno, Speaker diarization with LSTM, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2018), pp. 5239–5243

  56. 56.

    C. Wooters, M. Huijbregts, The ICSI RT07s speaker diarization system, in Multimodal Technologies for Perception of Humans. Springer, Berlin, Heidelberg, (2007), pp. 509–519

  57. 57.

    S.N. Wrigley, G.J. Brown, V. Wan, S. Renals, Speech and crosstalk detection in multichannel audio. IEEE Trans. Speech Audio Process. 13(1), 84–91 (2004)

    Article  Google Scholar 

  58. 58.

    K. You, J. Chong, Y. Yi, E. Gonina, C. Hughes, Y. Chen, W. Sung, K. Keutzer, Parallel scalability in speech recognition. IEEE Signal Process. Mag. 26(6), 124–135 (2009)

    Article  Google Scholar 

  59. 59.

    K. You, Y. Lee, W. Sung, OpenMP-based parallel implementation of a continuous speech recognizer on a multi-core system, in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, (IEEE, 2009), pp. 621–624

  60. 60.

    E. Yucesoy, V. Nabiyev, Gender identification of a speaker from voice source, in 2013 21st Signal Processing and Communications Applications Conference (SIU), (IEEE, 2013), pp. 1–4

  61. 61.

    M. Zelenak, C. Segura, J. Luque, J. Hernando, Simultaneous speech detection with spatial features for speaker diarization. IEEE Trans. Audio Speech Lang. Process. 20(2), 436–446 (2012)

    Article  Google Scholar 

  62. 62.

    W. Zhu, J. Pelecanos, Online speaker diarization using adapted i-vector transforms, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2016), pp. 5045–5049

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Karim Dabbabi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dabbabi, K., Hajji, S. & Cherif, A. Real-Time Implementation of Speaker Diarization System on Raspberry PI3 Using TLBO Clustering Algorithm. Circuits Syst Signal Process 39, 4094–4109 (2020). https://doi.org/10.1007/s00034-020-01357-2

Download citation

Keywords

  • Real-time implementation
  • Single- and cross-show speaker diarization
  • TLBO algorithm
  • RTF
  • xRT
  • RPi3 board