Acoustic event diarization in TV/movie audios using deep embedding and integer linear programming

  • Yanxiong Li
  • Yuhan Zhang
  • Xianku Li
  • Mingle Liu
  • Wucheng Wang
  • Jichen YangEmail author


In this study, we propose a method for acoustic event diarization based on a feature of deep embedding and a clustering algorithm of integer linear programming. The deep embedding learned by deep auto-encoder network is used to represent the properties of different classes of acoustic events, and then the integer linear programming is adopted for merging audio segments belonging to the same class of acoustic events. Four kinds of TV/movie audios (21.5 h in total) are used as experimental data, including Sport, Situation comedy, Award ceremony, and Action movie. We compare the deep embedding with state-of-the-art features. Further, the clustering algorithm of integer linear programming is compared with other clustering algorithms adopted in previous works. Finally, the proposed method is compared to both supervised and unsupervised methods on four kinds of TV/movie audios. The results show that the proposed method is superior to other unsupervised methods based on agglomerative information bottleneck, Bayesian information criterion and spectral clustering, and is little inferior to the supervised method based on deep neural network in terms of acoustic event error.


Deep embedding integer linear programming acoustic event detection audio content analysis 



The work was supported by the national natural science foundation of China (61771200, 6191101285, and 6191101306), the project of international science and technology cooperation of Guangdong province (2019A050509001), the open project program of the national laboratory of pattern recognition (NLPR) (201800004), the fundamental research funds for the central universities, South China University of Technology (Research on key techniques for analyzing complex audio scene contents, 2019), and the project of science and technology of Guangzhou (201704040062).


  1. 1.
    Benetos E, Lafay G, Lagrange M, Plumbley MD (2017) Polyphonic sound event tracking using linear dynamical systems. IEEE/ACM Trans on ASLP 25(6):1266–1277 IEEE/ACMGoogle Scholar
  2. 2.
    Burileanu D, Pascalin L, Burileanu C, Puchiu M (2000) An adaptive and fast speech detection algorithm. In: Proc. of Int’l Workshop on Text, Speech and Dialogue, pp 177-182. SpringerGoogle Scholar
  3. 3.
    Diarization Error Rate, NIST. Accessed 27 March 2016
  4. 4.
    Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874 ElsevierMathSciNetCrossRefGoogle Scholar
  5. 5.
    Garcia-Romero D, Espy-Wilson CY (2010) Automatic acquisition device identification from speech recordings. In: Proc. ICASSP, pp 1806-1809. IEEEGoogle Scholar
  6. 6.
    Gencoglu O, Virtanen T, Huttunen H (2014) Recognition of acoustic events using deep neural network. In: Proc. of European Conf. on Signal Process., pp 506-510. ISCAGoogle Scholar
  7. 7.
    Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800 MIT PressCrossRefzbMATHGoogle Scholar
  8. 8.
    Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313:504–507 AAASMathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Küçükbay SE, Sert M (2015) Audio-based event detection in office live environments using optimized MFCC-SVM approach. In: Proc. of IEEE International Conference on Semantic Computing, pp 475-480. IEEEGoogle Scholar
  10. 10.
    Kumar A, Dighe P, Singh R, Chaudhuri S, Raj B (2012) Audio event detection from acoustic unit occurrence patterns. In: Proc. of IEEE ICASSP, pp 489-492. IEEEGoogle Scholar
  11. 11.
    Laffitte P, Sodoyer D, Tatkeu C, Girin L (2016) Deep neural networks for automatic detection of screams and shouted speech in subway trains. In: Proc. of IEEE ICASSP, pp 6460-6464. IEEEGoogle Scholar
  12. 12.
    Lee D, Lee S, Han Y, Lee K (2017) Ensemble of convolutional neural networks for weakly-supervised sound event detection using multiple scale input. In: Proc. DCASE2017 Workshop, pp 74-79. Tampere University of Technology
  13. 13.
    Li YX, He Q, Kwong S, Li T, Yang J (2009) Characteristics-based effective applause detection for meeting speech. Signal Process 89(8):1625–1633 ElsevierCrossRefzbMATHGoogle Scholar
  14. 14.
    Li Y, Jin H, Li W, He Q, Zhu Z, Feng X (2014) Fast speaker clustering using distance of feature matrix mean and adaptive convergence threshold. IET Signal Process 8(8):844–851 IETCrossRefGoogle Scholar
  15. 15.
    Li Y, Li X, Zhang Y, Liu M, Wang W (2018) Anomalous sound detection using deep audio representation and a BLSTM network for audio surveillance of roads. IEEE Access 6:58043–58055 IEEECrossRefGoogle Scholar
  16. 16.
    Li Y, Wang Q, Li X, Zhang X, Zhang Y, Chen A, He Q, Huang Q (2017) Unsupervised detection of acoustic events using information bottleneck principle. Digital Signal Process 63:123–134 ElsevierCrossRefGoogle Scholar
  17. 17.
    Li Y, Wang Q, Zhang X, Li W, Li X, Yang J, Feng X, Huang Q, He Q (2017) Unsupervised classification of speaker roles in multi-participant conversational speech. Comput Speech Lang 42:81–99 ElsevierCrossRefGoogle Scholar
  18. 18.
    Li Y, Zhang X, Jin H, Li X, Wang Q, He Q, Huang Q (2018) Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic events detection. Multimed Tools Appl 77:897–916 SpringerCrossRefGoogle Scholar
  19. 19.
    Lu L, Alan H (2008) Audio keywords discovery for text-like audio content analysis and retrieval. IEEE Trans on Multimedia 10(1):74–85 IEEECrossRefGoogle Scholar
  20. 20.
    Lu L, Ge F, Zhao Q, Yan Y (2010) A SVM-based audio event detection system. In: Proc. of Int’l Conf. on Electrical and Control Engineering, pp 292-295. IEEEGoogle Scholar
  21. 21.
    Lu X, Tsao Y, Matsuda S, Hori C (2014) Sparse representation based on a bag of spectral exemplars for acoustic event detection. In: Proc. of IEEE ICASSP, pp 6255-6259. IEEEGoogle Scholar
  22. 22.
    McLoughlin I, Zhang H, Xie Z, Song Y, Xiao W (2015) Robust sound event classification using deep neural networks. IEEE/ACM Trans on ASLP 23(3):540–552 IEEEGoogle Scholar
  23. 23.
    Mesaros A, Heittola T, Diment A, Elizalde B, Shah A, Vincent E, Raj B, Virtanen T (2017) DCASE 2017 challenge setup: tasks, datasets and baseline system, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop. Tampere University of Technology
  24. 24.
    Miro XA, Bozonnet S, Evans N, Fredouille C, Friedland G, Vinyals O (2012) Speaker diarization: a review of recent research. IEEE Trans ASLP 20(2):356–370 IEEEGoogle Scholar
  25. 25.
    Niessen ME, Kasteren TLMV, Merentitis A (2013) Hierarchical modeling using automated sub-clustering for sound event recognition. In: Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp 1-4. IEEEGoogle Scholar
  26. 26.
    Phan H, Maaß M, Mazur R, Mertins A (2015) Random regression forests for acoustic event detection and classification. IEEE/ACM Trans on ASLP 23(1):20–31 IEEE/ACMGoogle Scholar
  27. 27.
    Plumbley MD, Kroos C, Bello JP, Richard G, Ellis DPW, Mesaros A (2018) Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 2018. Surrey University
  28. 28.
    Portelo J, Bugalho M, Trancoso I, Neto J, Abad A, Serralheiro A (2009) Non-speech audio event detection. In: Proc. of IEEE ICASSP, pp 1973-1976. IEEEGoogle Scholar
  29. 29.
    Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Digit Signal Process 10(1):19–41 ElsevierCrossRefGoogle Scholar
  30. 30.
    Reynolds DA, Torres-Carrasquillo P (2005) Approaches and applications of audio diarization. In: Proc. of IEEE ICASSP, pp 953-956. IEEEGoogle Scholar
  31. 31.
    Schädler MR, Meyer BT, Kollmeier B (2012) Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J Acoust Soc Amer 131(5):4134–4151 Acoustical Society of AmericaCrossRefGoogle Scholar
  32. 32.
    Schröder J, Goetze S, Anemüller J (2015) Spectro-temporal Gabor filterbank features for acoustic event detection. IEEE/ACM Trans on ASLP 23(12):2198–2208 IEEE/ACMGoogle Scholar
  33. 33.
    Schröder J, Goetze S, Grützmacher V, Anemüller J (2013) Automatic acoustic siren detection in traffic noise by part-based models. In: Proc. of IEEE ICASSP, pp 493-497. IEEEGoogle Scholar
  34. 34.
    Schröder J, Moritz N, Anemüller J, Goetze S, Kollmeier B (2017) Classifier architectures for acoustic scenes and events: implications for DNNs, TDNNs, and perceptual features from DCASE 2016. IEEE/ACM Trans on ASLP 25(6):1304–1314 IEEE/ACMGoogle Scholar
  35. 35.
    Schröder J, Moritz N, Schädler MR, Cauchi B, Adiloglu K, Anemüller J, Doclo S, Kollmeier B, Goetze S (2013) On the use of spectro-temporal features for the IEEE AASP challenge detection and classification of acoustic scenes and events. In: Proc. oF IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp1-4. IEEEGoogle Scholar
  36. 36.
    Stowell D, Giannoulis D, Benetos E, Lagrange M, Plumbley MD (2015) Detection and classification of acoustic scenes and events. IEEE Trans on Multimedia 17(10):1733–1746 IEEECrossRefGoogle Scholar
  37. 37.
    Temko A, Malkin R, Zieger C, Macho D, Nadeu C, Omologo M (2007) CLEAR evaluation of acoustic event detection and classification systems. Lect Notes Comput Sci 4122:311–322 SpringerCrossRefGoogle Scholar
  38. 38.
    Tran HD, Li H (2011) Sound event recognition with probabilistic distance SVMs. IEEE Trans on ASLP 19(6):1556–1568 IEEEGoogle Scholar
  39. 39.
    Valente F, Motlicek P, Vijayasenan D (2010) Variational Bayesian speaker diarization of meeting recordings. In: Proc. of IEEE ICASSP, pp 4954-4957. IEEEGoogle Scholar
  40. 40.
    Virtanen T, Mesaros A, Heittola T, Plumbley MD, Foster P, Benetos E, Lagrange M (2016) Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016. Tampere Univeristy of Technology
  41. 41.
    Wang Y, Dawat S, Metze F (2014) Exploring audio semantic concepts for event-based video retrieval. In: Proc. of IEEE ICASSP, pp 1360-1364. IEEEGoogle Scholar
  42. 42.
    Xu Y , Kong Q, Wang W, Plumbley MD (2018) Large-scale weakly supervised audio classification using gated convolutional neural network. In: Proc. IEEE ICASSP, pp 121-125. IEEEGoogle Scholar
  43. 43.
    Yang J, He Q, Li Y, Zhang X (2013) Speaker change detection based on mean shift. J Comput 8(3):638–644 HindawiCrossRefGoogle Scholar
  44. 44.
    Yu D, Seltzer ML (2011) Improved bottleneck features using pre-trained deep neural networks. In: Proc. of INTERSPEECH, pp 237-240. ISCAGoogle Scholar
  45. 45.
    Zhang X, He Q, Feng X (2015) Acoustic feature extraction by tensor-based sparse representation for sound effects classification. In: Proc. of IEEE ICASSP, pp 166-170. IEEEGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Electronic and Information EngineeringSouth China University of TechnologyGuangzhouChina

Personalised recommendations