Multi-view representation for sound event recognition


The sound event recognition (SER) task is gaining lot of importance in emerging applications such as machine audition, audio surveillance, and environmental audio scene recognition. The recognition of sound events with noisy conditions in real-time surveillance applications is a difficult task. In this paper, we focus on learning patterns using multiple forms (views) of the given sound events. We propose two variants of the Multi-View Representation (MVR)-based approach for the SER task. The first variant combines the auditory image-based features and the cepstral features from sound signal. The second variant combines the statistical features extracted from the auditory images and the cepstral features of sound signal. In addition to these variants, Constant Q-transform and Variable Q-transform image-based features are also explored to study the other effective forms of multi-view representations. A discriminative model-based classifier is then used to recognize these representations as environmental sound events. The performance of the proposed MVR approaches is evaluated on three benchmark sound event datasets namely ESC-50, DCASE2016 Task 2, and DCASE2018 Task 2 for the SER task. The recognition accuracy of the proposed MVR approach is significantly better than the other approaches proposed in the recent literature.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Availability of data and material

The datasets namely ESC-50, DCASE2016 Task 2 and DCASE2018 Task 2 used in our studies are publicly available.

Code availability

The code is available from the corresponding author upon request.


  1. 1.

    Yang, W., Krishnan, S.: Sound event detection in real-life audio using joint spectral and temporal features. Signal Image Video Process. 12(7), 1345–1352 (2018)

    Article  Google Scholar 

  2. 2.

    Kong, Q., Xu, Y., Sobieraj, I., Wang, W., Plumbley, D.M.: Sound event detection and time-frequency segmentation from weakly labelled data. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 27(4), 777–787 (2019)

    Article  Google Scholar 

  3. 3.

    Chandrakala, S., Jayalakshmi, S.L.: Generative model driven representation learning in a hybrid framework for environmental audio scene and sound event recognition. IEEE Trans. Multimed. 22(1), 3–14 (2020)

    Article  Google Scholar 

  4. 4.

    Shreyas, N., Venkatraman, M., Malini, S., Chandrakala, S.: Trends of sound event recognition in audio surveillance: a recent review and study. In: The Cognitive Approach in Cloud Computing and Internet of Things Technologies for Surveillance Tracking Systems, pp. 95–106. Elsevier, (2020)

  5. 5.

    Wang, C.-Y., Tai, T.-C., Wang, J.-C., Santoso, A., Mathu-laprangsan, S., Chiang, C.-C., Chung-Hsien, W.: Sound events recognition and retrieval using multi-convolutional-channel sparse coding convolutional neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1875–1887 (2020)

    Article  Google Scholar 

  6. 6.

    Jayalakshmi, S.L., Chandrakala, S., Nedunchelian, R.: Global statistical features-based approach for acoustic event detection. Appl. Acoust. 139, 113–118 (2018)

    Article  Google Scholar 

  7. 7.

    Atrey, P.K., Maddage, N.C., Kankanhalli, M.S.: Audio based event detection for multimedia surveillance. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 5, p. V. IEEE, (2006)

  8. 8.

    Dennis, J., Tran, H.D., Li, H.: Spectrogram image feature for sound event classification in mismatched conditions. IEEE Signal Process. Lett. 18(2), 130–133 (2010)

    Article  Google Scholar 

  9. 9.

    Do Ha, M., Sheng, W., Liu, M., Zhang, S.: Context-aware sound event recognition for home service robots. In: 2016 IEEE International Conference on Automation Science and Engineering (CASE), pp. 739–744. IEEE, (2016)

  10. 10.

    Singh, S., Payne, R.S., Jennings, A.P.: Toward a methodology for assessing electric vehicle exterior sounds. IEEE Trans. Intell. Transp. Syst. 15(4), 1790–1800 (2014)

    Article  Google Scholar 

  11. 11.

    Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE, (2013)

  12. 12.

    Piczak, K.J.: ESC: dataset for environmental sound classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1015–1018. ACM, (2015)

  13. 13.

    Rakotomamonjy, A., Gasso, G.: Histogram of gradients of time-frequency representations for audio scene classification. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 142–153 (2014)

    Google Scholar 

  14. 14.

    Cowling, M., Sitte, R.: Comparison of techniques for environmental sound recognition. Pattern Recognit. Lett. 24(15), 2895–2907 (2003)

    Article  Google Scholar 

  15. 15.

    Cakir, E., Heittola, T., Huttunen, H., Virtanen, T.: Polyphonic sound event detection using multi label deep neural networks. In: 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE, (2015)

  16. 16.

    Jeong, I.-Y., Lee, S., Han, Y., Lee, K.: Audio event detection using multiple-input convolutional neural network. In: Detection and Classification of Acoustic Scenes and Events (DCASE) (2017)

  17. 17.

    Chen, Y., Zhang, Y., Duan, Z.: DCASE2017 sound event detection using convolutional neural network. In: Detection and Classification of Acoustic Scenes and Events (2017)

  18. 18.

    Adavanne, S., Parascandolo, G., Pertilä, P., Heittola, T., Virtanen, T.: Sound event detection in multichannel audio using spatial and harmonic features. arXiv preprintarXiv:1706.02293, (2017)

  19. 19.

    Parascandolo, G., Huttunen, H., Virtanen, T.: Recurrent neural networks for polyphonic sound event detection in real life recordings. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6440–6444. IEEE, (2016)

  20. 20.

    Lu, R., Duan, Z.: Bidirectional GRU for sound event detection. In: Detection and Classification of Acoustic Scenes and Events (2017)

  21. 21.

    Zhou, J.: Sound event detection in multichannel audio LSTM network. In: Proceedings of Detection Classification Acoustic Scenes Events, (2017)

  22. 22.

    Myung Jong Kim and Hoirin Kim: Audio-based objectionable content detection using discriminative transforms of time–frequency dynamics. IEEE Trans. Multimed. 14(5), 1390–1400 (2012)

    Article  Google Scholar 

  23. 23.

    Hyungjun, L., Kim, M.J., Kim, H.-R.: Bag-of-audio-words feature representation using GMM clustering for sound event classification. In: ICEIC2015, pp. 170–175, (2015)

  24. 24.

    Heittola, T., Mesaros, A., Eronen, A., Virtanen, T.: Context-dependent sound event detection. EURASIP J. Audio Speech Music Process. 2013(1), 1 (2013)

    Article  Google Scholar 

  25. 25.

    Chu, S., Narayanan, S., Jay Kuo, C.-C.: Environmental sound recognition with time–frequency audio features. IEEE Trans. Audio Speech Lang. Process. 17(6), 1142–1158 (2009)

    Article  Google Scholar 

  26. 26.

    Ye, J., Kobayashi, T., Wang, X., Tsuda, H., Masahiro, M.: An automatic taxonomy approach. In: IEEE Transactions on Emerging Topics in Computing, Audio Data Mining for Anthropogenic Disaster Identification (2017)

  27. 27.

    Serizel, R., Bisot, V., Essid, S., Richard, G.: Acoustic features for environmental sound analysis. In: Computational Analysis of Sound Scenes and Events, pp. 71–101. Springer, (2018)

  28. 28.

    Grzeszick, R., Plinge, A., Fink, G.A.: Bag-of-features methods for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1242–1252 (2017)

    Article  Google Scholar 

  29. 29.

    Li, Y., Li, X., Zhang, Y., Wang, W., Liu, M., Feng, X.: Acoustic scene classification using deep audio feature and BLSTM network. In: 2018 International Conference on Audio, Language and Image Processing (ICALIP), pp. 371–374. IEEE, (2018)

  30. 30.

    Vesperini, F., Gabrielli, L., Principi, E., Squartini, S.: Polyphonic sound event detection by using capsule neural networks. IEEE J. Sel. Top. Signal Process. 13(2), 310–322 (2019).

    Article  Google Scholar 

  31. 31.

    Yu, Y., Beuret, S., Zeng, D., Oyama, K.: Deep learning of human perception in audio event classification. In: 2018 IEEE International Symposium on Multimedia (ISM), pp. 188–189. IEEE, (2018)

  32. 32.

    Brown, J.C.: Calculation of a constant q spectral transform. J. Acoust. Soc. Am. 89(1), 425–434 (1991)

    Article  Google Scholar 

  33. 33.

    Hanyu, Z., Shengchen, L.: A system for DCASE challenge using 2018 CRNN with MEL features. Technical report, DCASE2018 Challenge (2018)

  34. 34.

    Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, T.: Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 26(2), 379–393 (2018)

    Article  Google Scholar 

  35. 35.

    Fonseca, E., Plakal, M., Font, F., Ellis, D.P.W., Favory, X., Pons, J., Serra, X.: General-purpose tagging of freesound audio with audioset labels: task description, dataset, and baseline. arXiv preprintarXiv:1807.09902, (2018)

  36. 36.

    Piczak, K.J.: Environmental sound classification with convolutional neural networks. In: 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. IEEE, (2015b)

  37. 37.

    Benetos, E., Lafay, G., Lagrange, M.: DCASE2016 task 2 baseline. Technical report, DCASE2016 Challenge (2016)

  38. 38.

    Komatsu, T., Toizumi, T., Kondo, R., Senda, Y.: Acoustic event detection method using semi-supervised non-negative matrix factorization with a mixture of local dictionaries. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), pp. 45–49, (2016)

  39. 39.

    Rakotomamonjy, A., Gasso, G.: Histogram of gradients of time–frequency representations for audio scene classification. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(1), 142–153 (2015)

    Google Scholar 

Download references


The authors would like to acknowledge the financial support vide No.DST/CSRI/2017/131(G) Project under the ‘Cognitive Science Research Initiative (CSRI)’ by the Department of Science and Technology, Government of India to carry out this work.

Author information



Corresponding author

Correspondence to S. Chandrakala.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chandrakala, S., M, V., N, S. et al. Multi-view representation for sound event recognition. SIViP (2021).

Download citation


  • Sound event recognition (SER)
  • Spectrograms
  • Mel-frequency cepstral coefficients (MFCCs)
  • Histogram of oriented gradients (HOG)
  • Moment-based features
  • Constant Q-transform (CQT)
  • Variable Q-transform (VQT)
  • Support vector machine (SVM)