Advertisement

Improved Convolutional Neural Networks for Acoustic Event Classification

  • Guichen Tang
  • Ruiyu Liang
  • Yue Xie
  • Yongqiang Bao
  • Shijia Wang
Article
  • 38 Downloads

Abstract

To further exploit the potential performance of convolutional neural networks in acoustic event classification, an improved convolutional neural network called AecNet (Acoustic event classification net) is proposed. For traditional convolutional neural network lacks the representation of low-level features, the proposed model includes more feature layers to reserve the information of low-level and high-level features of the input. In order to extract the features of different level effectively, 1 × 1 convolutions are adopted to compress the feature maps of all convolutional layers except the top convolutional layer. Then the condensed features are concatenated into one layer, which contains all features in different levels. So, the feature learning is enhanced and multi-scale convolutional neural network is constructed. In order to extract the dynamic features of the sound clip better, multi-channels spectrogram features comprised of mel-spectrogram, its first order delta along frequency and time, second order delta along frequency and time are adopted. In experiment, point of FFT, number of mel-bands and type of mel-spectrogram deltas are detailedly discussed and reasonable choice are suggested in practice. Experiments results on datasets ESC-10, ESC-50 and DCASE show that the proposed method yields improvements of recognition accuracy in various degrees compared with some state-of-art results on standard benchmark.

Keywords

Convolutional neural networks Acoustic event classification Mel-spectrogram Deep learning 

Notes

Acknowledgments

The work was supported by the National Natural Science Foundation of China under Grant No. 61871213, Six Talent Peaks Project in Jiangsu Province under Grant No. 2016-DZXX-023, China Postdoctoral Science Foundation funded project under Grant No. 2016 M601696, Qing Lan Project of Jiangsu Province, Jiangsu Planned Projects for Postdoctoral Research Funds under Grant No. 1601011B.

Compliance with ethical standards

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. 1.
    Aytar Y, Vondrick C, Torralba A (2016) SoundNet: Learning Sound Representations from Unlabeled Video. arXiv preprint arXiv:1610.09001Google Scholar
  2. 2.
    Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(1):281–305MathSciNetzbMATHGoogle Scholar
  3. 3.
    Chu S, Narayanan S, Kuo CCJ (2009) Environmental Sound Recognition With Time–Frequency Audio Features. IEEE Trans Audio Speech Lang Process 17(6):1142–1158CrossRefGoogle Scholar
  4. 4.
    Gemmeke JF, Ellis DPW, Freedman D, Jansen A, Lawrence W, Moore RC, Plakal M, Ritter M (2017) Audio Set: An ontology and human-labeled dataset for audio events. in 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017, March 5, 2017 - March 9, 2017. New Orleans, LA, United states: Institute of Electrical and Electronics Engineers Inc.Google Scholar
  5. 5.
    Gencoglu O, Virtanen T, Huttunen H (2014) Recognition of acoustic events using deep neural networks. in 22nd European Signal Processing Conference, EUSIPCO 2014, September 1, 2014 - September 5, 2014. Lisbon, Portugal: European Signal Processing Conference, EUSIPCOGoogle Scholar
  6. 6.
    Han Y, Lee K (2016) Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation. arXiv preprint arXiv:1607.02383Google Scholar
  7. 7.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. in 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, June 26, 2016 - July 1, 2016. Las Vegas, NV, United states: IEEE Computer SocietyGoogle Scholar
  8. 8.
    Hertel L, Barth E, Kaster T, Martinetz T (2015) Deep convolutional neural networks as generic feature extractors. in International Joint Conference on Neural Networks, IJCNN 2015, July 12, 2015 - July 17, 2015. Killarney, Ireland: Institute of Electrical and Electronics Engineers Inc.Google Scholar
  9. 9.
    Jarrett K, Kavukcuoglu K, Ranzato M A (2009) Lecun Y. What is the best multi-stage architecture for object recognition? in 12th International Conference on Computer Vision, ICCV 2009, September 29, 2009 - October 2, 2009. Kyoto, Japan: Institute of Electrical and Electronics Engineers Inc.Google Scholar
  10. 10.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. in 2014 ACM Conference on Multimedia, MM 2014, November 3, 2014 - November 7, 2014. Orlando, FL, United states: Association for Computing Machinery, Inc.Google Scholar
  11. 11.
    Kim HG, Jin YK (2017) Acoustic Event Detection in Multichannel Audio Using Gated Recurrent Neural Networks with High-Resolution Spectral Features. ETRI J 39(6):832–840CrossRefGoogle Scholar
  12. 12.
    Kingma DP, Ba J (2014) Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980Google Scholar
  13. 13.
    Kumar A, Raj B (2016) Audio event detection using weakly labeled data. in 24th ACM Multimedia Conference, MM 2016, October 15, 2016 - October 19, 2016. Amsterdam, United kingdom: Association for Computing Machinery, Inc.Google Scholar
  14. 14.
    Lin M, Chen Q, Yan S (2013) Network In Network. arXiv preprint arXiv:1312.4400Google Scholar
  15. 15.
    Marques GA (2016) Langlois T. tut acoustic scene classification submission. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Google Scholar
  16. 16.
    Mcloughlin I, Zhang H, Xie Z, Song Y, Xiao W (2015) Robust sound event classification using deep neural networks. IEEE-ACM T Audio Spe 23(3):540–552Google Scholar
  17. 17.
    Mesaros A, Heittola T, Benetos E, Foster P, Lagrange M, Virtanen T, Plumbley MD (2017) Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE-ACM T Audio Spe 26(2):379–393Google Scholar
  18. 18.
    Mikolov T, Joulin A, Chopra S, Mathieu M, Ranzato M A (2014) Learning Longer Memory in Recurrent Neural Networks. arXiv preprint arXiv:1412.7753Google Scholar
  19. 19.
    Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. in 30th International Conference on Machine Learning, ICML 2013, June 16, 2013 - June 21, 2013. Atlanta, GA, United states: International Machine Learning Society (IMLS)Google Scholar
  20. 20.
    Phan H, Maaß M, Mazur R, Mertins A (2015) Random regression forests for acoustic event detection and classification. IEEE-ACM T Audio Spe 23(1):20–31Google Scholar
  21. 21.
    Piczak KJ (2015) Environmental sound classification with convolutional neural networks. in 25th IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2015, September 17, 2015 - September 20, 2015. Boston, MA, United states: IEEE Computer SocietyGoogle Scholar
  22. 22.
    Piczak KJ (2015) ESC: Dataset for environmental sound classification. in 23rd ACM International Conference on Multimedia, MM 2015, October 26, 2015 - October 30, 2015. Brisbane, QLD, Australia: Association for Computing Machinery, Inc.Google Scholar
  23. 23.
    Povey D, Zhang X, Khudanpur S (2014) Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging. arXiv preprint arXiv:1410.7455v3Google Scholar
  24. 24.
    Radford A, Metz L, Chintala S (2015) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv preprint arXiv:1511.06434Google Scholar
  25. 25.
    Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis 115(3):211–252MathSciNetCrossRefGoogle Scholar
  26. 26.
    Sermanet P, Lecun Y (2011) Traffic sign recognition with multi-scale convolutional networks. in 2011 International Joint Conference on Neural Network, IJCNN 2011, July 31, 2011 - August 5, 2011. San Jose, CA, United states: Institute of Electrical and Electronics Engineers Inc.Google Scholar
  27. 27.
    Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetzbMATHGoogle Scholar
  28. 28.
    Sun Y, Wang X, Tang X (2015) Deeply learned face representations are sparse, selective, and robust. in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, June 7, 2015 - June 12, 2015. Boston, MA, United states: IEEE Computer SocietyGoogle Scholar
  29. 29.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V (2015) Rabinovich A. Going deeper with convolutions. in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, June 7, 2015 - June 12, 2015. Boston, MA, United states: IEEE Computer SocietyGoogle Scholar
  30. 30.
    Takahashi N, Gygli M, Pfister B, Van Gool L (2016) Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection. arXiv preprint arXiv:1604.07160Google Scholar
  31. 31.
    Valenti M, Diment A, Parascandolo G, Squartini S, Virtanen T (2016) DCASE 2016 acoustic scene classification using convolutional neural networks, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016. 95–99Google Scholar
  32. 32.
    Vu TH, Wang JC (2016) Acoustic scene and event recognition using recurrent neural networks. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Google Scholar
  33. 33.
    Xu Y, Huang Q, Wang W, Foster P, Sigtia S, Jackson PJB, Plumbley MD (2017) Unsupervised feature learning based on deep models for environmental audio tagging. IEEE-ACM T Audio Spe 25(6):1230–1241Google Scholar
  34. 34.
    Yun S, Kim S, Moon S, Cho J, Kim T (2016) Discriminative training of GMM parameters for audio scene classification and audio tagging. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Google Scholar
  35. 35.
    Zhang H, Mcloughlin I, Song Y (2015) Robust sound event recognition using convolutional neural networks. in 40th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015, April 19, 2014 - April 24, 2014. Brisbane, QLD, Australia: Institute of Electrical and Electronics Engineers Inc.Google Scholar
  36. 36.
    Zieger C, Omologo M (2008) Acoustic event classification using a distributed microphone network with a GMM/SVM combined algorithm. in INTERSPEECH 2008 - 9th Annual Conference of the International Speech Communication Association, September 22, 2008 - September 26, 2008. Brisbane, QLD, Australia: International Speech Communication AssociationGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Communication EngineeringNanjing Institute of TechnologyNanjingChina
  2. 2.School of Information Science and EngineeringSoutheast UniversityNanjingChina

Personalised recommendations