Abstract
Recently, deep neural networks, which include convolutional neural networks (CNNs), have been widely applied to acoustic scene classification (ASC). Motivated by the fact that some simplified CNNs have shown improvements over deep CNNs, such as Visual Geometry Group Net (VGG-Net), we have figured out how to simplify the VGG-Net style architecture to a shallow CNN with improved performance. Max pooling and batch normalization are also applied for better accuracy. With a series of controlled tests on detection and classification of acoustic scenes and events (DCASE) 2016 data sets, our shallow CNN achieves 6.7% improvement, and reduces time complexity to 5%, compared with the VGG-Net style CNN.
Similar content being viewed by others
References
Barchiesi D, Giannoulis D, Dan S, et al. Acoustic scene classification: classifying environments from the sounds they produce[J]. IEEE Signal Processing Magazine, 2015, 32(3): 16–34.
Ito A, Aiba A, Ito A, et al. Detection of abnormal sound using multi-stage GMM for surveillance microphone[C]//International Conference on Information Assurance and Security. Washington D C: IEEE, 2009:733–736.
Ajmera J, Mccowan I, Bourlard H. Speech/music segmentation using entropy and dynamism features in a HMM classification framework[J]. Speech Communication, 2003, 40(3): 351–363.
Chit K M. Audio-Based action scene classification using HMM-SVM algorithm[J]. International Journal of Advanced Research in Computer Engineering & Technology, 2013, 2(4): 1347–1351.
Xu Y, Huang Q, Wang W, et al. Hierarchical Learning for DNN-based Acoustic Scene Classification[R/OL]. [2016-09-03]. http://www.cs.tut.fi/sgn/arg/dcase2016/documents/challenge_ technical_reports/Task1/Xu_2016_task1.pdf.
Eghbal-Zadeh H, Lehner B, Dorfer M, et al. CP-JKU submissions for DCASE-2016: A Hybrid Approach Using Binaural Ivectors and Deep Convolutional Neural Networks[R/OL]. [2016-09-03]. http://www.cs.tut.fi/sgn/arg/dcase2016/documents/challenge_technical_reports/Task1/Eghbal-Zadeh_2016 _task1.pdf.
Heittola T, Mesaros A. Acoustic Scene Classification Task Results[EB/OL]. [2017-02-13]. http://www.cs.tut.fi/sgn/arg/dcase2016/task-results-acoustic-scene-classification.
Han Y C, Lee K G. Acoustic scene classification using convolutional neural network and multiple-width frequencydelta data augmentation[DB/OL]. [2017-04-15]. http://arxiv. org/ar: 1607. 02383, 2016.
Valenti M, Diment A, Parascandolo G, et al. DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks[R/OL]. [2016-09-03]. http://www.cs.tut.fi/sgn/arg/dcase2016/documents/challenge_technical_reports/Task1/Va lenti_2016_task1.pdf.
Thomas L, Alexander S. CQT-based Convolutional Neural Networks for Audio Scene Classification and Domestic Audio Tagging [R/OL]. [2016-09-03]. http://www.cs.tut.fi/sgn/arg/dcase2016/documents/challenge_technical_reports/Task1/Sc hindler_2016_task1.pdf.
Mafra G S, Duong N Q K, Ozerov A, et al. Acoustic Scene Classification: An Evaluation of an Extremely Compact Feature Representation[R/OL]. [2016-09-03]. http://www.cs.tut.fi/sgn/arg/dcase2016/documents/challenge_technical_report s/Task1/Duong_2016_task1.pdf.
Heittola T, Mesaros A, Virtanen T. Tut database for acoustic scene classification and sound event detection[C]//European Signal Processing Conference (EUSIPCO). Washington D C: IEEE, 2016: 1128–1132.
Johnson J, Li F F, Karpathy A, et al. Convolutional neural networks: Architectures, convolution pooling layers [EB/OL]. [2017-02-13]. http://cs231n.github.io/convolutional-networks/.
He K, Sun J. Convolutional neural networks at constrained time cost[C]//IEEE Conference on Computer Vision and Pattern Recognition. Washington D C: IEEE, 2015: 5353–5360.
Nam J, Herrera J, Slaney M, et al. Learning sparse feature representations for music annotation and retrieval[C]//International Society for Music Information Retrieval Conference. Porto, Portugal: Edições, 2012: 565–570.
Han Y, Lee S, Nam J, et al. Sparse feature learning for instrument identification: Effects of sampling and pooling methods[J]. Journal of the Acoustical Society of America, 2016, 139(5):2290–2298.
Mcfee B, Raffel C, Liang D W, et al. Librosa: Audio and music signal analysis in Python[C]//Proceedings of the Python 14th Python in Science Conference. Austin: TX, 2015: 18–25.
Jia Y Q, Shelhamer E, Donahue J, et al. Caffe: Convolutional architecture for fast feature embedding[C]//International Conference on Multimedia. New York: ACM, 2014: 675–678.
Sutskever I, Martens J, Dahl G, et al. On the importance of initialization and momentum in deep learning[C]//International Conference on Machine Learning. Atlanta: JMLR. org, 2013: 1139–1147.
Lee H, Kim G, Kim H G, et al. Deep CNNs along the time axis with intermap pooling for robustness to spectral variations[J]. IEEE Signal Processing Letters, 2016, 23(10): 1310–1314.
Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[C]//International Conference on Machine Learning. Lille: JMLR. org, 2015: 448–456.
Shimodaira H. Improving predictive inference under covariate shift by weighting the log-likelihood function[J]. Journal of Statistical Planning & Inference, 2000, 90(2): 227–244.
Author information
Authors and Affiliations
Corresponding author
Additional information
Foundation item: Supported by the National Natural Science Foundation of China (61102127, 61231015), National High Technology Research and Development Program of China (863 Program, 2015AA016306), National Key Research and Development Program (2016YFB0502204), the Innovation Fund of Shanghai Aerospace Science and Technology (SAST, 2015014), the Key Technology R&D Program of Hubei Provence (2014BAA153), and SKLSE-2015-A-06
Rights and permissions
About this article
Cite this article
Lu, L., Yang, Y., Jiang, Y. et al. Shallow Convolutional Neural Networks for Acoustic Scene Classification. Wuhan Univ. J. Nat. Sci. 23, 178–184 (2018). https://doi.org/10.1007/s11859-018-1308-z
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11859-018-1308-z