Sample Dropout for Audio Scene Classification Using Multi-scale Dense Connected Convolutional Neural Network

Feng, Dawei; Xu, Kele; Mi, Haibo; Liao, Feifan; Zhou, Yan

doi:10.1007/978-3-319-97289-3_9

Dawei Feng¹⁵,
Kele Xu^15,16,
Haibo Mi¹⁵,
Feifan Liao¹⁶ &
…
Yan Zhou¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11016))

Included in the following conference series:

Pacific Rim Knowledge Acquisition Workshop

1601 Accesses
3 Citations

Abstract

Acoustic scene classification is an intricate problem for a machine. As an emerging field of research, deep Convolutional Neural Networks (CNN) achieve convincing results. In this paper, we explore the use of multi-scale Dense connected convolutional neural network (DenseNet) for the classification task, with the goal to improve the classification performance as multi-scale features can be extracted from the time-frequency representation of the audio signal. On the other hand, most of previous CNN-based audio scene classification approaches aim to improve the classification accuracy, by employing different regularization techniques, such as the dropout of hidden units and data augmentation, to reduce overfitting. It is widely known that outliers in the training set have a high negative influence on the trained model, and culling the outliers may improve the classification performance, while it is often under-explored in previous studies. In this paper, inspired by the silence removal in the speech signal processing, a novel sample dropout approach is proposed, which aims to remove outliers in the training dataset. Using the DCASE 2017 audio scene classification datasets, the experimental results demonstrates the proposed multi-scale DenseNet providing a superior performance than the traditional single-scale DenseNet, while the sample dropout method can further improve the classification robustness of multi-scale DenseNet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023_32
Chapter Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Dixit, M., Chen, S., Gao, D., Rasiwasia, N., Vasconcelos, N.: Scene classification with semantic fisher vectors. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2974–2983. IEEE (2015)
Google Scholar
Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., Plumbley, M.D.: Detection and classification of acoustic scenes and events. IEEE Trans. Multimedia 17(10), 1733–1746 (2015)
Article Google Scholar
Eghbal-Zadeh, H., Lehner, B., Dorfer, M., Widmer, G.: CP-JKU submissions for DCASE-2016: a hybrid approach using binaural i-vectors and deep convolutional neural networks. In: IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) (2016)
Google Scholar
Mesaros, A., Heittola, T., Virtanen, T.: TUT database for acoustic scene classification and sound event detection. In: 2016 24th European Signal Processing Conference (EUSIPCO), pp. 1128–1132. IEEE (2016)
Google Scholar
Mesaros, A., et al.: DCASE 2017 challenge setup: tasks, datasets and baseline system. In: DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events (2017)
Google Scholar
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)
Google Scholar
Geiger, J.T., Schuller, B., Rigoll, G.: Large-scale audio feature extraction and SVM for acoustic scene classification. In: 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–4. IEEE (2013)
Google Scholar
Fonseca, E., Gong, R., Bogdanov, D., Slizovskaia, O., Gómez Gutiérrez, E., Serra, X.: Acoustic scene classification by ensembling gradient boosting machine and convolutional neural networks. In: Virtanen, T., et al. (eds.) Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 16 November 2017, Munich, Germany. Tampere (Finland): Tampere University of Technology, pp. 37–41. Tampere University of Technology (2017)
Google Scholar
Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems, pp. 892–900 (2016)
Google Scholar
Marchi, E., Tonelli, D., Xu, X., Ringeval, F., Deng, J., Schuller, B.: The up system for the 2016 DCASE challenge using deep recurrent neural network and multiscale kernel subspace learning. In: Detection and Classification of Acoustic Scenes and Events (2016)
Google Scholar
Bae, S.H., Choi, I., Kim, N.S.: Acoustic scene classification using parallel combination of LSTM and CNN. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), pp. 11–15 (2016)
Google Scholar
Phan, H., Koch, P., Hertel, L., Maass, M., Mazur, R., Mertins, A.: CNN-LTE: a class of 1-x pooling convolutional neural networks on label tree embeddings for audio scene classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 136–140. IEEE (2017)
Google Scholar
Xu, K., et al.: Mixup-based acoustic scene classification using multi-channel convolutional neural network. arXiv preprint arXiv:1805.07319 (2018)
Rakotomamonjy, A., Gasso, G.: Histogram of gradients of time-frequency representations for audio scene classification. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(1), 142–153 (2015)
Google Scholar
Piczak, K.J.: ESC: dataset for environmental sound classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1015–1018. ACM (2015)
Google Scholar
LeCun, Y., et al.: Learning algorithms for classification: a comparison on handwritten digit recognition. Neural Netw. Stat. Mech. Perspect. 261, 276 (1995)
Google Scholar
Li, B., Xu, K., Cui, X., Wang, Y., Ai, X., Wang, Y.: Multi-scale DenseNet-based electricity theft detection. arXiv preprint arXiv:1805.09591 (2018)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, no. 2, p. 3 (2017)
Google Scholar
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q.: Multi-scale dense convolutional networks for efficient prediction. arXiv preprint arXiv:1703.09844 (2017)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar

Download references

Acknowledgement

This study was supported by the Strategic Priority Research Programme (17-ZLXD-XX-02-06-02-08).

Author information

Authors and Affiliations

Science and Technology on Parallel and Distributed Laboratory, School of Computer, National University of Defense Technology, Changsha, 410073, China
Dawei Feng, Kele Xu & Haibo Mi
School of Information and Communication, National University of Defense Technology, Wuhan, 430010, China
Kele Xu, Feifan Liao & Yan Zhou

Authors

Dawei Feng
View author publications
You can also search for this author in PubMed Google Scholar
Kele Xu
View author publications
You can also search for this author in PubMed Google Scholar
Haibo Mi
View author publications
You can also search for this author in PubMed Google Scholar
Feifan Liao
View author publications
You can also search for this author in PubMed Google Scholar
Yan Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kele Xu .

Editor information

Editors and Affiliations

University of Tsukuba , Tokyo, Japan
Kenichi Yoshida
Shih Chien University, Taipei City, Taiwan
Maria Lee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Feng, D., Xu, K., Mi, H., Liao, F., Zhou, Y. (2018). Sample Dropout for Audio Scene Classification Using Multi-scale Dense Connected Convolutional Neural Network. In: Yoshida, K., Lee, M. (eds) Knowledge Management and Acquisition for Intelligent Systems. PKAW 2018. Lecture Notes in Computer Science(), vol 11016. Springer, Cham. https://doi.org/10.1007/978-3-319-97289-3_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-97289-3_9
Published: 27 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97288-6
Online ISBN: 978-3-319-97289-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics