Abstract
The ConditionaL Neural Network (CLNN) exploits the nature of the temporal sequencing of the sound signal represented in a spectrogram, and its variant the Masked ConditionaL Neural Network (MCLNN) induces the network to learn in frequency bands by embedding a filterbank-like sparseness over the network’s links using a binary mask. Additionally, the masking automates the exploration of different feature combinations concurrently analogous to handcrafting the optimum combination of features for a recognition task. We have evaluated the MCLNN performance using the Urbansound8k dataset of environmental sounds. Additionally, we present a collection of manually recorded sounds for rail and road traffic, YorNoise, to investigate the confusion rates among machine generated sounds possessing low-frequency components. MCLNN has achieved competitive results without augmentation and using 12% of the trainable parameters utilized by an equivalent model based on state-of-the-art Convolutional Neural Networks on the Urbansound8k. We extended the Urbansound8k dataset with YorNoise, where experiments have shown that common tonal properties affect the classification performance.
This work is funded by the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no. 608014 (CAPACITIE).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT (1992)
Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: International Conference on Acoustics, Speech and Signal Processing, ICASSP (2014)
Fahlman, S.E., Hinton, G.E., Sejnowski, T.J.: Massively parallel architectures for Al: NETL, Thistle, and Boltzmann machines. In: National Conference on Artificial Intelligence, AAAI (1983)
Hamel, P., Eck, D.: Learning features from music audio with deep belief networks. In: International Society for Music Information Retrieval Conference, ISMIR (2010)
Taylor, G.W., Hinton, G.E., Roweis, S.: Modeling human motion using binary latent variables. In: Advances in Neural Information Processing Systems, NIPS, pp. 1345–1352 (2006)
Battenberg, E., Wessel, D.: Analyzing drum patterns using conditional deep belief networks. In: International Society for Music Information Retrieval, ISMIR (2012)
Mohamed, A.-R., Hinton, G.: Phone recognition using restricted boltzmann machines In: IEEE International Conference on Acoustics Speech and Signal Processing, ICASSP (2010)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Neural Information Processing Systems, NIPS (2012)
Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 1533–1545 (2014)
Pons, J., Lidy, T., Serra, X.: Experimenting with musically motivated convolutional neural networks. In: International Workshop on Content-based Multimedia Indexing, CBMI (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Choi, K., Fazekas, G., Sandler, M., Cho, K.: Convolutional recurrent neural networks for music classification. In: arXiv preprint arXiv:1609.04243 (2016)
Lee, H., Grosse, R. Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML, pp. 1–8 (2009)
Lee, H., Largman, Y., Pham, P., Ng, A.Y.: Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Neural Information Processing Systems (NIPS) (2009)
Medhat, F., Chesmore, D., Robinson, J.: Masked conditional neural networks for audio classification. In: Lintas, A., Rovetta, S., Verschure, P., Villa, A. (eds.) Artificial Neural Networks and Machine Learning – ICANN 2017. LNCS, vol. 10614, pp. 349–358. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68612-7_40
Medhat, F., Chesmore, D., Robinson, J.: Masked conditional neural networks for automatic sound events recognition. In: IEEE International Conference on Data Science and Advanced Analytics (DSAA) (2017)
Lin, M., Chen, Q., Yan, S.: Network in network. In: International Conference on Learning Representations, ICLR (2014)
Bergstra, J., Casagrande, N., Erhan, D., Eck, D., Kégl, B.: Aggregate features and AdaBoost for music classification. Mach. Learn. 65, 473–484 (2006)
Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 1041–1044 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: IEEE International Conference on Computer Vision, ICCV (2015)
Kingma, D., Ba, J.: ADAM: a method for stochastic optimization. In: International Conference for Learning Representations, ICLR (2015)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. JMLR 15, 1929–1958 (2014)
Salamon, J., Bello, J.P.: Unsupervised feature learning for urban sound classification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2015)
Piczak, K.J.: Environmental sound classification with convolutional neural networks. In: IEEE International Workshop on Machine Learning For Signal Processing (MLSP) (2015)
Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24, 279–283 (2017)
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42, 143–175 (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Medhat, F., Chesmore, D., Robinson, J. (2017). Masked Conditional Neural Networks for Environmental Sound Classification. In: Bramer, M., Petridis, M. (eds) Artificial Intelligence XXXIV. SGAI 2017. Lecture Notes in Computer Science(), vol 10630. Springer, Cham. https://doi.org/10.1007/978-3-319-71078-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-71078-5_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71077-8
Online ISBN: 978-3-319-71078-5
eBook Packages: Computer ScienceComputer Science (R0)