Abstract
Audio source separation from a monaural mixture, which is termed as monaural source separation, is an important and challenging problem for applications. In this paper, a monaural source separation method using convolutional neural network in the time domain is proposed. The proposed neural network, input and output of which are both time-domain signals, consists of three convolutional layers, each of which is followed by a max-pooling layer, and two fully-connected layers. There are two key ideas behind the time-domain convolutional network: one is learning features automatically by the convolutional layers instead of extracting features such as spectra; the other is that the phase can be recovered automatically since both the input and output are in the time domain. The proposed approach is evaluated using the TSP speech corpus for monaural source separation, and achieves around 4.31–7.77 SIR gain with respect to the deep neural network, the recurrent neural network and nonnegative matrix factorization, while maintaining better SDR and SAR.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Huang, P.S., Chen, S.D., Smaragdis, P., Hasegawa-Johnson, M.: Singing-voice separation from monaural recordings using robust principal component analysis. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 57–60. IEEE (2012)
Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1562–1566. IEEE (2014)
Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Transa. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)
Kabal, P.: TSP Speech Database. McGill University, Database Version 1(0), 09–02 (2002)
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)
Raffel, C., McFee, B., Humphrey, E.J., Salamon, J., Nieto, O., Liang, D., Ellis, D.P., Raffel, C.C.: mir_eval: a transparent implementation of common MIR metrics. In: Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR. Citeseer (2014)
Smaragdis, P., Raj, B., Shashanka, M.: A probabilistic latent variable model for acoustic modeling. In: Advances in Models for Acoustic Processing, NIPS 148, 8–1 (2006)
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)
Vinyals, O., Ravuri, S.V., Povey, D.: Revisiting recurrent neural networks for robust ASR. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4085–4088. IEEE (2012)
Wang, Y., Narayanan, A., Wang, D.: On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)
Wang, Y., Wang, D.: Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)
Wang, Y., Wang, D.: A deep neural network for time-domain signal reconstruction. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4390–4394. IEEE (2015)
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for joint enhancement of magnitude and phase. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224. IEEE (2016)
Acknowledgments
This work is supported by the National Natural Science Foundation of China under Grant 61071208.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Zhang, P., Ma, X., Ding, S. (2017). Audio Source Separation from a Monaural Mixture Using Convolutional Neural Network in the Time Domain. In: Cong, F., Leung, A., Wei, Q. (eds) Advances in Neural Networks - ISNN 2017. ISNN 2017. Lecture Notes in Computer Science(), vol 10262. Springer, Cham. https://doi.org/10.1007/978-3-319-59081-3_46
Download citation
DOI: https://doi.org/10.1007/978-3-319-59081-3_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59080-6
Online ISBN: 978-3-319-59081-3
eBook Packages: Computer ScienceComputer Science (R0)