Audio Source Separation from a Monaural Mixture Using Convolutional Neural Network in the Time Domain

Zhang, Peng; Ma, Xiaohong; Ding, Shuxue

doi:10.1007/978-3-319-59081-3_46

Peng Zhang¹⁶,
Xiaohong Ma¹⁶ &
Shuxue Ding¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10262))

Included in the following conference series:

International Symposium on Neural Networks

2833 Accesses
1 Citations

Abstract

Audio source separation from a monaural mixture, which is termed as monaural source separation, is an important and challenging problem for applications. In this paper, a monaural source separation method using convolutional neural network in the time domain is proposed. The proposed neural network, input and output of which are both time-domain signals, consists of three convolutional layers, each of which is followed by a max-pooling layer, and two fully-connected layers. There are two key ideas behind the time-domain convolutional network: one is learning features automatically by the convolutional layers instead of extracting features such as spectra; the other is that the phase can be recovered automatically since both the input and output are in the time domain. The proposed approach is evaluated using the TSP speech corpus for monaural source separation, and achieves around 4.31–7.77 SIR gain with respect to the deep neural network, the recurrent neural network and nonnegative matrix factorization, while maintaining better SDR and SAR.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://pengzhxyz.github.io/bss-time-cnn.

References

Huang, P.S., Chen, S.D., Smaragdis, P., Hasegawa-Johnson, M.: Singing-voice separation from monaural recordings using robust principal component analysis. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 57–60. IEEE (2012)
Google Scholar
Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1562–1566. IEEE (2014)
Google Scholar
Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Transa. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)
Article Google Scholar
Kabal, P.: TSP Speech Database. McGill University, Database Version 1(0), 09–02 (2002)
Google Scholar
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Article Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)
Google Scholar
Raffel, C., McFee, B., Humphrey, E.J., Salamon, J., Nieto, O., Liang, D., Ellis, D.P., Raffel, C.C.: mir_eval: a transparent implementation of common MIR metrics. In: Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR. Citeseer (2014)
Google Scholar
Smaragdis, P., Raj, B., Shashanka, M.: A probabilistic latent variable model for acoustic modeling. In: Advances in Models for Acoustic Processing, NIPS 148, 8–1 (2006)
Google Scholar
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)
Article Google Scholar
Vinyals, O., Ravuri, S.V., Povey, D.: Revisiting recurrent neural networks for robust ASR. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4085–4088. IEEE (2012)
Google Scholar
Wang, Y., Narayanan, A., Wang, D.: On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)
Article Google Scholar
Wang, Y., Wang, D.: Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)
Article Google Scholar
Wang, Y., Wang, D.: A deep neural network for time-domain signal reconstruction. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4390–4394. IEEE (2015)
Google Scholar
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for joint enhancement of magnitude and phase. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224. IEEE (2016)
Google Scholar

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant 61071208.

Author information

Authors and Affiliations

School of Information and Communication Engineering, Dalian University of Technology, Dalian, China
Peng Zhang & Xiaohong Ma
School of Computer Science and Engineering, University of Aizu, Fukushima, Japan
Shuxue Ding

Authors

Peng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohong Ma
View author publications
You can also search for this author in PubMed Google Scholar
Shuxue Ding
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaohong Ma .

Editor information

Editors and Affiliations

Dalian University of Technology, Dalian, China
Fengyu Cong
City University of Hong Kong, Kowloon Tong, Hong Kong
Andrew Leung
Chinese Academy of Sciences, Beijing, China
Qinglai Wei

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, P., Ma, X., Ding, S. (2017). Audio Source Separation from a Monaural Mixture Using Convolutional Neural Network in the Time Domain. In: Cong, F., Leung, A., Wei, Q. (eds) Advances in Neural Networks - ISNN 2017. ISNN 2017. Lecture Notes in Computer Science(), vol 10262. Springer, Cham. https://doi.org/10.1007/978-3-319-59081-3_46

Download citation

DOI: https://doi.org/10.1007/978-3-319-59081-3_46
Published: 31 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59080-6
Online ISBN: 978-3-319-59081-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics