DNN Based Mask Estimation for Supervised Speech Separation

Chen, Jitong; Wang, DeLiang

doi:10.1007/978-3-319-73031-8_9

Jitong Chen² &
DeLiang Wang²

Part of the book series: Signals and Communication Technology ((SCT))

2176 Accesses
8 Citations

Abstract

This chapter introduces deep neural network (DNN) based mask estimation for supervised speech separation. Originated in computational auditory scene analysis (CASA), we treat speech separation as a mask estimation problem. Given a time-frequency (T-F) representation of noisy speech, the ideal binary mask (IBM) or ideal ratio mask (IRM) is defined to differentiate speech-dominant T-F units from noise-dominant ones. Mask estimation is then formulated as a problem of supervised learning, which learns a mapping function from acoustic features extracted from noisy speech to an ideal mask. Three main aspects of supervised learning are learning machines, training targets, and features, which are discussed in separate sections. Subsequently, we describe several representative supervised algorithms, mainly for monaural speech separation. For supervised separation, generalization to unseen conditions is a critical issue. The generalization capability of supervised speech separation is also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

E.C. Cherry, Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 25(5), 975–979 (1953)
Article Google Scholar
M. Brandstein, D. Ward, Microphone Arrays: Signal Processing Techniques and Applications (Springer, Berlin, 2001)
Book Google Scholar
P.C. Loizou, Speech Enhancement: Theory and Practice (CRC Press, Boca Raton, 2013)
Google Scholar
D.L. Wang, G.J. Brown (eds.), Computational Auditory Scene Analysis: Principles, Algorithms and Applications (Wiley-IEEE Press, Hoboken, 2006)
Google Scholar
A.S. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound (The MIT Press, Cambridge, 1990)
Google Scholar
D.L. Wang, Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplif. 12(4), 332–353 (2008)
Article Google Scholar
G. Hu, D.L. Wang, Speech segregation based on pitch tracking and amplitude modulation, in Proceedings of the WASPAA (2001), pp. 79–82
Google Scholar
G. Hu, D.L. Wang, Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neural Netw. 15(5), 1135–1150 (2004)
Article Google Scholar
D.L. Wang, On ideal binary mask as the computational goal of auditory scene analysis, in Speech Separation by Humans and Machines, ed. by P. Divenyi (Kluwer Academic Publishers, Boston, 2005), pp. 181–197
Chapter Google Scholar
D.S. Brungart, P.S. Chang, B.D. Simpson, D.L. Wang, Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. J. Acoust. Soc. Am. 120(6), 4007–4018 (2006)
Article Google Scholar
N. Li, P.C. Loizou, Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction. J. Acoust. Soc. Am. 123(3), 1673–1682 (2008)
Article Google Scholar
M.C. Anzalone, L. Calandruccio, K.A. Doherty, L.H. Carney, Determination of the potential benefit of time-frequency gain manipulation. Ear Hear. 27(5), 480 (2006)
Article Google Scholar
D.L. Wang, U. Kjems, M.S. Pedersen, J.B. Boldt, T. Lunner, Speech intelligibility in background noise with ideal binary time-frequency masking. J. Acoust. Soc. Am. 125(4), 2336–2347 (2009)
Article Google Scholar
N. Roman, D.L. Wang, G.J. Brown, A classification-based cocktail-party processor, in Proceedings of the NIPS-02 (2003), pp. 1425–1432
Google Scholar
N. Roman, D.L. Wang, G.J. Brown, Speech segregation based on sound localization. J. Acoust. Soc. Am. 114(4), 2236–2252 (2003)
Article Google Scholar
G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet MATH Google Scholar
V. Nair, G.E. Hinton, Rectified linear units improve restricted boltzmann machines, in Proceedings of the ICML (2010), pp. 807–814
Google Scholar
J. Chen, D.L. Wang, Long short-term memory for speaker generalization in supervised speech separation, in Proceedings of the INTERSPEECH (2016), pp. 3314–3318
Google Scholar
R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, in Proceedings of the ICML (2013), pp. 1310–1318
Google Scholar
M. Sundermeyer, H. Ney, R. Schlüter, From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 23(3), 517–529 (2015)
Article Google Scholar
I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in Proceedings of the NIPS (2014), pp. 3104–3112
Google Scholar
A. Graves, A. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in Proceedings of the ICASSP (2013), pp. 6645–6649
Google Scholar
Z. Jin, D.L. Wang, A supervised learning approach to monaural segregation of reverberant speech. IEEE Trans. Audio Speech Lang. Process. 17(4), 625–638 (2009)
Article Google Scholar
G. Kim, Y. Lu, Y. Hu, P.C. Loizou, An algorithm that improves speech intelligibility in noise for normal-hearing listeners. J. Acoust. Soc. Am. 126(3), 1486–1494 (2009)
Article Google Scholar
K. Han, D.L. Wang, A classification based approach to speech segregation. J. Acoust. Soc. Am. 132(5), 3475–3483 (2012)
Article Google Scholar
U. Kjems, J.B. Boldt, M.S. Pedersen, T. Lunner, D.L. Wang, Role of mask pattern in intelligibility of ideal binary-masked noisy speech. J. Acoust. Soc. Am. 126(3), 1415–1426 (2009)
Article Google Scholar
S. Gonzalez, M. Brookes, Mask-based enhancement for very low quality speech, in Proceedings of the ICASSP (2014), pp. 7029–7033
Google Scholar
S. Srinivasan, N. Roman, D.L. Wang, Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48(11), 1486–1501 (2006)
Article Google Scholar
A. Narayanan, D.L. Wang, Ideal ratio mask estimation using deep neural networks for robust speech recognition, in Proceedings of the ICASSP (2013), pp. 7092–7096
Google Scholar
Y. Wang, A. Narayanan, D.L. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)
Article Google Scholar
C. Hummersone, T. Stokes, T. Brookes, On the ideal ratio mask as the goal of computational auditory scene analysis, in Blind source separation, ed. by W.W.G.R. Naik (Springer, Berlin, 2014), pp. 349–368
Chapter Google Scholar
R.C. Hendriks, R. Heusdens, J. Jensen, MMSE based noise PSD tracking with low complexity, in Proceedings of the ICASSP (2010), pp. 4266–4269
Google Scholar
T. Virtanen, J.F. Gemmeke, B. Raj, Active-set newton algorithm for overcomplete non-negative representations of audio. IEEE/ACM Trans. Audio Speech Lang. Process. 21(11), 2277–2289 (2013)
Article Google Scholar
J. Garofolo, DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (National Institute of Standards and Technology, Gaithersburg, 1993)
Book Google Scholar
C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)
Article Google Scholar
A. Rix, J. Beerends, M. Hollier, A. Hekstra, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in Proceedings of the ICASSP (2001), pp. 749–752
Google Scholar
Z. Wang, X. Wang, X. Li, Q. Fu, Y. Yan, Oracle performance investigation of the ideal masks, in Proceedings of the IWAENC (2016), pp. 1–5
Google Scholar
F. Weninger, J.R. Hershey, J. Le Roux, B. Schuller, Discriminatively trained recurrent neural networks for single-channel speech separation, in Proceedings of the GlobalSIP (2014), pp. 577–581
Google Scholar
H. Erdogan, J.R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in Proceedings of the ICASSP (2015), pp. 708–712
Google Scholar
D.S. Williamson, Y. Wang, D.L. Wang, Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2016)
Article Google Scholar
N. Roman, D.L. Wang, Binaural tracking of multiple moving sources. IEEE/ACM Trans. Audio Speech Lang. Process. 16(4), 728–739 (2008)
Article Google Scholar
Y. Wang, K. Han, D.L. Wang, Exploring monaural features for classification-based speech segregation. IEEE Trans. Audio Speech Lang. Process. 21(2), 270–279 (2013)
Article Google Scholar
Y. Shao, D.L. Wang, Robust speaker identification using auditory features and computational auditory scene analysis, in Proceedings of the ICASSP (2008), pp. 1589–1592
Google Scholar
X. Zhao, Y. Shao, D.L. Wang, CASA-based robust speaker identification. IEEE Trans. Audio Speech Lang. Process. 20(5), 1608–1616 (2012)
Article Google Scholar
H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)
Article Google Scholar
H. Hermansky, N. Morgan, RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)
Article Google Scholar
J. Chen, Y. Wang, D. Wang, A feature study for classification-based speech separation at low signal-to-noise ratios. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1993–2002 (2014)
Article Google Scholar
K. Kumar, C. Kim, R.M. Stern, Delta-spectral cepstral coefficients for robust speech recognition, in Proceedings of the ICASSP (2011), pp. 4784–4787
Google Scholar
H.K. Maganti, M. Matassoni, An auditory based modulation spectral feature for reverberant speech recognition. in Proceedings of the INTERSPEECH (2010), pp. 570–573
Google Scholar
D.-S. Kim, S.-Y. Lee, R.M. Kil, Auditory processing of speech signals for robust speech recognition in real-world noisy environments. IEEE Trans. Speech Audio Process. 7(1), 55–69 (1999)
Article Google Scholar
K. Yuo, H. Wang, Robust features for noisy speech recognition based on temporal trajectory filtering of short-time autocorrelation sequences. Speech Commun. 28(1), 13–24 (1999)
Article Google Scholar
B.J. Shannon, K.K. Paliwal, Feature extraction from higher-lag autocorrelation coefficients for robust speech recognition. Speech Commun. 48(11), 1458–1485 (2006)
Article Google Scholar
S. Ikbal, H. Misra, H. Bourlard, Phase autocorrelation (PAC) derived robust speech features, in Proceedings of the ICASSP (2003), pp. 133–136
Google Scholar
C. Kim, R. Stern, Power-normalized cepstral coefficients (PNCC) for robust speech recognition, in Proceedings of the ICASSP (2012), pp. 4101–4104
Google Scholar
C. Kim, R.M. Stern, Nonlinear enhancement of onset for robust speech recognition. in Proceedings of the INTERSPEECH (2010), pp. 2058–2061
Google Scholar
M.R. Schädler, B.T. Meyer, B. Kollmeier, Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J. Acoust. Soc. Am. 131(5), 4134–4151 (2012)
Article Google Scholar
C. Chen, J.A. Bilmes, MVA processing of speech features. IEEE Trans. Audio Speech Lang. Process. 15(1), 257–270 (2007)
Article Google Scholar
IEEE, IEEE recommended practice for speech quality measurements. IEEE Trans. Audio Electroacoust. 17(3), 225–246 (1969)
Google Scholar
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
Article Google Scholar
Y. Wang, D.L. Wang, Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)
Article Google Scholar
E.W. Healy, S.E. Yoho, Y. Wang, D.L. Wang, An algorithm to improve speech recognition in noise for hearing-impaired listeners. J. Acoust. Soc. Am. 134(4), 3029–3038 (2013)
Article Google Scholar
M. Nilsson, S.D. Soli, J.A. Sullivan, Development of the hearing in noise test for the measurement of speech reception thresholds in quiet and in noise. J. Acoust. Soc. Am. 95(2), 1085–1099 (1994)
Article Google Scholar
X. Zhang, H. Zhang, S. Nie, G. Gao, W. Liu, A pairwise algorithm using the deep stacking network for speech separation and pitch estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(6), 1066–1078 (2016)
Article Google Scholar
J. Chen, Y. Wang, S.E. Yoho, D.L. Wang, E.W. Healy, Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. J. Acoust. Soc. Am. 139(5), 2604–2612 (2016)
Article Google Scholar
Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2014)
Article Google Scholar
K. Han, Y. Wang, D.L. Wang, Learning spectral mapping for speech dereverberation, in Proceedings of the ICASSP (2014), pp. 4628–4632
Google Scholar
K. Han, Y. Wang, D.L. Wang, W.S. Woods, I. Merks, T. Zhang, Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 23(6), 982–992 (2015)
Article Google Scholar
C. Avendano, H. Hermansky, Study on the dereverberation of speech based on temporal envelope filtering, in Proceedings of the ICSLP (1996), pp. 889–892
Google Scholar
M. Wu, D.L. Wang, A two-stage algorithm for one-microphone reverberant speech enhancement. IEEE Trans. Audio Speech Lang. Process. 14(3), 774–784 (2006)
Article MathSciNet Google Scholar
P.-S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Deep learning for monaural speech separation, in Proceedings of the ICASSP (2014), pp. 1562–1566
Google Scholar
J. Du, Y. Tu, Y. Xu, L. Dai, C.-H. Lee, Speech separation of a target speaker based on deep neural networks, in Proceedings of the ICSP (2014), pp. 473–477
Google Scholar
Y. Tu, J. Du, Y. Xu, L. Dai, C.-H. Lee, Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers, in Proceedings of the ISCSLP (2014), pp. 250–254
Google Scholar
P.-S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)
Article Google Scholar
Y. Jiang, D.L. Wang, R. Liu, Z. Feng, Binaural classification for reverberant speech segregation using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 2112–2121 (2014)
Article Google Scholar
K. Han, D.L. Wang, Neural networks for supervised pitch tracking in noise, in Proceedings of the ICASSP (2014), pp. 1488–1492
Google Scholar
X.-L. Zhang, D.L. Wang, Boosting contextual information for deep neural network based voice activity detection. IEEE/ACM Trans. Audio Speech Lang. Process. 24(2), 252–264 (2016)
Article Google Scholar
P. Papadopoulos, A. Tsiartas, J. Gibson, S. Narayanan, A supervised signal-to-noise ratio estimation of speech signals, in Proceedings of the ICASSP (2014), pp. 8237–8241
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Center for Cognitive and Brain Sciences, The Ohio State University, 2015 Neil Avenue, Columbus, OH, 43210, USA
Jitong Chen & DeLiang Wang

Authors

Jitong Chen
View author publications
You can also search for this author in PubMed Google Scholar
DeLiang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jitong Chen .

Editor information

Editors and Affiliations

University of Tsukuba, Ibaraki, Japan
Shoji Makino

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chen, J., Wang, D. (2018). DNN Based Mask Estimation for Supervised Speech Separation. In: Makino, S. (eds) Audio Source Separation. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-73031-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-73031-8_9
Published: 02 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73030-1
Online ISBN: 978-3-319-73031-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics