Abstract
This chapter introduces deep neural network (DNN) based mask estimation for supervised speech separation. Originated in computational auditory scene analysis (CASA), we treat speech separation as a mask estimation problem. Given a time-frequency (T-F) representation of noisy speech, the ideal binary mask (IBM) or ideal ratio mask (IRM) is defined to differentiate speech-dominant T-F units from noise-dominant ones. Mask estimation is then formulated as a problem of supervised learning, which learns a mapping function from acoustic features extracted from noisy speech to an ideal mask. Three main aspects of supervised learning are learning machines, training targets, and features, which are discussed in separate sections. Subsequently, we describe several representative supervised algorithms, mainly for monaural speech separation. For supervised separation, generalization to unseen conditions is a critical issue. The generalization capability of supervised speech separation is also discussed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
E.C. Cherry, Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 25(5), 975–979 (1953)
M. Brandstein, D. Ward, Microphone Arrays: Signal Processing Techniques and Applications (Springer, Berlin, 2001)
P.C. Loizou, Speech Enhancement: Theory and Practice (CRC Press, Boca Raton, 2013)
D.L. Wang, G.J. Brown (eds.), Computational Auditory Scene Analysis: Principles, Algorithms and Applications (Wiley-IEEE Press, Hoboken, 2006)
A.S. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound (The MIT Press, Cambridge, 1990)
D.L. Wang, Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplif. 12(4), 332–353 (2008)
G. Hu, D.L. Wang, Speech segregation based on pitch tracking and amplitude modulation, in Proceedings of the WASPAA (2001), pp. 79–82
G. Hu, D.L. Wang, Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neural Netw. 15(5), 1135–1150 (2004)
D.L. Wang, On ideal binary mask as the computational goal of auditory scene analysis, in Speech Separation by Humans and Machines, ed. by P. Divenyi (Kluwer Academic Publishers, Boston, 2005), pp. 181–197
D.S. Brungart, P.S. Chang, B.D. Simpson, D.L. Wang, Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. J. Acoust. Soc. Am. 120(6), 4007–4018 (2006)
N. Li, P.C. Loizou, Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction. J. Acoust. Soc. Am. 123(3), 1673–1682 (2008)
M.C. Anzalone, L. Calandruccio, K.A. Doherty, L.H. Carney, Determination of the potential benefit of time-frequency gain manipulation. Ear Hear. 27(5), 480 (2006)
D.L. Wang, U. Kjems, M.S. Pedersen, J.B. Boldt, T. Lunner, Speech intelligibility in background noise with ideal binary time-frequency masking. J. Acoust. Soc. Am. 125(4), 2336–2347 (2009)
N. Roman, D.L. Wang, G.J. Brown, A classification-based cocktail-party processor, in Proceedings of the NIPS-02 (2003), pp. 1425–1432
N. Roman, D.L. Wang, G.J. Brown, Speech segregation based on sound localization. J. Acoust. Soc. Am. 114(4), 2236–2252 (2003)
G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
V. Nair, G.E. Hinton, Rectified linear units improve restricted boltzmann machines, in Proceedings of the ICML (2010), pp. 807–814
J. Chen, D.L. Wang, Long short-term memory for speaker generalization in supervised speech separation, in Proceedings of the INTERSPEECH (2016), pp. 3314–3318
R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, in Proceedings of the ICML (2013), pp. 1310–1318
M. Sundermeyer, H. Ney, R. Schlüter, From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 23(3), 517–529 (2015)
I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in Proceedings of the NIPS (2014), pp. 3104–3112
A. Graves, A. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in Proceedings of the ICASSP (2013), pp. 6645–6649
Z. Jin, D.L. Wang, A supervised learning approach to monaural segregation of reverberant speech. IEEE Trans. Audio Speech Lang. Process. 17(4), 625–638 (2009)
G. Kim, Y. Lu, Y. Hu, P.C. Loizou, An algorithm that improves speech intelligibility in noise for normal-hearing listeners. J. Acoust. Soc. Am. 126(3), 1486–1494 (2009)
K. Han, D.L. Wang, A classification based approach to speech segregation. J. Acoust. Soc. Am. 132(5), 3475–3483 (2012)
U. Kjems, J.B. Boldt, M.S. Pedersen, T. Lunner, D.L. Wang, Role of mask pattern in intelligibility of ideal binary-masked noisy speech. J. Acoust. Soc. Am. 126(3), 1415–1426 (2009)
S. Gonzalez, M. Brookes, Mask-based enhancement for very low quality speech, in Proceedings of the ICASSP (2014), pp. 7029–7033
S. Srinivasan, N. Roman, D.L. Wang, Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48(11), 1486–1501 (2006)
A. Narayanan, D.L. Wang, Ideal ratio mask estimation using deep neural networks for robust speech recognition, in Proceedings of the ICASSP (2013), pp. 7092–7096
Y. Wang, A. Narayanan, D.L. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)
C. Hummersone, T. Stokes, T. Brookes, On the ideal ratio mask as the goal of computational auditory scene analysis, in Blind source separation, ed. by W.W.G.R. Naik (Springer, Berlin, 2014), pp. 349–368
R.C. Hendriks, R. Heusdens, J. Jensen, MMSE based noise PSD tracking with low complexity, in Proceedings of the ICASSP (2010), pp. 4266–4269
T. Virtanen, J.F. Gemmeke, B. Raj, Active-set newton algorithm for overcomplete non-negative representations of audio. IEEE/ACM Trans. Audio Speech Lang. Process. 21(11), 2277–2289 (2013)
J. Garofolo, DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (National Institute of Standards and Technology, Gaithersburg, 1993)
C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)
A. Rix, J. Beerends, M. Hollier, A. Hekstra, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in Proceedings of the ICASSP (2001), pp. 749–752
Z. Wang, X. Wang, X. Li, Q. Fu, Y. Yan, Oracle performance investigation of the ideal masks, in Proceedings of the IWAENC (2016), pp. 1–5
F. Weninger, J.R. Hershey, J. Le Roux, B. Schuller, Discriminatively trained recurrent neural networks for single-channel speech separation, in Proceedings of the GlobalSIP (2014), pp. 577–581
H. Erdogan, J.R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in Proceedings of the ICASSP (2015), pp. 708–712
D.S. Williamson, Y. Wang, D.L. Wang, Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2016)
N. Roman, D.L. Wang, Binaural tracking of multiple moving sources. IEEE/ACM Trans. Audio Speech Lang. Process. 16(4), 728–739 (2008)
Y. Wang, K. Han, D.L. Wang, Exploring monaural features for classification-based speech segregation. IEEE Trans. Audio Speech Lang. Process. 21(2), 270–279 (2013)
Y. Shao, D.L. Wang, Robust speaker identification using auditory features and computational auditory scene analysis, in Proceedings of the ICASSP (2008), pp. 1589–1592
X. Zhao, Y. Shao, D.L. Wang, CASA-based robust speaker identification. IEEE Trans. Audio Speech Lang. Process. 20(5), 1608–1616 (2012)
H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)
H. Hermansky, N. Morgan, RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)
J. Chen, Y. Wang, D. Wang, A feature study for classification-based speech separation at low signal-to-noise ratios. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1993–2002 (2014)
K. Kumar, C. Kim, R.M. Stern, Delta-spectral cepstral coefficients for robust speech recognition, in Proceedings of the ICASSP (2011), pp. 4784–4787
H.K. Maganti, M. Matassoni, An auditory based modulation spectral feature for reverberant speech recognition. in Proceedings of the INTERSPEECH (2010), pp. 570–573
D.-S. Kim, S.-Y. Lee, R.M. Kil, Auditory processing of speech signals for robust speech recognition in real-world noisy environments. IEEE Trans. Speech Audio Process. 7(1), 55–69 (1999)
K. Yuo, H. Wang, Robust features for noisy speech recognition based on temporal trajectory filtering of short-time autocorrelation sequences. Speech Commun. 28(1), 13–24 (1999)
B.J. Shannon, K.K. Paliwal, Feature extraction from higher-lag autocorrelation coefficients for robust speech recognition. Speech Commun. 48(11), 1458–1485 (2006)
S. Ikbal, H. Misra, H. Bourlard, Phase autocorrelation (PAC) derived robust speech features, in Proceedings of the ICASSP (2003), pp. 133–136
C. Kim, R. Stern, Power-normalized cepstral coefficients (PNCC) for robust speech recognition, in Proceedings of the ICASSP (2012), pp. 4101–4104
C. Kim, R.M. Stern, Nonlinear enhancement of onset for robust speech recognition. in Proceedings of the INTERSPEECH (2010), pp. 2058–2061
M.R. Schädler, B.T. Meyer, B. Kollmeier, Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J. Acoust. Soc. Am. 131(5), 4134–4151 (2012)
C. Chen, J.A. Bilmes, MVA processing of speech features. IEEE Trans. Audio Speech Lang. Process. 15(1), 257–270 (2007)
IEEE, IEEE recommended practice for speech quality measurements. IEEE Trans. Audio Electroacoust. 17(3), 225–246 (1969)
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
Y. Wang, D.L. Wang, Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)
E.W. Healy, S.E. Yoho, Y. Wang, D.L. Wang, An algorithm to improve speech recognition in noise for hearing-impaired listeners. J. Acoust. Soc. Am. 134(4), 3029–3038 (2013)
M. Nilsson, S.D. Soli, J.A. Sullivan, Development of the hearing in noise test for the measurement of speech reception thresholds in quiet and in noise. J. Acoust. Soc. Am. 95(2), 1085–1099 (1994)
X. Zhang, H. Zhang, S. Nie, G. Gao, W. Liu, A pairwise algorithm using the deep stacking network for speech separation and pitch estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(6), 1066–1078 (2016)
J. Chen, Y. Wang, S.E. Yoho, D.L. Wang, E.W. Healy, Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. J. Acoust. Soc. Am. 139(5), 2604–2612 (2016)
Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2014)
K. Han, Y. Wang, D.L. Wang, Learning spectral mapping for speech dereverberation, in Proceedings of the ICASSP (2014), pp. 4628–4632
K. Han, Y. Wang, D.L. Wang, W.S. Woods, I. Merks, T. Zhang, Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 23(6), 982–992 (2015)
C. Avendano, H. Hermansky, Study on the dereverberation of speech based on temporal envelope filtering, in Proceedings of the ICSLP (1996), pp. 889–892
M. Wu, D.L. Wang, A two-stage algorithm for one-microphone reverberant speech enhancement. IEEE Trans. Audio Speech Lang. Process. 14(3), 774–784 (2006)
P.-S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Deep learning for monaural speech separation, in Proceedings of the ICASSP (2014), pp. 1562–1566
J. Du, Y. Tu, Y. Xu, L. Dai, C.-H. Lee, Speech separation of a target speaker based on deep neural networks, in Proceedings of the ICSP (2014), pp. 473–477
Y. Tu, J. Du, Y. Xu, L. Dai, C.-H. Lee, Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers, in Proceedings of the ISCSLP (2014), pp. 250–254
P.-S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)
Y. Jiang, D.L. Wang, R. Liu, Z. Feng, Binaural classification for reverberant speech segregation using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 2112–2121 (2014)
K. Han, D.L. Wang, Neural networks for supervised pitch tracking in noise, in Proceedings of the ICASSP (2014), pp. 1488–1492
X.-L. Zhang, D.L. Wang, Boosting contextual information for deep neural network based voice activity detection. IEEE/ACM Trans. Audio Speech Lang. Process. 24(2), 252–264 (2016)
P. Papadopoulos, A. Tsiartas, J. Gibson, S. Narayanan, A supervised signal-to-noise ratio estimation of speech signals, in Proceedings of the ICASSP (2014), pp. 8237–8241
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Chen, J., Wang, D. (2018). DNN Based Mask Estimation for Supervised Speech Separation. In: Makino, S. (eds) Audio Source Separation. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-73031-8_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-73031-8_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73030-1
Online ISBN: 978-3-319-73031-8
eBook Packages: EngineeringEngineering (R0)