Skip to main content

DNN Based Mask Estimation for Supervised Speech Separation

  • Chapter
  • First Online:
Audio Source Separation

Part of the book series: Signals and Communication Technology ((SCT))

Abstract

This chapter introduces deep neural network (DNN) based mask estimation for supervised speech separation. Originated in computational auditory scene analysis (CASA), we treat speech separation as a mask estimation problem. Given a time-frequency (T-F) representation of noisy speech, the ideal binary mask (IBM) or ideal ratio mask (IRM) is defined to differentiate speech-dominant T-F units from noise-dominant ones. Mask estimation is then formulated as a problem of supervised learning, which learns a mapping function from acoustic features extracted from noisy speech to an ideal mask. Three main aspects of supervised learning are learning machines, training targets, and features, which are discussed in separate sections. Subsequently, we describe several representative supervised algorithms, mainly for monaural speech separation. For supervised separation, generalization to unseen conditions is a critical issue. The generalization capability of supervised speech separation is also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. E.C. Cherry, Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 25(5), 975–979 (1953)

    Article  Google Scholar 

  2. M. Brandstein, D. Ward, Microphone Arrays: Signal Processing Techniques and Applications (Springer, Berlin, 2001)

    Book  Google Scholar 

  3. P.C. Loizou, Speech Enhancement: Theory and Practice (CRC Press, Boca Raton, 2013)

    Google Scholar 

  4. D.L. Wang, G.J. Brown (eds.), Computational Auditory Scene Analysis: Principles, Algorithms and Applications (Wiley-IEEE Press, Hoboken, 2006)

    Google Scholar 

  5. A.S. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound (The MIT Press, Cambridge, 1990)

    Google Scholar 

  6. D.L. Wang, Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplif. 12(4), 332–353 (2008)

    Article  Google Scholar 

  7. G. Hu, D.L. Wang, Speech segregation based on pitch tracking and amplitude modulation, in Proceedings of the WASPAA (2001), pp. 79–82

    Google Scholar 

  8. G. Hu, D.L. Wang, Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neural Netw. 15(5), 1135–1150 (2004)

    Article  Google Scholar 

  9. D.L. Wang, On ideal binary mask as the computational goal of auditory scene analysis, in Speech Separation by Humans and Machines, ed. by P. Divenyi (Kluwer Academic Publishers, Boston, 2005), pp. 181–197

    Chapter  Google Scholar 

  10. D.S. Brungart, P.S. Chang, B.D. Simpson, D.L. Wang, Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. J. Acoust. Soc. Am. 120(6), 4007–4018 (2006)

    Article  Google Scholar 

  11. N. Li, P.C. Loizou, Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction. J. Acoust. Soc. Am. 123(3), 1673–1682 (2008)

    Article  Google Scholar 

  12. M.C. Anzalone, L. Calandruccio, K.A. Doherty, L.H. Carney, Determination of the potential benefit of time-frequency gain manipulation. Ear Hear. 27(5), 480 (2006)

    Article  Google Scholar 

  13. D.L. Wang, U. Kjems, M.S. Pedersen, J.B. Boldt, T. Lunner, Speech intelligibility in background noise with ideal binary time-frequency masking. J. Acoust. Soc. Am. 125(4), 2336–2347 (2009)

    Article  Google Scholar 

  14. N. Roman, D.L. Wang, G.J. Brown, A classification-based cocktail-party processor, in Proceedings of the NIPS-02 (2003), pp. 1425–1432

    Google Scholar 

  15. N. Roman, D.L. Wang, G.J. Brown, Speech segregation based on sound localization. J. Acoust. Soc. Am. 114(4), 2236–2252 (2003)

    Article  Google Scholar 

  16. G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  17. V. Nair, G.E. Hinton, Rectified linear units improve restricted boltzmann machines, in Proceedings of the ICML (2010), pp. 807–814

    Google Scholar 

  18. J. Chen, D.L. Wang, Long short-term memory for speaker generalization in supervised speech separation, in Proceedings of the INTERSPEECH (2016), pp. 3314–3318

    Google Scholar 

  19. R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, in Proceedings of the ICML (2013), pp. 1310–1318

    Google Scholar 

  20. M. Sundermeyer, H. Ney, R. Schlüter, From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 23(3), 517–529 (2015)

    Article  Google Scholar 

  21. I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in Proceedings of the NIPS (2014), pp. 3104–3112

    Google Scholar 

  22. A. Graves, A. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in Proceedings of the ICASSP (2013), pp. 6645–6649

    Google Scholar 

  23. Z. Jin, D.L. Wang, A supervised learning approach to monaural segregation of reverberant speech. IEEE Trans. Audio Speech Lang. Process. 17(4), 625–638 (2009)

    Article  Google Scholar 

  24. G. Kim, Y. Lu, Y. Hu, P.C. Loizou, An algorithm that improves speech intelligibility in noise for normal-hearing listeners. J. Acoust. Soc. Am. 126(3), 1486–1494 (2009)

    Article  Google Scholar 

  25. K. Han, D.L. Wang, A classification based approach to speech segregation. J. Acoust. Soc. Am. 132(5), 3475–3483 (2012)

    Article  Google Scholar 

  26. U. Kjems, J.B. Boldt, M.S. Pedersen, T. Lunner, D.L. Wang, Role of mask pattern in intelligibility of ideal binary-masked noisy speech. J. Acoust. Soc. Am. 126(3), 1415–1426 (2009)

    Article  Google Scholar 

  27. S. Gonzalez, M. Brookes, Mask-based enhancement for very low quality speech, in Proceedings of the ICASSP (2014), pp. 7029–7033

    Google Scholar 

  28. S. Srinivasan, N. Roman, D.L. Wang, Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48(11), 1486–1501 (2006)

    Article  Google Scholar 

  29. A. Narayanan, D.L. Wang, Ideal ratio mask estimation using deep neural networks for robust speech recognition, in Proceedings of the ICASSP (2013), pp. 7092–7096

    Google Scholar 

  30. Y. Wang, A. Narayanan, D.L. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)

    Article  Google Scholar 

  31. C. Hummersone, T. Stokes, T. Brookes, On the ideal ratio mask as the goal of computational auditory scene analysis, in Blind source separation, ed. by W.W.G.R. Naik (Springer, Berlin, 2014), pp. 349–368

    Chapter  Google Scholar 

  32. R.C. Hendriks, R. Heusdens, J. Jensen, MMSE based noise PSD tracking with low complexity, in Proceedings of the ICASSP (2010), pp. 4266–4269

    Google Scholar 

  33. T. Virtanen, J.F. Gemmeke, B. Raj, Active-set newton algorithm for overcomplete non-negative representations of audio. IEEE/ACM Trans. Audio Speech Lang. Process. 21(11), 2277–2289 (2013)

    Article  Google Scholar 

  34. J. Garofolo, DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (National Institute of Standards and Technology, Gaithersburg, 1993)

    Book  Google Scholar 

  35. C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)

    Article  Google Scholar 

  36. A. Rix, J. Beerends, M. Hollier, A. Hekstra, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in Proceedings of the ICASSP (2001), pp. 749–752

    Google Scholar 

  37. Z. Wang, X. Wang, X. Li, Q. Fu, Y. Yan, Oracle performance investigation of the ideal masks, in Proceedings of the IWAENC (2016), pp. 1–5

    Google Scholar 

  38. F. Weninger, J.R. Hershey, J. Le Roux, B. Schuller, Discriminatively trained recurrent neural networks for single-channel speech separation, in Proceedings of the GlobalSIP (2014), pp. 577–581

    Google Scholar 

  39. H. Erdogan, J.R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in Proceedings of the ICASSP (2015), pp. 708–712

    Google Scholar 

  40. D.S. Williamson, Y. Wang, D.L. Wang, Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2016)

    Article  Google Scholar 

  41. N. Roman, D.L. Wang, Binaural tracking of multiple moving sources. IEEE/ACM Trans. Audio Speech Lang. Process. 16(4), 728–739 (2008)

    Article  Google Scholar 

  42. Y. Wang, K. Han, D.L. Wang, Exploring monaural features for classification-based speech segregation. IEEE Trans. Audio Speech Lang. Process. 21(2), 270–279 (2013)

    Article  Google Scholar 

  43. Y. Shao, D.L. Wang, Robust speaker identification using auditory features and computational auditory scene analysis, in Proceedings of the ICASSP (2008), pp. 1589–1592

    Google Scholar 

  44. X. Zhao, Y. Shao, D.L. Wang, CASA-based robust speaker identification. IEEE Trans. Audio Speech Lang. Process. 20(5), 1608–1616 (2012)

    Article  Google Scholar 

  45. H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)

    Article  Google Scholar 

  46. H. Hermansky, N. Morgan, RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)

    Article  Google Scholar 

  47. J. Chen, Y. Wang, D. Wang, A feature study for classification-based speech separation at low signal-to-noise ratios. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1993–2002 (2014)

    Article  Google Scholar 

  48. K. Kumar, C. Kim, R.M. Stern, Delta-spectral cepstral coefficients for robust speech recognition, in Proceedings of the ICASSP (2011), pp. 4784–4787

    Google Scholar 

  49. H.K. Maganti, M. Matassoni, An auditory based modulation spectral feature for reverberant speech recognition. in Proceedings of the INTERSPEECH (2010), pp. 570–573

    Google Scholar 

  50. D.-S. Kim, S.-Y. Lee, R.M. Kil, Auditory processing of speech signals for robust speech recognition in real-world noisy environments. IEEE Trans. Speech Audio Process. 7(1), 55–69 (1999)

    Article  Google Scholar 

  51. K. Yuo, H. Wang, Robust features for noisy speech recognition based on temporal trajectory filtering of short-time autocorrelation sequences. Speech Commun. 28(1), 13–24 (1999)

    Article  Google Scholar 

  52. B.J. Shannon, K.K. Paliwal, Feature extraction from higher-lag autocorrelation coefficients for robust speech recognition. Speech Commun. 48(11), 1458–1485 (2006)

    Article  Google Scholar 

  53. S. Ikbal, H. Misra, H. Bourlard, Phase autocorrelation (PAC) derived robust speech features, in Proceedings of the ICASSP (2003), pp. 133–136

    Google Scholar 

  54. C. Kim, R. Stern, Power-normalized cepstral coefficients (PNCC) for robust speech recognition, in Proceedings of the ICASSP (2012), pp. 4101–4104

    Google Scholar 

  55. C. Kim, R.M. Stern, Nonlinear enhancement of onset for robust speech recognition. in Proceedings of the INTERSPEECH (2010), pp. 2058–2061

    Google Scholar 

  56. M.R. Schädler, B.T. Meyer, B. Kollmeier, Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J. Acoust. Soc. Am. 131(5), 4134–4151 (2012)

    Article  Google Scholar 

  57. C. Chen, J.A. Bilmes, MVA processing of speech features. IEEE Trans. Audio Speech Lang. Process. 15(1), 257–270 (2007)

    Article  Google Scholar 

  58. IEEE, IEEE recommended practice for speech quality measurements. IEEE Trans. Audio Electroacoust. 17(3), 225–246 (1969)

    Google Scholar 

  59. A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)

    Article  Google Scholar 

  60. Y. Wang, D.L. Wang, Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)

    Article  Google Scholar 

  61. E.W. Healy, S.E. Yoho, Y. Wang, D.L. Wang, An algorithm to improve speech recognition in noise for hearing-impaired listeners. J. Acoust. Soc. Am. 134(4), 3029–3038 (2013)

    Article  Google Scholar 

  62. M. Nilsson, S.D. Soli, J.A. Sullivan, Development of the hearing in noise test for the measurement of speech reception thresholds in quiet and in noise. J. Acoust. Soc. Am. 95(2), 1085–1099 (1994)

    Article  Google Scholar 

  63. X. Zhang, H. Zhang, S. Nie, G. Gao, W. Liu, A pairwise algorithm using the deep stacking network for speech separation and pitch estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(6), 1066–1078 (2016)

    Article  Google Scholar 

  64. J. Chen, Y. Wang, S.E. Yoho, D.L. Wang, E.W. Healy, Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. J. Acoust. Soc. Am. 139(5), 2604–2612 (2016)

    Article  Google Scholar 

  65. Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2014)

    Article  Google Scholar 

  66. K. Han, Y. Wang, D.L. Wang, Learning spectral mapping for speech dereverberation, in Proceedings of the ICASSP (2014), pp. 4628–4632

    Google Scholar 

  67. K. Han, Y. Wang, D.L. Wang, W.S. Woods, I. Merks, T. Zhang, Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 23(6), 982–992 (2015)

    Article  Google Scholar 

  68. C. Avendano, H. Hermansky, Study on the dereverberation of speech based on temporal envelope filtering, in Proceedings of the ICSLP (1996), pp. 889–892

    Google Scholar 

  69. M. Wu, D.L. Wang, A two-stage algorithm for one-microphone reverberant speech enhancement. IEEE Trans. Audio Speech Lang. Process. 14(3), 774–784 (2006)

    Article  MathSciNet  Google Scholar 

  70. P.-S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Deep learning for monaural speech separation, in Proceedings of the ICASSP (2014), pp. 1562–1566

    Google Scholar 

  71. J. Du, Y. Tu, Y. Xu, L. Dai, C.-H. Lee, Speech separation of a target speaker based on deep neural networks, in Proceedings of the ICSP (2014), pp. 473–477

    Google Scholar 

  72. Y. Tu, J. Du, Y. Xu, L. Dai, C.-H. Lee, Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers, in Proceedings of the ISCSLP (2014), pp. 250–254

    Google Scholar 

  73. P.-S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)

    Article  Google Scholar 

  74. Y. Jiang, D.L. Wang, R. Liu, Z. Feng, Binaural classification for reverberant speech segregation using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 2112–2121 (2014)

    Article  Google Scholar 

  75. K. Han, D.L. Wang, Neural networks for supervised pitch tracking in noise, in Proceedings of the ICASSP (2014), pp. 1488–1492

    Google Scholar 

  76. X.-L. Zhang, D.L. Wang, Boosting contextual information for deep neural network based voice activity detection. IEEE/ACM Trans. Audio Speech Lang. Process. 24(2), 252–264 (2016)

    Article  Google Scholar 

  77. P. Papadopoulos, A. Tsiartas, J. Gibson, S. Narayanan, A supervised signal-to-noise ratio estimation of speech signals, in Proceedings of the ICASSP (2014), pp. 8237–8241

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jitong Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Chen, J., Wang, D. (2018). DNN Based Mask Estimation for Supervised Speech Separation. In: Makino, S. (eds) Audio Source Separation. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-73031-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73031-8_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73030-1

  • Online ISBN: 978-3-319-73031-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics