Abstract
Deep learning-based speech enhancement approaches like deep neural networks (DNN) and Long Short-Term Memory (LSTM) have already demonstrated superior results to classical methods. However, these methods do not take full advantage of temporal context information. While DNN and LSTM consider temporal context in the noisy source speech, it does not do so for the estimated clean speech. Both DNN and LSTM also have a tendency to over-smooth spectra, which causes the enhanced speech to sound muffled. This paper proposes a novel architecture to address both issues, which we term a conditional generative model (CGM). By adopting an adversarial training scheme applied to a generator of deep dilated convolutional layers, CGM is designed to model the joint and symmetric conditions of both noisy and estimated clean spectra. We evaluate CGM against both DNN and LSTM in terms of Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) on TIMIT sentences corrupted by ITU-T P.501 and NOISEX-92 noise in a range of matched and mismatched noise conditions. Results show that both the CGM architecture and the adversarial training mechanism lead to better PESQ and STOI in all tested noise conditions. In addition to yielding significant improvements in PESQ and STOI, CGM and adversarial training both mitigate against over-smoothing.
Similar content being viewed by others
Notes
Matched noises: white from NOISEX-92 database, res_mono and con_bin from ITU-T recommendation P.501 database; Mismatched noises: destroyerops, f16 and m109 from NOISEX-92 database.
References
M. Arjovsky, S. Chintala, L. Bottou, Wasserstein GAN. arXiv preprint arXiv:1701.07875 (2017)
S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 27(2), 113–120 (1979)
J. Chen, D. Wang, Long short-term memory for speaker generalization in supervised speech separation. in Proceedings of 17th Annual Conference on International Speech Communication Association (2016), pp. 3314–3318
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, Z. Zhang, Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)
Z. Chen, S. Watanabe, H. Erdogan, J. Hershey, Integration of speech enhancement and recognition using long-short term memory recurrent neural network. in INTERSPEECH (2015)
I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments. Signal process. 81(11), 2403–2418 (2001)
J. Du, Y. Tu, L.R. Dai, C.H. Lee, A regression approach to single-channel speech separation via high-resolution deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(8), 1424–1437 (2016)
Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 33(2), 443–445 (1985)
H. Erdogan, J. R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 708–712
T. Gao, J. Du, L.R. Dai, C.H. Lee, SNR-based progressive learning of deep neural network for speech enhancement. in INTERSPEECH (2016), pp. 3713–3717
J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report n 93 (1993)
I. Goodfellow, NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets. in Advances in Neural Information Processing Systems (2014), pp. 2672–2680
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. in IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
ITU: Test signals for use in telephonometry. ITU-T Recommendation P.501 (Aug. 1996)
D. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
A. Kumar, D. Florencio, speech enhancement in multiple-noise conditions using deep neural networks. in INTERSPEECH (2016), pp. 3738–3742
V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines. in 2010 International Conference on Machine Learning (2010), pp. 807–814
S. Pascual, A. Bonafonte, J. Serrà, SEGAN: speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017)
A. Radford, L. Metz, S. Chintala, unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
C. Recommendation, Pulse code modulation (PCM) of voice frequencies. ITU (1988)
A. Rix, J. Beerends, M. Hollier, A. Hekstra, Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs. ITU-T Recommendation 862 (2001)
L. Sun, J. Du, L.R. Dai, C.H. Lee, Multiple-target deep learning for LSTM-RNN based speech enhancement. in Hands-free Speech Communications and Microphone Arrays (HSCMA), 2017 (IEEE, 2017), pp. 136–140
C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 19(7), 2125–2136 (2011)
T. Tieleman, G. Hinton, Rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning. Tech. rep., Technical report 2012 31 (2012)
T. Toda, A.W. Black, K. Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 15(8), 2222–2235 (2007)
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu, WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016)
A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al, Conditional image generation with PixelCNN decoders. in Advances in Neural Information Processing Systems (2016), pp. 4790–4798
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
Q. Wang, J. Du, L.R. Dai, C.H. Lee, Joint noise and mask aware training for DNN-based speech enhancement with sub-band features. in Hands-free Speech Communications and Microphone Arrays (HSCMA), 2017 (IEEE, 2017), pp. 101–105
Y. Wang, J. Chen, D. Wang, Deep neural network based supervised speech segregation generalizes to novel noises through large-scale training. Dept. of Comput. Sci. and Eng., The Ohio State Univ., Columbus, OH, USA, Tech. Rep. OSU-CISRC-3/15-TR02 (2015)
Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 22(12), 1849–1858 (2014)
F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J.R. Hershey, B. Schuller, Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. in International Conference on Latent Variable Analysis and Signal Separation (Springer, 2015), pp. 91–99
F. Weninger, J.R. Hershey, J. Le Roux, B. Schuller, Discriminatively trained recurrent neural networks for single-channel speech separation. in 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP) (IEEE, 2014), pp. 577–581
D.S. Williamson, Y. Wang, D. Wang, Complex ratio masking for joint enhancement of magnitude and phase. in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 5220–5224
Y. Xu, J. Du, L.R. Dai, C.H. Lee, An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2014)
Y. Xu, J. Du, L.R. Dai, C.H. Lee, Global variance equalization for improving deep neural network based speech enhancement. in 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP) (IEEE, 2014), pp. 71–75
Y. Xu, J. Du, L.R. Dai, C.H. Lee, A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(1), 7–19 (2015). https://doi.org/10.1109/TASLP.2014.2364452
Y. Xu, J. Du, Z. Huang, L.R. Dai, C.H. Lee, Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement. arXiv preprint arXiv:1703.07172 (2017)
X.L. Zhang, D. Wang, Multi-resolution stacking for speech separation based on boosted DNN. in INTERSPEECH (2015), pp. 1745–1749
X.L. Zhang, D. Wang, A deep ensemble learning method for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(5), 967–977 (2016)
Y. Zhao, D. Wang, I. Merks, T. Zhang, DNN-based enhancement of noisy and reverberant speech. in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 6525–6529
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61273264 and U1613211), and by Science and Technology Department of Anhui Province (No. 15CZZ02007).
Rights and permissions
About this article
Cite this article
Li, ZX., Dai, LR., Song, Y. et al. A Conditional Generative Model for Speech Enhancement. Circuits Syst Signal Process 37, 5005–5022 (2018). https://doi.org/10.1007/s00034-018-0798-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-018-0798-4