Abstract
When speech signals are captured in real acoustical environments, the captured signals are distorted by certain types of interference, such as ambient noise, reverberation, and extraneous speakers’ utterances. There are two important approaches to speech enhancement that reduce such interference in the captured signals. One approach is based on the spatial features of the signals, such as direction of arrival and acoustic transfer functions, and enhances speech using multichannel audio signal processing. The other approach is based on speech spectral models that represent the probability density function of the speech spectra, and it enhances speech by distinguishing between speech and noise based on the spectral models. In this chapter, we propose a new approach that integrates the above two approaches. The proposed approach uses the spatial and spectral features of signals in a complementary manner to achieve reliable and accurate speech enhancement. The approach can be applied to various speech enhancement problems, including denoising, dereverberation, and blind source separation (BSS). In particular, in this chapter, we focus on applying the approach to BSS. We show experimentally that the proposed integration can improve the performance of BSS compared with a conventional approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
As noted later, despite this assumption, this scenario can represent a situation with long reverberation, and can be used for achieving dereverberation.
- 2.
If we interpret the ATFs from s t to \(z_{t}^{(m)}\) also as a part of the interference, we may formulate speech enhancement that estimates s t . This is beyond the scope of this chapter.
- 3.
The same model can be used to represent ambient noise, for example, as in [10]. The way to formulate MLSE for denoising and its extension to MAPSE can be found in [12]. As regards MLSE based dereverberation with the long-term linear prediction approach, the generative model of the interference can be defined in the following form [10, 11, 16].
$$\displaystyle\begin{array}{rcl} p(\mathbf{a}_{n,f}\vert \theta _{f})& =& \delta (\mathbf{a}_{n,f} -\mathbf{r}_{n,f}(\theta _{f})), {}\end{array}$$(9.16)where δ(⋅ ) is the Dirac delta function, and \(\mathbf{r}_{n,f}(\theta _{f}) = [r_{n,f}^{(1)}(\theta _{f}),r_{n,f}^{(2)}(\theta _{f}),\ldots,r_{n,f}^{(M)}(\theta _{f})]^{T}\) is a spatial vector of the interference signal, namely the late reverberation signal. The model parameter set θ f is composed of the prediction coefficients, and the late reverberation \(r_{n,f}^{(m)}(\theta _{f})\) is modeled by an inner product of a vector containing the prediction coefficients and that containing a past captured signal in the MLSE based dereverberation. It was shown that the MLSE based dereverberation can be extended to MAPSE based dereverberation as discussed in [11] based on the technique discussed in this chapter.
References
J. Benesty, S. Makino, J. Chen (eds.), Speech Enhancement (Signals and Communication Technology) (Springer, Berlin, 2005)
C.M. Biship, Pattern Recognition and Machine Learning (Information Science and Statistics) (Springer, New York, 2010)
A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B Methodol. 39, 1–38 (1977)
N.Q.K. Duong, E. Vincent, R. Gribonval, Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18(7), 1830–1840 (2010)
M. Fujimoto, T. Nakatani, Model-based noise suppression using unsupervised estimation of hidden Markov model for non-stationary noise, in Proceedings of INTERSPEECH 2013 (2013), pp. 2982–2986
S. Gannot, M. Moonen, Subspace methods for multimicrophone speech dereverberation. EURASIP J. Adv. Signal Process. 2003(11), 1074–1090 (2003)
J.F. Gemmeke, T. Virtanen, A. Hurmalainen, Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(7), 2067–2080 (2011)
S. Haykin, Adaptive Filter Theory, 5th edn. (Prentice Hall, Englewood Cliffs, 2013)
K. Iso, S. Araki, S. Makino, T. Nakatani, H. Sawada, T. Yamada, A. Nakamura, Blind source separation of mixed speech in a high reverberation environment, in Proceedings of 3rd Joint Workshop on Hands-free Speech Communication and Microphone Array (HSCMA-2011) (2011), pp. 36–39
N. Ito, S. Araki, T. Nakatani, Probabilistic integration of diffuse noise suppression and dereverberation, in Proceedings of IEEE ICASSP-2014 (2014), pp. 5204–5208
Y. Iwata, T. Nakatani, Introduction of speech log-spectral priors into dereverberation based on Itakura-Saito distance minimization, in Proceedings of IEEE ICASSP-2012 (2012), pp. 245–248
Y. Iwata, T. Nakatani, M. Fujimoto, T. Yoshioka, H. Saito, MAP spectral estimation of speech using log-spectral prior for noise reduction (in Japanese), in Proceedings of Autumn-2012 Meeting of the Acoustical Society of Japan (2012), pp. 795–798
Y. Izumi, N. Ono, S. Sagayama, Sparseness-based 2ch BSS using the EM algorithm in reverberant environment, in Proceedings of IEEE WASPAA-2007 (2007), pp. 147–150
P.C. Loizou, Speech Enhancement: Theory and Practice, 2nd edn. (CRC Press, Boca Raton, 2013)
P.J. Moreno, B. Raj, R.M. Stern, A vector taylor series approach for environment-independent speech recognition, in Proceedings of IEEE ICASSP-1996, vol. 2 (1996), pp. 733–736
T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, B.H. Juang, Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Trans. Audio Speech Lang. Process. 18(7), 1717–1731 (2010)
A. Ogawa, K. Kinoshita, T. Hori, T. Nakatani, A. Nakamura, Fast segment search for corpus-based speech enhancement based on speech recognition technology, in Proceedings of IEEE ICASSP-2014 (2014), pp. 1576–1580
D. Pearce, H.G. Hirsch, The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions, in Proceedings of INTERSPEECH-2000, vol. 2000 (2000), pp. 29–32
M. Rainer, Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9(5), 504–512 (2001)
S.J. Rennie, J.R. Hershey, P.A. Olsen, Single-channel multitalker speech recognition. IEEE SP Mag. 27(6), 66–80 (2010)
H. Sawada, S. Araki, R. Mukai, S. Makino, Grouping separated frequency components by estimating propagation model parameters in frequency-domain blind source separation. IEEE Trans. Audio Speech Lang. Process. 15(5), 1592–1604 (2007)
H. Sawada, S. Araki, S. Makino, Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Trans. Audio Speech Lang. Process. 19(3), 516–527 (2011)
M. Seltzer, D. Yu, Y. Wang, An investigation of deep neural networks for noise robust speech recognition, in Proceedings of IEEE ICASSP-2013 (2013), pp. 7398–7402
M. Souden, J. Chen, J. Benesty, S. Affes, An integrated solution for online multichannel noise tracking and reduction. IEEE Trans. Audio Speech Lang. Process. 19, 2159–2169 (2011)
M. Togami, Y. Kawaguchi, R. Takeda, Y. Obuchi, N. Nukaga, Optimized speech dereverberation from probabilistic perspective for time varying acoustic transfer function. IEEE Trans. Audio Speech Lang. Process. 21(7), 1369–1380 (2013)
O. Yilmaz, S. Rickard, Blind separation of speech mixture via time-frequency masking. IEEE Trans. Signal Process. 52(7), 1830–1847 (2004)
T. Yoshioka, T. Nakatani, M. Miyoshi, H.G. Okuno, Blind separation and dereverberation of speech mixtures by joint optimization. IEEE Trans. Audio Speech Lang. Process. 19(1), 69–84 (2011)
E. Vincent, H. Sawada, P. Bofill, S. Makino, J. Rosca, First stereo audio source separation evaluation campaign: data, algorithms and results, in Proceedings of International Conference on Independent Component Analysis (ICA), pp. 552–559 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer Science+Business Media New York
About this chapter
Cite this chapter
Iwata, Y., Nakatani, T., Yoshioka, T., Fujimoto, M., Saito, H. (2015). Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement. In: Ogunfunmi, T., Togneri, R., Narasimha, M. (eds) Speech and Audio Processing for Coding, Enhancement and Recognition. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1456-2_9
Download citation
DOI: https://doi.org/10.1007/978-1-4939-1456-2_9
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-1455-5
Online ISBN: 978-1-4939-1456-2
eBook Packages: EngineeringEngineering (R0)