Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement

Iwata, Yasuaki; Nakatani, Tomohiro; Yoshioka, Takuya; Fujimoto, Masakiyo; Saito, Hirofumi

doi:10.1007/978-1-4939-1456-2_9

Yasuaki Iwata^4,5,
Tomohiro Nakatani⁵,
Takuya Yoshioka⁵,
Masakiyo Fujimoto⁵ &
…
Hirofumi Saito⁴

1953 Accesses

Abstract

When speech signals are captured in real acoustical environments, the captured signals are distorted by certain types of interference, such as ambient noise, reverberation, and extraneous speakers’ utterances. There are two important approaches to speech enhancement that reduce such interference in the captured signals. One approach is based on the spatial features of the signals, such as direction of arrival and acoustic transfer functions, and enhances speech using multichannel audio signal processing. The other approach is based on speech spectral models that represent the probability density function of the speech spectra, and it enhances speech by distinguishing between speech and noise based on the spectral models. In this chapter, we propose a new approach that integrates the above two approaches. The proposed approach uses the spatial and spectral features of signals in a complementary manner to achieve reliable and accurate speech enhancement. The approach can be applied to various speech enhancement problems, including denoising, dereverberation, and blind source separation (BSS). In particular, in this chapter, we focus on applying the approach to BSS. We show experimentally that the proposed integration can improve the performance of BSS compared with a conventional approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Statistical Analysis and Evaluation of Blind Speech Extraction Algorithms

Single-channel speech separation using combined EMD and speech-specific information

Article 23 October 2017

General Formulation of Multichannel Extensions of NMF Variants

Notes

1.
As noted later, despite this assumption, this scenario can represent a situation with long reverberation, and can be used for achieving dereverberation.
2.
If we interpret the ATFs from s _t to $z_{t}^{(m)}$ also as a part of the interference, we may formulate speech enhancement that estimates s _t. This is beyond the scope of this chapter.
3.
The same model can be used to represent ambient noise, for example, as in [10]. The way to formulate MLSE for denoising and its extension to MAPSE can be found in [12]. As regards MLSE based dereverberation with the long-term linear prediction approach, the generative model of the interference can be defined in the following form [10, 11, 16].
$$\displaystyle\begin{array}{rcl} p(\mathbf{a}_{n,f}\vert \theta _{f})& =& \delta (\mathbf{a}_{n,f} -\mathbf{r}_{n,f}(\theta _{f})), {}\end{array}$$
(9.16)
where δ(⋅ ) is the Dirac delta function, and $\mathbf{r}_{n,f}(\theta _{f}) = [r_{n,f}^{(1)}(\theta _{f}),r_{n,f}^{(2)}(\theta _{f}),\ldots,r_{n,f}^{(M)}(\theta _{f})]^{T}$ is a spatial vector of the interference signal, namely the late reverberation signal. The model parameter set θ _f is composed of the prediction coefficients, and the late reverberation $r_{n,f}^{(m)}(\theta _{f})$ is modeled by an inner product of a vector containing the prediction coefficients and that containing a past captured signal in the MLSE based dereverberation. It was shown that the MLSE based dereverberation can be extended to MAPSE based dereverberation as discussed in [11] based on the technique discussed in this chapter.

References

J. Benesty, S. Makino, J. Chen (eds.), Speech Enhancement (Signals and Communication Technology) (Springer, Berlin, 2005)
Google Scholar
C.M. Biship, Pattern Recognition and Machine Learning (Information Science and Statistics) (Springer, New York, 2010)
Google Scholar
A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B Methodol. 39, 1–38 (1977)
MathSciNet MATH Google Scholar
N.Q.K. Duong, E. Vincent, R. Gribonval, Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18(7), 1830–1840 (2010)
Article Google Scholar
M. Fujimoto, T. Nakatani, Model-based noise suppression using unsupervised estimation of hidden Markov model for non-stationary noise, in Proceedings of INTERSPEECH 2013 (2013), pp. 2982–2986
Google Scholar
S. Gannot, M. Moonen, Subspace methods for multimicrophone speech dereverberation. EURASIP J. Adv. Signal Process. 2003(11), 1074–1090 (2003)
Article MATH Google Scholar
J.F. Gemmeke, T. Virtanen, A. Hurmalainen, Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(7), 2067–2080 (2011)
Article Google Scholar
S. Haykin, Adaptive Filter Theory, 5th edn. (Prentice Hall, Englewood Cliffs, 2013)
Google Scholar
K. Iso, S. Araki, S. Makino, T. Nakatani, H. Sawada, T. Yamada, A. Nakamura, Blind source separation of mixed speech in a high reverberation environment, in Proceedings of 3rd Joint Workshop on Hands-free Speech Communication and Microphone Array (HSCMA-2011) (2011), pp. 36–39
Google Scholar
N. Ito, S. Araki, T. Nakatani, Probabilistic integration of diffuse noise suppression and dereverberation, in Proceedings of IEEE ICASSP-2014 (2014), pp. 5204–5208
Google Scholar
Y. Iwata, T. Nakatani, Introduction of speech log-spectral priors into dereverberation based on Itakura-Saito distance minimization, in Proceedings of IEEE ICASSP-2012 (2012), pp. 245–248
Google Scholar
Y. Iwata, T. Nakatani, M. Fujimoto, T. Yoshioka, H. Saito, MAP spectral estimation of speech using log-spectral prior for noise reduction (in Japanese), in Proceedings of Autumn-2012 Meeting of the Acoustical Society of Japan (2012), pp. 795–798
Google Scholar
Y. Izumi, N. Ono, S. Sagayama, Sparseness-based 2ch BSS using the EM algorithm in reverberant environment, in Proceedings of IEEE WASPAA-2007 (2007), pp. 147–150
Google Scholar
P.C. Loizou, Speech Enhancement: Theory and Practice, 2nd edn. (CRC Press, Boca Raton, 2013)
Google Scholar
P.J. Moreno, B. Raj, R.M. Stern, A vector taylor series approach for environment-independent speech recognition, in Proceedings of IEEE ICASSP-1996, vol. 2 (1996), pp. 733–736
Google Scholar
T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, B.H. Juang, Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Trans. Audio Speech Lang. Process. 18(7), 1717–1731 (2010)
Article Google Scholar
A. Ogawa, K. Kinoshita, T. Hori, T. Nakatani, A. Nakamura, Fast segment search for corpus-based speech enhancement based on speech recognition technology, in Proceedings of IEEE ICASSP-2014 (2014), pp. 1576–1580
Google Scholar
D. Pearce, H.G. Hirsch, The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions, in Proceedings of INTERSPEECH-2000, vol. 2000 (2000), pp. 29–32
Google Scholar
M. Rainer, Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9(5), 504–512 (2001)
Article Google Scholar
S.J. Rennie, J.R. Hershey, P.A. Olsen, Single-channel multitalker speech recognition. IEEE SP Mag. 27(6), 66–80 (2010)
Google Scholar
H. Sawada, S. Araki, R. Mukai, S. Makino, Grouping separated frequency components by estimating propagation model parameters in frequency-domain blind source separation. IEEE Trans. Audio Speech Lang. Process. 15(5), 1592–1604 (2007)
Article Google Scholar
H. Sawada, S. Araki, S. Makino, Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Trans. Audio Speech Lang. Process. 19(3), 516–527 (2011)
Article Google Scholar
M. Seltzer, D. Yu, Y. Wang, An investigation of deep neural networks for noise robust speech recognition, in Proceedings of IEEE ICASSP-2013 (2013), pp. 7398–7402
Google Scholar
M. Souden, J. Chen, J. Benesty, S. Affes, An integrated solution for online multichannel noise tracking and reduction. IEEE Trans. Audio Speech Lang. Process. 19, 2159–2169 (2011)
Article Google Scholar
M. Togami, Y. Kawaguchi, R. Takeda, Y. Obuchi, N. Nukaga, Optimized speech dereverberation from probabilistic perspective for time varying acoustic transfer function. IEEE Trans. Audio Speech Lang. Process. 21(7), 1369–1380 (2013)
Article Google Scholar
O. Yilmaz, S. Rickard, Blind separation of speech mixture via time-frequency masking. IEEE Trans. Signal Process. 52(7), 1830–1847 (2004)
Article MathSciNet Google Scholar
T. Yoshioka, T. Nakatani, M. Miyoshi, H.G. Okuno, Blind separation and dereverberation of speech mixtures by joint optimization. IEEE Trans. Audio Speech Lang. Process. 19(1), 69–84 (2011)
Article Google Scholar
E. Vincent, H. Sawada, P. Bofill, S. Makino, J. Rosca, First stereo audio source separation evaluation campaign: data, algorithms and results, in Proceedings of International Conference on Independent Component Analysis (ICA), pp. 552–559 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Information Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan
Yasuaki Iwata & Hirofumi Saito
NTT Communication Science Laboratories, NTT Corporation, 3-4, Hikaridai, Seikacho, Sorakugun, Kyoto, 619-0237, Japan
Yasuaki Iwata, Tomohiro Nakatani, Takuya Yoshioka & Masakiyo Fujimoto

Authors

Yasuaki Iwata
View author publications
You can also search for this author in PubMed Google Scholar
Tomohiro Nakatani
View author publications
You can also search for this author in PubMed Google Scholar
Takuya Yoshioka
View author publications
You can also search for this author in PubMed Google Scholar
Masakiyo Fujimoto
View author publications
You can also search for this author in PubMed Google Scholar
Hirofumi Saito
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomohiro Nakatani .

Editor information

Editors and Affiliations

Dept. of Electrical Engineering, Santa Clara University, Santa Clara, California, USA
Tokunbo Ogunfunmi
School of EE&C Engineering, The University of Western Australia, Crawley, West Australia, Australia
Roberto Togneri
Qualcomm Inc., Santa Clara, California, USA
Madihally (Sim) Narasimha

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Iwata, Y., Nakatani, T., Yoshioka, T., Fujimoto, M., Saito, H. (2015). Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement. In: Ogunfunmi, T., Togneri, R., Narasimha, M. (eds) Speech and Audio Processing for Coding, Enhancement and Recognition. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1456-2_9

Download citation

DOI: https://doi.org/10.1007/978-1-4939-1456-2_9
Published: 18 September 2014
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-1455-5
Online ISBN: 978-1-4939-1456-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement

Abstract

Access this chapter

Similar content being viewed by others

Statistical Analysis and Evaluation of Blind Speech Extraction Algorithms

Single-channel speech separation using combined EMD and speech-specific information

General Formulation of Multichannel Extensions of NMF Variants

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement

Abstract

Access this chapter

Similar content being viewed by others

Statistical Analysis and Evaluation of Blind Speech Extraction Algorithms

Single-channel speech separation using combined EMD and speech-specific information

General Formulation of Multichannel Extensions of NMF Variants

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation