Skip to main content

Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement

  • Chapter
  • First Online:
Book cover Speech and Audio Processing for Coding, Enhancement and Recognition

Abstract

When speech signals are captured in real acoustical environments, the captured signals are distorted by certain types of interference, such as ambient noise, reverberation, and extraneous speakers’ utterances. There are two important approaches to speech enhancement that reduce such interference in the captured signals. One approach is based on the spatial features of the signals, such as direction of arrival and acoustic transfer functions, and enhances speech using multichannel audio signal processing. The other approach is based on speech spectral models that represent the probability density function of the speech spectra, and it enhances speech by distinguishing between speech and noise based on the spectral models. In this chapter, we propose a new approach that integrates the above two approaches. The proposed approach uses the spatial and spectral features of signals in a complementary manner to achieve reliable and accurate speech enhancement. The approach can be applied to various speech enhancement problems, including denoising, dereverberation, and blind source separation (BSS). In particular, in this chapter, we focus on applying the approach to BSS. We show experimentally that the proposed integration can improve the performance of BSS compared with a conventional approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    As noted later, despite this assumption, this scenario can represent a situation with long reverberation, and can be used for achieving dereverberation.

  2. 2.

    If we interpret the ATFs from s t to \(z_{t}^{(m)}\) also as a part of the interference, we may formulate speech enhancement that estimates s t . This is beyond the scope of this chapter.

  3. 3.

    The same model can be used to represent ambient noise, for example, as in [10]. The way to formulate MLSE for denoising and its extension to MAPSE can be found in [12]. As regards MLSE based dereverberation with the long-term linear prediction approach, the generative model of the interference can be defined in the following form [10, 11, 16].

    $$\displaystyle\begin{array}{rcl} p(\mathbf{a}_{n,f}\vert \theta _{f})& =& \delta (\mathbf{a}_{n,f} -\mathbf{r}_{n,f}(\theta _{f})), {}\end{array}$$
    (9.16)

    where δ(⋅ ) is the Dirac delta function, and \(\mathbf{r}_{n,f}(\theta _{f}) = [r_{n,f}^{(1)}(\theta _{f}),r_{n,f}^{(2)}(\theta _{f}),\ldots,r_{n,f}^{(M)}(\theta _{f})]^{T}\) is a spatial vector of the interference signal, namely the late reverberation signal. The model parameter set θ f is composed of the prediction coefficients, and the late reverberation \(r_{n,f}^{(m)}(\theta _{f})\) is modeled by an inner product of a vector containing the prediction coefficients and that containing a past captured signal in the MLSE based dereverberation. It was shown that the MLSE based dereverberation can be extended to MAPSE based dereverberation as discussed in [11] based on the technique discussed in this chapter.

References

  1. J. Benesty, S. Makino, J. Chen (eds.), Speech Enhancement (Signals and Communication Technology) (Springer, Berlin, 2005)

    Google Scholar 

  2. C.M. Biship, Pattern Recognition and Machine Learning (Information Science and Statistics) (Springer, New York, 2010)

    Google Scholar 

  3. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B Methodol. 39, 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  4. N.Q.K. Duong, E. Vincent, R. Gribonval, Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18(7), 1830–1840 (2010)

    Article  Google Scholar 

  5. M. Fujimoto, T. Nakatani, Model-based noise suppression using unsupervised estimation of hidden Markov model for non-stationary noise, in Proceedings of INTERSPEECH 2013 (2013), pp. 2982–2986

    Google Scholar 

  6. S. Gannot, M. Moonen, Subspace methods for multimicrophone speech dereverberation. EURASIP J. Adv. Signal Process. 2003(11), 1074–1090 (2003)

    Article  MATH  Google Scholar 

  7. J.F. Gemmeke, T. Virtanen, A. Hurmalainen, Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(7), 2067–2080 (2011)

    Article  Google Scholar 

  8. S. Haykin, Adaptive Filter Theory, 5th edn. (Prentice Hall, Englewood Cliffs, 2013)

    Google Scholar 

  9. K. Iso, S. Araki, S. Makino, T. Nakatani, H. Sawada, T. Yamada, A. Nakamura, Blind source separation of mixed speech in a high reverberation environment, in Proceedings of 3rd Joint Workshop on Hands-free Speech Communication and Microphone Array (HSCMA-2011) (2011), pp. 36–39

    Google Scholar 

  10. N. Ito, S. Araki, T. Nakatani, Probabilistic integration of diffuse noise suppression and dereverberation, in Proceedings of IEEE ICASSP-2014 (2014), pp. 5204–5208

    Google Scholar 

  11. Y. Iwata, T. Nakatani, Introduction of speech log-spectral priors into dereverberation based on Itakura-Saito distance minimization, in Proceedings of IEEE ICASSP-2012 (2012), pp. 245–248

    Google Scholar 

  12. Y. Iwata, T. Nakatani, M. Fujimoto, T. Yoshioka, H. Saito, MAP spectral estimation of speech using log-spectral prior for noise reduction (in Japanese), in Proceedings of Autumn-2012 Meeting of the Acoustical Society of Japan (2012), pp. 795–798

    Google Scholar 

  13. Y. Izumi, N. Ono, S. Sagayama, Sparseness-based 2ch BSS using the EM algorithm in reverberant environment, in Proceedings of IEEE WASPAA-2007 (2007), pp. 147–150

    Google Scholar 

  14. P.C. Loizou, Speech Enhancement: Theory and Practice, 2nd edn. (CRC Press, Boca Raton, 2013)

    Google Scholar 

  15. P.J. Moreno, B. Raj, R.M. Stern, A vector taylor series approach for environment-independent speech recognition, in Proceedings of IEEE ICASSP-1996, vol. 2 (1996), pp. 733–736

    Google Scholar 

  16. T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, B.H. Juang, Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Trans. Audio Speech Lang. Process. 18(7), 1717–1731 (2010)

    Article  Google Scholar 

  17. A. Ogawa, K. Kinoshita, T. Hori, T. Nakatani, A. Nakamura, Fast segment search for corpus-based speech enhancement based on speech recognition technology, in Proceedings of IEEE ICASSP-2014 (2014), pp. 1576–1580

    Google Scholar 

  18. D. Pearce, H.G. Hirsch, The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions, in Proceedings of INTERSPEECH-2000, vol. 2000 (2000), pp. 29–32

    Google Scholar 

  19. M. Rainer, Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9(5), 504–512 (2001)

    Article  Google Scholar 

  20. S.J. Rennie, J.R. Hershey, P.A. Olsen, Single-channel multitalker speech recognition. IEEE SP Mag. 27(6), 66–80 (2010)

    Google Scholar 

  21. H. Sawada, S. Araki, R. Mukai, S. Makino, Grouping separated frequency components by estimating propagation model parameters in frequency-domain blind source separation. IEEE Trans. Audio Speech Lang. Process. 15(5), 1592–1604 (2007)

    Article  Google Scholar 

  22. H. Sawada, S. Araki, S. Makino, Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Trans. Audio Speech Lang. Process. 19(3), 516–527 (2011)

    Article  Google Scholar 

  23. M. Seltzer, D. Yu, Y. Wang, An investigation of deep neural networks for noise robust speech recognition, in Proceedings of IEEE ICASSP-2013 (2013), pp. 7398–7402

    Google Scholar 

  24. M. Souden, J. Chen, J. Benesty, S. Affes, An integrated solution for online multichannel noise tracking and reduction. IEEE Trans. Audio Speech Lang. Process. 19, 2159–2169 (2011)

    Article  Google Scholar 

  25. M. Togami, Y. Kawaguchi, R. Takeda, Y. Obuchi, N. Nukaga, Optimized speech dereverberation from probabilistic perspective for time varying acoustic transfer function. IEEE Trans. Audio Speech Lang. Process. 21(7), 1369–1380 (2013)

    Article  Google Scholar 

  26. O. Yilmaz, S. Rickard, Blind separation of speech mixture via time-frequency masking. IEEE Trans. Signal Process. 52(7), 1830–1847 (2004)

    Article  MathSciNet  Google Scholar 

  27. T. Yoshioka, T. Nakatani, M. Miyoshi, H.G. Okuno, Blind separation and dereverberation of speech mixtures by joint optimization. IEEE Trans. Audio Speech Lang. Process. 19(1), 69–84 (2011)

    Article  Google Scholar 

  28. E. Vincent, H. Sawada, P. Bofill, S. Makino, J. Rosca, First stereo audio source separation evaluation campaign: data, algorithms and results, in Proceedings of International Conference on Independent Component Analysis (ICA), pp. 552–559 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomohiro Nakatani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer Science+Business Media New York

About this chapter

Cite this chapter

Iwata, Y., Nakatani, T., Yoshioka, T., Fujimoto, M., Saito, H. (2015). Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement. In: Ogunfunmi, T., Togneri, R., Narasimha, M. (eds) Speech and Audio Processing for Coding, Enhancement and Recognition. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1456-2_9

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-1456-2_9

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4939-1455-5

  • Online ISBN: 978-1-4939-1456-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics