Abstract
This chapter deals with multichannel source separation and denoising based on sparseness of source signals in the time-frequency domain. In this approach, time-frequency masks are typically estimated based on clustering of source location features, such as time and level differences between microphones. In this chapter, we describe the approach and its recent advances. Especially, we introduce a recently proposed clustering method, observation vector clustering, which has attracted attention for its effectiveness. We introduce algorithms for observation vector clustering based on a complex Watson mixture model (cWMM), a complex Bingham mixture model (cBMM), and a complex Gaussian mixture model (cGMM). We show through experiments the effectiveness of observation vector clustering in source separation and denoising.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
P.A. Naylor, N.D. Gaubitch, Speech Dereverberation. (Springer, 2009)
M. Brandstein, D. Ward, Microphone Arrays: Signal Processing Techniques and Applications. (Springer, 2001)
R. Zelinski, A microphone array with adaptive post-filtering for noise reduction in reverberant rooms, in Proceeding of ICASSP (1988), pp. 2578–2581
S. Gannot, D. Burshtein, E. Weinstein, Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans. SP 49(8), 1614–1626 (2001)
S. Doclo, M. Moonen, GSVD-based optimal filtering for single and multimicrophone speech enhancement. IEEE Trans. SP 50(9), 2230–2244 (2002)
S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. ASSP ASSP-27(2), 113–120 (1979)
Y. Ephraim, D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. ASSP 32(6), 1109–1121 (1984)
R. Miyazaki, H. Saruwatari, T. Inoue, Y. Takahashi, K. Shikano, K. Kondo, Musical-noise-free speech enhancement based on optimized iterative spectral subtraction. IEEE Trans. ASLP 20(7), 2080–2094 (2012)
P. Smaragdis, Probabilistic decompositions of spectra for sound separation, in Blind Speech Separation, ed. by S. Makino, T.-W. Lee, H. Sawada (Springer, 2007), pp. 365–386
Ö. Yılmaz, S. Rickard, Blind separation of speech mixtures via time-frequency masking. IEEE Trans. SP 52(7), 1830–1847 (2004)
S. Araki, H. Sawada, R. Mukai, S. Makino, Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors. Signal Process. 87(8), 1833–1847 (2007)
Y. Izumi, N. Ono, S. Sagayama, Sparseness-based 2ch BSS using the EM algorithm in reverberant environment, in Proceeding of WASPAA (2007), pp. 147–150
H. Sawada, S. Araki, S. Makino, A two-stage frequency-domain blind source separation method for underdetermined convolutive mixtures, in Proceeding of WASPAA (2007), pp. 139–142
M.I. Mandel, R.J. Weiss, D.P.W. Ellis, Model-based expectation-maximization source separation and localization. IEEE Trans. ASLP 18(2), 382–394 (2010)
D.H. Tran Vu, R. Haeb-Umbach, Blind speech separation employing directional statistics in an expectation maximization framework, in Proceeding of ICASSP (2010), pp. 241–244
H. Sawada, S. Araki, S. Makino, Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Trans. ASLP 19(3), 516–527 (2011)
M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, A. Ogawa, T. Hori, S. Watanabe, M. Fujimoto, T. Yoshioka, T. Oba, Y. Kubo, M. Souden, S.-J. Hahm, A. Nakamura, Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/noise modeling combined with dynamic variance adaptation, in Proceeding of CHiME 2011 Workshop on Machine Listening in Multisource Environments (2011), pp. 12–17
M. Souden, S. Araki, K. Kinoshita, T. Nakatani, H. Sawada, A multichannel MMSE-based framework for speech source separation and noise reduction. IEEE Trans. ASLP 21(9), 1913–1928 (2013)
T. Nakatani, S. Araki, T. Yoshioka, M. Delcroix, M. Fujimoto, Dominance based integration of spatial and spectral features for speech enhancement. IEEE Trans. ASLP 21(12), 2516–2531 (2013)
T. Yoshioka, N. Ito, M. Delcroix, A. Ogawa, K. Kinoshita, M. Fujimoto, C. Yu, W.J. Fabian, M. Espi, T. Higuchi, S. Araki, T. Nakatani, The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices, in Proceeding of ASRU (2015), pp. 436–443
Y. Wang, D. Wang, Towards scaling up classification-based speech separation. IEEE Trans. ASLP 21(7), 1381–1390 (2013)
J. Heymann, L. Drude, R. Haeb-Umbach, Neural network based spectral mask estimation for acoustic beamforming, in Proceeding of ICASSP (2016), pp. 196–200
C. Bishop, Pattern Recognition and Machine Learning. (Springer, 2006)
N. Murata, S. Ikeda, A. Ziehe, An approach to blind source separation based on temporal structure of speech signals. Neurocomputing 41(1–4), 1–24 (2001)
H. Sawada, R. Mukai, S. Araki, S. Makino, A robust and precise method for solving the permutation problem of frequency-domain blind source separation. IEEE Trans. SAP 12(5), 530–538 (2004)
H. Sawada, S. Araki, S. Makino, Measuring dependence of bin-wise separated signals for permutation alignment in frequency-domain BSS, in Proceeding of IEEE International Symposium on Circuits and Systems (ISCAS) (2007), pp. 3247–3250
K.V. Mardia, I.L. Dryden, The complex Watson distribution and shape analysis. J. Roy. Stat. Soc.: Ser. B (Stat. Methodol.) 61(4), 913–926 (1999)
G. Watson, Equatorial distributions on a sphere. Biometrika 52, 193–201 (1965)
N. Ito, S. Araki, T. Nakatani, Modeling audio directional statistics using a complex Bingham mixture model for blind source extraction from diffuse noise, in Proceeding of ICASSP (2016), pp. 465–468
J.T. Kent, The complex Bingham distribution and shape analysis. J. Roy. Stat. Soc.: Ser. B (Methodol.) 56(2), 285–299 (1994)
C. Bingham, An antipodally symmetric distribution on the sphere. Ann. Stat. 2, 1201–1205 (1974)
N. Ito, S. Araki, T. Yoshioka, T. Nakatani, Relaxed disjointness based clustering for joint blind source separation and dereverberation, in Proceeding of IWAENC (2014), pp. 268–272
N.Q.K. Duong, E. Vincent, R. Gribonval, Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. ASLP 18(7), 1830–1840 (2010)
E. Vincent, R. Gribonval, C. Févotte, Performance measurement in blind audio source separation. IEEE Trans. ASLP 14(4), 1462–1469 (2006)
J. Barker, R. Marxer, E. Vincent, S. Watanabe, The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines, in Proceeding of ASRU (2015), pp. 504–511
N. Ito, S. Araki, T. Nakatani, Permutation-free clustering of relative transfer function features for blind source separation, in Proceeding of EUSIPCO (2015), pp. 409–413
S. Sra, D. Karp, The multivariate Watson distribution: maximum-likelihood estimation and other aspects. J. Multivar. Anal. 114, 256–269 (2013)
K.V. Mardia, P.E. Jupp, Directional Statistics. (Wiley, 1999)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix 1 Derivation of cWMM-Based Mask Estimation Algorithm
Here we derive the cWMM-based mask estimation algorithm in Sect. 11.3.1. The derivation of the E-step is straightforward and omitted. The update rules for the M-step is obtained by maximizing the following Q-function with respect to \(\varTheta _{\text {W},f}\):
Here, \(\mathbf {R}^{(k)}_f\) is defined by
and C denotes a constant independent of \(\varTheta _{\text {W},f}\).
The update rule for \(\alpha ^{(k)}_f\) is obvious: note the constraint (11.18) and apply the Lagrangian multiplier method.
The update rule for \(\mathbf {a}^{(k)}_f\) is obtained by maximizing \(Q(\varTheta _{\text {W},f})\) subject to (11.19). Noting (11.26), we see that this is equivalent to maximizing \(\mathbf {a}^{(k)\textsf {H}}_f\mathbf {R}^{(k)}_f\mathbf {a}^{(k)}_f\) subject to (11.19). From the linear algebra, \(\mathbf {a}^{(k)}_f\) is therefore a unit-norm principal eigenvector of \(\mathbf {R}^{(k)}_f\).
The update rule for \(\kappa ^{(k)}_f\) is obtained by maximizing
Since \(\mathbf {a}^{(k)}_f\) is a unit-norm principal eigenvector of \(\mathbf {R}^{(k)}_f\), we have
where \(\lambda ^{(k)}_f\) is the principal eigenvalue of \(\mathbf {R}^{(k)}_f\). Therefore, we have the following nonlinear equation for \(\kappa ^{(k)}_{f}\):
Using (3.8) in [37], (11.58) is approximately solved as follows:
Appendix 2 Derivation of cBMM-Based Mask Estimation Algorithm
Here we derive the cBMM-based mask estimation algorithm in Sect. 11.3.2. The update rule for the E-step is obvious. The update rules for the M-step is obtained by maximizing the following Q-function with respect to \(\varTheta _{\text {B},f}\):
Here, \(c(\mathbf {B})\) is defined by (11.36), and \(\mathbf {R}^{(k)}_f\) by (11.55).
The update rule for \(\alpha ^{(k)}_f\) is obvious.
To derive the update rule for \(\mathbf {B}^{(k)}_{f}\), let us denote the mth largest eigenvalue of \(\mathbf {R}^{(k)}_{f}\) by \(\lambda ^{(k)}_{fm}\) and a corresponding unit-norm eigenvector by \(\mathbf {v}^{(k)}_{fm}\). We assume that \(\lambda ^{(k)}_{fm},\) \(m=1,\dots ,M,\) are all distinct and positive, which is always true in practice. \(\mathbf {R}^{(k)}_{f}\) is represented as
From a result in [38], \(\mathbf {v}^{(k)}_{fm},\) \(m=1,\dots ,M\), are also the eigenvectors of \(\mathbf {B}^{(k)}_{f}\). Hence, \(\mathbf {B}^{(k)}_{f}\) is represented in the form
Substituting (11.63) and (11.64) into (11.62) and disregarding terms independent of \(\beta ^{(k)}_{fm}, m=1,\dots ,M,\) we have
Therefore, we have
Using an approximation in [38], this nonlinear equation can be approximately solved as follows:
Substituting (11.67) into (11.64) and adding a matrix of the form \(\xi \mathbf {I}\) so that the largest eigenvalue of \(\mathbf {B}^{(k)}_{f}\) is zero, we obtain the following update rule for \(\mathbf {B}^{(k)}_{f}\):
Appendix 3 Derivation of cGMM-Based Mask Estimation Algorithm
Here we derive the cGMM-based mask estimation algorithm in Sect. 11.3.3. The derivation of the E-step is straightforward and omitted. The update rules for the M-step is obtained by maximizing the following Q-function with respect to \(\varTheta _{\text {G},f}\):
Here, C denotes a constant independent of \(\varTheta _{\text {G},f}\).
The update rule for \(\alpha ^{(k)}_f\) is obvious.
From (11.70), the update rule for \(\phi ^{(k)}_{tf}\) is given by
As for \(\mathbf {B}^{(k)}_{f}\), it should satisfy
Therefore, the update rule for \(\mathbf {B}^{(k)}_{f}\) is
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Ito, N., Araki, S., Nakatani, T. (2018). Recent Advances in Multichannel Source Separation and Denoising Based on Source Sparseness. In: Makino, S. (eds) Audio Source Separation. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-73031-8_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-73031-8_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73030-1
Online ISBN: 978-3-319-73031-8
eBook Packages: EngineeringEngineering (R0)