Abstract
Many of the traditionally speech enhancement methods reduce noise from corrupted speech by processing the magnitude spectrum in a short-time Fourier analysis-modification-synthesis (AMS) based framework. More recently, use of the modulation domain for speech processing has been investigated, however early efforts in this direction did not account for the changing properties of the modulation spectrum across time. Motivated by this and evidence of the significance of the modulation domain, we investigated the processing of the modulation spectrum on a short-time basis for speech enhancement. For this purpose, a modulation domain-based AMS framework was used, in which the trajectories of each acoustic frequency bin were processed frame-wise in a secondary AMS framework. A number of different enhancement algorithms were investigated for the enhancement of speech in the short-time modulation domain. These included spectral subtraction and MMSE magnitude estimation. In each case, the respective algorithm was used to modify the short-time modulation magnitude spectrum within the modulation AMS framework. Here we review the findings of this investigation, comparing the quality of stimuli enhanced using these modulation based approaches to stimuli enhanced using corresponding modification algorithms applied in the acoustic domain. Results presented show modulation domain based approaches to have improved quality compared to their acoustic domain counterparts. Further, MMSE modulation magnitude estimation (MME) is shown to have improved speech quality compared to Modulation spectral subtraction (ModSSub) stimuli. MME stimuli are found to have good removal of noise without the introduction of musical noise, problematic in spectral subtraction based enhancement. Results also show that ModSSub has minimal musical noise compared to acoustic Spectral subtraction, for appropriately selected modulation frame duration. For modulation domain based methods, modulation frame duration is shown to be an important parameter, with quality generally improved by use of shorter frame durations. From the results of experiments conducted, it is concluded that the short-time modulation domain provides an effective alternative to the short-time acoustic domain for speech processing. Further, that in this domain, MME provides effective noise suppression without the introduction of musical noise distortion.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Note that for references made to the magnitude, phase or complex spectra throughout this text, the STFT modifier is implied unless otherwise stated. The acoustic and modulation modifiers are also included to disambiguate between acoustic and modulation domains.
References
J. Allen, L. Rabiner, A unified approach to short-time Fourier analysis and synthesis. Proc. IEEE 65(11), 1558–1564 (1977)
T. Arai, M. Pavel, H. Hermansky, C. Avendano, Intelligibility of speech with filtered time trajectories of spectral envelopes, in Proceedings of International Conference on Spoken Language Processing (ICSLP), Philadelphia, PA, Oct 1996, pp. 2490–2493
L. Atlas, Modulation spectral transforms: application to speech separation and modification. Tech. Rep. 155. IEICE, University of Washington, Washington, WA (2003)
L. Atlas, S. Shamma, Joint acoustic and modulation frequency. EURASIP J. Appl. Signal Process. 2003(7), 668–675 (2003)
L. Atlas, M. Vinton, Modulation frequency and efficient audio coding, in Proceedings of the SPIE The International Society for Optical Engineering, vol. 4474 (2001), pp. 1–8
S. Bacon, D. Grantham, Modulation masking: effects of modulation frequency, depth, and phase. J. Acoust. Soc. Am. 85(6), 2575–2580 (1989)
M. Berouti, R. Schwartz, J. Makhoul, Enhancement of speech corrupted by acoustic noise, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 4., Washington, DC, Apr 1979, pp. 208–211
S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)
O. Cappe, Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor. IEEE Trans. Speech Audio Process. 2(2), 345–349 (1994)
I. Cohen, Relaxed statistical model for speech enhancement and a priori SNR estimation. IEEE Trans. Speech Audio Process. 13(5), 870–881 (2005)
D. Depireux, J. Simon, D. Klein, S. Shamma, Spectrotemporal response field characterization with dynamic ripples in ferrect primary auditory cortex. J. Neurophysiol. 85(3), 1220–1234 (2001)
R. Drullman, J. Festen, R. Plomp, Effect of reducing slow temporal modulations on speech reception. J. Acoust. Soc. Am. 95(5), 2670–2680 (1994)
R. Drullman, J. Festen, R. Plomp, Effect of temporal envelope smearing on speech reception. J. Acoust. Soc. Am. 95(2), 1053–1064 (1994)
Y. Ephraim, D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984)
Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 33(2), 443–445 (1985)
T. Falk, S. Stadler, W.B. Kleijn, W.-Y. Chan, Noise suppression based on extending a speech-dominated modulation band, in Proceedings of the ISCA Conference of the International Speech Communication Association (INTERSPEECH), Antwerp, Aug 2007, pp. 970–973
R. Goldsworthy, J. Greenberg, Analysis of speech-based speech transmission index methods with implications for nonlinear operations. J. Acoust. Soc. Am. 116(6), 3679–3689 (2004)
R. Gray, A. Buzo, A. Gray, Y. Matsuyama, Distortion measures for speech processing. IEEE Trans. Acoust. Speech Signal Process. 28(4), 367–376 (1980)
S. Greenberg, T. Arai, The relation between speech intelligibility and the complex modulation spectrum, in Proceedings of the ISCA European Conference on Speech Communication and Technology (EUROSPEECH), Aalborg, Sept 2001, pp. 473–476
D. Griffin, J. Lim, Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984)
H. Hermansky, N. Morgan, RASTA processing of speech. IEEE Trans. Speech Audio Process. 2, 578–589 (1994)
H. Hermansky, E. Wan, C. Avendano, Speech enhancement based on temporal processing, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, Detroit, MI, May 1995, pp. 405–408
T. Houtgast, H. Steeneken, A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria. J. Acoust. Soc. Am. 77(3), 1069–1077 (1985)
X. Huang, A. Acero, H. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development (Prentice Hall, Upper Saddle River, 2001)
S. Kamath, P. Loizou, A multi-band spectral subtraction method for enhancing speech corrupted by colored noise, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2002)
N. Kanedera, T. Arai, H. Hermansky, M. Pavel, On the relative importance of various components of the modulation spectrum for automatic speech recognition. Speech Commun. 28(1), 43–55 (1999)
D. Kim, A cue for objective speech quality estimation in temporal envelope representations. IEEE Signal Process. Lett. 11(10), 849–852 (2004)
D. Kim, Anique: an auditory model for single-ended speech quality estimation. IEEE Trans. Speech Audio Process. 13(5), 821–831 (2005)
B. Kingsbury, N. Morgan, S. Greenberg, Robust speech recognition using the modulation spectrogram. Speech Commun. 25(1–3), 117–132 (1998)
T. Kinnunen, Joint acoustic-modulation frequency for speaker recognition, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1. Toulouse, May 2006, pp. 665–668
T. Kinnunen, K. Lee, H. Li, Dimension reduction of the modulation spectrogram for speaker verification, in Proceedings of ISCA Speaker and Language Recognition Workshop (ODYSSEY), Stellenbosch, Jan 2008
N. Kowalski, D. Depireux, S. Shamma, Analysis of dynamic spectra in ferret primary auditory cortex: I. Characteristics of single unit responses to moving ripple spectra. J. Neurophysiol. 76(5), 3503–3523 (1996)
J. Lim, A. Oppenheim, Enhancement and bandwidth compression of noisy speech. Proc. IEEE 67(12), 1586–1604 (1979)
P. Loizou, Speech Enhancement: Theory and Practice (Taylor and Francis, Boca Raton, 2007)
X. Lu, S. Matsuda, M. Unoki, S. Nakamura, Temporal contrast normalization and edge-preserved smoothing of temporal modulation structures of speech for robust speech recognition. Speech Commun. 52(1), 1–11 (2010)
J. Lyons, K. Paliwal, Effect of compressing the dynamic range of the power spectrum in modulation filtering based speech enhancement, in Proceedings of ISCA Conference of the International Speech Communication Association (INTERSPEECH), Brisbane, Sep 2008, pp. 387–390
N. Malayath, H. Hermansky, S. Kajarekar, B. Yegnanarayana, Data-driven temporal filters and alternatives to GMM in speaker verification. Digit. Signal Proces. 10(1–3), 55–74 (2000)
R. McAulay, M. Malpass, Speech enhancement using a soft-decision noise suppression filter. IEEE Trans. Acoust. Speech Signal Process. 28(2), 137–145 (1980)
N. Mesgarani, S. Shamma, Speech enhancement based on filtering the spectrotemporal modulations, in Proceedings of IEEE International Conference Acoustics Speech and Signal Processing (ICASSP), vol. 1, Philadelphia, PA, Mar 2005, pp. 1105–1108
C. Nadeu, P. Pachés-Leal, B.-H. Juang, Filtering the time sequences of spectral parameters for speech recognition. Speech Commun. 22(4), 315–332 (1997)
K. Paliwal, B. Schwerin, K. Wójcicki, Role of modulation magnitude and phase spectrum towards speech intelligibility. Speech Commun. 53(3), 327–339 (2011)
K. Paliwal, B. Schwerin, K. Wójcicki, Speech enhancement using minimum mean-square error short-time spectral modulation magnitude estimator. Speech Commun. 54(2), 282–305 (2012)
K. Paliwal, K. Wójcicki, Effect of analysis window duration on speech intelligibility. IEEE Signal Process. Lett. 15, 785–788 (2008)
K. Paliwal, K. Wójcicki, B. Schwerin, Single-channel speech enhancement using spectral subtraction in the short-time modulation domain. Speech Commun. 52(5), 450–475 (2010)
K. Payton, L. Braida, A method to determine the speech transmission index from speech waveforms. J. Acoust. Soc. Am. 106(6), 3637–3648 (1999)
J. Picone, Signal modeling techniques in speech recognition. Proc. IEEE 81(9), 1215–1247 (1993)
S. Quackenbush, T. Barnwell, M. Clements, Objective Measures of Speech Quality (Prentice Hall, Englewood Cliffs, 1988)
T. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice (Prentice Hall, Upper Saddle River, 2002)
L. Rabiner, R. Schafer, Theory and Applications of Digital Speech Processing (Pearson Higher Education, Upper Saddle River, 2011)
A. Rix, J. Beerends, M. Hollier, A. Hekstra, Perceptual Evaluation of Speech Quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs. ITU-T Recommendation P.862 (2001)
P. Scalart, J. Filho, Speech enhancement based on a priori signal to noise estimation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Process (ICASSP), vol. 2. Atlanta, GA, May 1996, pp. 629–632
C. Schreiner, J. Urbas, Representation of amplitude modulation in the auditory cortex of the cat: I. The anterior auditory field (AAF). Hear. Res. 21(3), 227–241 (1986)
B. Schwerin, K. Paliwal, Using STFT real and imaginary parts of modulation signals for MMSE-based speech enhancement. Speech Commun. 58, 49–68 (2014)
S. Shamma, Auditory cortical representation of complex acoustic spectra as inferred from the ripple analysis method. Netw. Comput. Neural Syst. 7(3), 439–476 (1996)
B. Shannon, K. Paliwal, Role of phase estimation in speech enhancement, in Proceedings of International Conference on Spoken Language Processing (ICSLP), Pittsburgh, PA, Sep 2006, pp. 1423–1426
S. Sheft, W. Yost, Temporal integration in amplitude modulation detection. J. Acoust. Soc. Am. 88(2), 796–805 (1990)
S. So, K. Paliwal, Modulation-domain Kalman filtering for single-channel speech enhancement. Speech Commun. 53(6), 818–829 (2011)
J. Sohn, N.S. Kim, W. Sung, A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)
H. Steeneken, T. Houtgast, A physical method for measuring speech-transmission quality. J. Acoust. Soc. Am. 67(1), 318–326 (1980)
J. Thompson, L. Atlas, A non-uniform modulation transform for audio coding with increased time resolution, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Process (ICASSP), vol. 5, Hong Kong, Apr 2003, pp. 397–400
V. Tyagi, I. McCowan, H. Misra, H. Bourland, Mel-cepstrum modulation spectrum (MCMS) features for robust ASR, in Proceedings of IEEE Workshop Automatic Speech Recognition and Understanding (ASRU), St. Thomas, VI, Dec 2003
P. Vary, R. Martin, Digital Speech Transmission: Enhancement, Coding and Error Concealment (Wiley, West Sussex, 2006)
N. Virag, Single channel speech enhancement based on masking properties of the human auditory system. IEEE Trans. Speech Audio Process. 7(2), 126–137 (1999)
S.V. Vuuren, H. Hermanshy, On the importance of components of the modulation spectrum for speaker verification, in Proceedings of International Conference on Spoken Language Processing (ICSLP), vol. 7, Sydney, Nov 1998, pp. 3205–3208
D. Wang, J. Lim, The unimportance of phase in speech enhancement. IEEE Trans. Acoust. Speech Signal Process. 30(4), 679–681 (1982)
X. Xiao, E. Chng, H. Li, Normalization of the speech modulation spectra for robust speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Process (ICASSP), vol. 4, Monolulu, HI, Apr 2007, pp. 1021–1024
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer Science+Business Media New York
About this chapter
Cite this chapter
Paliwal, K., Schwerin, B. (2015). Modulation Processing for Speech Enhancement. In: Ogunfunmi, T., Togneri, R., Narasimha, M. (eds) Speech and Audio Processing for Coding, Enhancement and Recognition. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1456-2_10
Download citation
DOI: https://doi.org/10.1007/978-1-4939-1456-2_10
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-1455-5
Online ISBN: 978-1-4939-1456-2
eBook Packages: EngineeringEngineering (R0)