In this paper, an enhanced algorithm based on several elaborate harmonic grouping strategies for monaural voiced speech segregation is proposed. Main achievements of the proposed algorithm lie in three aspects. Firstly, the algorithm classifies the time-frequency (T-F) units into resolved and unresolved ones by carrier-to-envelope energy ratio, which leads to more accurate classification results than by cross-channel correlation. Secondly, resolved T-F units are grouped together according to minimum amplitude principle, which has been verified to exist in human perception, as well as the harmonic principle. Finally, “enhanced” envelope autocorrelation function is employed to detect amplitude modulation rates, which helps a lot in reducing half-frequency error in grouping of unresolved units. Systematic evaluation and comparison show that performance of separation is greatly improved by the proposed algorithm. Specifically, signal-to-noise ratio (SNR) is improved by 0.96 dB compared with that of previous method. Besides, our algorithm is also effective in improving the PESQ score and subjective perception score.
This is a preview of subscription content, log in to check access.
Buy single article
Instant unlimited access to the full article PDF.
Price includes VAT for USA
Boll S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoustics Speech Signal Process, 1979, 27: 113–120
Paliwal K, Wojcicki K, Schwerin B. Single-channel speech enhancement using spectral subtraction in the short-time modulation domain. Speech Commun, 2010, 52: 450–475
Benesty J, Makino S, Chen J. Speech Enhancement. New York: Springer, 2005
Asano F, Ikeda S, Ogawa M, et al. Combined approach of array processing and independent component analysis for blind separation of acoustic signals. IEEE Trans Speech Audio Process, 2003, 11: 204–215
Koldovsky Z, Tichavsky P. Time-domain blind separation of audio sources based on a complete ICA decomposition of an observation space. IEEE Trans Audio Speech Lang Process, 2011, 19: 406–416
Wang D L, Brown G J. Computational auditory scene analysis: principles, algorithms and applications. New Jersey: Wiley-IEEE Press, 2006
Bregman S. Auditory Scene Analysis. MA: MIT Press, 1990
Weintraub M. A theory and computational model of monaural auditory sound separation. Dissertation for Doctoral Degree. Palo Alto: Stanford University, 1985
Cooke M P. Modeling auditory processing and organization. Dissertation for Doctoral Degree. Sheffield: University of Sheffield, 1991
Hu G N, Wang D L. Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans Neural Netw, 2004, 15: 1135–1150
Li P, Guan Y, Wang S, et al. Monaural speech separation based on MAXVQ and CASA for robust speech recognition. Comput Speech Lang, 2010, 24: 30–44
Carlyon R P, Shackleton T M. Comparing the fundamental frequencies of resolved and unresolved harmonics: evidence for two pitch mechanisms? J Acoust Soc Am, 1994, 95: 3541–3554
Klapuri A. Auditory-model based methods for multiple fundamental frequency estimation. In: Signal Processing Methods for Music Transcription. New York: Springer, 2006. 229–265
de Boer E, de Jongh H R. On cochlear encoding: potentialities and limitations of the reverse-correlation techniques. J Acoust Soc Amer, 1978, 63: 115–135
Kohlrausch A, Fassel R, Dau T. The influence of carrier level and frequency on modulation and beat-detection thresholds for sinusoidal carriers. J Acoust soc Am, 2000, 108: 723–734
Tolonen T, Karjalainen M. A computationally efficient multipitch analysis model. IEEE Trans Speech Audio Process, 2000, 8: 708–716
Hu G, Wang D L. A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans Audio Speech Lang Process, 2010, 18: 2067–2079
Wang D L. On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi P, ed. Speech Separation by Humans and Machines. Boston: Kluwer, 2005. 181–197
LIU WenJiu was born in 1960. He received the B.S., M.S. degrees in mathematics from Peking University and Beijing University of Posts and Telecommunications, and Ph.D. degree in computer applications from Tsinghua University, Beijing, China, in 1983, 1989 and 1993, respectively. Currently, he is a research professor at the National Key Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include speech recognition, speech synthesis, speaker recognition, key words spotting, computational auditory scene analysis, speech enhancement, noise reduction, etc. Dr. Liu Wenju is a member of Neural Network Committee of China and the Signal Processing Society of the IEEE. He is an editorial board member of journal of Computer Science Application as well as a reviewer of numerous academic journals such as IEEE Transaction on Audio, Speech, and Language Processing, Cognitive Computation, etc.
JIANG Wei was born in 1982. He reveived the B.S. degree from Yanshan University in Qinhuangdao, China in 2005 and the M.S. degree from Harbin Institute of Technology in Harbin, China in 2008. He is currently working toward the Ph.D. degree at the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include speech segregation, computational auditory scene analysis and acoustic properties of speech.
ZHANG XueLiang was born in 1981. He received the B.S. degree from Inner Mongolia University in Hohhot, China in 2003 and the M.S. degree from Harbin Institute of Technology in Harbin, China in 2005 and the Ph.D. degree in Pattern Recognition and Intelligent System from Institute of Automation, Chinese Academy of Sciences, Beijing, China in 2010. Currently, he is a lecturer at the Computer Sciences Department, Inner Mongolia University. His research interests include speech separation, computational auditory scene analysis and speech signal processing. Dr. Zhang Xueliang is a member of International Speech Communication Association.
Electronic supplementary material
About this article
Cite this article
Liu, W., Zhang, X., Jiang, W. et al. Monaural voiced speech segregation based on elaborate harmonic grouping strategies. Sci. China Inf. Sci. 54, 2471–2480 (2011) doi:10.1007/s11432-011-4506-2
- computational auditory scene analysis
- voiced speech separation
- harmonistic principle
- minimum amplitude principle
- elaborate harmonic grouping strategies