Monaural voiced speech segregation based on elaborate harmonic grouping strategies

  • 49 Accesses

  • 1 Citations


In this paper, an enhanced algorithm based on several elaborate harmonic grouping strategies for monaural voiced speech segregation is proposed. Main achievements of the proposed algorithm lie in three aspects. Firstly, the algorithm classifies the time-frequency (T-F) units into resolved and unresolved ones by carrier-to-envelope energy ratio, which leads to more accurate classification results than by cross-channel correlation. Secondly, resolved T-F units are grouped together according to minimum amplitude principle, which has been verified to exist in human perception, as well as the harmonic principle. Finally, “enhanced” envelope autocorrelation function is employed to detect amplitude modulation rates, which helps a lot in reducing half-frequency error in grouping of unresolved units. Systematic evaluation and comparison show that performance of separation is greatly improved by the proposed algorithm. Specifically, signal-to-noise ratio (SNR) is improved by 0.96 dB compared with that of previous method. Besides, our algorithm is also effective in improving the PESQ score and subjective perception score.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA


  1. 1

    Boll S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoustics Speech Signal Process, 1979, 27: 113–120

  2. 2

    Paliwal K, Wojcicki K, Schwerin B. Single-channel speech enhancement using spectral subtraction in the short-time modulation domain. Speech Commun, 2010, 52: 450–475

  3. 3

    Benesty J, Makino S, Chen J. Speech Enhancement. New York: Springer, 2005

  4. 4

    Asano F, Ikeda S, Ogawa M, et al. Combined approach of array processing and independent component analysis for blind separation of acoustic signals. IEEE Trans Speech Audio Process, 2003, 11: 204–215

  5. 5

    Koldovsky Z, Tichavsky P. Time-domain blind separation of audio sources based on a complete ICA decomposition of an observation space. IEEE Trans Audio Speech Lang Process, 2011, 19: 406–416

  6. 6

    Wang D L, Brown G J. Computational auditory scene analysis: principles, algorithms and applications. New Jersey: Wiley-IEEE Press, 2006

  7. 7

    Bregman S. Auditory Scene Analysis. MA: MIT Press, 1990

  8. 8

    Weintraub M. A theory and computational model of monaural auditory sound separation. Dissertation for Doctoral Degree. Palo Alto: Stanford University, 1985

  9. 9

    Cooke M P. Modeling auditory processing and organization. Dissertation for Doctoral Degree. Sheffield: University of Sheffield, 1991

  10. 10

    Hu G N, Wang D L. Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans Neural Netw, 2004, 15: 1135–1150

  11. 11

    Li P, Guan Y, Wang S, et al. Monaural speech separation based on MAXVQ and CASA for robust speech recognition. Comput Speech Lang, 2010, 24: 30–44

  12. 12

    Carlyon R P, Shackleton T M. Comparing the fundamental frequencies of resolved and unresolved harmonics: evidence for two pitch mechanisms? J Acoust Soc Am, 1994, 95: 3541–3554

  13. 13

    Klapuri A. Auditory-model based methods for multiple fundamental frequency estimation. In: Signal Processing Methods for Music Transcription. New York: Springer, 2006. 229–265

  14. 14

    de Boer E, de Jongh H R. On cochlear encoding: potentialities and limitations of the reverse-correlation techniques. J Acoust Soc Amer, 1978, 63: 115–135

  15. 15

    Kohlrausch A, Fassel R, Dau T. The influence of carrier level and frequency on modulation and beat-detection thresholds for sinusoidal carriers. J Acoust soc Am, 2000, 108: 723–734

  16. 16

    Tolonen T, Karjalainen M. A computationally efficient multipitch analysis model. IEEE Trans Speech Audio Process, 2000, 8: 708–716

  17. 17

    Hu G, Wang D L. A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans Audio Speech Lang Process, 2010, 18: 2067–2079

  18. 18

    Wang D L. On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi P, ed. Speech Separation by Humans and Machines. Boston: Kluwer, 2005. 181–197

Download references

Author information

Correspondence to WenJu Liu.

Additional information

LIU WenJiu was born in 1960. He received the B.S., M.S. degrees in mathematics from Peking University and Beijing University of Posts and Telecommunications, and Ph.D. degree in computer applications from Tsinghua University, Beijing, China, in 1983, 1989 and 1993, respectively. Currently, he is a research professor at the National Key Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include speech recognition, speech synthesis, speaker recognition, key words spotting, computational auditory scene analysis, speech enhancement, noise reduction, etc. Dr. Liu Wenju is a member of Neural Network Committee of China and the Signal Processing Society of the IEEE. He is an editorial board member of journal of Computer Science Application as well as a reviewer of numerous academic journals such as IEEE Transaction on Audio, Speech, and Language Processing, Cognitive Computation, etc.

JIANG Wei was born in 1982. He reveived the B.S. degree from Yanshan University in Qinhuangdao, China in 2005 and the M.S. degree from Harbin Institute of Technology in Harbin, China in 2008. He is currently working toward the Ph.D. degree at the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include speech segregation, computational auditory scene analysis and acoustic properties of speech.

ZHANG XueLiang was born in 1981. He received the B.S. degree from Inner Mongolia University in Hohhot, China in 2003 and the M.S. degree from Harbin Institute of Technology in Harbin, China in 2005 and the Ph.D. degree in Pattern Recognition and Intelligent System from Institute of Automation, Chinese Academy of Sciences, Beijing, China in 2010. Currently, he is a lecturer at the Computer Sciences Department, Inner Mongolia University. His research interests include speech separation, computational auditory scene analysis and speech signal processing. Dr. Zhang Xueliang is a member of International Speech Communication Association.

Electronic supplementary material

Supplementary material, approximately 2.75 MB.

Supplementary material, approximately 2.75 MB.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Liu, W., Zhang, X., Jiang, W. et al. Monaural voiced speech segregation based on elaborate harmonic grouping strategies. Sci. China Inf. Sci. 54, 2471–2480 (2011) doi:10.1007/s11432-011-4506-2

Download citation


  • computational auditory scene analysis
  • voiced speech separation
  • harmonistic principle
  • minimum amplitude principle
  • elaborate harmonic grouping strategies