Whispered Speech Enhancement Based on Improved Mel Frequency Scale and Modified Compensated Phase Spectrum

  • Yi Wei
  • Chen LiEmail author
  • Tianfeng Li
  • Yumin Zeng


The whispered speech enhancement based on a novel improved Mel frequency scale is investigated in the proposed algorithm. The scale is derived from the characteristics of whispered speech. The whispered speech magnitude spectrum recombines with a changed phase spectrum in the process of synthesis rather than preserving the noisy whispered speech phase spectrum. The significance of phase correction is that the low-energy component of the new complex spectrum cancels more than the high-energy component, thus removing background noise as much as possible. Moreover, the noise estimation parameter in the compensated phase is obtained by a new method. This algorithm tries to find a trade-off mechanism between the whispered speech distortion, the noise reduction and the level of remnant music noise. The objective and subjective evaluations show that the proposed algorithm outperforms comparable whispered speech enhancement algorithms.


Whispered speech enhancement Mel frequency scale Modified phase Noise estimation 



This project was supported by the National Key Research and Development Program of China (Grant No. 2017YFB0503500), the Natural Science Foundation of Jiangsu Province (Grant No. BK20171031) and the Key Laboratory of Virtual Geographic Environment (Nanjing Normal University), Ministry of Education (Grant No. 2017VGE01).

Supplementary material

34_2019_1164_MOESM1_ESM.mp4 (19 kb)
Supplementary material 1 (MP4 18 kb)
34_2019_1164_MOESM2_ESM.mp4 (34 kb)
Supplementary material 2 (MP4 33 kb)
34_2019_1164_MOESM3_ESM.mp4 (20 kb)
Supplementary material 3 (MP4 19 kb)
34_2019_1164_MOESM4_ESM.mp4 (32 kb)
Supplementary material 4 (MP4 31 kb)
34_2019_1164_MOESM5_ESM.mp4 (19 kb)
Supplementary material 5 (MP4 18 kb)
34_2019_1164_MOESM6_ESM.mp4 (19 kb)
Supplementary material 6 (MP4 18 kb)
34_2019_1164_MOESM7_ESM.mp4 (19 kb)
Supplementary material 7 (MP4 18 kb)
34_2019_1164_MOESM8_ESM.mp4 (19 kb)
Supplementary material 8 (MP4 18 kb)
34_2019_1164_MOESM9_ESM.mp4 (34 kb)
Supplementary material 9 (MP4 33 kb)
34_2019_1164_MOESM10_ESM.mp4 (34 kb)
Supplementary material 10 (MP4 33 kb)
34_2019_1164_MOESM11_ESM.mp4 (34 kb)
Supplementary material 11 (MP4 33 kb)
34_2019_1164_MOESM12_ESM.mp4 (34 kb)
Supplementary material 12 (MP4 33 kb)
34_2019_1164_MOESM13_ESM.mp4 (34 kb)
Supplementary material 13 (MP4 33 kb)
34_2019_1164_MOESM14_ESM.mp4 (34 kb)
Supplementary material 14 (MP4 33 kb)
34_2019_1164_MOESM15_ESM.mp4 (34 kb)
Supplementary material 15 (MP4 33 kb)
34_2019_1164_MOESM16_ESM.mp4 (34 kb)
Supplementary material 16 (MP4 33 kb)
34_2019_1164_MOESM17_ESM.mp4 (20 kb)
Supplementary material 17 (MP4 19 kb)
34_2019_1164_MOESM18_ESM.mp4 (20 kb)
Supplementary material 18 (MP4 19 kb)
34_2019_1164_MOESM19_ESM.mp4 (20 kb)
Supplementary material 19 (MP4 19 kb)
34_2019_1164_MOESM20_ESM.mp4 (20 kb)
Supplementary material 20 (MP4 19 kb)
34_2019_1164_MOESM21_ESM.mp4 (32 kb)
Supplementary material 21 (MP4 31 kb)
34_2019_1164_MOESM22_ESM.mp4 (32 kb)
Supplementary material 22 (MP4 31 kb)
34_2019_1164_MOESM23_ESM.mp4 (32 kb)
Supplementary material 23 (MP4 31 kb)
34_2019_1164_MOESM24_ESM.mp4 (32 kb)
Supplementary material 24 (MP4 31 kb)


  1. 1.
    S.E. Bou-Ghazale, J.H.L. Hansen, A comparative study of traditional and newly proposed features for recognition of speech under stress. IEEE Trans. Speech Audio Process. 8(4), 429–442 (2000)CrossRefGoogle Scholar
  2. 2.
    I. Eklund, H. Traunmüller, Comparative study of male and female whispered and phonated versions of the long vowels of Swedish. Phonetica 54(1), 1–21 (1997)CrossRefGoogle Scholar
  3. 3.
    A. Farmani, H.B. Bahar, Hardware implementation of 128-Bit AES image encryption with low power techniques on FPGA to VHDL. Majlesi J. Electr. Eng. 6(4), 13–22 (2012)Google Scholar
  4. 4.
    A. Farmani, M. Jafari, S.S. Miremadi, A high performance hardware implementation image encryption with AES algorithm, in Third International Conference on Digital Image Processing (ICDIP 2011). International Society for Optics and Photonics, vol 8009 (2011) p. 800905Google Scholar
  5. 5.
    H. Fastl, E. Zwicker, Psychoacoustics; Fact and Models, 3rd edn. (Springer, Berlin, 2006)Google Scholar
  6. 6.
    D.T. Grozdic, S.T. Jovicic, Whispered Speech recognition using deep denoising autoencoder and inverse filtering. IEEE ACM Trans. Audio Speech Lang. Process. (TASLP) 25(12), 2313–2322 (2017)CrossRefGoogle Scholar
  7. 7.
    W.W. Hung, H.C. Wang, On the use of weighted filter bank analysis for the derivation of robust MFCCs. Signal Process. Lett. IEEE 8(3), 70–73 (2001)CrossRefGoogle Scholar
  8. 8.
    T. Itoh, K. Takeda, F. Itakura, Acoustic analysis and recognition of whispered speech. IEEE Workshop on Automatic Speech Recognition and Understanding. ASRU’01 IEEE, (2001), pp. 429–432Google Scholar
  9. 9.
    T. Itoh, K. Takeda, F. Itakura, Acoustical analysis and recognition of whispered speech. Speech Commun. 45(2), 139–152 (2005)CrossRefGoogle Scholar
  10. 10.
    S.T. Jovičić, Formant feature differences between whispered and voiced sustained vowels. Acta Acust. united with Acust. 84(4), 739–743 (1998)Google Scholar
  11. 11.
    K.J. Kallail, F.W. Emanuel, Formant-frequency differences between isolated whispered and phonated vowel samples produced by adult female subjects. J. Speech Lang. Hear. Res. 27(2), 245–251 (1984)CrossRefGoogle Scholar
  12. 12.
    S. Kamath, A Multi-Band Spectral Subtraction Method for Speech Enhancement. Master’s Thesis, University of Texas-Dallas, Department Electrical Engineering, (2001), pp. 34–36Google Scholar
  13. 13.
    S. Kamath, P. Loizou, A multi-band spectral subtraction method for enhancing speech corrupted by colored noise, in Proceedings International Conference Acoustic, Speech, Signal Processing, Orlando, USA, (2002)Google Scholar
  14. 14.
    M.S.E. Langarani, H. Veisi, H. Sameti, The effect of phase information in speech enhancement and speech recognition, in International Conference on Information Science, Signal Processing and Their Applications (2012, IEEE), pp. 1446–1447Google Scholar
  15. 15.
    X.L. Li, B.L. Xu, Formant comparison between whispered and voiced vowels in Mandarin. Acta Acust. united with Acust. 91(6), 1079–1085 (2005)Google Scholar
  16. 16.
    X.L. Li, D. Hui, B.L. Xu, Entropy-based initial/final segmentation for Chinese whispered speech. Sheng xue Xue bao (ActaAcustica) 30(1), 69–75 (2005)Google Scholar
  17. 17.
    J.J. Li, I.V. McLoughlin, L.R. Dai, Z.H. Ling, Whisper-to-speech conversion using restricted Boltzmann machine arrays. Electron. Lett. 50(24), 1781–1782 (2014)CrossRefGoogle Scholar
  18. 18.
    J.S. Lim, A.V. Oppenheim, Enhancement and bandwidth compression of noisy speech. Proc. IEEE 67(12), 1586–1604 (1979)CrossRefGoogle Scholar
  19. 19.
    W. Lin, L.L. Yang, B.L. Xu, Speaker recognition of Chinese whispered speech based on modified MFCC parameters. J. Nanjing Univ. (Nat. Sci.) 42(1), 54–62 (2006)Google Scholar
  20. 20.
    P. Loizou, Speech Enhancement: Theory and Practice (CRC, Boca Raton, 2007)CrossRefGoogle Scholar
  21. 21.
    M. Matsuda, H. Kasuya, Acoustic nature of the whisper, in European Conference on Speech Communication and Technology. DBLP (1999)Google Scholar
  22. 22.
    G. N. Meenakshi, P. K. Ghosh, Whispered speech to neutral speech conversion using bidirectional LSTMs, in Proceedings of Interspeech, (2018), pp. 491–495Google Scholar
  23. 23.
    B.C.J. Moore, An Introduction to the Psychology of Hearing, 5th edn. (Academic Press, Cambridge, 2003), pp. 66–69Google Scholar
  24. 24.
    R.W. Morris, Enhancement and Recognition of Whispered Speech (Georgia Institute of Technology, Georgia, 2003)Google Scholar
  25. 25.
    R.W. Morris, M.A. Clements, Reconstruction of speech from whispers. Med. Eng. Phys. 24(7–8), 515–520 (2002)CrossRefGoogle Scholar
  26. 26.
    S. Pascual, A. Bonafonte, J. Serrà, J.A. Gonzalez, Whispered-to-voiced Alaryngeal Speech Conversion with Generative Adversarial Networks. arXiv preprint arXiv:1808.10687 (2018)
  27. 27.
    A.W. Rix, J.G. Beerends, M.P. Hollier, Perceptual evaluation of speech quality (PESQ)—a new method for speech quality assessment of telephone networks and codecs, in IEEE International Conference on Acoustics (IEEE, 2002)Google Scholar
  28. 28.
    M.F. Schwartz, Identification of speaker sex from isolated, whispered vowels. J. Acoust. Soc. Am. 44(6), 1736–1737 (1968)CrossRefGoogle Scholar
  29. 29.
    A.P. Stark, K.K. Wójcicki, J.G. Lyons, Noise driven short-time phase spectrum compensation procedure for speech enhancement, in Ninth Annual Conference of the International Speech Communication Association (2008)Google Scholar
  30. 30.
    J. Sun, Z. Tao, J.H. Gu, Research on whisper enhancement based on AD neural network. Comput. Eng. Appl. 43(29), 242–244 (2007)Google Scholar
  31. 31.
    Z. Tao, H.M. Zhao, D. Wu, Ear speech enhancement based on modified Mel domain masking model and no speech probability. J. Acoust. 34(4), 370–377 (2009)Google Scholar
  32. 32.
    Z. Tao, X.J. Zhang, H.M. Zhao, Noise reduction in whisper speech based on the auditory masking model, in International Conference on Information Networking and Automation (ICINA), IEEE, vol. 2, (2010), pp. V2-272–V2-277Google Scholar
  33. 33.
    V.C. Tartter, What’s in a whisper? J. Acoust. Soc. Am. 86(5), 1678–1683 (1989)CrossRefGoogle Scholar
  34. 34.
    V.C. Tartter, Identifiability of vowels and speakers from whispered syllables. Percept. Psychophys. 49(4), 365–372 (1991)CrossRefGoogle Scholar
  35. 35.
    K. Wójcicki, M. Milacic, A. Stark, Exploiting conjugate symmetry of the short-time Fourier spectrum for speech enhancement. IEEE Signal Process. Lett. 15, 461–464 (2008)CrossRefGoogle Scholar
  36. 36.
    W. Xie, Research on Single-Channel Whisper Enhancement Based on Multi-window Spectrum (Southeast University, Nanjing, 2011)Google Scholar
  37. 37.
    L.L. Yang, W. Lin, B.L. Xu, Research on Chinese whispered isolated character recognition. Appl. Acoust. 25(3), 187–192 (2006)Google Scholar
  38. 38.
    J. Zhou, Whisper intelligibility enhancement using a supervised learning approach. Circuits Syst. Signal Process. 31(6), 2061–2074 (2012)MathSciNetCrossRefGoogle Scholar
  39. 39.
    J. Zhou, R. Liang, L. Zhao, Unsupervised learning of phonemes of whispered speech in a noisy environment based on convolutive non-negative matrix factorization. Inf. Sci. 257(2), 115–126 (2014)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Physics and TechnologyNanjing Normal UniversityNanjingChina
  2. 2.Key Laboratory of Virtual Geographic Environment (Nanjing Normal University)Ministry of EducationNanjingChina
  3. 3.State Key Laboratory Cultivation Base of Geographical Environment Evolution (Jiangsu Province)NanjingChina
  4. 4.Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and ApplicationNanjingChina

Personalised recommendations