Skip to main content
Log in

Histogram equalization of contextual statistics of speech features for robust speech recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In the recent past, we have witnessed a flurry of research activity aimed at the development of novel and ingenious robustness methods for automatic speech recognition (ASR). Among them, histogram equalization (HEQ) of speech features constitutes one most prominent and successful line of research due to its inherent neat formulation and remarkable performance. In this paper, we adopt an effective modeling framework for joint equalization of spatial-temporal contextual statistics of speech features. On top of that, we explore various combinations of simple differencing and averaging operations to render the contextual relationships of feature vector components, not only between different dimensions but also between consecutive speech frames, in the HEQ process. Furthermore, several variants of HEQ are investigated and integrated into the proposed modeling framework to efficiently compensate for the effects of noise interference on the feature vector components. In addition, the utilities of the methods deduced from this framework and several existing robustness methods are analyzed and compared extensively. All experiments were carried out on the Aurora-2 database and task, and were further verified on the Aurora-4 database and task. Empirical experimental results suggest that our proposed methods can offer substantial improvements over the baseline system and achieve performance competitive to or better than some of the existing noise robustness methods in speech recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Acharya T, Ray AK (2005) Image processing: principles and applications. Wiley-Interscience

  2. Alpaydin E (2010) Radial basis functions. In: The book “Introduction to Machine Learning, Second Edition,” The MIT Press, ch. 12.3: 288–294

  3. Atal BS (1974) Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J Acoust Soc Am 55(6):1304–1312

    Article  Google Scholar 

  4. Boll SF (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans on Acoustics, Speech, and Signal Process 27(2):113–120

    Article  Google Scholar 

  5. Buera L, Lleida E, Miguel A, Ortega A, Saz O (2007) Cepstral vector normalization based on stereo data for robust speech recognition. IEEE Trans on Audio, Speech and Lang Process 15(3):1098–1113

    Article  Google Scholar 

  6. Chen B, Chen KY, Chen PN, Chen YW (2012) Spoken document retrieval with unsupervised query modeling techniques. IEEE Trans on Audio, Speech and Lang Process 20(9):2602–2612

    Article  Google Scholar 

  7. Chen B, Lin SH (2010) Distribution-based feature compensation for robust speech recognition. In: The book “Recent Advances in Robust Speech Recognition Technology,” edited by Ramez J, Griz JM, Segura J, Bentham Science Publishers.

  8. Chen B, Lin SH, Chang YM, Liu JW (2013) Extractive speech summarization using evaluation metric-related training criteria. Inf Process & Manag 49(1):1–12

    Article  Google Scholar 

  9. Chen WH, Lin SH, Chen B (2008) Exploiting spatial-temporal feature distribution characteristics for robust speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, 2204–2207.

  10. Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans on Acoustics, Speech and Signal Process 28(4):357–366

    Article  Google Scholar 

  11. Dharanipragada S, Padmanabhan M (2000) A nonlinear unsupervised adaptation technique for speech recognition. In Proc of the Int Conf on Spoken Lang Processing 4:556–559

    Google Scholar 

  12. Droppo J, Acero A (2008) Environmental robustness. In: Springer Handbook of Speech Processing, J Benesty, MM Sondhi, and Y Huang, Eds. New York: Springer, ch. 33: 653–679

  13. Ephraim Y, Malah D (1985) Speech enhancement using a minimum mean-square log-spectral amplitude estimator. IEEE Trans on Acoustic, Speech and Signal Process 33(2):443–445

    Article  Google Scholar 

  14. Furui S (1981) Cepstral analysis technique for automatic speaker verification. IEEE Trans on Acoustic, Speech and Signal Process 29(2):254–272

    Article  Google Scholar 

  15. Gales MJ (1995) Model-based Techniques for Noise Robust Speech Recognition. Ph.D. Thesis, Cambridge University

  16. Gauvain J, Lee CH (1994) Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans on Speech and Audio Process 2(2):291–298

    Article  Google Scholar 

  17. Gong Y (1995) Speech recognition in noisy environments: a survey. Speech Comm 16(3):261–291

    Article  Google Scholar 

  18. Hilger F, Ney H (2006) Quantile based histogram equalization for noise robust large vocabulary speech recognition. IEEE Trans on Audio, Speech, and Lan Processing 14(3):845–854

    Article  Google Scholar 

  19. Hirsch HG (2002) Experimental framework for the performance evaluation of speech recognition front-ends on a large vocabulary task. ETSI STQ-Aurora DSR Working Group

  20. Hirsch HG, Pearce D (2002) The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of the ISCA ITRW ASR 2002 “Automatic Speech Recognition: Challenges for the Next Millennium”

  21. Hsieh HJ, Hung JW, Chen B (2012) Exploring joint equalization of spatial-temporal contextual statistics of speech features for robust speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association

  22. Hsu CW, Lee LS (2004) Higher order cepstral moment normalization (HOCMN) for robust speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 197–200

  23. Huang X, Acero A, Hon HW (2001) Spoken Language Processing: a guide to theory, algorithm and system development. Prentice Hall

  24. Hung JW, Fan HT (2009) Subband feature statistics normalization techniques based on a discrete wavelet transform for robust speech recognition. IEEE Signal Process Lett 16(9):806–809

    Article  Google Scholar 

  25. Joshi V, Bilgi R, Umesh S, Garcia L, Benitez C (2011) Sub-band level histogram equalization for robust speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, 1661–1664.

  26. Junqua JC, Vassallo L (1996) Context modeling and clustering in continuous speech recognition. In Proceedings of the International Conference on Spoken Language Processing, 2262–2265

  27. Kanedera N, Arai T, Hermansky H, Pavel M (1997) On the importance of various modulation frequencies for speech recognition. In Proceedings of the European Conference on Speech Communication and Technology, 1079–1082

  28. Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 9:171–185

    Article  Google Scholar 

  29. Lim JS, Oppenheim AV (1979) Enhancement and bandwidth compression of noisy speech. Proc IEEE 67(12):1586–1604

    Article  Google Scholar 

  30. Lin SH, Chen B, Yeh YM (2009) Exploring the use of speech features and their corresponding distribution characteristics for robust speech recognition. IEEE Trans on Audio, Speech and Lang Process 17(1):84–94

    Article  Google Scholar 

  31. Macho D, Mauuary L, Noé B, Cheng YM, Ealey D, Jouvet D, Kelleher H, Pearce D, Saadoun F (2002) Evaluation of a noise-robust DSR front-end on Aurora databases. In Proceedings of the Annual Conference of the International Speech Communication Association, 17–20

  32. Molau S, Keysers D, Ney H (2003) Matching training and test data distributions for robust speech recognition. Speech Comm 41(4):579–601

    Article  Google Scholar 

  33. Moreno P (1996) Speech recognition in noisy environment. Ph.D. Dissertation, ECE Department, Carnegie Mellon University, Pittsburgh, PA.

  34. Saon G, Dharanipragada S, Povey D (2004) Feature space Gaussianization. In Proc of the IEEE Int Conf Acoust, Speech, and Signal Process 1:329–332

    Google Scholar 

  35. Segura JC, Benitez C, Torre A, Rubio AJ, Ramirez J (2004) Cepstral domain segmental nonlinear feature transformations for robust speech recognition. IEEE Signal Process Lett 11(5):517–520

    Article  Google Scholar 

  36. Suk YH, Choi SH, Lee HS (1999) Cepstrum third-order normalization method for noisy speech recognition. Electron Lett 35(7):527–528

    Article  Google Scholar 

  37. The radial basis network toolkit. Available from: http://www.mathworks.com/

  38. Torre A, Peinado AM, Segura JC, Perez-Cordoba JL, Bentez MC, Rubio AJ (2005) Histogram equalization of speech representation for robust speech recognition. IEEE Trans on Speech and Audio Process 13(3):355–366

    Article  Google Scholar 

  39. Viikki A, Laurila K (1998) Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Comm 25(1–3):133–147

    Article  Google Scholar 

  40. Wu J, Huo Q (2006) An environment-compensated minimum classification error training approach based on stochastic vector mapping. IEEE Trans on Audio, Speech and Lang Proces 14(6):2147–2155

    Article  Google Scholar 

  41. Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK Book (for HTK Version 3.4). Cambridge University Engineering Department, Cambridge

    Google Scholar 

Download references

Acknowledgments

This work is supported in part by the “Aim for the Top University Project” of National Taiwan Normal University (NTNU), sponsored by the Ministry of Education, Taiwan, and in part by the National Science Council, Taiwan, under Grants NSC 101-2221-E-003-024-MY3, NSC 102-2221-E-003-014-, NSC 101-2511-S-003-057-MY3, NSC 101-2511-S-003-047-MY3 and NSC 103-2911-I-003-301.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Berlin Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hsieh, HJ., Chen, B. & Hung, Jw. Histogram equalization of contextual statistics of speech features for robust speech recognition. Multimed Tools Appl 74, 6769–6795 (2015). https://doi.org/10.1007/s11042-014-1929-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-1929-y

Keywords

Navigation