Histogram equalization of contextual statistics of speech features for robust speech recognition

Hsieh, Hsin-Ju; Chen, Berlin; Hung, Jeih-weih

doi:10.1007/s11042-014-1929-y

Histogram equalization of contextual statistics of speech features for robust speech recognition

Published: 08 March 2014

Volume 74, pages 6769–6795, (2015)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Hsin-Ju Hsieh^1,2,
Berlin Chen¹ &
Jeih-weih Hung²

264 Accesses
2 Citations
Explore all metrics

Abstract

In the recent past, we have witnessed a flurry of research activity aimed at the development of novel and ingenious robustness methods for automatic speech recognition (ASR). Among them, histogram equalization (HEQ) of speech features constitutes one most prominent and successful line of research due to its inherent neat formulation and remarkable performance. In this paper, we adopt an effective modeling framework for joint equalization of spatial-temporal contextual statistics of speech features. On top of that, we explore various combinations of simple differencing and averaging operations to render the contextual relationships of feature vector components, not only between different dimensions but also between consecutive speech frames, in the HEQ process. Furthermore, several variants of HEQ are investigated and integrated into the proposed modeling framework to efficiently compensate for the effects of noise interference on the feature vector components. In addition, the utilities of the methods deduced from this framework and several existing robustness methods are analyzed and compared extensively. All experiments were carried out on the Aurora-2 database and task, and were further verified on the Aurora-4 database and task. Empirical experimental results suggest that our proposed methods can offer substantial improvements over the baseline system and achieve performance competitive to or better than some of the existing noise robustness methods in speech recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust Feature Extraction Based on Teager-Entropy and Half Power Spectrum Estimation for Speech Recognition

Stereo-based histogram equalization for robust speech recognition

Article Open access 09 June 2015

Spectro-temporal Power Spectrum Features for Noise Robust ASR

Article 22 November 2016

References

Acharya T, Ray AK (2005) Image processing: principles and applications. Wiley-Interscience
Alpaydin E (2010) Radial basis functions. In: The book “Introduction to Machine Learning, Second Edition,” The MIT Press, ch. 12.3: 288–294
Atal BS (1974) Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J Acoust Soc Am 55(6):1304–1312
Article Google Scholar
Boll SF (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans on Acoustics, Speech, and Signal Process 27(2):113–120
Article Google Scholar
Buera L, Lleida E, Miguel A, Ortega A, Saz O (2007) Cepstral vector normalization based on stereo data for robust speech recognition. IEEE Trans on Audio, Speech and Lang Process 15(3):1098–1113
Article Google Scholar
Chen B, Chen KY, Chen PN, Chen YW (2012) Spoken document retrieval with unsupervised query modeling techniques. IEEE Trans on Audio, Speech and Lang Process 20(9):2602–2612
Article Google Scholar
Chen B, Lin SH (2010) Distribution-based feature compensation for robust speech recognition. In: The book “Recent Advances in Robust Speech Recognition Technology,” edited by Ramez J, Griz JM, Segura J, Bentham Science Publishers.
Chen B, Lin SH, Chang YM, Liu JW (2013) Extractive speech summarization using evaluation metric-related training criteria. Inf Process & Manag 49(1):1–12
Article Google Scholar
Chen WH, Lin SH, Chen B (2008) Exploiting spatial-temporal feature distribution characteristics for robust speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, 2204–2207.
Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans on Acoustics, Speech and Signal Process 28(4):357–366
Article Google Scholar
Dharanipragada S, Padmanabhan M (2000) A nonlinear unsupervised adaptation technique for speech recognition. In Proc of the Int Conf on Spoken Lang Processing 4:556–559
Google Scholar
Droppo J, Acero A (2008) Environmental robustness. In: Springer Handbook of Speech Processing, J Benesty, MM Sondhi, and Y Huang, Eds. New York: Springer, ch. 33: 653–679
Ephraim Y, Malah D (1985) Speech enhancement using a minimum mean-square log-spectral amplitude estimator. IEEE Trans on Acoustic, Speech and Signal Process 33(2):443–445
Article Google Scholar
Furui S (1981) Cepstral analysis technique for automatic speaker verification. IEEE Trans on Acoustic, Speech and Signal Process 29(2):254–272
Article Google Scholar
Gales MJ (1995) Model-based Techniques for Noise Robust Speech Recognition. Ph.D. Thesis, Cambridge University
Gauvain J, Lee CH (1994) Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans on Speech and Audio Process 2(2):291–298
Article Google Scholar
Gong Y (1995) Speech recognition in noisy environments: a survey. Speech Comm 16(3):261–291
Article Google Scholar
Hilger F, Ney H (2006) Quantile based histogram equalization for noise robust large vocabulary speech recognition. IEEE Trans on Audio, Speech, and Lan Processing 14(3):845–854
Article Google Scholar
Hirsch HG (2002) Experimental framework for the performance evaluation of speech recognition front-ends on a large vocabulary task. ETSI STQ-Aurora DSR Working Group
Hirsch HG, Pearce D (2002) The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of the ISCA ITRW ASR 2002 “Automatic Speech Recognition: Challenges for the Next Millennium”
Hsieh HJ, Hung JW, Chen B (2012) Exploring joint equalization of spatial-temporal contextual statistics of speech features for robust speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association
Hsu CW, Lee LS (2004) Higher order cepstral moment normalization (HOCMN) for robust speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 197–200
Huang X, Acero A, Hon HW (2001) Spoken Language Processing: a guide to theory, algorithm and system development. Prentice Hall
Hung JW, Fan HT (2009) Subband feature statistics normalization techniques based on a discrete wavelet transform for robust speech recognition. IEEE Signal Process Lett 16(9):806–809
Article Google Scholar
Joshi V, Bilgi R, Umesh S, Garcia L, Benitez C (2011) Sub-band level histogram equalization for robust speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, 1661–1664.
Junqua JC, Vassallo L (1996) Context modeling and clustering in continuous speech recognition. In Proceedings of the International Conference on Spoken Language Processing, 2262–2265
Kanedera N, Arai T, Hermansky H, Pavel M (1997) On the importance of various modulation frequencies for speech recognition. In Proceedings of the European Conference on Speech Communication and Technology, 1079–1082
Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 9:171–185
Article Google Scholar
Lim JS, Oppenheim AV (1979) Enhancement and bandwidth compression of noisy speech. Proc IEEE 67(12):1586–1604
Article Google Scholar
Lin SH, Chen B, Yeh YM (2009) Exploring the use of speech features and their corresponding distribution characteristics for robust speech recognition. IEEE Trans on Audio, Speech and Lang Process 17(1):84–94
Article Google Scholar
Macho D, Mauuary L, Noé B, Cheng YM, Ealey D, Jouvet D, Kelleher H, Pearce D, Saadoun F (2002) Evaluation of a noise-robust DSR front-end on Aurora databases. In Proceedings of the Annual Conference of the International Speech Communication Association, 17–20
Molau S, Keysers D, Ney H (2003) Matching training and test data distributions for robust speech recognition. Speech Comm 41(4):579–601
Article Google Scholar
Moreno P (1996) Speech recognition in noisy environment. Ph.D. Dissertation, ECE Department, Carnegie Mellon University, Pittsburgh, PA.
Saon G, Dharanipragada S, Povey D (2004) Feature space Gaussianization. In Proc of the IEEE Int Conf Acoust, Speech, and Signal Process 1:329–332
Google Scholar
Segura JC, Benitez C, Torre A, Rubio AJ, Ramirez J (2004) Cepstral domain segmental nonlinear feature transformations for robust speech recognition. IEEE Signal Process Lett 11(5):517–520
Article Google Scholar
Suk YH, Choi SH, Lee HS (1999) Cepstrum third-order normalization method for noisy speech recognition. Electron Lett 35(7):527–528
Article Google Scholar
The radial basis network toolkit. Available from: http://www.mathworks.com/
Torre A, Peinado AM, Segura JC, Perez-Cordoba JL, Bentez MC, Rubio AJ (2005) Histogram equalization of speech representation for robust speech recognition. IEEE Trans on Speech and Audio Process 13(3):355–366
Article Google Scholar
Viikki A, Laurila K (1998) Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Comm 25(1–3):133–147
Article Google Scholar
Wu J, Huo Q (2006) An environment-compensated minimum classification error training approach based on stochastic vector mapping. IEEE Trans on Audio, Speech and Lang Proces 14(6):2147–2155
Article Google Scholar
Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK Book (for HTK Version 3.4). Cambridge University Engineering Department, Cambridge
Google Scholar

Download references

Acknowledgments

This work is supported in part by the “Aim for the Top University Project” of National Taiwan Normal University (NTNU), sponsored by the Ministry of Education, Taiwan, and in part by the National Science Council, Taiwan, under Grants NSC 101-2221-E-003-024-MY3, NSC 102-2221-E-003-014-, NSC 101-2511-S-003-057-MY3, NSC 101-2511-S-003-047-MY3 and NSC 103-2911-I-003-301.

Author information

Authors and Affiliations

Department of Computer Science & Information Engineering, National Taiwan Normal University, Taipei, Taiwan
Hsin-Ju Hsieh & Berlin Chen
Department of Electrical Engineering, National Chi Nan University, Nantou, Taiwan
Hsin-Ju Hsieh & Jeih-weih Hung

Authors

Hsin-Ju Hsieh
View author publications
You can also search for this author in PubMed Google Scholar
Berlin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jeih-weih Hung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Berlin Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hsieh, HJ., Chen, B. & Hung, Jw. Histogram equalization of contextual statistics of speech features for robust speech recognition. Multimed Tools Appl 74, 6769–6795 (2015). https://doi.org/10.1007/s11042-014-1929-y

Download citation

Published: 08 March 2014
Issue Date: September 2015
DOI: https://doi.org/10.1007/s11042-014-1929-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Histogram equalization of contextual statistics of speech features for robust speech recognition

Abstract

Access this article

Similar content being viewed by others

Robust Feature Extraction Based on Teager-Entropy and Half Power Spectrum Estimation for Speech Recognition

Stereo-based histogram equalization for robust speech recognition

Spectro-temporal Power Spectrum Features for Noise Robust ASR

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Histogram equalization of contextual statistics of speech features for robust speech recognition

Abstract

Access this article

Similar content being viewed by others

Robust Feature Extraction Based on Teager-Entropy and Half Power Spectrum Estimation for Speech Recognition

Stereo-based histogram equalization for robust speech recognition

Spectro-temporal Power Spectrum Features for Noise Robust ASR

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation