Abstract
In the recent past, we have witnessed a flurry of research activity aimed at the development of novel and ingenious robustness methods for automatic speech recognition (ASR). Among them, histogram equalization (HEQ) of speech features constitutes one most prominent and successful line of research due to its inherent neat formulation and remarkable performance. In this paper, we adopt an effective modeling framework for joint equalization of spatial-temporal contextual statistics of speech features. On top of that, we explore various combinations of simple differencing and averaging operations to render the contextual relationships of feature vector components, not only between different dimensions but also between consecutive speech frames, in the HEQ process. Furthermore, several variants of HEQ are investigated and integrated into the proposed modeling framework to efficiently compensate for the effects of noise interference on the feature vector components. In addition, the utilities of the methods deduced from this framework and several existing robustness methods are analyzed and compared extensively. All experiments were carried out on the Aurora-2 database and task, and were further verified on the Aurora-4 database and task. Empirical experimental results suggest that our proposed methods can offer substantial improvements over the baseline system and achieve performance competitive to or better than some of the existing noise robustness methods in speech recognition.
Similar content being viewed by others
References
Acharya T, Ray AK (2005) Image processing: principles and applications. Wiley-Interscience
Alpaydin E (2010) Radial basis functions. In: The book “Introduction to Machine Learning, Second Edition,” The MIT Press, ch. 12.3: 288–294
Atal BS (1974) Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J Acoust Soc Am 55(6):1304–1312
Boll SF (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans on Acoustics, Speech, and Signal Process 27(2):113–120
Buera L, Lleida E, Miguel A, Ortega A, Saz O (2007) Cepstral vector normalization based on stereo data for robust speech recognition. IEEE Trans on Audio, Speech and Lang Process 15(3):1098–1113
Chen B, Chen KY, Chen PN, Chen YW (2012) Spoken document retrieval with unsupervised query modeling techniques. IEEE Trans on Audio, Speech and Lang Process 20(9):2602–2612
Chen B, Lin SH (2010) Distribution-based feature compensation for robust speech recognition. In: The book “Recent Advances in Robust Speech Recognition Technology,” edited by Ramez J, Griz JM, Segura J, Bentham Science Publishers.
Chen B, Lin SH, Chang YM, Liu JW (2013) Extractive speech summarization using evaluation metric-related training criteria. Inf Process & Manag 49(1):1–12
Chen WH, Lin SH, Chen B (2008) Exploiting spatial-temporal feature distribution characteristics for robust speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, 2204–2207.
Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans on Acoustics, Speech and Signal Process 28(4):357–366
Dharanipragada S, Padmanabhan M (2000) A nonlinear unsupervised adaptation technique for speech recognition. In Proc of the Int Conf on Spoken Lang Processing 4:556–559
Droppo J, Acero A (2008) Environmental robustness. In: Springer Handbook of Speech Processing, J Benesty, MM Sondhi, and Y Huang, Eds. New York: Springer, ch. 33: 653–679
Ephraim Y, Malah D (1985) Speech enhancement using a minimum mean-square log-spectral amplitude estimator. IEEE Trans on Acoustic, Speech and Signal Process 33(2):443–445
Furui S (1981) Cepstral analysis technique for automatic speaker verification. IEEE Trans on Acoustic, Speech and Signal Process 29(2):254–272
Gales MJ (1995) Model-based Techniques for Noise Robust Speech Recognition. Ph.D. Thesis, Cambridge University
Gauvain J, Lee CH (1994) Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans on Speech and Audio Process 2(2):291–298
Gong Y (1995) Speech recognition in noisy environments: a survey. Speech Comm 16(3):261–291
Hilger F, Ney H (2006) Quantile based histogram equalization for noise robust large vocabulary speech recognition. IEEE Trans on Audio, Speech, and Lan Processing 14(3):845–854
Hirsch HG (2002) Experimental framework for the performance evaluation of speech recognition front-ends on a large vocabulary task. ETSI STQ-Aurora DSR Working Group
Hirsch HG, Pearce D (2002) The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of the ISCA ITRW ASR 2002 “Automatic Speech Recognition: Challenges for the Next Millennium”
Hsieh HJ, Hung JW, Chen B (2012) Exploring joint equalization of spatial-temporal contextual statistics of speech features for robust speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association
Hsu CW, Lee LS (2004) Higher order cepstral moment normalization (HOCMN) for robust speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 197–200
Huang X, Acero A, Hon HW (2001) Spoken Language Processing: a guide to theory, algorithm and system development. Prentice Hall
Hung JW, Fan HT (2009) Subband feature statistics normalization techniques based on a discrete wavelet transform for robust speech recognition. IEEE Signal Process Lett 16(9):806–809
Joshi V, Bilgi R, Umesh S, Garcia L, Benitez C (2011) Sub-band level histogram equalization for robust speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, 1661–1664.
Junqua JC, Vassallo L (1996) Context modeling and clustering in continuous speech recognition. In Proceedings of the International Conference on Spoken Language Processing, 2262–2265
Kanedera N, Arai T, Hermansky H, Pavel M (1997) On the importance of various modulation frequencies for speech recognition. In Proceedings of the European Conference on Speech Communication and Technology, 1079–1082
Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 9:171–185
Lim JS, Oppenheim AV (1979) Enhancement and bandwidth compression of noisy speech. Proc IEEE 67(12):1586–1604
Lin SH, Chen B, Yeh YM (2009) Exploring the use of speech features and their corresponding distribution characteristics for robust speech recognition. IEEE Trans on Audio, Speech and Lang Process 17(1):84–94
Macho D, Mauuary L, Noé B, Cheng YM, Ealey D, Jouvet D, Kelleher H, Pearce D, Saadoun F (2002) Evaluation of a noise-robust DSR front-end on Aurora databases. In Proceedings of the Annual Conference of the International Speech Communication Association, 17–20
Molau S, Keysers D, Ney H (2003) Matching training and test data distributions for robust speech recognition. Speech Comm 41(4):579–601
Moreno P (1996) Speech recognition in noisy environment. Ph.D. Dissertation, ECE Department, Carnegie Mellon University, Pittsburgh, PA.
Saon G, Dharanipragada S, Povey D (2004) Feature space Gaussianization. In Proc of the IEEE Int Conf Acoust, Speech, and Signal Process 1:329–332
Segura JC, Benitez C, Torre A, Rubio AJ, Ramirez J (2004) Cepstral domain segmental nonlinear feature transformations for robust speech recognition. IEEE Signal Process Lett 11(5):517–520
Suk YH, Choi SH, Lee HS (1999) Cepstrum third-order normalization method for noisy speech recognition. Electron Lett 35(7):527–528
The radial basis network toolkit. Available from: http://www.mathworks.com/
Torre A, Peinado AM, Segura JC, Perez-Cordoba JL, Bentez MC, Rubio AJ (2005) Histogram equalization of speech representation for robust speech recognition. IEEE Trans on Speech and Audio Process 13(3):355–366
Viikki A, Laurila K (1998) Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Comm 25(1–3):133–147
Wu J, Huo Q (2006) An environment-compensated minimum classification error training approach based on stochastic vector mapping. IEEE Trans on Audio, Speech and Lang Proces 14(6):2147–2155
Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK Book (for HTK Version 3.4). Cambridge University Engineering Department, Cambridge
Acknowledgments
This work is supported in part by the “Aim for the Top University Project” of National Taiwan Normal University (NTNU), sponsored by the Ministry of Education, Taiwan, and in part by the National Science Council, Taiwan, under Grants NSC 101-2221-E-003-024-MY3, NSC 102-2221-E-003-014-, NSC 101-2511-S-003-057-MY3, NSC 101-2511-S-003-047-MY3 and NSC 103-2911-I-003-301.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hsieh, HJ., Chen, B. & Hung, Jw. Histogram equalization of contextual statistics of speech features for robust speech recognition. Multimed Tools Appl 74, 6769–6795 (2015). https://doi.org/10.1007/s11042-014-1929-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-1929-y