Auditory Processing Inspired Robust Feature Enhancement for Speech Recognition

  • Hari Krishna Maganti
  • Marco Matassoni
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 273)


The performance of Mel-frequency cepstrum based automatic speech recognition system significantly degrade in noisy environments. In this article, the feasibility of utilizing the bio-inspired auditory features to improve noise robustness is investigated. The features are based on auditory characteristics, which include gammatone filtering and modulation spectral processing to emulate the mechanisms performed in the cochlea and middle ear aimed to improve robustness in human ear. The robust noise resistant features that emulate cochlea frequency resolution are extracted by gammatone filtering. And then a long-term modulation spectral processing, which preserves speech intelligibility in the signal is performed. Compared and discussed are the features based on the performance on Aurora5 database, comprising the meeting recorder digit task recorded with four different microphones in a hands-free mode at a real meeting room and living room and office room simulated data corrupted with different levels of additive noises. The performance of these features is also investigated for CHiME challenge, aiming at speech separation and recognition in noise background that has been collected from a real family room using binaural microphones. The experimental results show that the proposed features provide considerable improvement with respect to the standard feature extraction techniques for both the versions of the database.


Speech Recognition Speech Signal Additive Noise Automatic Speech Recognition Speech Enhancement 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kellermann, W.: Some current challenges in multichannel acoustic signal processing. The Journal of the Acoustical Society of America 120, 3177–3178 (2006)Google Scholar
  2. 2.
    Droppo, J., Acero, A.: Environmental Robustness. In: Handbook of Speech Processing, pp. 653–679. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  3. 3.
    Maganti, H.K., Member, S., Gatica-perez, D., Mccowan, I.: Speech enhancement and recognition in meetings with an audio-visual sensor array. In: IDIAP Research Institute and Ecole Polytechnique Federale de Lausanne, EPFL (2006)Google Scholar
  4. 4.
    Woelfel, J., McDonough, J.: Distant Speech Recognition, 1st edn. John Wiley (2009)Google Scholar
  5. 5.
    Ephraim, Y., Cohen, I.: Recent Advances in Speech Enhancement. CRC Press (2006)Google Scholar
  6. 6.
    Habets, E.A.P.: Single-channel speech dereverberation based on spectral subtraction. In: PRORISC, Veldhoven, The Netherlands, pp. 250–254 (2004)Google Scholar
  7. 7.
    Omologo, M., Svaizer, P., Matassoni, M.: Environmental conditions and acoustic transduction in hands-free speech recognition. Speech Communication 25, 75–95 (1998)CrossRefGoogle Scholar
  8. 8.
    Martin, R.: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Transactions on Speech and Audio Processing 9, 504–512 (2001)CrossRefGoogle Scholar
  9. 9.
    Hermansky, H., Morgan, N.: Rasta processing of speech. IEEE Transactions on Speech and Audio Processing 2, 578–589 (1994)CrossRefGoogle Scholar
  10. 10.
    Gales, M., Young, S.: A fast and flexible implementation of parallel model combination. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1995, vol. 1, pp. 133–136 (1995)Google Scholar
  11. 11.
    Holmberg, M., Gelbart, D., Ramacher, U., Hemmert, W.: Automatic Speech Recognition with Neural Spike Trains. In: INTERSPEECH (2005)Google Scholar
  12. 12.
    Deng, L., Sheikhzadeh, H.: Use of Temporal Codes Computed From a Cochlear Model for Speech Recognition. Psychology Press (2006)Google Scholar
  13. 13.
    Ghitza, O.: Temporal non-place information in the auditory-nerve firing patterns as a front-end for speech recognition in a noisy environment. Journal of Phonetics (1988)Google Scholar
  14. 14.
    Seneff, S.: A joint synchrony/mean-rate model of auditory speech processing. Journal of Phonetics 16, 55–76 (1988)Google Scholar
  15. 15.
    Dau, T., Pueschel, D., Kohlrausch, A.: A quantitative model of the effective signal processing in the auditory system. The Journal of the Acoustical Society of America 99, 3615–3622 (1996)CrossRefGoogle Scholar
  16. 16.
    Flynn, R., Jones, E.: A comparative study of auditory-based front-ends for robust speech recognition using the aurora 2 database. In: Irish Signals and Systems Conference, 2006, pp. 111–116. IET (2006)Google Scholar
  17. 17.
    Kleinschmidt, M., Tchorz, J., Kollmeier, B.: Combining speech enhancement and auditory feature extraction for robust speech recognition. Speech Commun. 34, 75–91 (2000)CrossRefGoogle Scholar
  18. 18.
    Hermansky, H.: Auditory modeling in automatic recognition of speech. ECSAP (1996)Google Scholar
  19. 19.
    Schluter, R., Bezrukov, L., Wagner, H., Ney, H.: Gammatone features and feature combination for large vocabulary speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 4, pp. IV-649–IV-652 (2007)Google Scholar
  20. 20.
    Drullman, R., Festen, J.M., Plomp, R.: Effect of reducing slow temporal modulations on speech reception. The Journal of the Acoustical Society of America 95, 2670–2680 (1994)CrossRefGoogle Scholar
  21. 21.
    Kanedera, N., Arai, T., Hermansky, H., Pavel, M.: On the relative importance of various components of the modulation spectrum for automatic speech recognition. Speech Communication 28, 43–55 (1999)CrossRefGoogle Scholar
  22. 22.
    Houtgast, T., Steeneken, H.J.M., Plomp, R.: Predicting speech intelligibility in rooms from the modulation transfer function. Acustica 46, 60–72 (1980)Google Scholar
  23. 23.
    Kingsbury, B.: Perceptually Inspired Signal-processing Strategies for Robust Speech Recognition in Reverberant Environments. PhD thesis, Michigan State University (1998)Google Scholar
  24. 24.
    Maganti, H.K., Motlicek, P., Gatica-Perez, D.: Unsupervised speech/non-speech detection for automatic speech recognition in meeting rooms. In: IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP (2007)Google Scholar
  25. 25.
    Flanagan, J.L.: Models for approximating basilar membrane displacement. Journal of the Acoustical Society of America 32 (1960)Google Scholar
  26. 26.
    Johannesma, P.I.: The pre-response stimulus ensemble of neurons in the cochlear nucleus. In: Symposium on Hearing Theory (Institute for Perception Research), Eindhoven, Holland, pp. 58–69 (1972)Google Scholar
  27. 27.
    Boer, E.D.: On the principle of specific coding. Journal of Dynamic Systems, Measurement, and Control 95, 265–273 (1973)CrossRefGoogle Scholar
  28. 28.
    Patterson, R.D., Nimmo-Smith, I., Holdsworth, J., Rice, P.: An efficient auditory filterbank based on the gammatone function. In: Meeting of the IOC Speech Group on Auditory Modelling at RSRE (1987)Google Scholar
  29. 29.
    Slaney, M.: An efficient implementation of the patterson holdsworth auditory filterbank. Technical report, Apple Computers, Perception Group (1993)Google Scholar
  30. 30.
    Glasberg, B.R., Moore, B.C.J.: Derivation of auditory filter shapes from notched-noise data. Hearing Research 47, 103–138 (1990)CrossRefGoogle Scholar
  31. 31.
    Greenberg, S.: On the origins of speech intelligibility in the real world. In: ESCA Workshop on Robust Speech Recognition for Unkown Communication Channels, pp. 23–32 (1997)Google Scholar
  32. 32.
    Dudley, H.: Remarking speech. The Journal of the Acoustical Society of America 11, 169–177 (1939)CrossRefGoogle Scholar
  33. 33.
    Drullman, R., Festen, J.M., Plomp, R.: Effect of temporal envelope smearing on speech reception. Journal of The Acoustical Society of America 95 (1994)Google Scholar
  34. 34.
    Ellis, D.: Gammatone-like spectrograms (2010),
  35. 35.
    Hirsch, H.: Aurora-5 experimental framework for the performance evaluation of speech recognition in case of a hands-free speech input in noisy environments (2007),
  36. 36.
    Christensen, H., Baker, J., Ma, N., Green, P.: The chime corpus: a resource and a challenge for computational hearing in multisource environments. In: Interspeech 2010 (2010)Google Scholar
  37. 37.
    Nesta, F., Wada, T., Juang, B.H.: Batch-online semi-blind source separation applied to multi-channel acoustic echo cancellation. IEEE Transactions on Audio, Speech, and Language Processing 19, 583–599 (2011)CrossRefGoogle Scholar
  38. 38.
    Nesta, F., Svaizer, P., Omologo, M.: Convolutive bss of short mixtures by ica recursively regularized across frequencies. IEEE Transactions on Audio, Speech, and Language Processing 19, 624–639 (2011)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Hari Krishna Maganti
    • 1
  • Marco Matassoni
    • 1
  1. 1.Fondazione Bruno KesslerCenter for Information Technology - IRSTPovoItaly

Personalised recommendations