Multimedia Tools and Applications

, Volume 75, Issue 12, pp 7391–7406 | Cite as

Acoustic feature extraction method for robust speaker identification

  • Zuoqiang Li
  • Yong Gao


When there is a mismatch between the acoustic training environment and the testing environment, the performance of automatic speaker identification systems degrades significantly. A robust feature extraction method for speaker recognition based on the gammatone filter is therefore proposed in this paper. By employing the working mechanism of the human auditory model instead of the traditional triangular filter banks, gammatone filter banks are used to simulate the auditory model of the human ear cochlea. The cube root compression method, equal loudness technology, and relative spectral (RASTA) filtering technology are incorporated into the robust feature extraction process. A simulation experiment is conducted based on the Gaussian mixture model (GMM) recognition algorithm. The experimental results indicate that the proposed feature parameters could show superior robustness and represent the characteristics of the speaker better than the conventional mel-frequency cepstrum coefficient (MFCC), cochlear cepstrum coefficient (CFCC) and relative spectra-perceptual linear predictive (RASTA-PLP) parameters.


Robust speaker identification Gammatone filter banks Feature extraction RASTA CMVN 



The authors would like to give an acknowledgment to authors of references and a great deal of work done by them, as well as the co-workers for their helpful comments. The authors would like to thank anonymous reviewers for their useful comments that help revising the paper.


  1. 1.
    Acero A (1993) Acoustical and environmental robustness in automatic speech recognition. Springer, vol. 201Google Scholar
  2. 2.
    Cohen JR (1989) Application of an auditory model to speech recognition. J Acoust Soc Am 85:2623–2629CrossRefGoogle Scholar
  3. 3.
    Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Sig Process 28(4):357–366CrossRefGoogle Scholar
  4. 4.
    Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am 87(4):1738–1752CrossRefGoogle Scholar
  5. 5.
    Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Process 2(4):578–589CrossRefGoogle Scholar
  6. 6.
    Hermansky H, Morgan N, Bayya A, et al (1992) RASTA-PLP speech analysis technique. In: ICASSP-92, IEEE International Conference on Acoustics, Speech and Signal Processing. vol 1, 121–124Google Scholar
  7. 7.
    Huang X, Acero A, Hon HW (2001) Spoken language processing. Prentice Hall, Englewood CliffsGoogle Scholar
  8. 8.
    Hunt M, Lefebvre C (1988) Speaker dependent and independent speech recognition experiments with an auditory model. In: ICASSP-88, IEEE International Conference on Acoustics, Speech and Signal Processing. 215–218Google Scholar
  9. 9.
    Johannesma PIM (1972) The pre-response stimulus ensemble of neurons in the cochlear nucleus. Symposium on hearing theory. IPO, Eindhoven, pp 58–69Google Scholar
  10. 10.
    Lawrence R, Rabiner (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286CrossRefGoogle Scholar
  11. 11.
    Li Q, Huang Y (2011) An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions. IEEE Trans Audio Speech Lang Process 19(6):1791–1801CrossRefGoogle Scholar
  12. 12.
    Lyon RF, Katsiamis AG, Drakakis EM (2010) History and future of auditory filter models. In: ISCAS, IEEE International Symposium on Circuits and Systems. 3809–3812Google Scholar
  13. 13.
    Nemer E, Goubran R, Mahmoud S (2001) Robust voice activity detection using higher-order statistics in the LPC residual domain. IEEE Trans Speech Audio Process 9(3):217–231CrossRefGoogle Scholar
  14. 14.
    Panagiotakis C, Tziritas GA (2005) Speech/music discriminator based on RMS and zero-crossings. IEEE Trans Multimed 7(1):155–166CrossRefGoogle Scholar
  15. 15.
    Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3(1):72–83CrossRefGoogle Scholar
  16. 16.
    Saeidi R, Pohjalainen J, Kinnunen T et al (2010) Temporally weighted linear prediction features for tackling additive noise in speaker verification. IEEE Trans Sig Process Lett 17(6):599–602CrossRefGoogle Scholar
  17. 17.
    Sahidullah M, Saha G (2012) Design analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Comm 54(4):543–565CrossRefGoogle Scholar
  18. 18.
    Seneff S (1990) A joint synchrony/mean-rate model of auditory speech processing. Readings in speech recognition. Morgan Kaufmann Publishers Inc: 101–111Google Scholar
  19. 19.
    Shao Y, Jin Z, Wang DL, et al (2009) An auditory-based feature for robust speech recognition. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing. 4625–4628Google Scholar
  20. 20.
    Stevens SS (1957) On the psychophysical law. Psychol Rev 64(3):153CrossRefGoogle Scholar
  21. 21.
    Stevens SS (1972) Perceived level of noise by Mark VII and decibels (E). J Acoust Soc Am 51(2B):575–601CrossRefGoogle Scholar
  22. 22.
    Varga A, Steenneken HJM (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Comm 12(3):247–251CrossRefGoogle Scholar
  23. 23.
    Von Be Kesy G (1961) Concerning the pleasures of observing and the mechanics of the inner ear. Nobel Lecture, December, 11Google Scholar
  24. 24.
    Zhao X, Shao Y, Wang DL (2012) CASA-based robust speaker identification. IEEE Trans Audio Speech Lang Process 20(5):1608–1616CrossRefGoogle Scholar
  25. 25.
    Zheng R, Zhang S, Xu B (2004) Text-independent speaker identification using GMM-UBM and frame level likelihood normalization. IEEE International Symposium on Chinese Spoken Language Processing. 289–292Google Scholar
  26. 26.
    Zue V, Glass J, Goodine D, et al (1990) The summit speech recognition system: Phonological modelling and lexical access. In: ICASSP-90, IEEE International Conference on Acoustics, Speech and Signal Processing. vol 1, 49–52Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.College of Electronics and Information EngineeringSichuan UniversityChengduChina

Personalised recommendations