Improved filter bank on multitaper framework for robust Punjabi-ASR system

  • Virender KadyanEmail author
  • Archana Mantri
  • R. K. Aggarwal


Robustness of the automatic speech recognition (ASR) system relies upon the accuracy of feature extraction and classification in training phase. The mismatch between training and testing conditions during classification of large feature vectors causes a low performance. In this paper, the issue of robustness of acoustic information is addressed for practical Punjabi dataset. Traditional feature extraction approaches: mel frequency cepstral coefficients (MFCC) and gammatone frequency cepstral coefficients (GFCC) face the issue of high variance with leakage of spectral information. Also, handling of the huge number of feature information creates chaos for large speech vocabulary. To overcome this dilemma, a Principal component analysis (PCA) based multi-windowing technique is proposed with the incorporation of baseline GFCC and MFCC based feature approaches after the tuning of taper parameter. The proposed integrated approaches result in better feature vectors, which are further processed using differential evolution + hidden Markov model (DE + HMM) based modelling classifier. The integrated approaches show substantial performance for word recognition as compared to the conventional or fused feature extraction systems.


Gammatone frequency cepstral coefficients Principal component analysis Mel frequency cepstral coefficients Sine-weighted cepstrum estimator 



  1. Alam, M. J., Kinnunen, T., Kenny, P., Ouellet, P., & O’Shaughnessy, D. (2013). Multitaper MFCC and PLP features for speaker verification using i-vectors. Speech Communication,55(2), 237–251.CrossRefGoogle Scholar
  2. Charbuillet, C., Gas, B., Chetouani, M., & Zarader, J. L. (2006). Filter bank design for speaker diarization based on genetic algorithms. In 2006 IEEE international conference on acoustics, speech and signal processing, 2006. ICASSP 2006 Proceedings (Vol. 1, pp. I–I). IEEE.Google Scholar
  3. Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing,28(4), 357–366.CrossRefGoogle Scholar
  4. Dua, M., Aggarwal, R., & Biswas, M. (2018a). Discriminative training using noise robust integrated features and refined hmm modeling. Journal of Intelligent Systems. Scholar
  5. Dua, M., Aggarwal, R. K., & Biswas, M. (2018b). Performance evaluation of Hindi speech recognition system using optimized filterbanks. Engineering Science and Technology, an International Journal,21(3), 389–398.CrossRefGoogle Scholar
  6. Figielska, E., & Kasprzak, W. (2008). An evolutionary programming based algorithm for HMM training. Computational Intelligence: Methods and Applications, 166–175.Google Scholar
  7. Ghitza, O. (1986). Auditory nerve representation as a front-end for speech recognition in a noisy environment. Computer Speech & Language,1(2), 109–130.CrossRefGoogle Scholar
  8. Hansson, M., & Salomonsson, G. (1997). A multiple window method for estimation of peaked spectra. IEEE Transactions on Signal Processing,45(3), 778–781.CrossRefGoogle Scholar
  9. Hansson-Sandsten, M., & Sandberg, J. (2009). Optimal cepstrum estimation using multiple windows. In IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009. (pp. 3077–3080). IEEE.Google Scholar
  10. Harris, F. J. (1978). On the use of windows for harmonic analysis with the discrete Fourier transform. Proceedings of the IEEE,66(1), 51–83.CrossRefGoogle Scholar
  11. Hu, Y., & Loizou, P. C. (2004). Speech enhancement based on wavelet thresholding the multitaper spectrum. IEEE Transactions on Speech and Audio Processing,12(1), 59–67.CrossRefGoogle Scholar
  12. Hung, J. W. (2004). Optimization of filter-bank to improve the extraction of MFCC features in speech recognition. In Proceedings of 2004 international symposium on intelligent multimedia, video and speech processing, 2004 (pp. 675–678). IEEEGoogle Scholar
  13. Hung, J. W. (2004). Optimization of filter bank to improve the extraction of MFCC features in speech recognition. In Proceedings of International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004. (pp. 675–678).Google Scholar
  14. Kadyan, V., Mantri, A., & Aggarwal, R. K. (2017a). Refinement of HMM model parameters for punjabi automatic speech recognition (PASR) system. IETE Journal of Research,64(5), 1–16.Google Scholar
  15. Kadyan, V., Mantri, A., & Aggarwal, R. K. (2017b). A heterogeneous speech feature vectors generation approach with hybrid hmm classifiers. International Journal of Speech Technology,20(4), 761–769.CrossRefGoogle Scholar
  16. Kinnunen, T., Saeidi, R., Sandberg, J., & Hansson-Sandsten, M. (2010). What else is new than the Hamming window? Robust MFCCs for speaker recognition via multitapering. In Eleventh Annual Conference of the International Speech Communication Association.Google Scholar
  17. Kwong, S., Chau, C. W., Man, K. F., & Tang, K. S. (2001). Optimisation of HMM topology and its model parameters by genetic algorithms. Pattern Recognition,34(2), 509–522.CrossRefGoogle Scholar
  18. Lee, S. M., Fang, S. H., Hung, J. W., & Lee, L. S. (2001). Improved MFCC feature extraction by PCA-optimized filter-bank for speech recognition. In IEEE workshop on Automatic speech recognition and understanding, 2001. ASRU’01 (pp. 49–52). IEEE.Google Scholar
  19. Lee, S. M., Fang, S. H., Hung, J. W., & Lee, L. S. (2001). Improved MFCC feature extraction by PCA-optimized filter-bank for speech recognition. In IEEE workshop on automatic speech recognition and understanding, 2001. ASRU’01. (pp. 49–52).Google Scholar
  20. Maganti, H. K., &Matassoni, M. (2010). An auditory based modulation spectral feature for reverberant speech recognition. In Eleventh Annual Conference of the International Speech Communication Association.Google Scholar
  21. Maldonado, Y. P., Morales, S. O. C., & Ortega, R. O. C. (2012). GA approaches to HMM optimization for automatic speech recognition. In Mexican conference on pattern recognition (pp. 313–322). Springer, Berlin.Google Scholar
  22. Minh, V. D., & Lee, S. (2004). PCA-based human auditory filter bank for speech recognition. In 2004 International Conference on Signal Processing and Communications, 2004. SPCOM’04 (pp. 393–397). IEEE.Google Scholar
  23. Patterson, R. D., Nimmo-Smith, I., Holdsworth, J., & Rice, P. (1987). An efficient auditory filter bank based on the gammatone function. In A meeting of the IOC Speech Group on Auditory Modelling at RSRE (Vol. 2, No. 7).Google Scholar
  24. Pinheiro, H. N., Neto, F. M., Oliveira, A. L., Ren, T. I., Cavalcanti, G. D., & Adami, A. G. (2017). Optimizing speaker-specific filter banks for speaker verification. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5350–5354). IEEE.Google Scholar
  25. Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition (Vol. 14). Englewood Cliffs: PTR Prentice Hall.Google Scholar
  26. Riedel, K. S., & Sidorenko, A. (1995). Minimum bias multiple taper spectral estimation. IEEE Transactions on Signal Processing,43(1), 188–195.CrossRefGoogle Scholar
  27. Sandberg, J., Hansson-Sandsten, M., Kinnunen, T., Saeidi, R., Flandrin, P., & Borgnat, P. (2010). Multitaper estimation of frequency-warped cepstra with application to speaker verification. IEEE Signal Processing Letters,17(4), 343–346.CrossRefGoogle Scholar
  28. Schluter, R., Bezrukov, I., Wagner, H., & Ney, H. (2007). Gammatone features and feature combination for large vocabulary speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP 2007 (Vol. 4, pp. IV–649). IEEE.Google Scholar
  29. Thomson, D. J. (1982). Spectrum estimation and harmonic analysis. Proceedings of the IEEE,70(9), 1055–1096.CrossRefGoogle Scholar
  30. Yang, F., Zhang, C., & Bai, G. (2008). A novel genetic algorithm based on tabu search for HMM optimization. In Natural Computation, 2008. ICNC’08. Fourth International Conference on (Vol. 4, pp. 57–61). IEEE.Google Scholar
  31. Yang, F., Zhang, C., & Sun, T. (2008, December). Comparison of particle swarm optimization and genetic algorithm for HMM training. In 19th IEEE International conference on pattern recognition, 2008. ICPR 2008. (pp. 1–4).Google Scholar
  32. Zolnay, A., Kocharov, D., Schlüter, R., & Ney, H. (2007). Using multiple acoustic feature sets for speech recognition. Speech Communication,49(6), 514–525.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Virender Kadyan
    • 1
    Email author
  • Archana Mantri
    • 1
  • R. K. Aggarwal
    • 2
  1. 1.Department of Computer Science & Engineering, Chitkara University Institute of Engineering and TechnologyChitkara UniversityPunjabIndia
  2. 2.Department of Computer EngineeringNIT, KurukshetraKurukshetraIndia

Personalised recommendations