Advertisement

Integrating DNN–HMM Technique with Hierarchical Multi-layer Acoustic Model for Text-Dependent Speaker Verification

  • Mohammad Azharuddin LaskarEmail author
  • Rabul Hussain Laskar
Article
  • 47 Downloads

Abstract

Subspace techniques, such as i-vector/probabilistic linear discriminant analysis and joint factor analysis, have been the most commonly used techniques in the field of text-dependent speaker verification. These techniques, however, do not model the temporal structure of the pass-phrase which otherwise is an important cue in the context of text-dependent speaker verification. The hierarchical multi-layer acoustic model (HiLAM) uses Gaussian mixture model (GMM)—hidden Markov model (HMM) technique, which also accounts for the temporal information of the pass-phrase. Owing to its contextual information modeling, HiLAM has been found to outperform the subspace techniques. In this paper, we propose integrating DNN–HMM technique with HiLAM to further improve the system performance. Firstly, an attempt has been made to define a speaker-text unit/class that could characterize the speaker idiosyncrasies, which are known to be associated with shorter and more fundamental units of speech text. To this end, HiLAM is used to propose a new class definition, and the training data is aligned with respect to this class definition. The labeled data is then used to discriminatively train a deep neural network (DNN). The new method of alignment enables the neural network to learn the actual context of the pass-phrase components. This is not the case with DNN trained in automatic speech recognition fashion. Besides, the network also models the speaker idiosyncrasies associated with specific and finer text units. The use of DNN posteriors to replace the GMM likelihood probabilities of HiLAM has led to significant improvement in performance over the baseline HiLAM system. Relative EER reduction of up to 36.58% has been observed on Part 1 of RSR2015 database.

Keywords

Text-dependent speaker verification DNN HiLAM DNN–HMM 

Notes

Acknowledgements

The authors would like to thank the Speech and Image Processing Laboratory of the National Institute of Technology Silchar, Silchar, for supporting the research work.

References

  1. 1.
    H. Ali, S.N. Tran, E. Benetos, A.S.D.A. Garcez, Speaker recognition with hybrid features from a deep belief network. Neural Comput. Appl. 29(6), 13–19 (2018)CrossRefGoogle Scholar
  2. 2.
    O. Buyuk, Telephone-based text-dependent speaker verification. Ph.D. Thesis (2011)Google Scholar
  3. 3.
    L. Chen, Y. Zhao, S.X. Zhang, J. Li, G. Ye, F. Soong, Exploring sequential characteristics in speaker bottleneck feature for text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2017)Google Scholar
  4. 4.
    N. Chen, Y. Qian, K. Yu, Multi-task learning for text-dependent speaker verification, in International Speech Communication Association (Interspeech) (2015)Google Scholar
  5. 5.
    S. Dey, S. Madikeri, M. Ferras, P. Motlicek, Deep neural network based posteriors for text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5050–5054Google Scholar
  6. 6.
    S. Dey, P. Motlicek, S. Madikeri, M. Ferras, Template-matching for text-dependent speaker verification. Speech Commun. 88, 96–105 (2017)CrossRefGoogle Scholar
  7. 7.
    T. Fu, Y. Qian, Y. Liu, K. Yu, Tandem deep features for text-dependent speaker verification, in International Speech Communication Association (Interspeech) (2014)Google Scholar
  8. 8.
    C. Hanilçi, H. Çeliktaş, Turkish text-dependent speaker verification using i-vector/PLDA approach, in 26th Signal Processing and Communications Applications Conference (SIU) (IEEE, 2018)Google Scholar
  9. 9.
    G. Heigold, I. Moreno, S. Bengio, N. Shazeer, End-to-end text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5115–5119Google Scholar
  10. 10.
    G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRefGoogle Scholar
  11. 11.
    P. Kenny, T. Stafylakis, J. Alam, P. Ouellet, M. Kockmann, Joint factor analysis for text-dependent speaker verification, in Proceedings of Odyssey Workshop (2014), pp. 1–8Google Scholar
  12. 12.
    T. Kinnunen, Designing a speaker-discriminative adaptive filter bank for speaker recognition, in International Conference on Spoken Language Processing (2002)Google Scholar
  13. 13.
    M. Längkvist, L. Karlsson, A. Loutfi, A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognit. Lett. 42, 11–24 (2014)CrossRefGoogle Scholar
  14. 14.
    A. Larcher, J.F. Bonastre, J.S. Mason, Reinforced temporal structure information for embedded utterance-based speaker recognition, in International Speech and Communication Association (Interspeech) (2008), pp. 371–374Google Scholar
  15. 15.
    A. Larcher, K.A. Lee, B. Ma, H. Li, Modelling the alternative hypothesis for text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 734–738Google Scholar
  16. 16.
    A. Larcher, K.A. Lee, B. Ma, H. Li, Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun. 60, 56–77 (2014)CrossRefGoogle Scholar
  17. 17.
    R.P. Lippmann, Speech recognition by machines and humans. Speech Commun. 22(1), 1–15 (1997)CrossRefGoogle Scholar
  18. 18.
    Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, K. Yu, Deep feature for text-dependent speaker verification. Speech Commun. 73, 1–13 (2015)CrossRefGoogle Scholar
  19. 19.
    National Institute of Standards and Technology, Speaker recognition evaluation. http://www.nist.gov/speech/tests/spk. Accessed Sept 2015
  20. 20.
    T.N. Sainath, B. Kingsbury, B. Ramabhadran, Improving training time of deep belief networks through hybrid pre-training and larger batch sizes, in Proceedings of NIPS Workshop on Log-Linear Models (2012)Google Scholar
  21. 21.
    M. Sheikhan, D. Gharavian, F. Ashoftedel, Using DTW neural-based MFCC warping to improve emotional speech recognition. Neural Comput. Appl. 21(7), 1765–1773 (2012)CrossRefGoogle Scholar
  22. 22.
    D. Snyder, SRE16 Xvector Model 1a. http://kaldi-asr.org/models.html. Accessed Dec 2018
  23. 23.
    D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, in Proceedings of Interspeech (2017), pp. 999–1003Google Scholar
  24. 24.
    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: robust DNN embeddings for speaker recognition, in ICASSP (2018) (Submitted) Google Scholar
  25. 25.
    T. Stafylakis, P. Kenny, P. Ouellet, J. Perez, M. Kockmann, P. Dumouchel, Text-dependent speaker recognition using PLDA with uncertainty propagation, in Matrix, vol. 500 (2013)Google Scholar
  26. 26.
    E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP), vol. 14 (2014), pp. 4052–4056Google Scholar
  27. 27.
    Y. Xu, I. McLoughlin, Y. Song, K. Wu, Improved i-vector representation for speaker diarization. Circuits Syst. Signal Process. 35(9), 3393–3404 (2016)MathSciNetCrossRefGoogle Scholar
  28. 28.
    S.J. Young, S. Young, The HTK Hidden Markov Model Toolkit: Design and Philosophy, vol. 28 (University of Cambridge, Department of Engineering, Cambridge, 1993)Google Scholar
  29. 29.
    H. Zeinali, H. Sameti, L. Burget, HMM-based phrase-independent i-vector extractor for text-dependent speaker verification. IEEE/ACM Trans. Audio Speech Lang. Process. 25(7), 1421–1435 (2017)CrossRefGoogle Scholar
  30. 30.
    H. Zeinali, H. Sameti, L. Burget, Text-dependent speaker verification based on i-vectors, neural networks and hidden Markov models. Comput. Speech Lang. 46, 53–71 (2017)CrossRefGoogle Scholar
  31. 31.
    Z. Zhou, G. Huang, H. Chen, J. Gao, Automatic radar waveform recognition based on deep convolutional denoising auto-encoders. Circuits Syst. Signal Process. 37, 4034–4048 (2018)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Electronics and Communication EngineeringNational Institute of Technology SilcharSilcharIndia

Personalised recommendations