DNN and i-vector combined method for speaker recognition on multi-variability environments

Abstract

The article deals with the compensation of variability in Automatic Speaker Verification systems in scenarios where the variability conditions due to utterance duration, reverberation and environmental noise are simultaneously present. We introduce a new representation of the speaker’s discriminative information, based on the use of a deep neural network trained discriminatively for speaker classification and i-vector representation. The proposed representation allows us to increase the verification performance by reducing the error between 2.5 and 7.9 % for all variability conditions compared to baseline systems. We also analyze the speaker verification system robustness based on interquartile range, obtaining a 1.19 times improvement compared to baselines evaluated.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    The UBM refers to a Universal Background Model of the population.

  2. 2.

    http://dnt.kr.hsnr.de/download.html.

References

  1. Al-Ali, A. K. H., Senadji, B., & Naik, G. R. (2017). Enhanced forensic speaker verification using multi-run ica in the presence of environmental noise and reverberation conditions. In: Proceedings of ICSIPA. IEEE, pp 174–179.

  2. Alam, M. J., Kenny, P., Bhattacharya, G., & Kockmann, M. (2017). Speaker verification under adverse conditions using i-vector adaptation and neural networks. In: Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20–24, 2017, pp 3732–3736.

  3. Avila, A. R., Paja, M. O. S., & Fraga, F. J., et al. (2014). Improving the performance of far-field speaker verification using multi-condition training: the case of GMM-UBM and i-vector systems. In: INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14–18, 2014, pp 1096–1100.

  4. Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.

    Article  Google Scholar 

  5. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Trans Audio, Speech & Language Processing, 19(4), 788–798.

    Article  Google Scholar 

  6. Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-vector length normalization in speaker recognition systems. In: INTERSPEECH 2011, 12th Annual conference of the international speech communication association, Florence, Italy, August 27–31, 2011, pp 249–252.

  7. Garcia-Romero, D., Zhou, X., & Espy-Wilson, C. Y. (2012). Multicondition training of gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition. In: 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto, Japan, March 25-30, 2012, pp 4257–4260.

  8. Gonzalez-Rodriguez, J. (2014). Evaluating automatic speaker recognition systems: An overview of the nist speaker recognition evaluations (1996–2014). Loquens, 1(1), 007.

    Article  Google Scholar 

  9. Greenberg, C. S., Stanford, V. M., Martin, A. F., Yadagiri, M., Doddington, G. R., Godfrey, J. J., & Hernandez-Cordero, J. (2013). The 2012 NIST speaker recognition evaluation. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon, France, August 25–29, 2013, pp 1971–1975.

  10. Guo, J., Xu, N., Qian, K., Shi, Y., Xu, K., Wu, Y., et al. (2018). Deep neural network based i-vector mapping for speaker verification using short utterances. Speech Communication, 105, 92–102.

    Article  Google Scholar 

  11. Hinton, G., Deng, L., Yu, D., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

    Article  Google Scholar 

  12. Hinton, G. E. (2012). A practical guide to training restricted boltzmann machines. Neural networks: Tricks of the trade (2nd ed., pp. 599–619). Berlin, Heidelberg: Springer.

    Google Scholar 

  13. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.

    MathSciNet  Article  Google Scholar 

  14. Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.

    MathSciNet  Article  Google Scholar 

  15. Kenny, P. (2005). Joint factor analysis of speaker and session variability: Theory and algorithms. CRIM, Montreal (Report) CRIM-06/08-13.

  16. Kenny, P. (2010). Bayesian speaker verification with heavy-tailed priors. In: Odyssey 2010: The speaker and language recognition workshop, Brno, Czech Republic, June 28–July 1, 2010, p. 14.

  17. Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2005). Factor analysis simplified. In: 2005 IEEE international conference on acoustics, speech, and signal processing, ICASSP ’05, Philadelphia, Pennsylvania, USA, March 18–23, 2005, pp. 637–640.

  18. Kenny, P., Stafylakis, T., Ouellet, P., Gupta, V., & Alam, MJ. (2014). Deep neural networks for extracting baum-welch statistics for speaker recognition. In: Odyssey 2014: The speaker and language recognition workshop, Joensuu, Finland, June 16–19, 2014.

  19. Kheder, W. B., Matrouf, D., Ajili, M., & Bonastre, J. F. (2018). A unified joint model to deal with nuisance variabilities in the i-vector space. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(3), 633–645.

    Article  Google Scholar 

  20. Kim, C., & Stern, R. M. (2016). Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Trans Audio, Speech & Language Processing, 24(7), 1315–1329.

    Article  Google Scholar 

  21. Kinoshita, K., Delcroix, M., Yoshioka, T., & Nakatani, T., et al. (2013). The reverb challenge: Acommon evaluation framework for dereverberation and recognition of reverberant speech. In: IEEE workshop on applications of signal processing to audio and acoustics, WASPAA 2013, New Paltz, NY, USA, October 20–23, 2013, pp. 1–4.

  22. Kudashev, O., Novoselov, S., Pekhovsky, T., Simonchik, K., & Lavrentyeva, G. (2016). Usage of DNN in speaker recognition: Advantages and problems. In: Advances in Neural Networks-ISNN 2016, 13th International symposium on neural networks, ISNN 2016, St. Petersburg, Russia, July 6–8, 2016, Proceedings, pp 82–91.

  23. Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2014, Florence, Italy, May 4–9, 2014, pp 1695–1699.

  24. Ma, J., Sethu, V., Ambikairajah, E., & Lee, K. A. (2017). Duration compensation of i-vectors for short duration speaker verification. Electronics Letters, 53(6), 405–407.

    Article  Google Scholar 

  25. Mohamed, A., Dahl, G. E., & Hinton, G. E. (2012). Acoustic modeling using deep belief networks. IEEE Trans Audio, Speech & Language Processing, 20(1), 14–22.

    Article  Google Scholar 

  26. Novotný, O., Plchot, O., Matejka, P., Mosner, L., & Glembek, O. (2018). On the use of x-vectors for robust speaker recognition. Odyssey the speaker and language recognition workshop, 26–29 June 2018, (pp. 168–175). Les Sables d’Olonne.

  27. Pekhovsky, T., Novoselov, S., Sholohov, A., & Kudashev, O. (2016). On autoencoders in the i-vector space for speaker recognition. In: Odyssey 2016: The speaker and language recognition workshop, Bilbao, Spain, June 21–24, 2016, pp 217–224.

  28. Poddar, A., Sahidullah, M., & Saha, G. (2017). Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biometrics, 7(2), 91–101.

    Article  Google Scholar 

  29. Rajan, P., Kinnunen, T., & Hautamäki, V. (2013). Effect of multicondition training on i-vector PLDA configurations for speaker recognition. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon, France, August 25–29, 2013, pp 3694–3697.

  30. Reyes-Díaz, F. J., Hernández-Sierra, G., & Calvo-de Lara, J. R. (2017). Two-space variability compensation technique for speaker verification in short length and reverberant environments. International Journal of Speech Technology (IJST), 20(3), 475–485.

    Article  Google Scholar 

  31. Reyes-Díaz, F. J., Roble-Gutiérres, A., Hernández-Sierra, G., & Calvo-de Lara, J. R. (2018). Filtrado wiener para la reducción de ruido en la verificación de locutores. Revista Cubana de Ciencias Informáticas (RCCI), 12(3), 152–162.

    Google Scholar 

  32. Ribas, D., Vincent, E., & Calvo-de Lara, J. R. (2015). Full multicondition training for robust i-vector based speaker recognition. In: INTERSPEECH 2015, 16th annual conference of the international speech communication association, Dresden, Germany, September 6–10, 2015, pp 1057–1061.

  33. Richardson, F., Reynolds, D., & Dehak, N. (2015). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), 1671–1675.

    Article  Google Scholar 

  34. Saad, D. (2010). On-line learning in neural networks. Cambridge: Cambridge University Press.

    Google Scholar 

  35. Scheffer, N., Ferrer, L., Lawson, A., Lei, Y., & McLaren, M. (2013). Recent developments in voice biometrics: Robustness and high accuracy. In: 2013 IEEE international conference on technologies for homeland security (HST), pp 447–452.

  36. Senior, A. W., Sak, H., & Shafran, I. (2015). Context dependent phone models for LSTM RNN acoustic modelling. In: 2015 IEEE international conference on acoustics, speech and signal processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19–24, 2015, pp 4585–4589.

  37. Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpur, S. (2017). Deep neural network embeddings for text-independent speaker verification. In: Interspeech 2017, 18th annual conference of the international speech communication association, Stockholm, Sweden, August 20–24, 2017, pp 999–1003.

  38. Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing, ICASSP 2018, Calgary, AB, Canada, April 15–20, 2018, pp. 5329–5333.

  39. Sohn, J., Kim, N. S., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6(1), 1–3.

    Article  Google Scholar 

  40. Solanas, A., & Pérez, A. (2004). Estadística descriptiva en ciencias del comportamiento. Thomson, https://books.google.com.cu/books?id=NOBYAAAACAAJ.

  41. Xu, L., Das, RK., Yılmaz, E., Yang, J., & Li, H. (2018). Generative x-vectors for text-independent speaker verification. arXiv preprint arXiv:180906798.

  42. Zhang, C., & Koishida, K. (2017). End-to-end text-independent speaker verification with triplet loss on short utterances. In: Interspeech 2017, 18th annual conference of the international speech communication association, Stockholm, Sweden, August 20–24, 2017, pp 1487–1491.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Flavio J. Reyes-Díaz.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Reyes-Díaz, F.J., Hernández-Sierra, G. & de Lara, J.R.C. DNN and i-vector combined method for speaker recognition on multi-variability environments. Int J Speech Technol (2021). https://doi.org/10.1007/s10772-021-09796-1

Download citation

Keywords

  • Multi-variability compensation
  • Bottleneck features
  • Speaker verification
  • Short utterances
  • Additive noise
  • Reverberation
  • Deep neural network