The article deals with the compensation of variability in Automatic Speaker Verification systems in scenarios where the variability conditions due to utterance duration, reverberation and environmental noise are simultaneously present. We introduce a new representation of the speaker’s discriminative information, based on the use of a deep neural network trained discriminatively for speaker classification and i-vector representation. The proposed representation allows us to increase the verification performance by reducing the error between 2.5 and 7.9 % for all variability conditions compared to baseline systems. We also analyze the speaker verification system robustness based on interquartile range, obtaining a 1.19 times improvement compared to baselines evaluated.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
The UBM refers to a Universal Background Model of the population.
Al-Ali, A. K. H., Senadji, B., & Naik, G. R. (2017). Enhanced forensic speaker verification using multi-run ica in the presence of environmental noise and reverberation conditions. In: Proceedings of ICSIPA. IEEE, pp 174–179.
Alam, M. J., Kenny, P., Bhattacharya, G., & Kockmann, M. (2017). Speaker verification under adverse conditions using i-vector adaptation and neural networks. In: Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20–24, 2017, pp 3732–3736.
Avila, A. R., Paja, M. O. S., & Fraga, F. J., et al. (2014). Improving the performance of far-field speaker verification using multi-condition training: the case of GMM-UBM and i-vector systems. In: INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14–18, 2014, pp 1096–1100.
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Trans Audio, Speech & Language Processing, 19(4), 788–798.
Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-vector length normalization in speaker recognition systems. In: INTERSPEECH 2011, 12th Annual conference of the international speech communication association, Florence, Italy, August 27–31, 2011, pp 249–252.
Garcia-Romero, D., Zhou, X., & Espy-Wilson, C. Y. (2012). Multicondition training of gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition. In: 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto, Japan, March 25-30, 2012, pp 4257–4260.
Gonzalez-Rodriguez, J. (2014). Evaluating automatic speaker recognition systems: An overview of the nist speaker recognition evaluations (1996–2014). Loquens, 1(1), 007.
Greenberg, C. S., Stanford, V. M., Martin, A. F., Yadagiri, M., Doddington, G. R., Godfrey, J. J., & Hernandez-Cordero, J. (2013). The 2012 NIST speaker recognition evaluation. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon, France, August 25–29, 2013, pp 1971–1975.
Guo, J., Xu, N., Qian, K., Shi, Y., Xu, K., Wu, Y., et al. (2018). Deep neural network based i-vector mapping for speaker verification using short utterances. Speech Communication, 105, 92–102.
Hinton, G., Deng, L., Yu, D., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Hinton, G. E. (2012). A practical guide to training restricted boltzmann machines. Neural networks: Tricks of the trade (2nd ed., pp. 599–619). Berlin, Heidelberg: Springer.
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
Kenny, P. (2005). Joint factor analysis of speaker and session variability: Theory and algorithms. CRIM, Montreal (Report) CRIM-06/08-13.
Kenny, P. (2010). Bayesian speaker verification with heavy-tailed priors. In: Odyssey 2010: The speaker and language recognition workshop, Brno, Czech Republic, June 28–July 1, 2010, p. 14.
Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2005). Factor analysis simplified. In: 2005 IEEE international conference on acoustics, speech, and signal processing, ICASSP ’05, Philadelphia, Pennsylvania, USA, March 18–23, 2005, pp. 637–640.
Kenny, P., Stafylakis, T., Ouellet, P., Gupta, V., & Alam, MJ. (2014). Deep neural networks for extracting baum-welch statistics for speaker recognition. In: Odyssey 2014: The speaker and language recognition workshop, Joensuu, Finland, June 16–19, 2014.
Kheder, W. B., Matrouf, D., Ajili, M., & Bonastre, J. F. (2018). A unified joint model to deal with nuisance variabilities in the i-vector space. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(3), 633–645.
Kim, C., & Stern, R. M. (2016). Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Trans Audio, Speech & Language Processing, 24(7), 1315–1329.
Kinoshita, K., Delcroix, M., Yoshioka, T., & Nakatani, T., et al. (2013). The reverb challenge: Acommon evaluation framework for dereverberation and recognition of reverberant speech. In: IEEE workshop on applications of signal processing to audio and acoustics, WASPAA 2013, New Paltz, NY, USA, October 20–23, 2013, pp. 1–4.
Kudashev, O., Novoselov, S., Pekhovsky, T., Simonchik, K., & Lavrentyeva, G. (2016). Usage of DNN in speaker recognition: Advantages and problems. In: Advances in Neural Networks-ISNN 2016, 13th International symposium on neural networks, ISNN 2016, St. Petersburg, Russia, July 6–8, 2016, Proceedings, pp 82–91.
Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2014, Florence, Italy, May 4–9, 2014, pp 1695–1699.
Ma, J., Sethu, V., Ambikairajah, E., & Lee, K. A. (2017). Duration compensation of i-vectors for short duration speaker verification. Electronics Letters, 53(6), 405–407.
Mohamed, A., Dahl, G. E., & Hinton, G. E. (2012). Acoustic modeling using deep belief networks. IEEE Trans Audio, Speech & Language Processing, 20(1), 14–22.
Novotný, O., Plchot, O., Matejka, P., Mosner, L., & Glembek, O. (2018). On the use of x-vectors for robust speaker recognition. Odyssey the speaker and language recognition workshop, 26–29 June 2018, (pp. 168–175). Les Sables d’Olonne.
Pekhovsky, T., Novoselov, S., Sholohov, A., & Kudashev, O. (2016). On autoencoders in the i-vector space for speaker recognition. In: Odyssey 2016: The speaker and language recognition workshop, Bilbao, Spain, June 21–24, 2016, pp 217–224.
Poddar, A., Sahidullah, M., & Saha, G. (2017). Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biometrics, 7(2), 91–101.
Rajan, P., Kinnunen, T., & Hautamäki, V. (2013). Effect of multicondition training on i-vector PLDA configurations for speaker recognition. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon, France, August 25–29, 2013, pp 3694–3697.
Reyes-Díaz, F. J., Hernández-Sierra, G., & Calvo-de Lara, J. R. (2017). Two-space variability compensation technique for speaker verification in short length and reverberant environments. International Journal of Speech Technology (IJST), 20(3), 475–485.
Reyes-Díaz, F. J., Roble-Gutiérres, A., Hernández-Sierra, G., & Calvo-de Lara, J. R. (2018). Filtrado wiener para la reducción de ruido en la verificación de locutores. Revista Cubana de Ciencias Informáticas (RCCI), 12(3), 152–162.
Ribas, D., Vincent, E., & Calvo-de Lara, J. R. (2015). Full multicondition training for robust i-vector based speaker recognition. In: INTERSPEECH 2015, 16th annual conference of the international speech communication association, Dresden, Germany, September 6–10, 2015, pp 1057–1061.
Richardson, F., Reynolds, D., & Dehak, N. (2015). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), 1671–1675.
Saad, D. (2010). On-line learning in neural networks. Cambridge: Cambridge University Press.
Scheffer, N., Ferrer, L., Lawson, A., Lei, Y., & McLaren, M. (2013). Recent developments in voice biometrics: Robustness and high accuracy. In: 2013 IEEE international conference on technologies for homeland security (HST), pp 447–452.
Senior, A. W., Sak, H., & Shafran, I. (2015). Context dependent phone models for LSTM RNN acoustic modelling. In: 2015 IEEE international conference on acoustics, speech and signal processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19–24, 2015, pp 4585–4589.
Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpur, S. (2017). Deep neural network embeddings for text-independent speaker verification. In: Interspeech 2017, 18th annual conference of the international speech communication association, Stockholm, Sweden, August 20–24, 2017, pp 999–1003.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing, ICASSP 2018, Calgary, AB, Canada, April 15–20, 2018, pp. 5329–5333.
Sohn, J., Kim, N. S., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6(1), 1–3.
Solanas, A., & Pérez, A. (2004). Estadística descriptiva en ciencias del comportamiento. Thomson, https://books.google.com.cu/books?id=NOBYAAAACAAJ.
Xu, L., Das, RK., Yılmaz, E., Yang, J., & Li, H. (2018). Generative x-vectors for text-independent speaker verification. arXiv preprint arXiv:180906798.
Zhang, C., & Koishida, K. (2017). End-to-end text-independent speaker verification with triplet loss on short utterances. In: Interspeech 2017, 18th annual conference of the international speech communication association, Stockholm, Sweden, August 20–24, 2017, pp 1487–1491.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Reyes-Díaz, F.J., Hernández-Sierra, G. & de Lara, J.R.C. DNN and i-vector combined method for speaker recognition on multi-variability environments. Int J Speech Technol (2021). https://doi.org/10.1007/s10772-021-09796-1
- Multi-variability compensation
- Bottleneck features
- Speaker verification
- Short utterances
- Additive noise
- Deep neural network