Abstract
This paper presents a novel scheme for considering the frame-level speaker relevancy during i-vector extraction for speaker recognition. In the proposed system, the frame-level point-wise mutual information is utilized to directly modify the Baum-Welch statistics in order to extract a robust i-vector. Furthermore, a method for computing the frame-level speaker relevancy using deep neural network (DNN) analogous to the DNN used in robust automatic speech recognition (ASR) is proposed. The results show that the modified i-vectors obtained using the proposed methods outperformed the conventional i-vectors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Dehak, N.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
The NIST year 2008 speaker recognition evaluation plan (2008). http://www.itl.nist.gov/iad/mig//tests/sre/2008/
Saeidi, R., Alku, P.: Accounting for uncertainty of i-vectors in speaker recognition using uncertainty propagation and modified imputation, In: Proceedings INTERSPEECH, pp. 3546–3550 (2015)
Hasan, T., et al.: Duration mismatch compensation for i-vector based speaker recognition systems. In: Proceedings of IEEE International Conference Acoustic, Speech, and Signal Process, pp. 7663–7667 (2013)
Jung, C.S.: Selecting feature frames for automatic speaker recognition using mutual information. IEEE Trans. Audio Speech Lang. Process. 18(6), 1332–1340 (2010)
Mandasari, M.I. et al.: Evaluation of i-vector speaker recognition systems for forensic application. In: Proceedings of INTERSPEECH, pp. 21–24 (2011)
Mandasari, M.I.: Quality measure functions for calibration of speaker recognition systems in various duration conditions. IEEE Trans. Audio Speech Lang. Process. 21(11), 2425–2438 (2013)
Hansen, J., Hasan, T.: Speaker recognition by machines and humans. IEEE Signal Process. Mag. 32(6), 74–99 (2015)
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
The NIST year 2010 speaker recognition evaluation plan (2010). http://www.itl.nist.gov/iad/mig//tests/sre/2010/
Dehak, N. et al.: Language recognition via ivectors and dimensionality reduction, In: Proceedings of INTERSPEECH, pp. 857–860 (2011)
Kenny, P.: Eigenvoice modeling with sparse training data. IEEE Trans. Audio, Speech, Lang. Process. 13(3), 345–354 (2005)
Dehak, N., et al.: Support vector machines and joint factor analysis for speaker verification. In: Proceedings of IEEE International Conference on Acoust, Speech, and Signal Process, pp. 4237–4240 (2009)
Hoang, H.H., et al.: A re-examination of lexical association measures, In: Proceedings of Workshop on Multiword Expressions Identification, Interpretation, Disambiguation and Application, pp. 31–39 (2009)
Bouma, G.: Normalized (pointwise) mutual information in collocation extraction, In: Proceedings of the Biennial GSCL Conference, pp. 31–40 (2009)
Kenny, P.: A small footprint i-vector extractor. In: Proceedings of Odyssey, pp. 1–25 (2012)
Salakhutdinov, R.: Learning deep generative models. Ann. Rev. Stat. Appl. 2(1), 361–385 (2015)
Garcia-Romero, R., McCree, A.: Insights into deep neural networks for speaker recognition. In: Proceedings of INTERSPEECH, pp. 1141–1145 (2015)
Leonard, R.G.: A database for speaker-independent digit recognition. In: Proceedings of IEEE International Conference on Acoust, Speech, and Signal Process, pp. 328–331 (1984)
Gravier, G.: SPro: speech signal processing toolkit. Software. http://gforge.inria.fr/projects/spro
Lopes, C., Perdigao, F.: Phone recognition on the TIMIT database. Speech Technol. 1, 285–302 (2011)
Sadjadi, S.O., et al.: MSR identity toolbox v1.0: a MATLAB toolbox for speaker recognition research. Speech Lang. Process. Tech. Comm. Newsl. 1(4), (2013)
Srivastava, N., et al.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
Acknowledgments
This research was supported by Projects for Research and Development of Police science and Technology under Center for Research and Development of Police science and Technology and Korean National Police Agency funded by the Ministry of Science, ICT and Future Planning (PA-J000001-2017-101), and by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MEST) (NRF-2015R1A2A1A15054343).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kang, W.H., Cho, W.I., Jang, S.Y., Lee, H.S., Kim, N.S. (2018). I-Vector Extraction Using Speaker Relevancy for Short Duration Speaker Recognition. In: Kim, K., Kim, H., Baek, N. (eds) IT Convergence and Security 2017. Lecture Notes in Electrical Engineering, vol 449. Springer, Singapore. https://doi.org/10.1007/978-981-10-6451-7_10
Download citation
DOI: https://doi.org/10.1007/978-981-10-6451-7_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6450-0
Online ISBN: 978-981-10-6451-7
eBook Packages: EngineeringEngineering (R0)