Abstract
In this paper, we introduce a novel approach for Language Identification (LID). Two commonly used state-of-the-art methods based on UBM/GMM I-vector technique, combined with a back-end classifier, are first evaluated. The differential factor between these two methods is the deployment of input features to train the UBM/GMM models: conventional MFCCs, or deep Bottleneck Features (BNF) extracted from a neural network. Analogous to successful algorithms developed for speaker recognition tasks, this paper proposes to train the BNF classifier directly on language targets rather than using conventional phone targets (i.e. international phone alphabet). We show that the proposed approach reduces the number of targets by 96% when tested on 4 languages of SpeechDat databases, which leads to 94% reduction in training time (i.e. to train BNF classifier). We achieve in average, relative improvement of approximately 35% in terms of cost average \(C_{avg}\), as well as Language Error Rates (LER), across all test duration conditions.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Elenius, K., Lindberg, J.: SpeechDat Speech Databases for Creation of Voice Driven Teleservices. Phonum 4, Phonetics, pp. 61–64 (1997). http://www.speech.kth.se/prod/publications/files/538.pdf
Fér, R., Matějka, P., Grézl, F., Plchot, O., Cernocký, J.H.: Multilingual bottleneck features for language recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2015, pp. 389–393, January 2015
Glembek, O., Burget, L., Matějka, P., Karafiát, M., Kenny, P.: Simplification and optimization of i-vector extraction. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4516–4519, May 2011
Gonzalez-Dominguez, J., Lopez-Moreno, I., Sak, H., Gonzalez-Rodriguez, J., Moreno, P.J.: Automatic language identification using long short-term memory recurrent neural networks. In: Proceedings of Interspeech, pp. 2155–2159 (2014)
Kenny, P.: Joint factor analysis of speaker and session variability: Theory and algorithms. Technical report (2005)
Kramer, M.A.: Nonlinear principal component analysis using auto-associative neural networks. AIChEJ 37(2), 233–243 (1991)
Díez, M., Varona, A., Peñagarikano, M., Rodríguez-Fuentes, L.J., Bordel, G.: On the use of phone log-likelihood ratios as features in spoken language recognition. SLT, pp. 274–279 (2012)
Martinez, D., Plchot, O., Burget, L., Glembek, O., Matějka, P.: Language recognition in ivectors space. In: Twelfth Annual Conference of the International Speech Communication Association (2011)
Matejka, P., Cumani, S., Ondel, L., Mounika, K.V., Silnova, A., Rohdin, J.: BUT-PT System Description for NIST LRE 2017, 748097 (2017)
Matejka, P., et al.: Neural network bottleneck features for language identification. Odyssey, the Speaker and Language Recognition Workshop, pp. 299–304, June 2014
Povey, D., Chu, S.M., Varadarajan, B.: Universal background model based speech recognition. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4561–4564. IEEE (2008)
Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011, iEEE Catalog No.: CFP11SRW-USB
Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y., Khudanpur, S.: Deep neural network-based speaker embeddings for end-to-end speaker verification. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 165–170, December 2016
US department of commerce, N.: The 2007 NIST Language Recognition Evaluation Plan (LRE07). NIST Web document, pp. 1–5 (2007). https://catalog.ldc.upenn.edu/docs/LDC2009S04/LRE07EvalPlan-v8b-1.pdf
Acknowledgement
This work was partially supported by several industrial projects at Idiap and the China Scholarship Council.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Grisard, M., Motlicek, P., Allouchi, W., Baeriswyl, M., Lazaridis, A., Zhan, Q. (2019). Spoken Language Identification Using Language Bottleneck Features. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-27947-9_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)