Abstract
This paper presents a unified model to perform language and speaker recognition simultaneously and together. This model is based on a multi-task recurrent neural network, where the output of one task is fed in as the input of the other, leading to a collaborative learning framework that can improve both language and speaker recognition by sharing information between the tasks. The preliminary experiments presented in this paper demonstrate that the multi-task model outperforms similar task-specific models on both language and speaker tasks. The language recognition improvement is especially remarkable, which we believe is due to the speaker normalization effect caused by using the information from the speaker recognition component.
Keywords
- Speaker Recognition (SRE)
- Collaborative Learning Approach
- Language Recognition (LRE)
- Constrained Maximum Likelihood Linear Regression (CMLLR)
- Long Short-term Memory (LSTM)
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
This database was collected by our institute for commercial usage, so we cannot release the wave data, but the Fbanks and MFCCs in the Kaldi format have been published online. See http://data.cslt.org. The Kaldi recipe to reproduce the results is also available there.
References
Navratil, J.: Spoken language recognition-a step toward multilinguality in speech processing. IEEE Trans. Speech Audio Process. 9(6), 678–685 (2001)
Bimbot, F., Bonastre, J.-F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-García, J., Petrovska-Delacrétaz, D., Reynolds, D.A.: A tutorial on text-independent speaker verification. EURASIP J. Appl. Sig. Process. 2004, 430–451 (2004)
Campbell, W.M., Campbell, J.P., Reynolds, D.A., Singer, E., Torres-Carrasquillo, P.A.: Support vector machines for speaker and language recognition. Comput. Speech Lang. 20(2), 210–229 (2006)
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Lei, Y., Scheffer, N., Ferrer, L., McLaren, M.: A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: ICASSP 2014, pp. 1695–1699. IEEE (2014)
Dehak, N., Pedro, A.-C., Reynolds, D., Dehak, R.: Language recognition via i-vectors and dimensionality reduction. In: Interspeech 2011, pp. 857–860 (2011)
Martınez, D., Plchot, O., Burget, L., Glembek, O., Matejka, P.: Language recognition in i-vectors space. In: Interspeech 2011, pp. 861–864 (2011)
Ehsan, V., Xin, L., Erik, M., Ignacio, L.M., Javier, G.-D.: Deep neural networks for small footprint text-dependent speaker verification. In: ICASSP 2014, pp. 357–366 (2014)
Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker verification. In: ICASSP 2016, pp. 5115–5119. IEEE (2016)
Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y., Khudanpur, S.: Deep neural network-based speaker embeddings for end-to-end speaker verification. In: SLT 2016 (2016)
Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martinez, D., Gonzalez-Rodriguez, J., Moreno, P.: Automatic language identification using deep neural networks. In: ICASSP 2014, pp. 5337–5341. IEEE (2014)
Lozano-Diez, A., Zazo Candil, R., González Domínguez, J., Toledano, D.T., Gonzalez-Rodriguez, J.: An end-to-end approach to language identification in short utterances using convolutional neural networks. In: Interspeech 2015 (2015)
Garcia-Romero, D., McCree, A.: Stacked long-term TDNN for spoken language recognition. In: Interspeech 2016, pp. 3226–3230 (2016)
Jin, M., Song, Y., Mcloughlin, I., Dai, L.-R., Ye, Z.-F.: LID-senone extraction via deep neural networks for end-to-end language identification. In: Odyssey 2016 (2016)
Zazo, R., Lozano-Diez, A., Gonzalez-Dominguez, J., Toledano, D.T., Gonzalez-Rodriguez, J.: Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PloS one 11(1), e0146917 (2016)
Tkachenko, M., Yamshinin, A., Lyubimov, N., Kotov, M., Nastasenko, M.: Language identification using time delay neural network D-vector on short utterances. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 443–449. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_53
Ma, B., Meng, H.: English-Chinese bilingual text-independent speaker verification. In: ICASSP 2004. IEEE, p. V-293 (2004)
Auckenthaler, R., Carey, M.J., Mason, J.: Language dependency in text-independent speaker verification. In: ICASSP 2001, pp. 441–444. IEEE (2001)
Misra, A., Hansen, J.H.L.: Spoken language mismatch in speaker verification: an investigation with NIST-SRE and CRSS bi-ling corpora. In: IEEE Spoken Language Technology Workshop (SLT), pp. 372–377. IEEE (2014)
Rozi, A., Wang, D., Li, L., Zheng, T.F.: Language-aware PLDA for multilingual speaker recognition. In: O-COCOSDA 2016 (2016)
Matejka, P., Burget, L., Schwarz, P., Cernocky, J.: Brno university of technology system for NIST: language recognition evaluation. In: Speaker and Language Recognition Workshop, IEEE Odyssey 2006, pp. 1–7. IEEE (2005)
Gelly, G., Gauvain, J.-L., Le, V., Messaoudi, A.: A divide-and-conquer approach for language identification based on recurrent neural networks. In: Interspeech 2016, pp. 3231–3235 (2016)
Shen, W., Reynolds, D.: Improved GMM-based language recognition using constrained MLLR transforms. In: ICASSP 2008, pp. 4149–4152. IEEE (2008)
Zhang, S.-X., Chen, Z., Zhao, Y., Li, J., Gong, Y.: End-to-end attention based text-dependent speaker verification. arXiv preprint arXiv:1701.00562 (2017)
Gonzalez-Dominguez, J., Lopez-Moreno, I., Sak, H., Gonzalez-Rodriguez, J., Moreno, P.J.: Automatic language identification using long short-term memory recurrent neural networks. In: Interspeech 2014, pp. 2155–2159 (2014)
Salamea, C., D’Haro, L.F., de Córdoba, R., San-Segundo, R.: On the use of phone-gram units in recurrent neural networks for language identification. In: Odyssey 2016, pp. 117–123 (2016)
Tang, Z., Li, L., Wang, D., Vipperla, R.: Collaborative joint training with multitask recurrent model for speech and speaker recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(3), 493–504 (2017)
Li, X., Wu, X.: Modeling speaker variability using long short-term memory networks for speech recognition. In: Interspeech 2015, pp. 1086–1090 (2015)
Qian, Y., Tan, T., Yu, D.: Neural network based multi-factor aware joint training for robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2231–2240 (2016)
Wang, D., Zheng, T.F.: Transfer learning for speech and language processing. In: APSIPA 2015, pp. 1225–1237 (2015)
Lamel, L.F., Gauvain, J.-L.: Language identification using phone-based acoustic likelihoods. In: ICASSP 1994, vol. 1, p. I-293. IEEE (1994)
Zissman, M.A., et al.: Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4(1), 31 (1996)
Song, Y., Jiang, B., Bao, Y., Wei, S., Dai, L.-R.: I-vector representation based on bottleneck features for language identification. Electron. Lett. 49(24), 1569–1570 (2013)
Tian, Y., He, L., Liu, Y., Liu, J.: Investigation of Senone-based long-short term memory RNNS for spoken language recognition. In: Odyssey 2016, pp. 89–93 (2016)
Lu, L., Dong, Y., Xianyu, Z., Jiqing, L., Haila, W.: The effect of language factors for robust speaker recognition. In: ICASSP 2009, pp. 4217–4220. IEEE (2009)
Hochreiter, S., Jürgen, S.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Interspeech 2014, pp. 338–342 (2014)
Povey, D., Zhang, X., Khudanpur, S.: Parallel training of deep neural networks with natural gradient and parameter averaging. arXiv preprint arXiv:1410.7455 (2014)
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, No. EPFL-CONF-192584. IEEE Signal Processing Society (2011)
Yin, B., Eliathamby, A., Fang, C.: Hierarchical language identification based on automatic language clustering. In: Interspeech 2007, pp. 178–181 (2007)
Acknowledgment
This work was supported by the National Natural Science Foundation of China under Grant No. 61371136/61633013 and the National Basic Research Program (973 Program) of China under Grant No. 2013CB329302.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Li, L., Tang, Z., Wang, D., Abel, A., Feng, Y., Zhang, S. (2018). Collaborative Learning for Language and Speaker Recognition. In: Tao, J., Zheng, T., Bao, C., Wang, D., Li, Y. (eds) Man-Machine Speech Communication. NCMMSC 2017. Communications in Computer and Information Science, vol 807. Springer, Singapore. https://doi.org/10.1007/978-981-10-8111-8_6
Download citation
DOI: https://doi.org/10.1007/978-981-10-8111-8_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8110-1
Online ISBN: 978-981-10-8111-8
eBook Packages: Computer ScienceComputer Science (R0)