Abstract
Language Identification (LI) is an important first step in several speech processing systems. With a growing number of voice-based assistants, speech LI has emerged as a widely researched field. To approach the problem of identifying languages, we can either adopt an implicit approach where only the speech for a language is present or an explicit one where text is available with its corresponding transcript. This paper focuses on an implicit approach due to the absence of transcriptive data. This paper benchmarks existing models and proposes a new attention based model for language identification which uses log-Mel spectrogram images as input. We also present the effectiveness of raw waveforms as features to neural network models for LI tasks. For training and evaluation of models, we classified six languages (English, French, German, Spanish, Russian and Italian) with an accuracy of 95.4% and four languages (English, French, German, Spanish) with an accuracy of 96.3% obtained from the VoxForge dataset. This approach can further be scaled to incorporate more languages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bartz, C., Herold, T., Yang, H., Meinel, C.: Language identification using deep convolutional recurrent neural networks. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, E.S. (eds.) Neural Information Processing. LNCS, vol. 10639, pp. 880–889. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70136-3_93
Bergstra, J., Yamins, D., Cox, D.D.: Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures (2013)
Chen, L., et al.: Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5659–5667 (2017)
Dehak, N., Torres-Carrasquillo, P.A., Reynolds, D., Dehak, R.: Language recognition via i-vectors and dimensionality reduction. In: Twelfth Annual Conference of the International Speech Communication Association (2011)
Endah Safitri, N., Zahra, A., Adriani, M.: Spoken language identification with phonotactics methods on minangkabau, sundanese, and javanese languages. Procedia Comput. Sci. 81, 182–187 (2016). https://doi.org/10.1016/j.procs.2016.04.047
Ferrer, L., Scheffer, N., Shriberg, E.: A comparison of approaches for modeling prosodic features in speaker recognition. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4414–4417. IEEE (2010)
Ganapathy, S., Han, K., Thomas, S., Omar, M., Segbroeck, M.V., Narayanan, S.S.: Robust language identification using convolutional neural network features. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Gazeau, V., Varol, C.: Automatic spoken language recognition with neural networks. Int. J. Inf. Technol. Comput. Sci. (IJITCS) 10(8), 11–17 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. IEEE (2017)
Howard, J., et al.: Fastai (2018). https://github.com/fastai/fastai
Kumar, P., Biswas, A., Mishra, A.N., Chandra, M.: Spoken language identification using hybrid feature extraction methods. arXiv preprint arXiv:1003.5623 (2010)
Lee, J., Kim, T., Park, J., Nam, J.: Raw waveform-based audio classification using sample-level CNN architectures. arXiv preprint arXiv:1712.00866 (2017)
LibROSA: https://librosa.github.io/librosa/. Accessed 16 July 2019
Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martinez, D., Gonzalez-Rodriguez, J., Moreno, P.: Automatic language identification using deep neural networks. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5337–5341. IEEE (2014)
Martinez, D., Plchot, O., Burget, L., Glembek, O., Matějka, P.: Language recognition in ivectors space. In: Twelfth Annual Conference of the International Speech Communication Association (2011)
Montavon, G.: Deep learning for spoken language identification. In: NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, pp. 1–4 (2009)
Obuchi, Y., Sato, N.: Language identification using phonetic and prosodic HMMs with feature normalization. In: Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2005), vol. 1, pp. I–569. IEEE (2005)
Revay, S., Teschke, M.: Multiclass language identification using deep learning on spectral images of audio signals. arXiv preprint arXiv:1905.04348 (2019)
Tong, R., Ma, B., Zhu, D., Li, H.,Chng, E.S.: Integrating acoustic, prosodic and phonotactic features for spoken language identification. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 1, p. I, May 2006. https://doi.org/10.1109/ICASSP.2006.1659993
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
voxforge.org: Free speech recognition (Linux, Windows and mac) - http://www.voxforge.org/. Accessed 16 July 2019
Wei, Q., Liu, Y., Ruan, X.: A report on audio tagging with deeper CNN, 1D-convnet and 2D-convnet
Xu, K., et al.: General audio tagging with ensembling convolutional neural networks and statistical features. J. Acoust. Soc. Am. 145(6), EL52–EL527 (2019)
Xu, Y., et al.: Unsupervised feature learning based on deep models for environmental audio tagging. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1230–1241 (2017)
Youtube: http://www.youtube.com. Accessed 16 July 2019
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Zissman, M.A.: Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4(1), 31 (1996)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Sarthak, Shukla, S., Mittal, G. (2019). Spoken Language Identification Using ConvNets. In: Chatzigiannakis, I., De Ruyter, B., Mavrommati, I. (eds) Ambient Intelligence. AmI 2019. Lecture Notes in Computer Science(), vol 11912. Springer, Cham. https://doi.org/10.1007/978-3-030-34255-5_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-34255-5_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34254-8
Online ISBN: 978-3-030-34255-5
eBook Packages: Computer ScienceComputer Science (R0)