Spoken Language Identification Using ConvNets

Sarthak; Shukla, Shikhar; Mittal, Govind

doi:10.1007/978-3-030-34255-5_17

Sarthak¹¹,
Shikhar Shukla¹² &
Govind Mittal¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11912))

Included in the following conference series:

European Conference on Ambient Intelligence

1155 Accesses
12 Citations

Abstract

Language Identification (LI) is an important first step in several speech processing systems. With a growing number of voice-based assistants, speech LI has emerged as a widely researched field. To approach the problem of identifying languages, we can either adopt an implicit approach where only the speech for a language is present or an explicit one where text is available with its corresponding transcript. This paper focuses on an implicit approach due to the absence of transcriptive data. This paper benchmarks existing models and proposes a new attention based model for language identification which uses log-Mel spectrogram images as input. We also present the effectiveness of raw waveforms as features to neural network models for LI tasks. For training and evaluation of models, we classified six languages (English, French, German, Spanish, Russian and Italian) with an accuracy of 95.4% and four languages (English, French, German, Spanish) with an accuracy of 96.3% obtained from the VoxForge dataset. This approach can further be scaled to incorporate more languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bartz, C., Herold, T., Yang, H., Meinel, C.: Language identification using deep convolutional recurrent neural networks. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, E.S. (eds.) Neural Information Processing. LNCS, vol. 10639, pp. 880–889. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70136-3_93
Chapter Google Scholar
Bergstra, J., Yamins, D., Cox, D.D.: Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures (2013)
Google Scholar
Chen, L., et al.: Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5659–5667 (2017)
Google Scholar
Dehak, N., Torres-Carrasquillo, P.A., Reynolds, D., Dehak, R.: Language recognition via i-vectors and dimensionality reduction. In: Twelfth Annual Conference of the International Speech Communication Association (2011)
Google Scholar
Endah Safitri, N., Zahra, A., Adriani, M.: Spoken language identification with phonotactics methods on minangkabau, sundanese, and javanese languages. Procedia Comput. Sci. 81, 182–187 (2016). https://doi.org/10.1016/j.procs.2016.04.047
Article Google Scholar
Ferrer, L., Scheffer, N., Shriberg, E.: A comparison of approaches for modeling prosodic features in speaker recognition. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4414–4417. IEEE (2010)
Google Scholar
Ganapathy, S., Han, K., Thomas, S., Omar, M., Segbroeck, M.V., Narayanan, S.S.: Robust language identification using convolutional neural network features. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Google Scholar
Gazeau, V., Varol, C.: Automatic spoken language recognition with neural networks. Int. J. Inf. Technol. Comput. Sci. (IJITCS) 10(8), 11–17 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. IEEE (2017)
Google Scholar
Howard, J., et al.: Fastai (2018). https://github.com/fastai/fastai
Kumar, P., Biswas, A., Mishra, A.N., Chandra, M.: Spoken language identification using hybrid feature extraction methods. arXiv preprint arXiv:1003.5623 (2010)
Lee, J., Kim, T., Park, J., Nam, J.: Raw waveform-based audio classification using sample-level CNN architectures. arXiv preprint arXiv:1712.00866 (2017)
LibROSA: https://librosa.github.io/librosa/. Accessed 16 July 2019
Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martinez, D., Gonzalez-Rodriguez, J., Moreno, P.: Automatic language identification using deep neural networks. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5337–5341. IEEE (2014)
Google Scholar
Martinez, D., Plchot, O., Burget, L., Glembek, O., Matějka, P.: Language recognition in ivectors space. In: Twelfth Annual Conference of the International Speech Communication Association (2011)
Google Scholar
Montavon, G.: Deep learning for spoken language identification. In: NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, pp. 1–4 (2009)
Google Scholar
Obuchi, Y., Sato, N.: Language identification using phonetic and prosodic HMMs with feature normalization. In: Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2005), vol. 1, pp. I–569. IEEE (2005)
Google Scholar
Revay, S., Teschke, M.: Multiclass language identification using deep learning on spectral images of audio signals. arXiv preprint arXiv:1905.04348 (2019)
Tong, R., Ma, B., Zhu, D., Li, H.,Chng, E.S.: Integrating acoustic, prosodic and phonotactic features for spoken language identification. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 1, p. I, May 2006. https://doi.org/10.1109/ICASSP.2006.1659993
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
voxforge.org: Free speech recognition (Linux, Windows and mac) - http://www.voxforge.org/. Accessed 16 July 2019
Wei, Q., Liu, Y., Ruan, X.: A report on audio tagging with deeper CNN, 1D-convnet and 2D-convnet
Google Scholar
Xu, K., et al.: General audio tagging with ensembling convolutional neural networks and statistical features. J. Acoust. Soc. Am. 145(6), EL52–EL527 (2019)
Article Google Scholar
Xu, Y., et al.: Unsupervised feature learning based on deep models for environmental audio tagging. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1230–1241 (2017)
Article Google Scholar
Youtube: http://www.youtube.com. Accessed 16 July 2019
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Zissman, M.A.: Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4(1), 31 (1996)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Analytics Quotient, Bangalore, India
Sarthak
Samsung R&D Institute India-Bangalore, Bangalore, India
Shikhar Shukla
Birla Institute of Technology and Science, Pilani, Rajasthan, India
Govind Mittal

Authors

Sarthak
View author publications
You can also search for this author in PubMed Google Scholar
Shikhar Shukla
View author publications
You can also search for this author in PubMed Google Scholar
Govind Mittal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shikhar Shukla .

Editor information

Editors and Affiliations

Sapienza University of Rome, Rome, Italy
Ioannis Chatzigiannakis
Philips Research, Eindhoven, The Netherlands
Boris De Ruyter
Hellenic Open University, Patras, Greece
Irene Mavrommati

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sarthak, Shukla, S., Mittal, G. (2019). Spoken Language Identification Using ConvNets. In: Chatzigiannakis, I., De Ruyter, B., Mavrommati, I. (eds) Ambient Intelligence. AmI 2019. Lecture Notes in Computer Science(), vol 11912. Springer, Cham. https://doi.org/10.1007/978-3-030-34255-5_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-34255-5_17
Published: 04 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34254-8
Online ISBN: 978-3-030-34255-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

Ambient Intelligence and the Internet of Things (opens in a new tab)

Spoken Language Identification Using ConvNets