A Language Identification System Based on Voxforge Speech Corpus

  • Khaled LounnasEmail author
  • Mourad Abbas
  • Hocine Teffahi
  • Mohamed Lichouri
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 921)


In this work, we address the problem of identifying languages based on Voxforge speech corpus. We downloaded corpora for three languages: English, German and Persian from Voxforge. In addition, we recorded two additional corpora, the first one for Modern Standard Arabic (MSA) and the other one for Kabyl, one of the Algerian Berber dialects. To tackle this task, we used three classifiers, namely: k-Nearest Neighbors (kNN), Support Vector Machines (SVM) and Extra Trees Classifier. We obtained an average precision of \(87.45 \%\) for binary classification compared to \(44 \%\) for the multi-class one.


kNN SVM Extratrees Language identification 


  1. 1.
    Bhattacharjee, U., Sarmah, K.: Language identification system using MFCC and prosodic features. In: 2013 International Conference on Intelligent Systems and Signal Processing (ISSP), pp. 194–197. IEEE, March 2013Google Scholar
  2. 2.
    Biadsy, F., Hirschberg, J., Habash, N.: Spoken Arabic dialect identification using phonotactic modeling. In: Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, Association for Computational Linguistics, pp. 53–61, March 2009Google Scholar
  3. 3.
    Laguna, A.F., Guevara, R.C.: Experiments on automatic language identification for philippine languages using acoustic Gaussian mixture models. In: 2014 IEEE Region 10 Symposium, pp. 657–662. IEEE, April 2014Google Scholar
  4. 4.
    Hanani, A., Qaroush, A., Taylor, S.: Classifying ASR transcriptions according to Arabic dialect. In: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 126–134 (2016)Google Scholar
  5. 5.
    Lopez-Moreno, I., Gonzalez-Dominguez, J., Martinez, D., Plchot, O., Gonzalez-Rodriguez, J., Moreno, P.J.: On the use of deep feedforward neural networks for automatic language identification. Comput. Speech Lang. 40, 46–59 (2016)CrossRefGoogle Scholar
  6. 6.
    Sarma, M., Sarma, K.K.: Dialect identification from assamese speech using prosodic features and a neuro fuzzy classifier. In: 3rd IEEE International Conference on Signal Processing and Integrated Networks (SPIN), pp. 127–132, February 2016Google Scholar
  7. 7.
    Itrat, M., Ali, S.A., Asif, R., Khanzada, K., Rathi, M.K.: Automatic language identification for languages of Pakistan. Int. J. Comput. Sci. Netw. Secur. (IJCSNS) 17(2), 161 (2017)Google Scholar
  8. 8.
    Moftah, M., Fakhr, M.W., El Ramly, S.: Arabic dialect identification based on motif discovery using GMM-UBM with different motif lengths. In: The 2nd IEEE International Conference on Natural Language and Speech Processing, (ICNLS2018), pp. 1–6, April 2018Google Scholar
  9. 9.
    Dasarathy, B.V. (ed.): Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Washington D.C. (1991). ISBN 0-8186-8930-7Google Scholar
  10. 10.
    Shakhnarovish, G., Darrell, T., Indyk, P. (eds.): Nearest-Neighbor Methods in Learning and Vision. MIT Press, Cambridge (2005). ISBN 0-262-19547-XGoogle Scholar
  11. 11.
    Li, H., Ma, B., Lee, K.A.: Spoken language recognition: from fundamentals to practice. Proc. IEEE 101(5), 1136–1159 (2013)CrossRefGoogle Scholar
  12. 12.
    Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Khaled Lounnas
    • 1
    Email author
  • Mourad Abbas
    • 2
  • Hocine Teffahi
    • 1
  • Mohamed Lichouri
    • 1
    • 2
  1. 1.University of Sciences and Technology Houari BoumedieneBab EzzouarAlgeria
  2. 2.Computational Linguistics DepartmentCRSTDLABouzaréahAlgeria

Personalised recommendations