Language Identification for South African Bantu Languages Using Rank Order Statistics

  • Meluleki Dube
  • Hussein SulemanEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11853)


Language identification is an important pre-process in many data management and information retrieval and transformation systems. However, Bantu languages are known to be difficult to identify because of lack of data and language similarity. This paper investigates the performance of n-gram counting using rank orders in order to discriminate among the different Bantu languages spoken in South Africa, using varying test and training data sizes. The highest average accuracy obtained was 99.3% with a testing size of 495 characters and training size of 600000 characters. The lowest average accuracy obtained was 78.72% when the testing size was 15 characters and learning size was 200000 characters.


N-grams Bantu languages Rank order statistics 



This research was partially funded by the National Research Foundation of South Africa (Grant numbers: 85470 and 105862) and University of Cape Town. The authors acknowledge that opinions, findings and conclusions or recommendations expressed in this publication are that of the authors, and that the NRF accepts no liability whatsoever in this regard.


  1. 1.
    Botha, G.R., Barnard, E.: Factors that affect the accuracy of text-based language identification. Comput. Speech Lang. 26(5), 307–320 (2012)CrossRefGoogle Scholar
  2. 2.
    Cavnar, W.B., Trenkle, J.M., et al.: N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, vol. 161175. Citeseer (1994)Google Scholar
  3. 3.
    Chavula, C., Suleman, H.: Assessing the impact of vocabulary similarity on multilingual information retrieval for Bantu languages. In: Proceedings of the 8th Annual Meeting of the Forum on Information Retrieval Evaluation, pp. 16–23. ACM (2016)Google Scholar
  4. 4.
    Combrinck, H.P., Botha, E.: Text-based automatic language identification. In: Proceedings of the 6th Annual Symposium of the Pattern Recognition Association of South Africa (1995)Google Scholar
  5. 5.
    Dunning, T.: Statistical Identification of Language. Las Cruces, Computing Research Laboratory (1994) Google Scholar
  6. 6.
    Duvenhage, B., Ntini, M., Ramonyai, P.: Improved text language identification for the South African languages. In: 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech), pp. 214–218. IEEE (2017)Google Scholar
  7. 7.
    Li, W.: Random texts exhibit zipf’s-law-like word frequency distribution. IEEE Trans. Inf. Theory 38(6), 1842–1845 (1992)CrossRefGoogle Scholar
  8. 8.
    McNamee, P.: Language identification: a solved problem suitable for undergraduate instruction. J. Comput. Sci. Coll. 20(3), 94–101 (2005)Google Scholar
  9. 9.
    Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L.: The effects of a corpus on isizulu spellcheckers based on n-grams. In: 2016 IST-Africa Week Conference, pp. 1–10. IEEE (2016)Google Scholar
  10. 10.
    Poole, D., Mackworth, A.: Artificial intelligence foundations of computational agents. 2010 (2017)Google Scholar
  11. 11.
    Zulu, P., Botha, G., Barnard, E.: Orthographic measures of language distances between the official South African languages. Literator: J. Lit. Crit. Comp. Linguist. Lit. Stud. 29(1), 185–204 (2008)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of Cape TownCape TownSouth Africa

Personalised recommendations