Language Identification for South African Bantu Languages Using Rank Order Statistics
Language identification is an important pre-process in many data management and information retrieval and transformation systems. However, Bantu languages are known to be difficult to identify because of lack of data and language similarity. This paper investigates the performance of n-gram counting using rank orders in order to discriminate among the different Bantu languages spoken in South Africa, using varying test and training data sizes. The highest average accuracy obtained was 99.3% with a testing size of 495 characters and training size of 600000 characters. The lowest average accuracy obtained was 78.72% when the testing size was 15 characters and learning size was 200000 characters.
KeywordsN-grams Bantu languages Rank order statistics
This research was partially funded by the National Research Foundation of South Africa (Grant numbers: 85470 and 105862) and University of Cape Town. The authors acknowledge that opinions, findings and conclusions or recommendations expressed in this publication are that of the authors, and that the NRF accepts no liability whatsoever in this regard.
- 2.Cavnar, W.B., Trenkle, J.M., et al.: N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, vol. 161175. Citeseer (1994)Google Scholar
- 3.Chavula, C., Suleman, H.: Assessing the impact of vocabulary similarity on multilingual information retrieval for Bantu languages. In: Proceedings of the 8th Annual Meeting of the Forum on Information Retrieval Evaluation, pp. 16–23. ACM (2016)Google Scholar
- 4.Combrinck, H.P., Botha, E.: Text-based automatic language identification. In: Proceedings of the 6th Annual Symposium of the Pattern Recognition Association of South Africa (1995)Google Scholar
- 5.Dunning, T.: Statistical Identification of Language. Las Cruces, Computing Research Laboratory (1994) Google Scholar
- 6.Duvenhage, B., Ntini, M., Ramonyai, P.: Improved text language identification for the South African languages. In: 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech), pp. 214–218. IEEE (2017)Google Scholar
- 8.McNamee, P.: Language identification: a solved problem suitable for undergraduate instruction. J. Comput. Sci. Coll. 20(3), 94–101 (2005)Google Scholar
- 9.Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L.: The effects of a corpus on isizulu spellcheckers based on n-grams. In: 2016 IST-Africa Week Conference, pp. 1–10. IEEE (2016)Google Scholar
- 10.Poole, D., Mackworth, A.: Artificial intelligence foundations of computational agents. 2010 (2017)Google Scholar