Abstract
isiZulu is a Bantu language spoken by approximately 9 million people, but with very few written documents available on the Internet. The lack of electronic documents and supporting infrastructure to store and retrieve documents in isiZulu is an additional threat for its survival as a written language. This paper documents an investigation into the creation of one such infrastructural element - a custom Web search engine - for isiZulu, where previously no such system was in existence. The focus of the search engine was on the language-specific elements of morphological parsing and statistical language modelling. Morphological parsing was shown to produce better results for isiZulu, an agglutinative language, than traditional affix-based stemming. Statistical language modelling was able to successfully separate isiZulu documents from others, thus enabling the use of a language-based focused crawler.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Statistics South Africa: Census 2011 (2012). http://www.statssa.gov.za/census2011/default.asp
Wikipedia: Ikhasi Elikhulu, Wikimedia Foundation (2014). http://zu.wikipedia.org/wiki/Ikhasi_Elikhulu
Mustafa, M., Suleman, H.: Mixed language Arabic-English information retrieval. In: Gelbukh, A. (ed.) CICLing 2015, Part II. LNCS, vol. 9042, pp. 427–447. Springer, Heidelberg (2015)
Mukami, L.: Africa’s endangered languages. African Review (2013). http://www.africareview.com/Special-Reports/Africas-endangered-languages/-/979182/2008252/-/12yos0s/-/index.html
Pretorius, L., Bosch, S.E.: Finite-state computational morphology: An analyzer prototype for Zulu. Machine Translation 18(3), 195–216 (2003)
Madondo, L.M., Muziwenhlanhla, S: Some aspects of evaluative morphology in Zulu (2000)
Cosjin, E., Pirkola, A., Bothma, T., Jarvelin, K.: Information access in indigenous languages: a casestudy in Zulu. South African Journal of Libraries and Information Science 68(2), 94 (2002)
Abu El-Khair, I.: Arabic information retreival. In: Annual Review of Information Science and Technology, pp. 505–533. John Wiley and Sons, Egypt (2007)
Nwesri, A.F., Tahaghoghi, S.M., Scholer, F.: Answering english queries in automatically transcribed arabic speech. In: 6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007). IEEE (2007)
Hurskainen, A.: Swahili Language Manager. Nordic Journal of African Studies 8(2), 139–157 (1999)
Tune, K.T., Varma, V., Pingali, P.: Evalutation of Oromo-English Cross Language Information Retrieval. Cross Language Evaluation Forum, Hyderabad, India (2007)
Chakrabarti, S., van der Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks 31, 1623–1640 (1999)
Novak, B.: A survey of focused web crawling algorithms (2004)
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1994), pp. 161–175 (1994)
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13(4), 359–393 (1999)
McEnery, T.: Corpus linguistics: An introduction. Edinburgh University Press (2001)
Spiegler, S., Van Der Spuy, A., Flach, P.A.: Ukwabelana: an open-source morphological Zulu corpus. In: 23rd International Conference on Computational Linguistics. Association for Computational Linguistics (2010)
Siivola, V.: VariKn - Language modelling toolkit (2007). http://forge.pascal-network.org/docman/view.php/33/58/variKN_toolkit.html
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Malumba, N., Moukangwe, K., Suleman, H. (2015). AfriWeb: A Web Search Engine for a Marginalized Language. In: Allen, R., Hunter, J., Zeng, M. (eds) Digital Libraries: Providing Quality Information. ICADL 2015. Lecture Notes in Computer Science(), vol 9469. Springer, Cham. https://doi.org/10.1007/978-3-319-27974-9_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-27974-9_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27973-2
Online ISBN: 978-3-319-27974-9
eBook Packages: Computer ScienceComputer Science (R0)