AfriWeb: A Web Search Engine for a Marginalized Language

Malumba, Nkosana; Moukangwe, Katlego; Suleman, Hussein

doi:10.1007/978-3-319-27974-9_18

Nkosana Malumba¹⁶,
Katlego Moukangwe¹⁶ &
Hussein Suleman¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9469))

Included in the following conference series:

International Conference on Asian Digital Libraries

2897 Accesses
2 Citations
23 Altmetric

Abstract

isiZulu is a Bantu language spoken by approximately 9 million people, but with very few written documents available on the Internet. The lack of electronic documents and supporting infrastructure to store and retrieve documents in isiZulu is an additional threat for its survival as a written language. This paper documents an investigation into the creation of one such infrastructural element - a custom Web search engine - for isiZulu, where previously no such system was in existence. The focus of the search engine was on the language-specific elements of morphological parsing and statistical language modelling. Morphological parsing was shown to produce better results for isiZulu, an agglutinative language, than traditional affix-based stemming. Statistical language modelling was able to successfully separate isiZulu documents from others, thus enabling the use of a language-based focused crawler.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Statistics South Africa: Census 2011 (2012). http://www.statssa.gov.za/census2011/default.asp
Wikipedia: Ikhasi Elikhulu, Wikimedia Foundation (2014). http://zu.wikipedia.org/wiki/Ikhasi_Elikhulu
Mustafa, M., Suleman, H.: Mixed language Arabic-English information retrieval. In: Gelbukh, A. (ed.) CICLing 2015, Part II. LNCS, vol. 9042, pp. 427–447. Springer, Heidelberg (2015)
Google Scholar
Mukami, L.: Africa’s endangered languages. African Review (2013). http://www.africareview.com/Special-Reports/Africas-endangered-languages/-/979182/2008252/-/12yos0s/-/index.html
Pretorius, L., Bosch, S.E.: Finite-state computational morphology: An analyzer prototype for Zulu. Machine Translation 18(3), 195–216 (2003)
Article Google Scholar
Madondo, L.M., Muziwenhlanhla, S: Some aspects of evaluative morphology in Zulu (2000)
Google Scholar
Cosjin, E., Pirkola, A., Bothma, T., Jarvelin, K.: Information access in indigenous languages: a casestudy in Zulu. South African Journal of Libraries and Information Science 68(2), 94 (2002)
Google Scholar
Abu El-Khair, I.: Arabic information retreival. In: Annual Review of Information Science and Technology, pp. 505–533. John Wiley and Sons, Egypt (2007)
Google Scholar
Nwesri, A.F., Tahaghoghi, S.M., Scholer, F.: Answering english queries in automatically transcribed arabic speech. In: 6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007). IEEE (2007)
Google Scholar
Hurskainen, A.: Swahili Language Manager. Nordic Journal of African Studies 8(2), 139–157 (1999)
Google Scholar
Tune, K.T., Varma, V., Pingali, P.: Evalutation of Oromo-English Cross Language Information Retrieval. Cross Language Evaluation Forum, Hyderabad, India (2007)
Google Scholar
Chakrabarti, S., van der Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks 31, 1623–1640 (1999)
Article Google Scholar
Novak, B.: A survey of focused web crawling algorithms (2004)
Google Scholar
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1994), pp. 161–175 (1994)
Google Scholar
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13(4), 359–393 (1999)
Article Google Scholar
McEnery, T.: Corpus linguistics: An introduction. Edinburgh University Press (2001)
Google Scholar
Spiegler, S., Van Der Spuy, A., Flach, P.A.: Ukwabelana: an open-source morphological Zulu corpus. In: 23rd International Conference on Computational Linguistics. Association for Computational Linguistics (2010)
Google Scholar
Siivola, V.: VariKn - Language modelling toolkit (2007). http://forge.pascal-network.org/docman/view.php/33/58/variKN_toolkit.html

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Cape Town, Private Bag X3, Rondebosch, South Africa
Nkosana Malumba, Katlego Moukangwe & Hussein Suleman

Authors

Nkosana Malumba
View author publications
You can also search for this author in PubMed Google Scholar
Katlego Moukangwe
View author publications
You can also search for this author in PubMed Google Scholar
Hussein Suleman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hussein Suleman .

Editor information

Editors and Affiliations

Yonsei University, Seoul, Korea (Republic of)
Robert B. Allen
School of ITEE, University of Queensland, St. Lucia, Queensland, Australia
Jane Hunter
School of Library & Info Sci, Kent State University, KENT, Ohio, USA
Marcia L. Zeng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Malumba, N., Moukangwe, K., Suleman, H. (2015). AfriWeb: A Web Search Engine for a Marginalized Language. In: Allen, R., Hunter, J., Zeng, M. (eds) Digital Libraries: Providing Quality Information. ICADL 2015. Lecture Notes in Computer Science(), vol 9469. Springer, Cham. https://doi.org/10.1007/978-3-319-27974-9_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-27974-9_18
Published: 18 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27973-2
Online ISBN: 978-3-319-27974-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics