Skip to main content

AfriWeb: A Web Search Engine for a Marginalized Language

  • Conference paper
  • First Online:
Digital Libraries: Providing Quality Information (ICADL 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9469))

Included in the following conference series:

Abstract

isiZulu is a Bantu language spoken by approximately 9 million people, but with very few written documents available on the Internet. The lack of electronic documents and supporting infrastructure to store and retrieve documents in isiZulu is an additional threat for its survival as a written language. This paper documents an investigation into the creation of one such infrastructural element - a custom Web search engine - for isiZulu, where previously no such system was in existence. The focus of the search engine was on the language-specific elements of morphological parsing and statistical language modelling. Morphological parsing was shown to produce better results for isiZulu, an agglutinative language, than traditional affix-based stemming. Statistical language modelling was able to successfully separate isiZulu documents from others, thus enabling the use of a language-based focused crawler.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Statistics South Africa: Census 2011 (2012). http://www.statssa.gov.za/census2011/default.asp

  2. Wikipedia: Ikhasi Elikhulu, Wikimedia Foundation (2014). http://zu.wikipedia.org/wiki/Ikhasi_Elikhulu

  3. Mustafa, M., Suleman, H.: Mixed language Arabic-English information retrieval. In: Gelbukh, A. (ed.) CICLing 2015, Part II. LNCS, vol. 9042, pp. 427–447. Springer, Heidelberg (2015)

    Google Scholar 

  4. Mukami, L.: Africa’s endangered languages. African Review (2013). http://www.africareview.com/Special-Reports/Africas-endangered-languages/-/979182/2008252/-/12yos0s/-/index.html

  5. Pretorius, L., Bosch, S.E.: Finite-state computational morphology: An analyzer prototype for Zulu. Machine Translation 18(3), 195–216 (2003)

    Article  Google Scholar 

  6. Madondo, L.M., Muziwenhlanhla, S: Some aspects of evaluative morphology in Zulu (2000)

    Google Scholar 

  7. Cosjin, E., Pirkola, A., Bothma, T., Jarvelin, K.: Information access in indigenous languages: a casestudy in Zulu. South African Journal of Libraries and Information Science 68(2), 94 (2002)

    Google Scholar 

  8. Abu El-Khair, I.: Arabic information retreival. In: Annual Review of Information Science and Technology, pp. 505–533. John Wiley and Sons, Egypt (2007)

    Google Scholar 

  9. Nwesri, A.F., Tahaghoghi, S.M., Scholer, F.: Answering english queries in automatically transcribed arabic speech. In: 6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007). IEEE (2007)

    Google Scholar 

  10. Hurskainen, A.: Swahili Language Manager. Nordic Journal of African Studies 8(2), 139–157 (1999)

    Google Scholar 

  11. Tune, K.T., Varma, V., Pingali, P.: Evalutation of Oromo-English Cross Language Information Retrieval. Cross Language Evaluation Forum, Hyderabad, India (2007)

    Google Scholar 

  12. Chakrabarti, S., van der Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks 31, 1623–1640 (1999)

    Article  Google Scholar 

  13. Novak, B.: A survey of focused web crawling algorithms (2004)

    Google Scholar 

  14. Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1994), pp. 161–175 (1994)

    Google Scholar 

  15. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13(4), 359–393 (1999)

    Article  Google Scholar 

  16. McEnery, T.: Corpus linguistics: An introduction. Edinburgh University Press (2001)

    Google Scholar 

  17. Spiegler, S., Van Der Spuy, A., Flach, P.A.: Ukwabelana: an open-source morphological Zulu corpus. In: 23rd International Conference on Computational Linguistics. Association for Computational Linguistics (2010)

    Google Scholar 

  18. Siivola, V.: VariKn - Language modelling toolkit (2007). http://forge.pascal-network.org/docman/view.php/33/58/variKN_toolkit.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hussein Suleman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Malumba, N., Moukangwe, K., Suleman, H. (2015). AfriWeb: A Web Search Engine for a Marginalized Language. In: Allen, R., Hunter, J., Zeng, M. (eds) Digital Libraries: Providing Quality Information. ICADL 2015. Lecture Notes in Computer Science(), vol 9469. Springer, Cham. https://doi.org/10.1007/978-3-319-27974-9_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27974-9_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27973-2

  • Online ISBN: 978-3-319-27974-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics