Advertisement

TrieIR: Indexing and Retrieval Engine for Kannada Unicode Text

  • Sumant Kulkarni
  • Srinath Srinivasa
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8279)

Abstract

Kannada is a phonetic language. In Kannada language, the morphological forms of terms (especially of nouns and verbs) are formed by adding different morphological suffixes to their pure forms. Hence, when queried for morphological forms, search engines based on exact matching fail to identify other semantically similar and morphologically different terms, and thus reduce the quality of the search results. We observe that even though the morphological forms of a term look different, they can be grouped together based on their common prefixes. In this work we propose fuzzy matching based indexing and retrieval algorithms. We propose an indexing mechanism inspired from prefix trees. We also derive our inspirations from the fact that the Unicode encodes the Kannada terms very similar to the way terms are generated using Kannada grammar. We also discuss a query term truncation and decayed score based retrieval algorithm for better retrieval of the documents for the given query. The indexing and retrieval systems still are based on the tf-idf based indexing and retrieval. However, the novelty of the work lies in the way the algorithms bring together the similar terms. This solution can be scaled to work for other South Indian languages with no or little modification as their Unicode encoding and morphological behaviors are similar to Kannada.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Kulkarni, S., Srinivasa, S.: A Novel IR Approach for Kannada Unicode Text. Technical Report, Open Systems Lab (2013), http://osl.iiitb.ac.in/reports/trieir_report.pdf
  2. 2.
    Bar-Ilan, J., Gutman, T.: How do search engines handle non-English queries?-A case study. WWW (Alternate Paper Tracks) (2003)Google Scholar
  3. 3.
    Singh, A.K., Surana, H., Gali, K.: More accurate fuzzy text search for languages using abugida scripts. In: Proceedings of ACM SIGIR Workshop on Improving Web Retrieval for Non-English Queries (2007)Google Scholar
  4. 4.
    Vikram, T.N., Urs, S.R.: Development of Prototype Morphological Analyzer for he South Indian Language of Kannada. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 109–116. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  5. 5.
    Singh, A.K.: A computational phonetic model for Indian language scripts. In: Constraints on Spelling Changes: Fifth International Workshop on Writing Systems (2006)Google Scholar
  6. 6.
    Salton, G., McGill, M.J.: Introduction to modern information retrieval (1986)Google Scholar
  7. 7.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval, vol. 1. Cambridge University Press, Cambridge (2008)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2013

Authors and Affiliations

  • Sumant Kulkarni
    • 1
  • Srinath Srinivasa
    • 1
  1. 1.International Institute of Information TechnologyBangaloreIndia

Personalised recommendations