TrieIR: Indexing and Retrieval Engine for Kannada Unicode Text
Kannada is a phonetic language. In Kannada language, the morphological forms of terms (especially of nouns and verbs) are formed by adding different morphological suffixes to their pure forms. Hence, when queried for morphological forms, search engines based on exact matching fail to identify other semantically similar and morphologically different terms, and thus reduce the quality of the search results. We observe that even though the morphological forms of a term look different, they can be grouped together based on their common prefixes. In this work we propose fuzzy matching based indexing and retrieval algorithms. We propose an indexing mechanism inspired from prefix trees. We also derive our inspirations from the fact that the Unicode encodes the Kannada terms very similar to the way terms are generated using Kannada grammar. We also discuss a query term truncation and decayed score based retrieval algorithm for better retrieval of the documents for the given query. The indexing and retrieval systems still are based on the tf-idf based indexing and retrieval. However, the novelty of the work lies in the way the algorithms bring together the similar terms. This solution can be scaled to work for other South Indian languages with no or little modification as their Unicode encoding and morphological behaviors are similar to Kannada.
Unable to display preview. Download preview PDF.
- 1.Kulkarni, S., Srinivasa, S.: A Novel IR Approach for Kannada Unicode Text. Technical Report, Open Systems Lab (2013), http://osl.iiitb.ac.in/reports/trieir_report.pdf
- 2.Bar-Ilan, J., Gutman, T.: How do search engines handle non-English queries?-A case study. WWW (Alternate Paper Tracks) (2003)Google Scholar
- 3.Singh, A.K., Surana, H., Gali, K.: More accurate fuzzy text search for languages using abugida scripts. In: Proceedings of ACM SIGIR Workshop on Improving Web Retrieval for Non-English Queries (2007)Google Scholar
- 5.Singh, A.K.: A computational phonetic model for Indian language scripts. In: Constraints on Spelling Changes: Fifth International Workshop on Writing Systems (2006)Google Scholar
- 6.Salton, G., McGill, M.J.: Introduction to modern information retrieval (1986)Google Scholar