Advertisement

A Search Engine for Indian Languages

  • Ashwani Mujoo
  • Manoj Kumar Malviya
  • Rajat Moona
  • T V Prabhakar
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1875)

Abstract

There is a great need for a search engine for web documents written in languages other than English. In this paper, we describe the design issues of a Search Engine for Indian Languages. We also describe the implementation of two Search Engines for Indian Languages, one for documents in ISCII and the other for documents in Unicode. The software allows full-text indexing and searching of a database of documents written in any Brahmi-based Indian Language. The Search engine gathers the HTML documents from the web, indexes and compresses the documents and then searches for the given keywords. The main features of the search engines are phonetic tolerance, morphological analysis, compression and indexing, leading and trailing substring matches for keywords, search through compressed documents. The implementation includes a search server architecture, which can be accessed from a WYSIWYG front end, which is a Java swing applet. Performance results show that the search engine achieves a compression of almost 80 percent and has an appreciable precision and recall.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    S. Varadrajan and T. Chieuh, SASE: Implementation of a Compressed Text Search Engine, Proceedings of the USENIX symposium on Internet Technologies and Systems, 1997.Google Scholar
  2. 2.
    M Wolf, K Whistler, C Wicksteed: Unicode Technical Report #6, A Standard Compression Scheme for Unicode, http://www.unicode.org.
  3. 3.
    RFC Archive, UTF-8, A transformation format of ISO 10646, Network Working Group, SunSite, Denmark.Google Scholar
  4. 4.
    Indian Script Code for Information Interchange-ISCII standard. Bureau of Indian Standards, New Delhi, December 1992.Google Scholar
  5. 5.
    Puneet Chopra: An Efficient Concurrency Control Model for Compressed Tries, Department of Computer Science and Engineering, Indian Institute of Technology, Delhi.Google Scholar
  6. 6.
    Dr. Vineet Chaitanya and Dr. Rajeev Sangal: Morphological Analyser for Anusarka, Indian Languages Translation Project, IIT Kanpur Center for National Language Processing, University of Hyderabad, Hyderabad.Google Scholar
  7. 7.
    Unicode Home page http://www.unicode.org
  8. 8.
    Mujoo, A.: A Search Engine for Devanagari in Unicode with Compression, M.Tech. Thesis, IIT Kanpur, March 2000Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Ashwani Mujoo
    • 1
  • Manoj Kumar Malviya
    • 1
  • Rajat Moona
    • 1
  • T V Prabhakar
    • 1
  1. 1.Department of Computer Science and EngineeringIndian Institute of TechnologyKanpurIndia

Personalised recommendations