Advertisement

A Vertical Search Engine for School Information Based on Heritrix and Lucene

  • Hyo-Bong Lee
  • Franco Nazareno
  • Seung-Hyun Jung
  • Wan-Sup Cho
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6935)

Abstract

The contents on the web are increasing exponentially as the rapid development of the Internet applications and services continues to expand. A problem in obtaining useful information from vast contents quickly and accurately is facing us while people are enjoying the convenience of the Internet. The immediate response to this problem is a Web Search Engine. We developed a vertical search engine for a certain domain like university. The search engine consists of Crawler, Indexer, and Searcher. The crawler component is implemented with Heritrix crawler based on the mechanism of recursion and archiving. A reusable, extensible index establishment and management subsystem are designed and implemented by open-source package named Lucene in the indexer component. An experiment has been done for Chungbuk National University web sites, and the number of documents the system retrieves is more than 4 hundred times on the average for typical keywords set than those from Google or university’s search engines.

Keywords

Information retrieval Search Engine Web Crawling Indexing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Curran, K., Glinchey, J.: Vertical Search Engines. ITB Journal (16), 22–26 (2007)Google Scholar
  2. 2.
    Chau, M., Chen, H.: Comparison of Three Vertical Search Spiders, pp. 56–62. IEEE Computer Society, Los Alamitos (2003)Google Scholar
  3. 3.
    Chakrabarti, S., Jaju, R., Joshi, M., Punera, K.: Analyzing Fine-grained Hypertext Features for Enhanced Crawling and Topic Distillation, vol. 25(1). IEEE Computer Society, Los Alamitos (2002)Google Scholar
  4. 4.
    Cho, J., Page, L.: Efficient crawling through URL ordering. In: Proceedings of the Seventh International World Wide Web Conference, WWW7 (1998)Google Scholar
  5. 5.
    Gravano, L., Ipeirotis, P., Sahami, M.: Query- vs. Crawling-based Classification of Searchable Web Databases, vol. 25(1). IEEE Computer Society, Los Alamitos (2002)Google Scholar
  6. 6.
    Gospodnetic, O., Hatcher, E.: Lucene in Action, 2nd edn. Manning Publications Co. (2009)Google Scholar
  7. 7.
    Sigurðsson, K.: Incremental crawling with Heritrix, National and University Library of Iceland. In: Proc. IWAW (2005)Google Scholar
  8. 8.
    Stack, M.: Full Text Search of Web Archive Collections, Internet Archive, The Presidio of San Francisco, 116 Sheridan Ave, San Francisco, CA 94129 the 5th International Web Archiving Workshop, IWAW (2005)Google Scholar
  9. 9.
    Wang, X.: Lucene Nuthc Search Engine Development. Posts and Telcom. Press, Beijing (2008)Google Scholar
  10. 10.
    The Apache Software Foundation, http://tomcat.apache.org/
  11. 11.
  12. 12.
    Heritrix User Manual, http://crawler.archive.org
  13. 13.
  14. 14.
    Google search engine, http://www.google.com

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Hyo-Bong Lee
    • 1
  • Franco Nazareno
    • 2
  • Seung-Hyun Jung
    • 3
  • Wan-Sup Cho
    • 1
  1. 1.Dept. of Management Information Systems, u-BIZ BK21Chungbuk National UniversityKorea
  2. 2.Dept. of Bio-Information TechnologyChungbuk National UniversityKorea
  3. 3.Dept. of Information Industrial EngineeringChungbuk National UniversityKorea

Personalised recommendations