Metadata Based Web Mining for Topic-Specific Information Gathering

  • Jeonghee Yi
  • Neel Sundaresan
  • Anita Huang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1875)


As the World-Wide-Web grows at an exponential rate, we are faced with the issue of rating pages in terms of quality and trust. In this siutation, with significant linkage among web pages, what other pages say about a web page can be as important as and more objective than what the page says about itself. The cumulative knowledge of such recommendations (or lack of them) can help a system to decide whether to pursue a page or not. This metadata information can also be used by a web robot program, for example, to derive summary information about web documents written in a foreign language. In this paper, we describe how we exploit this type of metadata to drive a web information gathering system, which forms the backend of a topic-specific search engine. The system uses metadata from hyperlinks to guide itself to crawl the web staying focused on a target topic. The crawler follows links that point to information related to the topic and avoids following links to irrelevant pages. Moreover, the system uses the metadata to improve its definition of the target topic through association mining. Ultimately, the guided crawling system builds a rich repository of metadata information, which is used to serve the search engine.


Resource Description Framework Relevance Score Candidate Term Anchor Text Entire Text 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. The 20th VLDB Conference. Santiago, Chile, (1994)Google Scholar
  2. 2.
    K. Bharat, M. Henzinger: Improved Algorithms for Topic Distillation in Hyperlinked Environments. Proc. of 21st Int. ACM SIGIR Conference. Melbourne, Australia, (1998)Google Scholar
  3. 3.
    Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., Kleinberg, J.: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text.Google Scholar
  4. 4.
    Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. The 8th Int. World Wide Web Conference. Toronto, Canada, (1999)Google Scholar
  5. 5.
    Chen, H., Chung, Y.M., Ramsey, M. and Yang, C.C.: A Smart Itsy Bitsy Spider for the Web. Journal of American Society of Information Science. 49(7) (1998) 604–618CrossRefGoogle Scholar
  6. 6.
    Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling through URL Ordering. The 7th Int. World Wide Web Conference. Brisbane, Australia, (1998)Google Scholar
  7. 7.
    Matthias Eichstaedt, Daniel Ford, Reiner Kraft, Qi Lu, Wayne Niblack, Neel Sundaresan: Grand Central Station. IBM Research Report. IBM Almaden Research Center, (1998)Google Scholar
  8. 8.
    R. Feldman, H. Hirsh: Mining Associations in Text in the Presence of Background Knowledge. The 2nd Int. Conference on Knowledge Discovery and Data Mining. Portland, Oregon. (1996) 343–346Google Scholar
  9. 9.
    B, Huberman, P. Pirolli, J. Pitkow, R. Lukose: Strong Regularities in World Wide Web Surfing. Science. 280 (1998) 95–97CrossRefGoogle Scholar
  10. 10.
    J. Kleinberg: Authoritative Sources in a Hyperlinked Environment. Proc. of 9th ACM-SIAM Symposium on Discrete Algorithms. (1997)Google Scholar
  11. 11.
    Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for Emerging Cyber-Communities. The 8th Int. World Wide Web Conference. Toronto, Canada, (1999)Google Scholar
  12. 12.
    Lassila, O., Swick, R.R.: Resource Description Framework (RDF) Model, Syntax, Recommendation. W3C, (1999), ”
  13. 13.
    Lawrence, S., Giles, L.: Accessibility and Distribution of Information on the Web. Nature. 400, (1999) 107–109CrossRefGoogle Scholar
  14. 14.
    McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Building Domain-Specific Search Engines with Machine Learning Techniques. AAAI Spring Symposium. (1999)Google Scholar
  15. 15.
    Miller, G.: Nouns in WordNet: A Lexical Inheritance System. International Journal of Lexicography. 2(4) (1990) 245–264CrossRefGoogle Scholar
  16. 16.
    E. Spertus: ParaSite: Mining Structure Information on the Web. The 6th Int. World Wide Web Conference. Santa Clara, CA, (1997)Google Scholar
  17. 17.
    Sundaresan, N., Yi, J., Huang, A.: Using metadata to enhance a web information gathering system. The 3rd ACM SIGMOD Workshop on the Web and Databases. Dallas, TX, (2000) 11–16Google Scholar
  18. 18.
    Yi, J., Sundaresan, N., Huang, A.: Automated Construction of Topic-specific Web Search Engines with Data Mining Techniques. IBM Research Report. IBM Almaden Research Center. (2000)Google Scholar
  19. 19.
    Yi, J., Sundaresan N.: Metadata Based Web Mining for Relevance. International database Engineering and Applications Symposium, forthcoming. Yokohama, Japan, (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Jeonghee Yi
    • 1
  • Neel Sundaresan
    • 2
    • 3
  • Anita Huang
    • 2
  1. 1.Computer ScienceUniversity of CaliforniaLos AngelesUSA
  2. 2.IBM Almaden Research CenterSan JoseUSA
  3. 3.NehaNet Corp.San JoseUSA

Personalised recommendations