Skip to main content

Using Metadata to Enhance Web Information Gathering

  • Conference paper
  • First Online:
The World Wide Web and Databases (WebDB 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1997))

Included in the following conference series:

Abstract

With the web at close to a billion pages and growing at an exponential rate, we are faced with the issue of rating pages in terms of quality and trust. In this situation, what other pages say about a web page can be as important as what the page says about itself. The cumulative knowledge of these types of recommendations (or the lack thereof) can be objective enough to help a user or robot program to decide whether or not to pursue a web document. In addition, these annotations or metadata can be used by a web robot program to derive summary information about web documents that are written in a language that the robot does not understand. We use this idea to drive a web information gathering system that forms the core of a topic-specific search engine.

In this paper, we describe how our system uses metadata about the hyperlinks to guide itself to crawl the web. It sifts through useful information related to a particular topic to eliminate the traversal of links that may not be of interest. Thus, the guided crawling system stays focused on the target topic. It builds a rich repository of link information that includes metadata. This repository ultimately serves a search engine.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bharat, K., Henzinger, M: Improved Algorithms for Topic Distillation in Hyperlinked Environments. The 21st Int. ACM SIGIR Conference. Melbourne, Australia, (1998)

    Google Scholar 

  2. Bray, T., Paoli, J., Sperberg-McQueen, C.M.: Extensible Markup Language (XML) 1.0, W3C Recommendation. W3C. (1998), http://www.w3.org/TR/1998/REC-xml-19980210

  3. Brickley, D., Guha, R.B.: Resource Description Framework (RDF) Schema Specification 1.0. W3C Candidate Recommendation. Mar., 2000, http://www.w3.org/TR/PR-rdf-schema

  4. Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., Kleinberg, J.: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. The 7th Int. World Wide Web Conference. Brisbane, Australia, (1998)

    Google Scholar 

  5. Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. The 8th Int. World Wide Web Conference. Toronto, Canada, (1999)

    Google Scholar 

  6. Chen, H., Chung, Y.M., Ramsey, M., Yang, C.C.: A Smart Itsy Bitsy Spider for the Web. Journal of American Society of Information Science. 49(7) (1998) 604–618

    Article  Google Scholar 

  7. Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling through URL Ordering. The 7th Int. World Wide Web Conference. Brisbane, Australia, (1998)

    Google Scholar 

  8. Eichstaedt, M., Ford, D., Kraft, R., Lu, Q., Niblack, W., Sundaresan, N.: Grand Central Station. IBM Research Report. IBM Almaden Research Center, (1998)

    Google Scholar 

  9. Gibson, D., Kleinberg, J., Raghavan, P.: Inferring Web Communities from Link Topology. The 9th ACM HyperText. Pittsburgh, PA, (1998)

    Google Scholar 

  10. Kleinberg, J.: Authoritative Sources in a Hyperlinked Environment. The 9th ACM-SIAM Symposium on Discrete Algorithms. (1997)

    Google Scholar 

  11. Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for Emerging Cyber-Communities. The 8th Int.World WideWeb Conference. Toronto, Canada, (1999)

    Google Scholar 

  12. Lassila, O., Swick, R.R.: Resource Description Framework (RDF) Model and Syntax Specification W3C Recommendation. (1999), http://www.w3.org/TR/REC-rdf-syntax/

  13. Lawrence, S., Giles, L.: Searching the World Wide Web. Science, 280, (1999) 98–100.

    Article  Google Scholar 

  14. Lawrence, S., Giles, L.: Accessibility and Distribution of Information on the Web. Nature. 400, (1999) 107–109

    Article  Google Scholar 

  15. McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Building Domain-Specific Search Engines with Machine Learning Techniques. AAAI Spring Symposium. (1999)

    Google Scholar 

  16. Miller, R., Bharat, K.: SPHNIX: A Framework for Creating Personal, Site-Specific Web Crawlers. The 7th Int. World Wide Web Conference. Brisbane, Australia, (1998)

    Google Scholar 

  17. Spertus, E.: ParaSite: Mining Structure Information on the Web. The 6th Int. World Wide Web Conference. Santa Clara, CA, (1997)

    Google Scholar 

  18. Sundaresan, N., Ford, D.: An architecture for summarizing the web, Int. Conference on Metadata. Montreal, Canada (1998)

    Google Scholar 

  19. Yi, J., Sundaresan, N., Huang, A.: Automated Construction of Topic-specific Web Search Engines with Data Mining Techniques. IBM Research Report. IBM Almaden Research Center. (2000)

    Google Scholar 

  20. Yi, J., Sundaresan, N., Huang, A.: Metadata Based Web Mining for Topic-Specific Information Gathering. The 1st Int. Electronic Commerce and Web Technologies Conference. forthcoming. London, UK, (2000)

    Google Scholar 

  21. Yi, J., Sundaresan, N.: Metadata Based Web Mining for Relevance. Int. database Engineering and Applications Symposium, forthcoming. Yokohama, Japan, (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yi, J., Sundaresan, N., Huang, A. (2001). Using Metadata to Enhance Web Information Gathering. In: Goos, G., Hartmanis, J., van Leeuwen, J., Suciu, D., Vossen, G. (eds) The World Wide Web and Databases. WebDB 2000. Lecture Notes in Computer Science, vol 1997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45271-0_3

Download citation

  • DOI: https://doi.org/10.1007/3-540-45271-0_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41826-9

  • Online ISBN: 978-3-540-45271-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics