Abstract
With the web at close to a billion pages and growing at an exponential rate, we are faced with the issue of rating pages in terms of quality and trust. In this situation, what other pages say about a web page can be as important as what the page says about itself. The cumulative knowledge of these types of recommendations (or the lack thereof) can be objective enough to help a user or robot program to decide whether or not to pursue a web document. In addition, these annotations or metadata can be used by a web robot program to derive summary information about web documents that are written in a language that the robot does not understand. We use this idea to drive a web information gathering system that forms the core of a topic-specific search engine.
In this paper, we describe how our system uses metadata about the hyperlinks to guide itself to crawl the web. It sifts through useful information related to a particular topic to eliminate the traversal of links that may not be of interest. Thus, the guided crawling system stays focused on the target topic. It builds a rich repository of link information that includes metadata. This repository ultimately serves a search engine.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bharat, K., Henzinger, M: Improved Algorithms for Topic Distillation in Hyperlinked Environments. The 21st Int. ACM SIGIR Conference. Melbourne, Australia, (1998)
Bray, T., Paoli, J., Sperberg-McQueen, C.M.: Extensible Markup Language (XML) 1.0, W3C Recommendation. W3C. (1998), http://www.w3.org/TR/1998/REC-xml-19980210
Brickley, D., Guha, R.B.: Resource Description Framework (RDF) Schema Specification 1.0. W3C Candidate Recommendation. Mar., 2000, http://www.w3.org/TR/PR-rdf-schema
Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., Kleinberg, J.: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. The 7th Int. World Wide Web Conference. Brisbane, Australia, (1998)
Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. The 8th Int. World Wide Web Conference. Toronto, Canada, (1999)
Chen, H., Chung, Y.M., Ramsey, M., Yang, C.C.: A Smart Itsy Bitsy Spider for the Web. Journal of American Society of Information Science. 49(7) (1998) 604–618
Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling through URL Ordering. The 7th Int. World Wide Web Conference. Brisbane, Australia, (1998)
Eichstaedt, M., Ford, D., Kraft, R., Lu, Q., Niblack, W., Sundaresan, N.: Grand Central Station. IBM Research Report. IBM Almaden Research Center, (1998)
Gibson, D., Kleinberg, J., Raghavan, P.: Inferring Web Communities from Link Topology. The 9th ACM HyperText. Pittsburgh, PA, (1998)
Kleinberg, J.: Authoritative Sources in a Hyperlinked Environment. The 9th ACM-SIAM Symposium on Discrete Algorithms. (1997)
Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for Emerging Cyber-Communities. The 8th Int.World WideWeb Conference. Toronto, Canada, (1999)
Lassila, O., Swick, R.R.: Resource Description Framework (RDF) Model and Syntax Specification W3C Recommendation. (1999), http://www.w3.org/TR/REC-rdf-syntax/
Lawrence, S., Giles, L.: Searching the World Wide Web. Science, 280, (1999) 98–100.
Lawrence, S., Giles, L.: Accessibility and Distribution of Information on the Web. Nature. 400, (1999) 107–109
McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Building Domain-Specific Search Engines with Machine Learning Techniques. AAAI Spring Symposium. (1999)
Miller, R., Bharat, K.: SPHNIX: A Framework for Creating Personal, Site-Specific Web Crawlers. The 7th Int. World Wide Web Conference. Brisbane, Australia, (1998)
Spertus, E.: ParaSite: Mining Structure Information on the Web. The 6th Int. World Wide Web Conference. Santa Clara, CA, (1997)
Sundaresan, N., Ford, D.: An architecture for summarizing the web, Int. Conference on Metadata. Montreal, Canada (1998)
Yi, J., Sundaresan, N., Huang, A.: Automated Construction of Topic-specific Web Search Engines with Data Mining Techniques. IBM Research Report. IBM Almaden Research Center. (2000)
Yi, J., Sundaresan, N., Huang, A.: Metadata Based Web Mining for Topic-Specific Information Gathering. The 1st Int. Electronic Commerce and Web Technologies Conference. forthcoming. London, UK, (2000)
Yi, J., Sundaresan, N.: Metadata Based Web Mining for Relevance. Int. database Engineering and Applications Symposium, forthcoming. Yokohama, Japan, (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yi, J., Sundaresan, N., Huang, A. (2001). Using Metadata to Enhance Web Information Gathering. In: Goos, G., Hartmanis, J., van Leeuwen, J., Suciu, D., Vossen, G. (eds) The World Wide Web and Databases. WebDB 2000. Lecture Notes in Computer Science, vol 1997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45271-0_3
Download citation
DOI: https://doi.org/10.1007/3-540-45271-0_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41826-9
Online ISBN: 978-3-540-45271-3
eBook Packages: Springer Book Archive