Using Metadata to Enhance Web Information Gathering

Yi, Jeonghee; Sundaresan, Neel; Huang, Anita

doi:10.1007/3-540-45271-0_3

Jeonghee Yi⁶,
Neel Sundaresan⁷ &
Anita Huang⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1997))

Included in the following conference series:

International Workshop on the World Wide Web and Databases

415 Accesses
1 Citations

Abstract

With the web at close to a billion pages and growing at an exponential rate, we are faced with the issue of rating pages in terms of quality and trust. In this situation, what other pages say about a web page can be as important as what the page says about itself. The cumulative knowledge of these types of recommendations (or the lack thereof) can be objective enough to help a user or robot program to decide whether or not to pursue a web document. In addition, these annotations or metadata can be used by a web robot program to derive summary information about web documents that are written in a language that the robot does not understand. We use this idea to drive a web information gathering system that forms the core of a topic-specific search engine.

In this paper, we describe how our system uses metadata about the hyperlinks to guide itself to crawl the web. It sifts through useful information related to a particular topic to eliminate the traversal of links that may not be of interest. Thus, the guided crawling system stays focused on the target topic. It builds a rich repository of link information that includes metadata. This repository ultimately serves a search engine.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bharat, K., Henzinger, M: Improved Algorithms for Topic Distillation in Hyperlinked Environments. The 21^st Int. ACM SIGIR Conference. Melbourne, Australia, (1998)
Google Scholar
Bray, T., Paoli, J., Sperberg-McQueen, C.M.: Extensible Markup Language (XML) 1.0, W3C Recommendation. W3C. (1998), http://www.w3.org/TR/1998/REC-xml-19980210
Brickley, D., Guha, R.B.: Resource Description Framework (RDF) Schema Specification 1.0. W3C Candidate Recommendation. Mar., 2000, http://www.w3.org/TR/PR-rdf-schema
Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., Kleinberg, J.: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. The 7^th Int. World Wide Web Conference. Brisbane, Australia, (1998)
Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. The 8^th Int. World Wide Web Conference. Toronto, Canada, (1999)
Google Scholar
Chen, H., Chung, Y.M., Ramsey, M., Yang, C.C.: A Smart Itsy Bitsy Spider for the Web. Journal of American Society of Information Science. 49(7) (1998) 604–618
Article Google Scholar
Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling through URL Ordering. The 7^th Int. World Wide Web Conference. Brisbane, Australia, (1998)
Google Scholar
Eichstaedt, M., Ford, D., Kraft, R., Lu, Q., Niblack, W., Sundaresan, N.: Grand Central Station. IBM Research Report. IBM Almaden Research Center, (1998)
Google Scholar
Gibson, D., Kleinberg, J., Raghavan, P.: Inferring Web Communities from Link Topology. The 9^th ACM HyperText. Pittsburgh, PA, (1998)
Google Scholar
Kleinberg, J.: Authoritative Sources in a Hyperlinked Environment. The 9^th ACM-SIAM Symposium on Discrete Algorithms. (1997)
Google Scholar
Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for Emerging Cyber-Communities. The 8^th Int.World WideWeb Conference. Toronto, Canada, (1999)
Google Scholar
Lassila, O., Swick, R.R.: Resource Description Framework (RDF) Model and Syntax Specification W3C Recommendation. (1999), http://www.w3.org/TR/REC-rdf-syntax/
Lawrence, S., Giles, L.: Searching the World Wide Web. Science, 280, (1999) 98–100.
Article Google Scholar
Lawrence, S., Giles, L.: Accessibility and Distribution of Information on the Web. Nature. 400, (1999) 107–109
Article Google Scholar
McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Building Domain-Specific Search Engines with Machine Learning Techniques. AAAI Spring Symposium. (1999)
Google Scholar
Miller, R., Bharat, K.: SPHNIX: A Framework for Creating Personal, Site-Specific Web Crawlers. The 7^th Int. World Wide Web Conference. Brisbane, Australia, (1998)
Google Scholar
Spertus, E.: ParaSite: Mining Structure Information on the Web. The 6^th Int. World Wide Web Conference. Santa Clara, CA, (1997)
Google Scholar
Sundaresan, N., Ford, D.: An architecture for summarizing the web, Int. Conference on Metadata. Montreal, Canada (1998)
Google Scholar
Yi, J., Sundaresan, N., Huang, A.: Automated Construction of Topic-specific Web Search Engines with Data Mining Techniques. IBM Research Report. IBM Almaden Research Center. (2000)
Google Scholar
Yi, J., Sundaresan, N., Huang, A.: Metadata Based Web Mining for Topic-Specific Information Gathering. The 1^st Int. Electronic Commerce and Web Technologies Conference. forthcoming. London, UK, (2000)
Google Scholar
Yi, J., Sundaresan, N.: Metadata Based Web Mining for Relevance. Int. database Engineering and Applications Symposium, forthcoming. Yokohama, Japan, (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science, University of California, 405 Hilgard Av. LA, CA 90095, Los Angeles, USA
Jeonghee Yi
NehaNet Corp., 2533 Paragon Dr. Suite E, CA 95131, San Jose, USA
Neel Sundaresan
IBM Almaden Research Center, 650 Harry Rd., CA 95120, San Jose, USA
Anita Huang

Authors

Jeonghee Yi
View author publications
You can also search for this author in PubMed Google Scholar
Neel Sundaresan
View author publications
You can also search for this author in PubMed Google Scholar
Anita Huang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universität Münster, Wirtschaftsinformatik Steinfurter Str.109, 48149, Münster, Germany
Gerhard Goos
Karlsruhe University, Germany
Juris Hartmanis
Cornell University, NY, USA
Jan van Leeuwen
Utrecht University, The Netherlands
Dan Suciu
Computer Science and Engineering, University ofWashington, WA 98195-2350, Seattle, USA
Gottfried Vossen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yi, J., Sundaresan, N., Huang, A. (2001). Using Metadata to Enhance Web Information Gathering. In: Goos, G., Hartmanis, J., van Leeuwen, J., Suciu, D., Vossen, G. (eds) The World Wide Web and Databases. WebDB 2000. Lecture Notes in Computer Science, vol 1997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45271-0_3

Download citation

DOI: https://doi.org/10.1007/3-540-45271-0_3
Published: 22 June 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41826-9
Online ISBN: 978-3-540-45271-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics