Skip to main content

Information Discovery on the World-Wide-Web

  • Chapter
Multimedia Information Retrieval and Management

Part of the book series: Signals and Communication Technology ((SCT))

  • 456 Accesses

Abstract

This chapter discusses data-mining techniques for web-related data. In particular, we discuss techniques that can help information seekers locate relevant information on web. Two kinds of techniques, web-structure mining and web-log mining, are discussed. We also examine three techniques, authorities and hubs [10], anchor points [9], and PageRank [13] that examine the link structures of hypertext web pages. Since the web is huge and dynamic, it is not possible for any IR system to maintain a global view of the web. Recommendation of web information, therefore, has to be based on incomplete information. We discuss the idea of Internet GlOSS [2], which uses word statistics to make intelligent guess on the topics of interest of web sites. Also we discuss how the interest of web users can be abstracted in user profiles. Understanding both web users and web sites allows an effective matching of the two. Finally, we explain how mining web-log data can discover the topics of interest of web sites and user profiles.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. V.N. Gudivada. Information Retrieval on the World Wide Web. IEEE Internet Computing, Vol. 1, No. 5, 1997, pp. 58–68.

    Article  Google Scholar 

  2. C.Y. Ng, Ben Kao, David Cheung. Text-Source Discovery and GlOSS Update in a Dynamic Web. Proceedings of the Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2000.

    Google Scholar 

  3. W. Frakes, R. Baeza-Yates. Information Retrieval — Data Structures and Algorithms. Prentice-Hall, 1992.

    Google Scholar 

  4. David Cheung, Ben Kao, Joseph Lee. Discovering User Access Patterns on the World-WideWeb, in Knowledge Based Systems Journal, Elsevier Science, V10, N7, May 1998.

    Google Scholar 

  5. A. Tomasic, L. Gravano, and H. Garcia-Molina. The Effectiveness of GlOSS For the Text-Database Discovery Problem, Proceedings of the 1994 ACM SIGMOD, 1994.

    Google Scholar 

  6. http://www.google.com

  7. The Web Robots FAQ. http://info.webcrawler.com/mak/projects/robots/faci.html

  8. S. Feldman. Just the Answers, Please: Choosing a Web Search Service, The Magazine for Database Professionals, May 1997.

    Google Scholar 

  9. Ben Kao, Joseph Lee, C.Y. Ng, and David Cheung. Anchor Point Indexing in Web Document Retrieval, in IEEE Transactions on Systems, Man, and Cybernetics Part C: Applications and Reviews, 30(3), pp. 364–373, 2000.

    Google Scholar 

  10. J.M. Kleinberg. Authoritative Sources In a Hyperlinked Environment, Journal of the ACM, 46, 1999.

    Google Scholar 

  11. B. Grossan. Search Engines: What They Are? How They Work? http://webreference.com/content/search/features.html

  12. J. Nielsen. The Art of Navigating Through Hypertext. Communications of the ACM, 33 (3): 297–310, 1990.

    Article  Google Scholar 

  13. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Computer Systems Laboratory, Stanford University, 1998.

    Google Scholar 

  14. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for Emerging Cyber-communities. Proceedings of the Eighth International Conference on the Web-WideWeb, 1999.

    Google Scholar 

  15. J. Dean and M.R. Henzinger. Finding Related Pages in the World-Wide-Web. Proceedings of the Eighth International Conference on the Web-Wide-Web, 1999.

    Google Scholar 

  16. D.L. Lee et. al. Document ranking and the Vector-Space Model. IEEE Software, Vol. 14, No. 2, Mar/Apr 1997, 67–75.

    Article  Google Scholar 

  17. G. Salton. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Mass: Add-Wesley, 1989.

    Google Scholar 

  18. L. Gravano et. al. The Efficacy of GlOSS for the Text Database Discovery Problem. ACM SIGMOD’94, 1994.

    Google Scholar 

  19. L. Gravano et. al. Precision and Recall of GlOSS Estimators for Database Discovery. PDIS’94, 1994.

    Google Scholar 

  20. L. Gravano et. al. Generalizing GLOSS to Vector-Space Databases and Broker Hierarchies. VLDB’95, May 1995.

    Google Scholar 

  21. J. Hartigan. Clustering Algorithms. Wiley, New York, 1975.

    MATH  Google Scholar 

  22. A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke and S. Raghavan. Searching the Web. ACM Transactions on Internet Technology, Vol. 1, No. 1, Aug 2001, 2–43.

    Article  Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Kao, B., Cheung, D. (2003). Information Discovery on the World-Wide-Web. In: Feng, D.D., Siu, WC., Zhang, HJ. (eds) Multimedia Information Retrieval and Management. Signals and Communication Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-05300-3_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-05300-3_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-05533-1

  • Online ISBN: 978-3-662-05300-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics