Skip to main content

Link Information as a Similarity Measure in Web Classification

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2857))

Abstract

The objective of this paper is to study how the link structure of the Web can be used to derive a similarity measure between documents. We evaluate five different measures and determine how accurate they are in predicting the subject of Web pages. Experiments with a Web directory indicate that the use of links from external pages greatly increases the quality of the results. Gains as high as 45.9 points in F 1 were obtained, when compared to a text-based classifier. Among the similarity measures tested in this work, co-citation presented the best performance in determining if two Web pages are related. This work provides an important insight on how similarity measures can be derived from links and applied to Web IR problems.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, pp. 107–117 (1998)

    Google Scholar 

  2. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM) 46, 604–632 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  3. Calado, P., Ribeiro-Neto, B., Ziviani, N., Moura, E., Silva, I.: Local versus global link information in the Web. ACM Transactions on Information Systems 21, 42–63 (2003)

    Article  Google Scholar 

  4. Hawking, D., Craswell, N.: Overview of TREC-2001 Web track. In: The Tenth Text Retrieval Conference (TREC-2001), Gaithersburg, Maryland, USA, pp. 61–67 (2001)

    Google Scholar 

  5. Chakrabarti, S., Dom, B.E., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, pp. 307–318 (1998)

    Google Scholar 

  6. Slattery, S., Craven, M.: Discovering test set regularities in relational domains. In: Proceedings of ICML 2000 17th International Conference on Machine Learning, Stanford, California, USA, pp. 895–902 (2000)

    Google Scholar 

  7. Joachims, T., Cristianini, N., Shawe-Taylor, J.: Composite kernels for hypertext categorization. In: Proceedings of ICML 2001 18th International Conference on Machine Learning, Williamstown, Massachusetts, US, pp. 250–257 (2001)

    Google Scholar 

  8. Cohn, D., Hofmann, T.: The missing link - a probabilistic model of document content and hypertext connectivity. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 430–436. MIT Press, Cambridge (2001)

    Google Scholar 

  9. Fisher, M., Everson, R.: When are links useful? Experiments in text classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 41–56. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  10. He, X., Zha, H., Ding, C.H.Q., Simon, H.D.: Web document clustering using hyperlink structures. Computational Statistics & Data Analysis 41, 19–45 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  11. Tombros, A., van Rijsbergen, C.J.: Query-sensitive similarity measures for the calculation of interdocument relationships. In: Proceedings of the 10th International Conference on Information and Knowledge Management CIKM, Altlanta, Georgia, USA, pp. 17–24 (2001)

    Google Scholar 

  12. Thelwall, M., Wilkinson, D.: Finding similar academic Web sites with links, bibliometric couplings and colinks. Information Processing & Management (2003) (in press)

    Google Scholar 

  13. Olsen, K.A., Korfhage, R.R., Sochats, K.M., Spring, M.B., Williams, J.G.: Visualization of a document collection: the VIBE system. Information Processing & Management 29, 69–81 (1993)

    Article  Google Scholar 

  14. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  15. Zhang, J., Rasmussen, E.M.: Developing a new similarity measure from two different perspectives. Information Processing & Management 37, 279–294 (2001)

    Article  MATH  Google Scholar 

  16. Flesca, S., Masciari, E.: Efficient and effective Web change detection. Data & Knoweledge Engeneering 46, 203–224 (2003)

    Article  Google Scholar 

  17. Dean, J., Henzinger, M.R.: Finding related pages in the World Wide Web. Computer Networks 31, 1467–1479 (1999); Also in Proceedings of the 8th International World Wide Web Conference

    Article  Google Scholar 

  18. Kumar, S.R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E.: The Web as a graph. In: Proceedings of the 19th Symposium on Principles of Database Systems, Dallas, Texas, USA, pp. 1–10 (2000)

    Google Scholar 

  19. Sun, A., Lim, E.P., Ng, W.K.: Web classification using support vector machine. In: Proceedings of the Fourth International Workshop on Web Information and Data Management, McLean, Virginia, USA, pp. 96–99. ACM Press, New York (2002)

    Chapter  Google Scholar 

  20. Yang, Y., Slattery, S., Ghani, R.: A study of approaches to hypertext categorization. Journal of Intelligent Information Systems 18, 219–241 (2002)

    Article  Google Scholar 

  21. Furnkranz, J.: Exploiting structural information for text classification on the WWW. In: Proceedings of the 3rd Symposium on Intelligent Data Analysis, IDA, Amsterdam, Netherlands, pp. 487–498 (1999)

    Google Scholar 

  22. Glover, E.J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D.M., Flake, G.W.: Using Web structure for classifying and describing Web pages. In: Proceedings of WWW 2002, International Conference on the World Wide Web, Honolulu, Hawaii, USA (2002)

    Google Scholar 

  23. Oh, H.J., Myaeng, S.H., Lee, M.H.: A practical hypertext catergorization method using links and incrementally available class information. In: Proceedings of The 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, pp. 264–271 (2000)

    Google Scholar 

  24. Kessler, M.M.: Bibliographic coupling between scientific papers. American Documentation 14, 10–25 (1963)

    Article  Google Scholar 

  25. Amsler, R.: Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, Texas, USA (1972)

    Google Scholar 

  26. Small, H.G.: Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science 24, 265–269 (1973)

    Article  Google Scholar 

  27. Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, pp. 13–22 (1994)

    Google Scholar 

  28. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24, 513–523 (1988)

    Article  Google Scholar 

  29. The Cadê? Web directory, http://www.cade.com.br/

  30. The TodoBR search engine, http://www.todobr.com.br/

  31. Stone, M.: Cross-validation choices and assessment of statistical predictions. Journal of the Royal Statistical Society B36, 111–147 (1974)

    Google Scholar 

  32. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  33. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, USA, pp. 42–49 (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cristo, M., Calado, P., de Moura, E.S., Ziviani, N., Ribeiro-Neto, B. (2003). Link Information as a Similarity Measure in Web Classification. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-39984-1_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20177-9

  • Online ISBN: 978-3-540-39984-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics