Advertisement

Link Information as a Similarity Measure in Web Classification

  • Marco Cristo
  • Pável Calado
  • Edleno Silva de Moura
  • Nivio Ziviani
  • Berthier Ribeiro-Neto
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2857)

Abstract

The objective of this paper is to study how the link structure of the Web can be used to derive a similarity measure between documents. We evaluate five different measures and determine how accurate they are in predicting the subject of Web pages. Experiments with a Web directory indicate that the use of links from external pages greatly increases the quality of the results. Gains as high as 45.9 points in F 1 were obtained, when compared to a text-based classifier. Among the similarity measures tested in this work, co-citation presented the best performance in determining if two Web pages are related. This work provides an important insight on how similarity measures can be derived from links and applied to Web IR problems.

Keywords

Similarity Measure Link Structure Companion Algorithm Link Information Internal Link 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, pp. 107–117 (1998)Google Scholar
  2. 2.
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM) 46, 604–632 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Calado, P., Ribeiro-Neto, B., Ziviani, N., Moura, E., Silva, I.: Local versus global link information in the Web. ACM Transactions on Information Systems 21, 42–63 (2003)CrossRefGoogle Scholar
  4. 4.
    Hawking, D., Craswell, N.: Overview of TREC-2001 Web track. In: The Tenth Text Retrieval Conference (TREC-2001), Gaithersburg, Maryland, USA, pp. 61–67 (2001)Google Scholar
  5. 5.
    Chakrabarti, S., Dom, B.E., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, pp. 307–318 (1998)Google Scholar
  6. 6.
    Slattery, S., Craven, M.: Discovering test set regularities in relational domains. In: Proceedings of ICML 2000 17th International Conference on Machine Learning, Stanford, California, USA, pp. 895–902 (2000)Google Scholar
  7. 7.
    Joachims, T., Cristianini, N., Shawe-Taylor, J.: Composite kernels for hypertext categorization. In: Proceedings of ICML 2001 18th International Conference on Machine Learning, Williamstown, Massachusetts, US, pp. 250–257 (2001)Google Scholar
  8. 8.
    Cohn, D., Hofmann, T.: The missing link - a probabilistic model of document content and hypertext connectivity. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 430–436. MIT Press, Cambridge (2001)Google Scholar
  9. 9.
    Fisher, M., Everson, R.: When are links useful? Experiments in text classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 41–56. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  10. 10.
    He, X., Zha, H., Ding, C.H.Q., Simon, H.D.: Web document clustering using hyperlink structures. Computational Statistics & Data Analysis 41, 19–45 (2002)zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Tombros, A., van Rijsbergen, C.J.: Query-sensitive similarity measures for the calculation of interdocument relationships. In: Proceedings of the 10th International Conference on Information and Knowledge Management CIKM, Altlanta, Georgia, USA, pp. 17–24 (2001)Google Scholar
  12. 12.
    Thelwall, M., Wilkinson, D.: Finding similar academic Web sites with links, bibliometric couplings and colinks. Information Processing & Management (2003) (in press)Google Scholar
  13. 13.
    Olsen, K.A., Korfhage, R.R., Sochats, K.M., Spring, M.B., Williams, J.G.: Visualization of a document collection: the VIBE system. Information Processing & Management 29, 69–81 (1993)CrossRefGoogle Scholar
  14. 14.
    Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)zbMATHGoogle Scholar
  15. 15.
    Zhang, J., Rasmussen, E.M.: Developing a new similarity measure from two different perspectives. Information Processing & Management 37, 279–294 (2001)zbMATHCrossRefGoogle Scholar
  16. 16.
    Flesca, S., Masciari, E.: Efficient and effective Web change detection. Data & Knoweledge Engeneering 46, 203–224 (2003)CrossRefGoogle Scholar
  17. 17.
    Dean, J., Henzinger, M.R.: Finding related pages in the World Wide Web. Computer Networks 31, 1467–1479 (1999); Also in Proceedings of the 8th International World Wide Web ConferenceCrossRefGoogle Scholar
  18. 18.
    Kumar, S.R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E.: The Web as a graph. In: Proceedings of the 19th Symposium on Principles of Database Systems, Dallas, Texas, USA, pp. 1–10 (2000)Google Scholar
  19. 19.
    Sun, A., Lim, E.P., Ng, W.K.: Web classification using support vector machine. In: Proceedings of the Fourth International Workshop on Web Information and Data Management, McLean, Virginia, USA, pp. 96–99. ACM Press, New York (2002)CrossRefGoogle Scholar
  20. 20.
    Yang, Y., Slattery, S., Ghani, R.: A study of approaches to hypertext categorization. Journal of Intelligent Information Systems 18, 219–241 (2002)CrossRefGoogle Scholar
  21. 21.
    Furnkranz, J.: Exploiting structural information for text classification on the WWW. In: Proceedings of the 3rd Symposium on Intelligent Data Analysis, IDA, Amsterdam, Netherlands, pp. 487–498 (1999)Google Scholar
  22. 22.
    Glover, E.J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D.M., Flake, G.W.: Using Web structure for classifying and describing Web pages. In: Proceedings of WWW 2002, International Conference on the World Wide Web, Honolulu, Hawaii, USA (2002)Google Scholar
  23. 23.
    Oh, H.J., Myaeng, S.H., Lee, M.H.: A practical hypertext catergorization method using links and incrementally available class information. In: Proceedings of The 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, pp. 264–271 (2000)Google Scholar
  24. 24.
    Kessler, M.M.: Bibliographic coupling between scientific papers. American Documentation 14, 10–25 (1963)CrossRefGoogle Scholar
  25. 25.
    Amsler, R.: Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, Texas, USA (1972)Google Scholar
  26. 26.
    Small, H.G.: Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science 24, 265–269 (1973)CrossRefGoogle Scholar
  27. 27.
    Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, pp. 13–22 (1994)Google Scholar
  28. 28.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24, 513–523 (1988)CrossRefGoogle Scholar
  29. 29.
    The Cadê? Web directory, http://www.cade.com.br/
  30. 30.
    The TodoBR search engine, http://www.todobr.com.br/
  31. 31.
    Stone, M.: Cross-validation choices and assessment of statistical predictions. Journal of the Royal Statistical Society B36, 111–147 (1974)Google Scholar
  32. 32.
    Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)zbMATHGoogle Scholar
  33. 33.
    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, USA, pp. 42–49 (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Marco Cristo
    • 1
    • 2
  • Pável Calado
    • 1
  • Edleno Silva de Moura
    • 3
  • Nivio Ziviani
    • 1
  • Berthier Ribeiro-Neto
    • 1
  1. 1.Computer Science DepartmentFederal University of Minas GeraisBelo HorizonteBrazil
  2. 2.Fucapi, Technology FoundationManausBrazil
  3. 3.Computer Science DepartmentFederal University of AmazonasManausBrazil

Personalised recommendations