Abstract
The objective of this paper is to study how the link structure of the Web can be used to derive a similarity measure between documents. We evaluate five different measures and determine how accurate they are in predicting the subject of Web pages. Experiments with a Web directory indicate that the use of links from external pages greatly increases the quality of the results. Gains as high as 45.9 points in F 1 were obtained, when compared to a text-based classifier. Among the similarity measures tested in this work, co-citation presented the best performance in determining if two Web pages are related. This work provides an important insight on how similarity measures can be derived from links and applied to Web IR problems.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, pp. 107–117 (1998)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM) 46, 604–632 (1999)
Calado, P., Ribeiro-Neto, B., Ziviani, N., Moura, E., Silva, I.: Local versus global link information in the Web. ACM Transactions on Information Systems 21, 42–63 (2003)
Hawking, D., Craswell, N.: Overview of TREC-2001 Web track. In: The Tenth Text Retrieval Conference (TREC-2001), Gaithersburg, Maryland, USA, pp. 61–67 (2001)
Chakrabarti, S., Dom, B.E., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, pp. 307–318 (1998)
Slattery, S., Craven, M.: Discovering test set regularities in relational domains. In: Proceedings of ICML 2000 17th International Conference on Machine Learning, Stanford, California, USA, pp. 895–902 (2000)
Joachims, T., Cristianini, N., Shawe-Taylor, J.: Composite kernels for hypertext categorization. In: Proceedings of ICML 2001 18th International Conference on Machine Learning, Williamstown, Massachusetts, US, pp. 250–257 (2001)
Cohn, D., Hofmann, T.: The missing link - a probabilistic model of document content and hypertext connectivity. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 430–436. MIT Press, Cambridge (2001)
Fisher, M., Everson, R.: When are links useful? Experiments in text classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 41–56. Springer, Heidelberg (2003)
He, X., Zha, H., Ding, C.H.Q., Simon, H.D.: Web document clustering using hyperlink structures. Computational Statistics & Data Analysis 41, 19–45 (2002)
Tombros, A., van Rijsbergen, C.J.: Query-sensitive similarity measures for the calculation of interdocument relationships. In: Proceedings of the 10th International Conference on Information and Knowledge Management CIKM, Altlanta, Georgia, USA, pp. 17–24 (2001)
Thelwall, M., Wilkinson, D.: Finding similar academic Web sites with links, bibliometric couplings and colinks. Information Processing & Management (2003) (in press)
Olsen, K.A., Korfhage, R.R., Sochats, K.M., Spring, M.B., Williams, J.G.: Visualization of a document collection: the VIBE system. Information Processing & Management 29, 69–81 (1993)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Zhang, J., Rasmussen, E.M.: Developing a new similarity measure from two different perspectives. Information Processing & Management 37, 279–294 (2001)
Flesca, S., Masciari, E.: Efficient and effective Web change detection. Data & Knoweledge Engeneering 46, 203–224 (2003)
Dean, J., Henzinger, M.R.: Finding related pages in the World Wide Web. Computer Networks 31, 1467–1479 (1999); Also in Proceedings of the 8th International World Wide Web Conference
Kumar, S.R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E.: The Web as a graph. In: Proceedings of the 19th Symposium on Principles of Database Systems, Dallas, Texas, USA, pp. 1–10 (2000)
Sun, A., Lim, E.P., Ng, W.K.: Web classification using support vector machine. In: Proceedings of the Fourth International Workshop on Web Information and Data Management, McLean, Virginia, USA, pp. 96–99. ACM Press, New York (2002)
Yang, Y., Slattery, S., Ghani, R.: A study of approaches to hypertext categorization. Journal of Intelligent Information Systems 18, 219–241 (2002)
Furnkranz, J.: Exploiting structural information for text classification on the WWW. In: Proceedings of the 3rd Symposium on Intelligent Data Analysis, IDA, Amsterdam, Netherlands, pp. 487–498 (1999)
Glover, E.J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D.M., Flake, G.W.: Using Web structure for classifying and describing Web pages. In: Proceedings of WWW 2002, International Conference on the World Wide Web, Honolulu, Hawaii, USA (2002)
Oh, H.J., Myaeng, S.H., Lee, M.H.: A practical hypertext catergorization method using links and incrementally available class information. In: Proceedings of The 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, pp. 264–271 (2000)
Kessler, M.M.: Bibliographic coupling between scientific papers. American Documentation 14, 10–25 (1963)
Amsler, R.: Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, Texas, USA (1972)
Small, H.G.: Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science 24, 265–269 (1973)
Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, pp. 13–22 (1994)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24, 513–523 (1988)
The Cadê? Web directory, http://www.cade.com.br/
The TodoBR search engine, http://www.todobr.com.br/
Stone, M.: Cross-validation choices and assessment of statistical predictions. Journal of the Royal Statistical Society B36, 111–147 (1974)
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, USA, pp. 42–49 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cristo, M., Calado, P., de Moura, E.S., Ziviani, N., Ribeiro-Neto, B. (2003). Link Information as a Similarity Measure in Web Classification. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-540-39984-1_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20177-9
Online ISBN: 978-3-540-39984-1
eBook Packages: Springer Book Archive