Link Information as a Similarity Measure in Web Classification

Cristo, Marco; Calado, Pável; de Moura, Edleno Silva; Ziviani, Nivio; Ribeiro-Neto, Berthier

doi:10.1007/978-3-540-39984-1_4

Link Information as a Similarity Measure in Web Classification

Marco Cristo^7,8,
Pável Calado⁷,
Edleno Silva de Moura⁹,
Nivio Ziviani⁷ &
…
Berthier Ribeiro-Neto⁷

Conference paper

546 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2857))

Abstract

The objective of this paper is to study how the link structure of the Web can be used to derive a similarity measure between documents. We evaluate five different measures and determine how accurate they are in predicting the subject of Web pages. Experiments with a Web directory indicate that the use of links from external pages greatly increases the quality of the results. Gains as high as 45.9 points in F ₁ were obtained, when compared to a text-based classifier. Among the similarity measures tested in this work, co-citation presented the best performance in determining if two Web pages are related. This work provides an important insight on how similarity measures can be derived from links and applied to Web IR problems.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, pp. 107–117 (1998)
Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM) 46, 604–632 (1999)
Article MATH MathSciNet Google Scholar
Calado, P., Ribeiro-Neto, B., Ziviani, N., Moura, E., Silva, I.: Local versus global link information in the Web. ACM Transactions on Information Systems 21, 42–63 (2003)
Article Google Scholar
Hawking, D., Craswell, N.: Overview of TREC-2001 Web track. In: The Tenth Text Retrieval Conference (TREC-2001), Gaithersburg, Maryland, USA, pp. 61–67 (2001)
Google Scholar
Chakrabarti, S., Dom, B.E., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, pp. 307–318 (1998)
Google Scholar
Slattery, S., Craven, M.: Discovering test set regularities in relational domains. In: Proceedings of ICML 2000 17th International Conference on Machine Learning, Stanford, California, USA, pp. 895–902 (2000)
Google Scholar
Joachims, T., Cristianini, N., Shawe-Taylor, J.: Composite kernels for hypertext categorization. In: Proceedings of ICML 2001 18th International Conference on Machine Learning, Williamstown, Massachusetts, US, pp. 250–257 (2001)
Google Scholar
Cohn, D., Hofmann, T.: The missing link - a probabilistic model of document content and hypertext connectivity. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 430–436. MIT Press, Cambridge (2001)
Google Scholar
Fisher, M., Everson, R.: When are links useful? Experiments in text classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 41–56. Springer, Heidelberg (2003)
Chapter Google Scholar
He, X., Zha, H., Ding, C.H.Q., Simon, H.D.: Web document clustering using hyperlink structures. Computational Statistics & Data Analysis 41, 19–45 (2002)
Article MATH MathSciNet Google Scholar
Tombros, A., van Rijsbergen, C.J.: Query-sensitive similarity measures for the calculation of interdocument relationships. In: Proceedings of the 10th International Conference on Information and Knowledge Management CIKM, Altlanta, Georgia, USA, pp. 17–24 (2001)
Google Scholar
Thelwall, M., Wilkinson, D.: Finding similar academic Web sites with links, bibliometric couplings and colinks. Information Processing & Management (2003) (in press)
Google Scholar
Olsen, K.A., Korfhage, R.R., Sochats, K.M., Spring, M.B., Williams, J.G.: Visualization of a document collection: the VIBE system. Information Processing & Management 29, 69–81 (1993)
Article Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Zhang, J., Rasmussen, E.M.: Developing a new similarity measure from two different perspectives. Information Processing & Management 37, 279–294 (2001)
Article MATH Google Scholar
Flesca, S., Masciari, E.: Efficient and effective Web change detection. Data & Knoweledge Engeneering 46, 203–224 (2003)
Article Google Scholar
Dean, J., Henzinger, M.R.: Finding related pages in the World Wide Web. Computer Networks 31, 1467–1479 (1999); Also in Proceedings of the 8th International World Wide Web Conference
Article Google Scholar
Kumar, S.R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E.: The Web as a graph. In: Proceedings of the 19th Symposium on Principles of Database Systems, Dallas, Texas, USA, pp. 1–10 (2000)
Google Scholar
Sun, A., Lim, E.P., Ng, W.K.: Web classification using support vector machine. In: Proceedings of the Fourth International Workshop on Web Information and Data Management, McLean, Virginia, USA, pp. 96–99. ACM Press, New York (2002)
Chapter Google Scholar
Yang, Y., Slattery, S., Ghani, R.: A study of approaches to hypertext categorization. Journal of Intelligent Information Systems 18, 219–241 (2002)
Article Google Scholar
Furnkranz, J.: Exploiting structural information for text classification on the WWW. In: Proceedings of the 3rd Symposium on Intelligent Data Analysis, IDA, Amsterdam, Netherlands, pp. 487–498 (1999)
Google Scholar
Glover, E.J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D.M., Flake, G.W.: Using Web structure for classifying and describing Web pages. In: Proceedings of WWW 2002, International Conference on the World Wide Web, Honolulu, Hawaii, USA (2002)
Google Scholar
Oh, H.J., Myaeng, S.H., Lee, M.H.: A practical hypertext catergorization method using links and incrementally available class information. In: Proceedings of The 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, pp. 264–271 (2000)
Google Scholar
Kessler, M.M.: Bibliographic coupling between scientific papers. American Documentation 14, 10–25 (1963)
Article Google Scholar
Amsler, R.: Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, Texas, USA (1972)
Google Scholar
Small, H.G.: Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science 24, 265–269 (1973)
Article Google Scholar
Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, pp. 13–22 (1994)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24, 513–523 (1988)
Article Google Scholar
The Cadê? Web directory, http://www.cade.com.br/
The TodoBR search engine, http://www.todobr.com.br/
Stone, M.: Cross-validation choices and assessment of statistical predictions. Journal of the Royal Statistical Society B36, 111–147 (1974)
Google Scholar
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, USA, pp. 42–49 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil
Marco Cristo, Pável Calado, Nivio Ziviani & Berthier Ribeiro-Neto
Fucapi, Technology Foundation, Manaus, AM, Brazil
Marco Cristo
Computer Science Department, Federal University of Amazonas, Manaus, AM, Brazil
Edleno Silva de Moura

Authors

Marco Cristo
View author publications
You can also search for this author in PubMed Google Scholar
Pável Calado
View author publications
You can also search for this author in PubMed Google Scholar
Edleno Silva de Moura
View author publications
You can also search for this author in PubMed Google Scholar
Nivio Ziviani
View author publications
You can also search for this author in PubMed Google Scholar
Berthier Ribeiro-Neto
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computing Science, University of Alberta, Canada
Mario A. Nascimento
Universidade Federal do Amazonas, Manaus, AM, Brasil
Edleno S. de Moura
INESC-ID/IST, R. Alves Redol 9, 1000, Lisboa, Portugal
Arlindo L. Oliveira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cristo, M., Calado, P., de Moura, E.S., Ziviani, N., Ribeiro-Neto, B. (2003). Link Information as a Similarity Measure in Web Classification. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-540-39984-1_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20177-9
Online ISBN: 978-3-540-39984-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics