Link-Local Features for Hypertext Classification

  • Hervé Utard
  • Johannes Fürnkranz
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4289)


Previous work in hypertext classification has resulted in two principal approaches for incorporating information about the graph properties of the Web into the training of a classifier. The first approach uses the complete text of the neighboring pages, whereas the second approach uses only their class labels. In this paper, we argue that both approaches are unsatisfactory: the first one brings in too much irrelevant information, while the second approach is too coarse by abstracting the entire page into a single class label. We argue that one needs to focus on relevant parts of predecessor pages, namely on the region in the neighborhood of the origin of an incoming link. To this end, we will investigate different ways for extracting such features, and compare several different techniques for using them in a text classifier.


Class Label Feature Extraction Technique Path Expression Anchor Text XPath Expression 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks 30(1-7), 107–117 (1998); Proceedings of the 7th International World Wide Web Conference (WWW-7), Brisbane, Australia Google Scholar
  2. 2.
    Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of the ACM SIGMOD International Conference on Management on Data, pp. 307–318. ACM Press, Seattle (1998)Google Scholar
  3. 3.
    Craven, M., Di Pasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence 118(1-2), 69–114 (2000)zbMATHCrossRefGoogle Scholar
  4. 4.
    Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, p. 1. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  5. 5.
    Fürnkranz, J.: Web Mining. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 137–142. Springer, Heidelberg (2005)Google Scholar
  6. 6.
    Fürnkranz, J.: Hyperlink ensembles: A case study in hypertext classification. Information Fusion 3(4), 299–312 (2002) (Special Issue on Fusion of Multiple Classifiers)CrossRefGoogle Scholar
  7. 7.
    Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998, pp. 137–142. Springer, Heidelberg (1998)Google Scholar
  8. 8.
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Lu, Q., Getoor, L.: Link-based classification. In: Proceedings of the International Conference on Machine Learning (ICML 2003), pp. 496–503 (2003)Google Scholar
  10. 10.
    McBryan, O.A.: GENVL and WWWW: Tools for taming the Web. In: Proceedings of the 1st World-Wide Web Conference (WWW-1), pp. 58–67. Elsevier, Geneva (1994)Google Scholar
  11. 11.
    Rüping, S., Scheffer, T. (eds.): Proceedings of the ICML-2005 Workshop on Learning With Multiple Views, Bonn Germany (2005)Google Scholar
  12. 12.
    Utard, H.: Hypertext classification. Master’s thesis, TU Darmstadt, Knowledge Engineering Group (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Hervé Utard
    • 1
  • Johannes Fürnkranz
    • 1
  1. 1.Knowledge Engineering GroupTU DarmstadtDarmstadtGermany

Personalised recommendations