Skip to main content

Hierarchical Classification of HTML Documents with WebClassII

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2633))

Included in the following conference series:

Abstract

This paper describes a new method for the classification of a HTML document into a hierarchy of categories. The hierarchy of categories is involved in all phases of automated document classification, namely feature extraction, learning, and classification of a new document. The innovative aspects of this work are the feature selection process, the automated threshold determination for classification scores, and an experimental study on real-word Web documents that can be associated to any node in the hierarchy. Moreover, a new measure for the evaluation of system performances has been introduced in order to compare three different techniques (flat, hierarchical with proper training sets, hierarchical with hierarchical training sets). The method has been implemented in the context of a client-server application, named WebClassII. Results show that for hierarchical techniques it is better to use hierarchical training sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Almuallim H., Akiba Y., & Kaneda S.: An efficient algorithm for finding optimal gain-ratio multiple-split tests on hierarchical attributes in decision tree learning. Proc. of the Nat. Conf. on Artificial Intelligence (AAAI’96) (1996) 703–708

    Google Scholar 

  2. Cleverdon C.: Optimizing convenient online access to bibliographic databases. Information Services and Use. 4 (1984) 37–47

    Google Scholar 

  3. D’Alessio S., Murray K., Schiaffino R., & Kershenbau A.: The effect of using hierarchical classifiers in text categorization. Proc. of the 6th Int. Conf. on “Recherche d’Information Assistée par Ordinateur”. (RIAO) (2000) 302–313

    Google Scholar 

  4. Dumais S. & Chen H.: Hierarchical classification of Web document. Proc. of the 23rd ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR’00) (2000) 256–263

    Google Scholar 

  5. Esposito F., Malerba D., Di Pace L., & Leo P.: A Machine Learning Approach to Web Mining. In E. Lamma & P. Mello (Eds.). AI*IA 99: Advances in Artificial Intelligence, Lecture Notes in Artificial Intelligence, Vol. 1792, Berlin: Springer (2000) 190–201

    Chapter  Google Scholar 

  6. Joachims T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Proc. of the 14th Int. Conf. on Machine Learning (1997) 143–151

    Google Scholar 

  7. Koller D. & Sahami M.: Hierarchically classifying documents using very few words. Proc. of the 14th Int. Conf. on Machine Learning ICML’97 (1997) 170–178

    Google Scholar 

  8. Malerba D., Esposito F., & Ceci M.: Mining HTML Pages to Support Document Sharing in a Cooperative System. In R. Unland, A. Chaudri, D. Chabane & W. Lindner (Eds.): XML-Based Data Management and Multimedia Engineering — EDBT 2002 Workshops, Lecture Notes in Computer Science, Vol. 2490, Berlin: Springer (2002)

    Chapter  Google Scholar 

  9. McCallum A., Rosenfeld R., Mitchell T.M., Ng A. Y.: Improving text classification by shrinkage in a hierarchy of classes. Proc. of the 15th Int. Conf. on Machine Learning (ICML’98) (1998) 359–367

    Google Scholar 

  10. Mladenic D.: Machine learning on non-homogeneus, distribuited text data, PhD Thesis, University of Ljubjana (1998)

    Google Scholar 

  11. Porter M. F.: An algorithm for suffix stripping. Program, 14(3) (1980) 130–137

    Google Scholar 

  12. Salton G.: Automatic text processing: The transformation, analysis, and retrieval of information by computer. Reading, MA: Addison-Wesley (1989)

    Google Scholar 

  13. Sahami M.: Learning limited dependence Bayesian classifiers. Proc. of the 2nd Int. Conference on Knowledge Discovery in Databases (KDD’96) (1996) 335–338

    Google Scholar 

  14. Sebastiani F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34 (2002) 1–47

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ceci, M., Malerba, D. (2003). Hierarchical Classification of HTML Documents with WebClassII. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_5

Download citation

  • DOI: https://doi.org/10.1007/3-540-36618-0_5

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-01274-0

  • Online ISBN: 978-3-540-36618-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics