Hierarchical Classification of HTML Documents with WebClassII

Ceci, Michelangelo; Malerba, Donato

doi:10.1007/3-540-36618-0_5

Michelangelo Ceci⁵ &
Donato Malerba⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2633))

Included in the following conference series:

European Conference on Information Retrieval

1309 Accesses
16 Citations
3 Altmetric

Abstract

This paper describes a new method for the classification of a HTML document into a hierarchy of categories. The hierarchy of categories is involved in all phases of automated document classification, namely feature extraction, learning, and classification of a new document. The innovative aspects of this work are the feature selection process, the automated threshold determination for classification scores, and an experimental study on real-word Web documents that can be associated to any node in the hierarchy. Moreover, a new measure for the evaluation of system performances has been introduced in order to compare three different techniques (flat, hierarchical with proper training sets, hierarchical with hierarchical training sets). The method has been implemented in the context of a client-server application, named WebClassII. Results show that for hierarchical techniques it is better to use hierarchical training sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Almuallim H., Akiba Y., & Kaneda S.: An efficient algorithm for finding optimal gain-ratio multiple-split tests on hierarchical attributes in decision tree learning. Proc. of the Nat. Conf. on Artificial Intelligence (AAAI’96) (1996) 703–708
Google Scholar
Cleverdon C.: Optimizing convenient online access to bibliographic databases. Information Services and Use. 4 (1984) 37–47
Google Scholar
D’Alessio S., Murray K., Schiaffino R., & Kershenbau A.: The effect of using hierarchical classifiers in text categorization. Proc. of the 6th Int. Conf. on “Recherche d’Information Assistée par Ordinateur”. (RIAO) (2000) 302–313
Google Scholar
Dumais S. & Chen H.: Hierarchical classification of Web document. Proc. of the 23rd ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR’00) (2000) 256–263
Google Scholar
Esposito F., Malerba D., Di Pace L., & Leo P.: A Machine Learning Approach to Web Mining. In E. Lamma & P. Mello (Eds.). AI*IA 99: Advances in Artificial Intelligence, Lecture Notes in Artificial Intelligence, Vol. 1792, Berlin: Springer (2000) 190–201
Chapter Google Scholar
Joachims T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Proc. of the 14th Int. Conf. on Machine Learning (1997) 143–151
Google Scholar
Koller D. & Sahami M.: Hierarchically classifying documents using very few words. Proc. of the 14th Int. Conf. on Machine Learning ICML’97 (1997) 170–178
Google Scholar
Malerba D., Esposito F., & Ceci M.: Mining HTML Pages to Support Document Sharing in a Cooperative System. In R. Unland, A. Chaudri, D. Chabane & W. Lindner (Eds.): XML-Based Data Management and Multimedia Engineering — EDBT 2002 Workshops, Lecture Notes in Computer Science, Vol. 2490, Berlin: Springer (2002)
Chapter Google Scholar
McCallum A., Rosenfeld R., Mitchell T.M., Ng A. Y.: Improving text classification by shrinkage in a hierarchy of classes. Proc. of the 15th Int. Conf. on Machine Learning (ICML’98) (1998) 359–367
Google Scholar
Mladenic D.: Machine learning on non-homogeneus, distribuited text data, PhD Thesis, University of Ljubjana (1998)
Google Scholar
Porter M. F.: An algorithm for suffix stripping. Program, 14(3) (1980) 130–137
Google Scholar
Salton G.: Automatic text processing: The transformation, analysis, and retrieval of information by computer. Reading, MA: Addison-Wesley (1989)
Google Scholar
Sahami M.: Learning limited dependence Bayesian classifiers. Proc. of the 2nd Int. Conference on Knowledge Discovery in Databases (KDD’96) (1996) 335–338
Google Scholar
Sebastiani F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34 (2002) 1–47
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Università degli Studi, via Orabona, 4, 70126, Bari, Italy
Michelangelo Ceci & Donato Malerba

Authors

Michelangelo Ceci
View author publications
You can also search for this author in PubMed Google Scholar
Donato Malerba
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Instituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Via Giuseppe Moruzzi, 1, 56124, Pisa, Italy
Fabrizio Sebastiani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ceci, M., Malerba, D. (2003). Hierarchical Classification of HTML Documents with WebClassII. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_5

Download citation

DOI: https://doi.org/10.1007/3-540-36618-0_5
Published: 15 April 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-01274-0
Online ISBN: 978-3-540-36618-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics