Abstract
This paper describes a new method for the classification of a HTML document into a hierarchy of categories. The hierarchy of categories is involved in all phases of automated document classification, namely feature extraction, learning, and classification of a new document. The innovative aspects of this work are the feature selection process, the automated threshold determination for classification scores, and an experimental study on real-word Web documents that can be associated to any node in the hierarchy. Moreover, a new measure for the evaluation of system performances has been introduced in order to compare three different techniques (flat, hierarchical with proper training sets, hierarchical with hierarchical training sets). The method has been implemented in the context of a client-server application, named WebClassII. Results show that for hierarchical techniques it is better to use hierarchical training sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Almuallim H., Akiba Y., & Kaneda S.: An efficient algorithm for finding optimal gain-ratio multiple-split tests on hierarchical attributes in decision tree learning. Proc. of the Nat. Conf. on Artificial Intelligence (AAAI’96) (1996) 703–708
Cleverdon C.: Optimizing convenient online access to bibliographic databases. Information Services and Use. 4 (1984) 37–47
D’Alessio S., Murray K., Schiaffino R., & Kershenbau A.: The effect of using hierarchical classifiers in text categorization. Proc. of the 6th Int. Conf. on “Recherche d’Information Assistée par Ordinateur”. (RIAO) (2000) 302–313
Dumais S. & Chen H.: Hierarchical classification of Web document. Proc. of the 23rd ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR’00) (2000) 256–263
Esposito F., Malerba D., Di Pace L., & Leo P.: A Machine Learning Approach to Web Mining. In E. Lamma & P. Mello (Eds.). AI*IA 99: Advances in Artificial Intelligence, Lecture Notes in Artificial Intelligence, Vol. 1792, Berlin: Springer (2000) 190–201
Joachims T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Proc. of the 14th Int. Conf. on Machine Learning (1997) 143–151
Koller D. & Sahami M.: Hierarchically classifying documents using very few words. Proc. of the 14th Int. Conf. on Machine Learning ICML’97 (1997) 170–178
Malerba D., Esposito F., & Ceci M.: Mining HTML Pages to Support Document Sharing in a Cooperative System. In R. Unland, A. Chaudri, D. Chabane & W. Lindner (Eds.): XML-Based Data Management and Multimedia Engineering — EDBT 2002 Workshops, Lecture Notes in Computer Science, Vol. 2490, Berlin: Springer (2002)
McCallum A., Rosenfeld R., Mitchell T.M., Ng A. Y.: Improving text classification by shrinkage in a hierarchy of classes. Proc. of the 15th Int. Conf. on Machine Learning (ICML’98) (1998) 359–367
Mladenic D.: Machine learning on non-homogeneus, distribuited text data, PhD Thesis, University of Ljubjana (1998)
Porter M. F.: An algorithm for suffix stripping. Program, 14(3) (1980) 130–137
Salton G.: Automatic text processing: The transformation, analysis, and retrieval of information by computer. Reading, MA: Addison-Wesley (1989)
Sahami M.: Learning limited dependence Bayesian classifiers. Proc. of the 2nd Int. Conference on Knowledge Discovery in Databases (KDD’96) (1996) 335–338
Sebastiani F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34 (2002) 1–47
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ceci, M., Malerba, D. (2003). Hierarchical Classification of HTML Documents with WebClassII. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_5
Download citation
DOI: https://doi.org/10.1007/3-540-36618-0_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-01274-0
Online ISBN: 978-3-540-36618-8
eBook Packages: Springer Book Archive