Abstract
Hypertext categorization is the automatic classification of web documents into predefined classes. It poses new challenges for automatic categorization because of the rich information in a hypertext document. Hyperlinks, HTML tags, and metadata all provide rich information for hypertext categorization that is not available in traditional text classification. This paper looks at (i) what representation to use for documents and which extra information hidden in HTML pages to take into consideration to improve the classification task, and (ii) how to deal with the very high number of features of texts. A hypertext dataset and three well-known learning algorithms (Naïve Bayes, K-Nearest Neighbour and C4.5) were used to exploit the enriched text representation along with feature reduction. The results showed that enhancing the basic text content with HTML page keywords, title and anchor links improved the accuracy of the classification algorithms.
Chapter PDF
Similar content being viewed by others
References
K. Bharat and A. Broader. A technique for measuring the relative size and overlap of public web search engines. In Proc. Of the 7 th World Wide Web Conference (WWW7), 1998.
H. Chen and S. Dumais. Bringing order to the web: automatically categorizing search results. In proceedings of CHI-00, ACM International Conference on Human Factors in Computing Systems, p 145–152, Den Haag, NL, 2000. ACM Press, New York, US.
H. Oh, S. Myaeng, and M. Lee. A practical hypertext categorization method using links and incrementally available class information. In proceedings of the Twenty Third ACM SIGIR Conference, Athens, Greece, July 2000.
J. Furnkranz. Exploiting structural information for text classification on the WWW. Proceedings of IDA-99, Third Symposium on Intelligent Data Analysis, pp 487–497, Amsterdam, NL, 1999.
S. Slattery and T. Mitchell. Discovering test set regularities in relational domains. In Seventeenth International Conference on Machine Learning, June 2000.
Y. Yang, S. Slattery and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems (Special Issue on Automatic Text Categorization) 18(2–3) 2002, pp. 219–241.
G. Attardi, A. Gulli, and F. Sebastiani. Automatic web page categorization by link and context analysis. In Proceedings of THAI’99, European Symposium on Telematics, Hypermedia and Artificial Intelligence, 1005–119, 1999.
S. Chakrabati, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. Proceedings ACM SIGMOD International Conference on Management of Data, pages 307–318, Seattle, Washington, June 1998. ACM Press.
C. Apte, F. Damereau, and S. Weiss. Automated learning of decision rules for text categorization. ACM trans.Information Systems, Vol.12, No.3, July 1994, pp. 233–251.
A. Bensaid and N. Tazi. Text categorization with semi-supervised agglomerative hierarchical clustering. International Journal of Intelligent Systems, 1999.
G. Salton and McGill. Introduction to Modern Information Retrieval. McGraw Hill, 1983.
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), pp 513–523, 1988.
Y. Yang and J. Pederson. A Comparative study on feature selection in text categorization. International Conference on Machine Learning (ICML), 1997.
Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer Science + Business Media, Inc.
About this chapter
Cite this chapter
Benbrahim, H., Bramer, M. (2004). Impact on Performance of Hypertext Classification of Selective Rich HTML Capture. In: Bramer, M., Devedzic, V. (eds) Artificial Intelligence Applications and Innovations. AIAI 2004. IFIP International Federation for Information Processing, vol 154. Springer, Boston, MA. https://doi.org/10.1007/1-4020-8151-0_25
Download citation
DOI: https://doi.org/10.1007/1-4020-8151-0_25
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-8150-7
Online ISBN: 978-1-4020-8151-4
eBook Packages: Springer Book Archive