Neighbourhood Exploitation in Hypertext Categorization

Benbrahim, Houda; Bramer, Max

doi:10.1007/1-84628-102-4_19

Houda Benbrahim⁴ &
Max Bramer⁴

Included in the following conference series:

International Conference on Innovative Techniques and Applications of Artificial Intelligence

438 Accesses
2 Citations

Abstract

The exponential growth of the web has led to the necessity to put some order to its content. The automatic classification of web documents into predefined classes, that is hypertext categorization, came to elevate humans from that task. The extra information available in a hypertext document poses new challenges for automatic categorization. HTML tags and linked neighbourhood all provide rich information for hypertext categorization that is not available in traditional text classification. This paper looks at (i) which extra information hidden in HTML tags and linked neighbourhood pages to take into consideration to improve the classification task, and (ii) how to deal with the high level of noise in linked pages. A hypertext dataset and four well-known learning algorithms (Naive Bayes, K-Nearest Neighbour, Support Vector Machine and C4.5) were used to exploit the enriched text representation. The results showed that the clever use of the information in linked neighbourhood and HTML tags improved the accuracy of the classification algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

K. Bharat and A. Broader. A technique for measuring the relative size and overlap of public web search engines. In Proc. Of the 7th World Wide Web Conference (WWW7), 1998.
Google Scholar
H. Chen and S. Dumais. Bringing order to the web: automatically categorizing search results. In proceedings of CHI-00, ACM International Conference on Human Factors in Computing Systems, p 145–152, Den Haag, NL, 2000. ACM Press, New York, US.
Chapter Google Scholar
http://www.yahoo.com
Google Scholar
S. Chakrabarti, B.Dom, R. Agrawal and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In VLDB, Athens, Greece, Aug. 1997.
Google Scholar
H. Benbrahim and M. Bramer. Impact on performance of hypertext classification by selective rich html capture. IFIP World Computer Congress, Toulouse, France, Aug 2004 (to appear).
Google Scholar
H. Oh, S. Myaeng, and M. Lee. A practical hypertext categorization method using links and incrementally available class information. In Proceedings of the Twenty Third ACM SIGIR Conference, Athens, Greece, July 2000.
Google Scholar
T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext categorization. In International Conference on Machine Learning (ICML’01), San Francisco, CA, 2001, Morgan Kaufmann.
Google Scholar
Y. Yang, S. Slattery and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems (Special Issue on Automatic Text Categorization) 18(2–3) 2002, pp. 219–241.
Article Google Scholar
C. Apte, F. Damereau, and S. Weiss. Automated learning of decision rules for text categorization. ACM trans.Information Systems, Vol.12, No.3, July 1994, pp. 233–251.
Article Google Scholar
A. Bensaid and N. Tazi. Text categorization with semi-supervised agglomerative hierarchical clustering. International Journal of Intelligent Systems, 1999.
Google Scholar
S. Chakrabati, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. Proceedings ACM SIGMOD International Conference on Management of Data, pages 307–318, Seattle, Washington, June 1998. ACM Press.
Google Scholar
T. Joachims. Text categorization with Support Vector Machines: Learning with many relevant features. Proceedings of ECML-98, 10* European Conference on Machine Learning, pages 137–142.
Google Scholar
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/proiect/theo-20/www/data/
Google Scholar
http://www.pedal.rdg.ac.uk/banksearchdataset/
Google Scholar
Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1999.
Google Scholar
D. Lewis. Feature selection and feature extraction for text categorization. Proceedings of Speech and Natural Language Workshop, 1992, pp. 212–217.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Software Engineering, Portsmouth University, Portsmouth
Houda Benbrahim & Max Bramer

Authors

Houda Benbrahim
View author publications
You can also search for this author in PubMed Google Scholar
Max Bramer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Technology, University of Portsmouth, Portsmouth, UK
Max Bramer BSc, PhD, CEng, FBCS, FIEE, FRSA
Department of Computer Science, University of Liverpool, Liverpool, UK
Frans Coenen
Nottingham Trent University, UK
Tony Allen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Benbrahim, H., Bramer, M. (2005). Neighbourhood Exploitation in Hypertext Categorization. In: Bramer, M., Coenen, F., Allen, T. (eds) Research and Development in Intelligent Systems XXI. SGAI 2004. Springer, London. https://doi.org/10.1007/1-84628-102-4_19

Download citation

DOI: https://doi.org/10.1007/1-84628-102-4_19
Publisher Name: Springer, London
Print ISBN: 978-1-85233-907-4
Online ISBN: 978-1-84628-102-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics