Skip to main content

Neighbourhood Exploitation in Hypertext Categorization

  • Conference paper
Research and Development in Intelligent Systems XXI (SGAI 2004)

Abstract

The exponential growth of the web has led to the necessity to put some order to its content. The automatic classification of web documents into predefined classes, that is hypertext categorization, came to elevate humans from that task. The extra information available in a hypertext document poses new challenges for automatic categorization. HTML tags and linked neighbourhood all provide rich information for hypertext categorization that is not available in traditional text classification. This paper looks at (i) which extra information hidden in HTML tags and linked neighbourhood pages to take into consideration to improve the classification task, and (ii) how to deal with the high level of noise in linked pages. A hypertext dataset and four well-known learning algorithms (Naive Bayes, K-Nearest Neighbour, Support Vector Machine and C4.5) were used to exploit the enriched text representation. The results showed that the clever use of the information in linked neighbourhood and HTML tags improved the accuracy of the classification algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. K. Bharat and A. Broader. A technique for measuring the relative size and overlap of public web search engines. In Proc. Of the 7th World Wide Web Conference (WWW7), 1998.

    Google Scholar 

  2. H. Chen and S. Dumais. Bringing order to the web: automatically categorizing search results. In proceedings of CHI-00, ACM International Conference on Human Factors in Computing Systems, p 145–152, Den Haag, NL, 2000. ACM Press, New York, US.

    Chapter  Google Scholar 

  3. http://www.yahoo.com

    Google Scholar 

  4. S. Chakrabarti, B.Dom, R. Agrawal and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In VLDB, Athens, Greece, Aug. 1997.

    Google Scholar 

  5. H. Benbrahim and M. Bramer. Impact on performance of hypertext classification by selective rich html capture. IFIP World Computer Congress, Toulouse, France, Aug 2004 (to appear).

    Google Scholar 

  6. H. Oh, S. Myaeng, and M. Lee. A practical hypertext categorization method using links and incrementally available class information. In Proceedings of the Twenty Third ACM SIGIR Conference, Athens, Greece, July 2000.

    Google Scholar 

  7. T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext categorization. In International Conference on Machine Learning (ICML’01), San Francisco, CA, 2001, Morgan Kaufmann.

    Google Scholar 

  8. Y. Yang, S. Slattery and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems (Special Issue on Automatic Text Categorization) 18(2–3) 2002, pp. 219–241.

    Article  Google Scholar 

  9. C. Apte, F. Damereau, and S. Weiss. Automated learning of decision rules for text categorization. ACM trans.Information Systems, Vol.12, No.3, July 1994, pp. 233–251.

    Article  Google Scholar 

  10. A. Bensaid and N. Tazi. Text categorization with semi-supervised agglomerative hierarchical clustering. International Journal of Intelligent Systems, 1999.

    Google Scholar 

  11. S. Chakrabati, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. Proceedings ACM SIGMOD International Conference on Management of Data, pages 307–318, Seattle, Washington, June 1998. ACM Press.

    Google Scholar 

  12. T. Joachims. Text categorization with Support Vector Machines: Learning with many relevant features. Proceedings of ECML-98, 10* European Conference on Machine Learning, pages 137–142.

    Google Scholar 

  13. http://www-2.cs.cmu.edu/afs/cs.cmu.edu/proiect/theo-20/www/data/

    Google Scholar 

  14. http://www.pedal.rdg.ac.uk/banksearchdataset/

    Google Scholar 

  15. Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1999.

    Google Scholar 

  16. D. Lewis. Feature selection and feature extraction for text categorization. Proceedings of Speech and Natural Language Workshop, 1992, pp. 212–217.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag London Limited

About this paper

Cite this paper

Benbrahim, H., Bramer, M. (2005). Neighbourhood Exploitation in Hypertext Categorization. In: Bramer, M., Coenen, F., Allen, T. (eds) Research and Development in Intelligent Systems XXI. SGAI 2004. Springer, London. https://doi.org/10.1007/1-84628-102-4_19

Download citation

  • DOI: https://doi.org/10.1007/1-84628-102-4_19

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-85233-907-4

  • Online ISBN: 978-1-84628-102-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics