Skip to main content

Automatic Web Page Classification Using Various Features

  • Conference paper
Advances in Multimedia Information Processing - PCM 2008 (PCM 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5353))

Included in the following conference series:

  • 1466 Accesses

Abstract

A model of automatically classifying uncertain Web pages using multiple features is presented. Since the traditional tree structure can barely classify an avalanche of new Web pages, the proposed approach partially uses the idea of “bag of words” incorporating the idea of classification fusion to describe and categorize Web pages. The proposed approach extracts features of Web pages from various perspectives, such as consulting a Web directory service, analyzing the text features of Web pages’ titles and meta-search keywords, and identifying primary content of Web pages. Through fusing the results from these three dedicated classifiers, Web pages are classified to one or more categories with a bunch of words representing the Web pages. In order to demonstrate the effectiveness of the proposed method, experiments are carried out. In the experiments, the Web pages are classified using the proposed fusion method to four categories. A comparison between the dedicated classifiers and fusion methods is also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Shuen, A.A.: Web 2.0: A Strategy Guide. O’Reilly, Beijing (2008)

    Google Scholar 

  2. Embleton, K., Heinrich, H.: Searching to Find. Searcher 16(2), 22–46 (2008)

    Google Scholar 

  3. Ricardo, B.Y., Berthier, R.N.: Modern Information Retrieval. ACM Press, Addison-Wesley, New York (1999)

    Google Scholar 

  4. Conesa, J., Storey, V.C., Sugumaran, V.: Data & Knowledge Engineering 66(1), 18–34 (2008)

    Google Scholar 

  5. Gerstel, O., Kutten, S., Laber, E.S., Matichin, R., Peleg, D., Pessoa, A.A., Souza, C.: Reducing Human Interactions in Web Directory Searches. ACM Transactions on Information Systems 25(4), 20–28 (2007)

    Article  Google Scholar 

  6. Comer, D.: The Internet Book: Everything You Need to Know about Computer networking and How the Internet Works. Pearson Prentice Hall, Upper Saddle River (2007)

    MATH  Google Scholar 

  7. Kan, M.K., Thi, H.O.N.: Fast Webpage Classification Using URL Features. In: Proceeding of Conference on Info and Knowledge Mnagement (CIKM 2005), Bremen, Germany, pp. 325–326 (2005)

    Google Scholar 

  8. Yang, H.C., Lee, C.H.: A Text Mining Approach on Automatic Generation of Web Directories and Hierarchies. Expert Systems with Applications 24(4), 645–663 (2004)

    Article  Google Scholar 

  9. Robnik-Sikonja, M., Kononenko, I.: Theoretical and Empirical Analysis of ReliefF and RreliefF. Machine Learning 53(1-2), 23–69 (2003)

    Article  MATH  Google Scholar 

  10. Wang, Y., Makedon, F.: Application of Relief-F Feature Filtering Algorithm to Selecting Informative Genes for Cancer Classification using Microarray Data. In: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference, Stanford, California, pp. 497–498 (2004)

    Google Scholar 

  11. Jin, X., Li, R., Shen, X., Bie, R.: Automatic Web Pages Categorization with ReliefF and Hidden Naive Bayes. In: SAC 2007, Seoul, Korea (2007)

    Google Scholar 

  12. Xu, Z., King, I., Lyu, M.R.: Web Page Classification with Heterogeneous Data Fusion. In: WWW 2007, Poster Paper, Banff, Alberta, Canada (2007)

    Google Scholar 

  13. www.wordnet.com

  14. Chowdhury, G.G.: Introduction to Modern Information Retrieval. Facet, London (2004)

    Google Scholar 

  15. http://www.google.ca/dirhp?hl=en

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wen, H., Fang, L., Guan, L. (2008). Automatic Web Page Classification Using Various Features. In: Huang, YM.R., et al. Advances in Multimedia Information Processing - PCM 2008. PCM 2008. Lecture Notes in Computer Science, vol 5353. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89796-5_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-89796-5_38

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-89795-8

  • Online ISBN: 978-3-540-89796-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics