Automatic Web Page Classification Using Various Features

Wen, Hao; Fang, Liping; Guan, Ling

doi:10.1007/978-3-540-89796-5_38

Hao Wen⁸,
Liping Fang⁸ &
Ling Guan⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5353))

Included in the following conference series:

Pacific-Rim Conference on Multimedia

1466 Accesses

Abstract

A model of automatically classifying uncertain Web pages using multiple features is presented. Since the traditional tree structure can barely classify an avalanche of new Web pages, the proposed approach partially uses the idea of “bag of words” incorporating the idea of classification fusion to describe and categorize Web pages. The proposed approach extracts features of Web pages from various perspectives, such as consulting a Web directory service, analyzing the text features of Web pages’ titles and meta-search keywords, and identifying primary content of Web pages. Through fusing the results from these three dedicated classifiers, Web pages are classified to one or more categories with a bunch of words representing the Web pages. In order to demonstrate the effectiveness of the proposed method, experiments are carried out. In the experiments, the Web pages are classified using the proposed fusion method to four categories. A comparison between the dedicated classifiers and fusion methods is also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Shuen, A.A.: Web 2.0: A Strategy Guide. O’Reilly, Beijing (2008)
Google Scholar
Embleton, K., Heinrich, H.: Searching to Find. Searcher 16(2), 22–46 (2008)
Google Scholar
Ricardo, B.Y., Berthier, R.N.: Modern Information Retrieval. ACM Press, Addison-Wesley, New York (1999)
Google Scholar
Conesa, J., Storey, V.C., Sugumaran, V.: Data & Knowledge Engineering 66(1), 18–34 (2008)
Google Scholar
Gerstel, O., Kutten, S., Laber, E.S., Matichin, R., Peleg, D., Pessoa, A.A., Souza, C.: Reducing Human Interactions in Web Directory Searches. ACM Transactions on Information Systems 25(4), 20–28 (2007)
Article Google Scholar
Comer, D.: The Internet Book: Everything You Need to Know about Computer networking and How the Internet Works. Pearson Prentice Hall, Upper Saddle River (2007)
MATH Google Scholar
Kan, M.K., Thi, H.O.N.: Fast Webpage Classification Using URL Features. In: Proceeding of Conference on Info and Knowledge Mnagement (CIKM 2005), Bremen, Germany, pp. 325–326 (2005)
Google Scholar
Yang, H.C., Lee, C.H.: A Text Mining Approach on Automatic Generation of Web Directories and Hierarchies. Expert Systems with Applications 24(4), 645–663 (2004)
Article Google Scholar
Robnik-Sikonja, M., Kononenko, I.: Theoretical and Empirical Analysis of ReliefF and RreliefF. Machine Learning 53(1-2), 23–69 (2003)
Article MATH Google Scholar
Wang, Y., Makedon, F.: Application of Relief-F Feature Filtering Algorithm to Selecting Informative Genes for Cancer Classification using Microarray Data. In: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference, Stanford, California, pp. 497–498 (2004)
Google Scholar
Jin, X., Li, R., Shen, X., Bie, R.: Automatic Web Pages Categorization with ReliefF and Hidden Naive Bayes. In: SAC 2007, Seoul, Korea (2007)
Google Scholar
Xu, Z., King, I., Lyu, M.R.: Web Page Classification with Heterogeneous Data Fusion. In: WWW 2007, Poster Paper, Banff, Alberta, Canada (2007)
Google Scholar
www.wordnet.com
Chowdhury, G.G.: Introduction to Modern Information Retrieval. Facet, London (2004)
Google Scholar
http://www.google.ca/dirhp?hl=en

Download references

Author information

Authors and Affiliations

Department of Mechanical and Industrial Engineering, Ryerson University, Canada
Hao Wen & Liping Fang
Department of Electrical and Computer Engineering, Ryerson University, Canada
Ling Guan

Authors

Hao Wen
View author publications
You can also search for this author in PubMed Google Scholar
Liping Fang
View author publications
You can also search for this author in PubMed Google Scholar
Ling Guan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Engineering Science, National Cheng Kung University, No.1, University Road, 701, Tainan City, Taiwan
Yueh-Min Ray Huang
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95, Zhongguancun East Road, 100190, Beijing, China
Changsheng Xu
Institute of Biomedical Engineering, National Cheng Kung University, No. 1, University Road, 701, Tainan City, Taiwan
Kuo-Sheng Cheng
Department of Electrical Engineering, National Cheng Kung University, No. 1, University Road, 701, Tainan City, Taiwan
Jar-Ferr Kevin Yang
Department of Electrical and Computer Engineering, Concordia University, S-EV005.139, 1515 St. Catherine West, Montreal, H4G 2W1, Quebec, Canada
M. N. S. Swamy
Microsoft Research Asia, 5/F, Beijing Sigma Center, No. 49, Zhichun Road, Hai Dian District, 100080, Beijing, China
Shipeng Li
Department of Information Management, National Kaohsiung University of Applied Sciences, No. 415, Jiangong Road, Sanmin District, 80778, Kaohsiung, Taiwan
Jen-Wen Ding

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wen, H., Fang, L., Guan, L. (2008). Automatic Web Page Classification Using Various Features. In: Huang, YM.R., et al. Advances in Multimedia Information Processing - PCM 2008. PCM 2008. Lecture Notes in Computer Science, vol 5353. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89796-5_38

Download citation

DOI: https://doi.org/10.1007/978-3-540-89796-5_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89795-8
Online ISBN: 978-3-540-89796-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics