Abstract
A model of automatically classifying uncertain Web pages using multiple features is presented. Since the traditional tree structure can barely classify an avalanche of new Web pages, the proposed approach partially uses the idea of “bag of words” incorporating the idea of classification fusion to describe and categorize Web pages. The proposed approach extracts features of Web pages from various perspectives, such as consulting a Web directory service, analyzing the text features of Web pages’ titles and meta-search keywords, and identifying primary content of Web pages. Through fusing the results from these three dedicated classifiers, Web pages are classified to one or more categories with a bunch of words representing the Web pages. In order to demonstrate the effectiveness of the proposed method, experiments are carried out. In the experiments, the Web pages are classified using the proposed fusion method to four categories. A comparison between the dedicated classifiers and fusion methods is also presented.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Shuen, A.A.: Web 2.0: A Strategy Guide. O’Reilly, Beijing (2008)
Embleton, K., Heinrich, H.: Searching to Find. Searcher 16(2), 22–46 (2008)
Ricardo, B.Y., Berthier, R.N.: Modern Information Retrieval. ACM Press, Addison-Wesley, New York (1999)
Conesa, J., Storey, V.C., Sugumaran, V.: Data & Knowledge Engineering 66(1), 18–34 (2008)
Gerstel, O., Kutten, S., Laber, E.S., Matichin, R., Peleg, D., Pessoa, A.A., Souza, C.: Reducing Human Interactions in Web Directory Searches. ACM Transactions on Information Systems 25(4), 20–28 (2007)
Comer, D.: The Internet Book: Everything You Need to Know about Computer networking and How the Internet Works. Pearson Prentice Hall, Upper Saddle River (2007)
Kan, M.K., Thi, H.O.N.: Fast Webpage Classification Using URL Features. In: Proceeding of Conference on Info and Knowledge Mnagement (CIKM 2005), Bremen, Germany, pp. 325–326 (2005)
Yang, H.C., Lee, C.H.: A Text Mining Approach on Automatic Generation of Web Directories and Hierarchies. Expert Systems with Applications 24(4), 645–663 (2004)
Robnik-Sikonja, M., Kononenko, I.: Theoretical and Empirical Analysis of ReliefF and RreliefF. Machine Learning 53(1-2), 23–69 (2003)
Wang, Y., Makedon, F.: Application of Relief-F Feature Filtering Algorithm to Selecting Informative Genes for Cancer Classification using Microarray Data. In: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference, Stanford, California, pp. 497–498 (2004)
Jin, X., Li, R., Shen, X., Bie, R.: Automatic Web Pages Categorization with ReliefF and Hidden Naive Bayes. In: SAC 2007, Seoul, Korea (2007)
Xu, Z., King, I., Lyu, M.R.: Web Page Classification with Heterogeneous Data Fusion. In: WWW 2007, Poster Paper, Banff, Alberta, Canada (2007)
Chowdhury, G.G.: Introduction to Modern Information Retrieval. Facet, London (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wen, H., Fang, L., Guan, L. (2008). Automatic Web Page Classification Using Various Features. In: Huang, YM.R., et al. Advances in Multimedia Information Processing - PCM 2008. PCM 2008. Lecture Notes in Computer Science, vol 5353. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89796-5_38
Download citation
DOI: https://doi.org/10.1007/978-3-540-89796-5_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89795-8
Online ISBN: 978-3-540-89796-5
eBook Packages: Computer ScienceComputer Science (R0)