Web Document Classification by Keywords Using Random Forests
Web directory hierarchy is critical to serve user’s search request. Creating and maintaining such directories without human experts involvement requires good classification of web documents. In this paper, we explore web page classification using keywords from documents as attributes and using the random forest learning methods. Our initially results are promising that the random forests learning method performed better than several other well known learning methods. When the number of topics increased from five to seven, random forests still performed better than other methods even though absolute classification rates decreased.
Keywordsweb document classification random forests data mining keywords topics web directory
Unable to display preview. Download preview PDF.
- 3.Svetnik, V.: Random Forest: A Classification and Regression Tool for compound classification and QSAR modeling. J. Chem. Inf. Computer Science 43, 1947–1958 (2003)Google Scholar
- 4.Zhang, J., Zulkernine, M.: A Hybrid Network Intrusion Detection Technique Using Random Forests. In: Proceedings of the First International Conference on Availability, Reliability and Security (ARES 2006), pp. 262–269 (2006)Google Scholar
- 5.Russel, I., Markov, Z., Neller, T.: Wed Document Classification. NSF Project MLeXAI sample project report, http://uhaweb.hartford.edu/compsci/ccli/samplep.htm
- 6.Qi, W., Davidson, B.: Web page classification: Features and Algorithms. ACM Computing Surveys 41(2) (2009)Google Scholar
- 7.Shen, D., Chen, Z., et al.: Web-page classification through summarization. In: SIGIR 2004 (2004)Google Scholar
- 8.Glover, E.J., Tsioutsiouliklis, K., Flake, et al.: Using web structure for classifying and describing web pages. In: Proc. of www, vol. 12 (2002)Google Scholar
- 9.Ye, Y., Li, H., Deng, X., Huang, J.: Feature weighting random forest for detection of hidden web search interfaces. Computational Linguistics and Chinese Language Processing 13(4), 387–404 (2009)Google Scholar