Abstract
World Wide Web (www) is a large repository of information which contains a plethora of information in the form of web documents. Information stored in web is increasing at a very rapid rate and people rely more and more on Internet for acquiring information. Internet World Stats reveal that world Internet usage has increased by 480 % within the period 2000–2011. This exponential growth of the web has made it a difficult task to organize data and to find it. If we categorize data on the Internet, it would be easier to find relevant piece of information quickly and conveniently. There are some popular web directories projects like yahoo directory and Mozilla directory in which web pages are organized according to their categories. According to a recent survey, it has been estimated that about 584 million websites are currently hosted on the Internet. But these Internet directories have only a tiny fraction of websites listed with them. The proper classification has made these directories popular among web users. However these web directories make use of human effort for classifying web pages and also only 2.5 % of available webpages are included in these directories. Rapid growth of web has made it increasingly difficult to classify web pages manually, mainly due to the fact that manually or semi-automatic classification of website is a tedious and costly affair. Because of this reason web page classification using machine learning algorithms has become a major research topic in these days. A number of algorithms have been proposed for the classification of web sites by analyzing its features. In this paper we will introduce a fast, effective, probabilistic classification model with a good accuracy based on machine learning and data mining techniques for the automated classification of web-pages into different categories based on their textual content.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Netcraft’s internet survey for August 2012. http://news.netcraft.com/archives/2013/08/09/august-2013-web-server-survey.html
Simple Knn approach for text classification. http://www.cis.uab.edu/zhang/Spam-mining-papers/A.Simple.KNN.Algorithm.for.Text.Categorization.pdf
Optimized approach for text classification. http://www.cis.uab.edu/zhang/Spam-mining-papers/An.Optimized.Approach.for.KNN.Text.Categorization.using.P.Trees.pdf
Guo G, Wang H, Greer K (2004) A kNN model-based approach and its application in text categorization. In: 5th international conference, CICLing Springer, Seoul, Korea, pp 559–570
McCallum A, K. Nigam (1988) A comparison of event models for Naïve Bayes text classification. In: AAAI/ICML-98 workshop on learning for text categorization, pp 41–48
Lewis DD, Ringuette M (1994) A classification of two learning algorithms for text categorization. In: Proceedings of 3rd annual symposium on document analysis and information retrieval (SDAIR’94), pp 81–93
Apte C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Syst 12(3):233–251
Wermter S (2000) Neural network agents for learning semantic text classification. Inf Retrieval 3(2):87–103
Weigend AS, Weiner ED, Peterson JO (1999) Exploiting hierarchy in text categorization. Inf Retrieval 1(3):193–216
Leopold E, Kindermann J (2002) Text categorization with support vector machines. How to represent texts in input space? Mach Learn 46(1–3):423–444
Bennett D, Demiritz A (1998) Semi-supervised support vector machines. Adv Neural Inf Process Syst 11:368–374
Pierre JM (2000) Practical issues for automated categorization of web sites. In: Electronic proceedings of ECDL 2000 workshop on the semantic web, Lisbon, Portugal
Sun A, Lim E, Ng W (2002) Web classification using support vector machine. In: Proceedings of the 4th international workshop on web information and data management, McLean, Virginia, USA, pp 96–99
Zhang Y, Xiao BFL (2008) Web page classification based on a least square support vector machine with latent semantic analysis. In: Proceedings of the 5th international conference on fuzzy systems and knowledge discovery, vol 2. pp 528–532
Kwon O, Lee J (2000) Web page classification based on k-nearest neighbor approach. In: Proceedings of the 5th international workshop on information retrieval with Asian languages, Hong Kong, China, pp 9–15
Dehghan S, Rahmani AM (2008) A classifier-CMAC neural network model for web mining. In: Proceedings of the IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology, vol 1. pp 427–431
Dou S, Jian-Tao S, Qiang Y, Zheng C (2006) A comparison of implicit and explicit links for web page classification. In: Proceedings of the 15th international conference on World Wide Web, Edinburgh, Scotland, pp 643–650
Zhongzhi S, Xiaoli L (2002) Innovating web page classification through reducing noise. J Comput Sci Technol 17(1):9–17
Xu Z, King I, Lyu MR (2007) Web page classification with heterogeneous data fusion. In Proceedings of the 16th international conference on World Wide Web, Banff, Alberta, Canada, pp 1171–1172
Attardi G, Gulli A, Sebastiani F (1999) Automatic web page categorization by link and context analysis. In: Hutchison C, Lanzarone G (eds) Proceedings of THAI’99, pp 105–119
Dou S, Zheng C, Qiang Y, Hua-Jun Z, Benyu Z, Yuchang L, Wei-Ying M (2004) Web-page classification through summarization. In: Proceedings of the 27th annual International ACM SIGIR conference on research and development in information retrieval, Sheffield, United Kingdom, pp 242–249
Dumais S, Chen H (2000) Hierarchical classification of web content. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, Athens, Greece, pp 256–263
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Porter Stemming algorithm, with various implementations. http://tartarus.org/martin/PorterStemmer/index.html
The BOW or libbow C Library. Available http://www.cs.cmu.edu/~mccallum/bow/
Bird S, Loper E, Klein E (2009) Natural language processing with python. O’Reilly Media Inc. Python NLTK, Sebastopol
Python based open source mathematical analysis toolkit. www.matplotlib.org
Mitchell TM (1997) Machine learning. McGraw-Hill Companies, Inc, New York
Data Mining concepts and techniques Han Kamber Lee Morgan Kaufman publications, 3rd edn. 2012
DMOZ open directory project. Available http://dmoz.org/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer India
About this paper
Cite this paper
Haleem, H., Niyas, C., Verma, S., Kumar, A., Ahmad, F. (2014). Effective Probabilistic Model for Webpage Classification. In: Sengupta, S., Das, K., Khan, G. (eds) Emerging Trends in Computing and Communication. Lecture Notes in Electrical Engineering, vol 298. Springer, New Delhi. https://doi.org/10.1007/978-81-322-1817-3_29
Download citation
DOI: https://doi.org/10.1007/978-81-322-1817-3_29
Published:
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-1816-6
Online ISBN: 978-81-322-1817-3
eBook Packages: EngineeringEngineering (R0)