Effective Probabilistic Model for Webpage Classification

Haleem, Hammad; Niyas, C.; Verma, Siddharth; Kumar, Akshay; Ahmad, Faiyaz

doi:10.1007/978-81-322-1817-3_29

Hammad Haleem⁴,
C. Niyas⁴,
Siddharth Verma⁴,
Akshay Kumar⁴ &
…
Faiyaz Ahmad⁴

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 298))

1046 Accesses

Abstract

World Wide Web (www) is a large repository of information which contains a plethora of information in the form of web documents. Information stored in web is increasing at a very rapid rate and people rely more and more on Internet for acquiring information. Internet World Stats reveal that world Internet usage has increased by 480 % within the period 2000–2011. This exponential growth of the web has made it a difficult task to organize data and to find it. If we categorize data on the Internet, it would be easier to find relevant piece of information quickly and conveniently. There are some popular web directories projects like yahoo directory and Mozilla directory in which web pages are organized according to their categories. According to a recent survey, it has been estimated that about 584 million websites are currently hosted on the Internet. But these Internet directories have only a tiny fraction of websites listed with them. The proper classification has made these directories popular among web users. However these web directories make use of human effort for classifying web pages and also only 2.5 % of available webpages are included in these directories. Rapid growth of web has made it increasingly difficult to classify web pages manually, mainly due to the fact that manually or semi-automatic classification of website is a tedious and costly affair. Because of this reason web page classification using machine learning algorithms has become a major research topic in these days. A number of algorithms have been proposed for the classification of web sites by analyzing its features. In this paper we will introduce a fast, effective, probabilistic classification model with a good accuracy based on machine learning and data mining techniques for the automated classification of web-pages into different categories based on their textual content.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Netcraft’s internet survey for August 2012. http://news.netcraft.com/archives/2013/08/09/august-2013-web-server-survey.html
Simple Knn approach for text classification. http://www.cis.uab.edu/zhang/Spam-mining-papers/A.Simple.KNN.Algorithm.for.Text.Categorization.pdf
Optimized approach for text classification. http://www.cis.uab.edu/zhang/Spam-mining-papers/An.Optimized.Approach.for.KNN.Text.Categorization.using.P.Trees.pdf
Guo G, Wang H, Greer K (2004) A kNN model-based approach and its application in text categorization. In: 5th international conference, CICLing Springer, Seoul, Korea, pp 559–570
Google Scholar
McCallum A, K. Nigam (1988) A comparison of event models for Naïve Bayes text classification. In: AAAI/ICML-98 workshop on learning for text categorization, pp 41–48
Google Scholar
Lewis DD, Ringuette M (1994) A classification of two learning algorithms for text categorization. In: Proceedings of 3rd annual symposium on document analysis and information retrieval (SDAIR’94), pp 81–93
Google Scholar
Apte C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Syst 12(3):233–251
Article Google Scholar
Wermter S (2000) Neural network agents for learning semantic text classification. Inf Retrieval 3(2):87–103
Article Google Scholar
Weigend AS, Weiner ED, Peterson JO (1999) Exploiting hierarchy in text categorization. Inf Retrieval 1(3):193–216
Article Google Scholar
Leopold E, Kindermann J (2002) Text categorization with support vector machines. How to represent texts in input space? Mach Learn 46(1–3):423–444
Article MATH Google Scholar
Bennett D, Demiritz A (1998) Semi-supervised support vector machines. Adv Neural Inf Process Syst 11:368–374
Google Scholar
Pierre JM (2000) Practical issues for automated categorization of web sites. In: Electronic proceedings of ECDL 2000 workshop on the semantic web, Lisbon, Portugal
Google Scholar
Sun A, Lim E, Ng W (2002) Web classification using support vector machine. In: Proceedings of the 4th international workshop on web information and data management, McLean, Virginia, USA, pp 96–99
Google Scholar
Zhang Y, Xiao BFL (2008) Web page classification based on a least square support vector machine with latent semantic analysis. In: Proceedings of the 5th international conference on fuzzy systems and knowledge discovery, vol 2. pp 528–532
Google Scholar
Kwon O, Lee J (2000) Web page classification based on k-nearest neighbor approach. In: Proceedings of the 5th international workshop on information retrieval with Asian languages, Hong Kong, China, pp 9–15
Google Scholar
Dehghan S, Rahmani AM (2008) A classifier-CMAC neural network model for web mining. In: Proceedings of the IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology, vol 1. pp 427–431
Google Scholar
Dou S, Jian-Tao S, Qiang Y, Zheng C (2006) A comparison of implicit and explicit links for web page classification. In: Proceedings of the 15th international conference on World Wide Web, Edinburgh, Scotland, pp 643–650
Google Scholar
Zhongzhi S, Xiaoli L (2002) Innovating web page classification through reducing noise. J Comput Sci Technol 17(1):9–17
Article MATH Google Scholar
Xu Z, King I, Lyu MR (2007) Web page classification with heterogeneous data fusion. In Proceedings of the 16th international conference on World Wide Web, Banff, Alberta, Canada, pp 1171–1172
Google Scholar
Attardi G, Gulli A, Sebastiani F (1999) Automatic web page categorization by link and context analysis. In: Hutchison C, Lanzarone G (eds) Proceedings of THAI’99, pp 105–119
Google Scholar
Dou S, Zheng C, Qiang Y, Hua-Jun Z, Benyu Z, Yuchang L, Wei-Ying M (2004) Web-page classification through summarization. In: Proceedings of the 27th annual International ACM SIGIR conference on research and development in information retrieval, Sheffield, United Kingdom, pp 242–249
Google Scholar
Dumais S, Chen H (2000) Hierarchical classification of web content. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, Athens, Greece, pp 256–263
Google Scholar
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Article Google Scholar
Porter Stemming algorithm, with various implementations. http://tartarus.org/martin/PorterStemmer/index.html
The BOW or libbow C Library. Available http://www.cs.cmu.edu/~mccallum/bow/
Bird S, Loper E, Klein E (2009) Natural language processing with python. O’Reilly Media Inc. Python NLTK, Sebastopol
MATH Google Scholar
Python based open source mathematical analysis toolkit. www.matplotlib.org
Mitchell TM (1997) Machine learning. McGraw-Hill Companies, Inc, New York
MATH Google Scholar
Data Mining concepts and techniques Han Kamber Lee Morgan Kaufman publications, 3rd edn. 2012
Google Scholar
DMOZ open directory project. Available http://dmoz.org/

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Faculty of Engineering and Technology, Jamia Millia Islamia, New Delhi, 110025, India
Hammad Haleem, C. Niyas, Siddharth Verma, Akshay Kumar & Faiyaz Ahmad

Authors

Hammad Haleem
View author publications
You can also search for this author in PubMed Google Scholar
C. Niyas
View author publications
You can also search for this author in PubMed Google Scholar
Siddharth Verma
View author publications
You can also search for this author in PubMed Google Scholar
Akshay Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Faiyaz Ahmad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hammad Haleem .

Editor information

Editors and Affiliations

Department of Information Technology, BP Poddar Institute of Management and Technology, Kolkata, West Bengal, India
Sabnam Sengupta
Department of Information Technology, BP Poddar Institute of Management and Technology, Kolkata, West Bengal, India
Kunal Das
Department of Information Technology, BP Poddar Institute of Management and Technology, Kolkata, West Bengal, India
Gitosree Khan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Haleem, H., Niyas, C., Verma, S., Kumar, A., Ahmad, F. (2014). Effective Probabilistic Model for Webpage Classification. In: Sengupta, S., Das, K., Khan, G. (eds) Emerging Trends in Computing and Communication. Lecture Notes in Electrical Engineering, vol 298. Springer, New Delhi. https://doi.org/10.1007/978-81-322-1817-3_29

Download citation

DOI: https://doi.org/10.1007/978-81-322-1817-3_29
Published: 25 February 2014
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-1816-6
Online ISBN: 978-81-322-1817-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics