Skip to main content

Effective Probabilistic Model for Webpage Classification

  • Conference paper
  • First Online:
Emerging Trends in Computing and Communication

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 298))

  • 1046 Accesses

Abstract

World Wide Web (www) is a large repository of information which contains a plethora of information in the form of web documents. Information stored in web is increasing at a very rapid rate and people rely more and more on Internet for acquiring information. Internet World Stats reveal that world Internet usage has increased by 480 % within the period 2000–2011. This exponential growth of the web has made it a difficult task to organize data and to find it. If we categorize data on the Internet, it would be easier to find relevant piece of information quickly and conveniently. There are some popular web directories projects like yahoo directory and Mozilla directory in which web pages are organized according to their categories. According to a recent survey, it has been estimated that about 584 million websites are currently hosted on the Internet. But these Internet directories have only a tiny fraction of websites listed with them. The proper classification has made these directories popular among web users. However these web directories make use of human effort for classifying web pages and also only 2.5 % of available webpages are included in these directories. Rapid growth of web has made it increasingly difficult to classify web pages manually, mainly due to the fact that manually or semi-automatic classification of website is a tedious and costly affair. Because of this reason web page classification using machine learning algorithms has become a major research topic in these days. A number of algorithms have been proposed for the classification of web sites by analyzing its features. In this paper we will introduce a fast, effective, probabilistic classification model with a good accuracy based on machine learning and data mining techniques for the automated classification of web-pages into different categories based on their textual content.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Netcraft’s internet survey for August 2012. http://news.netcraft.com/archives/2013/08/09/august-2013-web-server-survey.html

  2. Simple Knn approach for text classification. http://www.cis.uab.edu/zhang/Spam-mining-papers/A.Simple.KNN.Algorithm.for.Text.Categorization.pdf

  3. Optimized approach for text classification. http://www.cis.uab.edu/zhang/Spam-mining-papers/An.Optimized.Approach.for.KNN.Text.Categorization.using.P.Trees.pdf

  4. Guo G, Wang H, Greer K (2004) A kNN model-based approach and its application in text categorization. In: 5th international conference, CICLing Springer, Seoul, Korea, pp 559–570

    Google Scholar 

  5. McCallum A, K. Nigam (1988) A comparison of event models for Naïve Bayes text classification. In: AAAI/ICML-98 workshop on learning for text categorization, pp 41–48

    Google Scholar 

  6. Lewis DD, Ringuette M (1994) A classification of two learning algorithms for text categorization. In: Proceedings of 3rd annual symposium on document analysis and information retrieval (SDAIR’94), pp 81–93

    Google Scholar 

  7. Apte C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Syst 12(3):233–251

    Article  Google Scholar 

  8. Wermter S (2000) Neural network agents for learning semantic text classification. Inf Retrieval 3(2):87–103

    Article  Google Scholar 

  9. Weigend AS, Weiner ED, Peterson JO (1999) Exploiting hierarchy in text categorization. Inf Retrieval 1(3):193–216

    Article  Google Scholar 

  10. Leopold E, Kindermann J (2002) Text categorization with support vector machines. How to represent texts in input space? Mach Learn 46(1–3):423–444

    Article  MATH  Google Scholar 

  11. Bennett D, Demiritz A (1998) Semi-supervised support vector machines. Adv Neural Inf Process Syst 11:368–374

    Google Scholar 

  12. Pierre JM (2000) Practical issues for automated categorization of web sites. In: Electronic proceedings of ECDL 2000 workshop on the semantic web, Lisbon, Portugal

    Google Scholar 

  13. Sun A, Lim E, Ng W (2002) Web classification using support vector machine. In: Proceedings of the 4th international workshop on web information and data management, McLean, Virginia, USA, pp 96–99

    Google Scholar 

  14. Zhang Y, Xiao BFL (2008) Web page classification based on a least square support vector machine with latent semantic analysis. In: Proceedings of the 5th international conference on fuzzy systems and knowledge discovery, vol 2. pp 528–532

    Google Scholar 

  15. Kwon O, Lee J (2000) Web page classification based on k-nearest neighbor approach. In: Proceedings of the 5th international workshop on information retrieval with Asian languages, Hong Kong, China, pp 9–15

    Google Scholar 

  16. Dehghan S, Rahmani AM (2008) A classifier-CMAC neural network model for web mining. In: Proceedings of the IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology, vol 1. pp 427–431

    Google Scholar 

  17. Dou S, Jian-Tao S, Qiang Y, Zheng C (2006) A comparison of implicit and explicit links for web page classification. In: Proceedings of the 15th international conference on World Wide Web, Edinburgh, Scotland, pp 643–650

    Google Scholar 

  18. Zhongzhi S, Xiaoli L (2002) Innovating web page classification through reducing noise. J Comput Sci Technol 17(1):9–17

    Article  MATH  Google Scholar 

  19. Xu Z, King I, Lyu MR (2007) Web page classification with heterogeneous data fusion. In Proceedings of the 16th international conference on World Wide Web, Banff, Alberta, Canada, pp 1171–1172

    Google Scholar 

  20. Attardi G, Gulli A, Sebastiani F (1999) Automatic web page categorization by link and context analysis. In: Hutchison C, Lanzarone G (eds) Proceedings of THAI’99, pp 105–119

    Google Scholar 

  21. Dou S, Zheng C, Qiang Y, Hua-Jun Z, Benyu Z, Yuchang L, Wei-Ying M (2004) Web-page classification through summarization. In: Proceedings of the 27th annual International ACM SIGIR conference on research and development in information retrieval, Sheffield, United Kingdom, pp 242–249

    Google Scholar 

  22. Dumais S, Chen H (2000) Hierarchical classification of web content. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, Athens, Greece, pp 256–263

    Google Scholar 

  23. Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

    Article  Google Scholar 

  24. Porter Stemming algorithm, with various implementations. http://tartarus.org/martin/PorterStemmer/index.html

  25. The BOW or libbow C Library. Available http://www.cs.cmu.edu/~mccallum/bow/

  26. Bird S, Loper E, Klein E (2009) Natural language processing with python. O’Reilly Media Inc. Python NLTK, Sebastopol

    MATH  Google Scholar 

  27. Python based open source mathematical analysis toolkit. www.matplotlib.org

  28. Mitchell TM (1997) Machine learning. McGraw-Hill Companies, Inc, New York

    MATH  Google Scholar 

  29. Data Mining concepts and techniques Han Kamber Lee Morgan Kaufman publications, 3rd edn. 2012

    Google Scholar 

  30. DMOZ open directory project. Available http://dmoz.org/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hammad Haleem .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer India

About this paper

Cite this paper

Haleem, H., Niyas, C., Verma, S., Kumar, A., Ahmad, F. (2014). Effective Probabilistic Model for Webpage Classification. In: Sengupta, S., Das, K., Khan, G. (eds) Emerging Trends in Computing and Communication. Lecture Notes in Electrical Engineering, vol 298. Springer, New Delhi. https://doi.org/10.1007/978-81-322-1817-3_29

Download citation

  • DOI: https://doi.org/10.1007/978-81-322-1817-3_29

  • Published:

  • Publisher Name: Springer, New Delhi

  • Print ISBN: 978-81-322-1816-6

  • Online ISBN: 978-81-322-1817-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics