Skip to main content

Noise Elimination from Web Page Based on Regular Expressions for Web Content Mining

  • Conference paper
Advanced Computing, Networking and Informatics- Volume 1

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 27))

Abstract

Web content mining is used for discovering useful knowledge or information from the web page. So, noisy data in web document significantly affect the performance of web content mining. In this paper, a noise elimination method has been proposedbased on regular expression followed by Site Style Tree (SST). The proposed technique consists of two phases. In the first phase, filtering method based on regular expression is used on web pages to remove noisy HTML tags The filtered document then undergoes to second phase where an entropy based measured is used for removing further noise. The page size is reduced considerably by eliminate a number of lines of code preceded by some predefined noisy HTML tags. The con-sized web document is then used to form Document Object Model (DOM) tree and consequently the Site Style Tree is formed by crawling the pages from the same URL path as of the website. The experiment conducted on some most popular websites like www.amazon.com, www.yahoo.com and www.abcnews.com. The experimental result reveals that the filtering method eliminates a significant amount of noise before introduction of SST, so the overall space and time complexity is reduced compared to other SST based approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Han, J., Chang, K.C.-C.: Data Mining for Web Intelligence. IEEE Computer 35(11), 64–70 (2002)

    Article  Google Scholar 

  2. Srivastava, J., Desikan, P., Kumar, V.: Web Mining - Concepts, Applications, and Research Directions. In: Chu, W., Lin, T.Y. (eds.) Foundations and Advances in Data Mining. STUDFUZZ, vol. 180, pp. 275–307. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  3. Sabnis, V., Thakur, R.S.: Department of Computer Applications, MANIT, Bhopal, India, GA Based Model for Web Content Mining. IJCSI International Journal of Computer Science Issues 10(2), 3 (2013)

    Google Scholar 

  4. Eirinaki, M., Vazirgiannis, M.: Web mining for web personalization. ACM Transactions on Internet Technology 3(1), 1–27 (2003)

    Article  Google Scholar 

  5. Abraham, A.: Business Intelligence from Web Usage Mining. Journal of Information & Knowledge Management 2(4), 375–390 (2003)

    Article  Google Scholar 

  6. Turney, P.: Coherent Keyphrase Extraction via Web Mining. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pp. 434–439 (2003)

    Google Scholar 

  7. Lin, S.-H., Ho, J.-M.: Discovering Informative Content Blocks from Web Documents. In: Proceedings of Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 588–593 (2002)

    Google Scholar 

  8. Bar-Yossef, Z., Rajagopalan, S.: Template Detection via Data Mining and its Applications. In: Proceedings of the 11th International Conference on World Wide Web (2002)

    Google Scholar 

  9. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: International Conference on Machine Learning (1997)

    Google Scholar 

  10. Kushmerick, N.: Learning to remove Internet advertisements. In: Proceedings of Third Annual Conference on Autonomous Agents, pp. 175–181 (1999)

    Google Scholar 

  11. Kao, J.Y., Lin, S.H., Ho, J.M., Chen, M.S.: Entropy-based link analysis for mining web informative structures. In: Proceedings of Eleventh International Conference on Information and Knowledge Management, pp. 574–581 (2002)

    Google Scholar 

  12. Fried, J.: Mastering regular expressions. O’Reilly Media Inc. (2006)

    Google Scholar 

  13. Lan, Y., Bing, L., Xiaoli, L.: Eliminating Noisy Information in Web Pages for Data Mining. In: Proceedings of Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 296–305 (2003)

    Google Scholar 

  14. Kang, B.H., Kim, Y.S.: Noise Elimination from the Web Documents by using URL paths and Information Redundancy (2006)

    Google Scholar 

  15. Cormen, T.H., Leiserson, C.E., Ronald, R.L., Clifford, S.: Introduction to Algorithm. The MIT Press (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amit Dutta .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Dutta, A., Paria, S., Golui, T., Kole, D.K. (2014). Noise Elimination from Web Page Based on Regular Expressions for Web Content Mining. In: Kumar Kundu, M., Mohapatra, D., Konar, A., Chakraborty, A. (eds) Advanced Computing, Networking and Informatics- Volume 1. Smart Innovation, Systems and Technologies, vol 27. Springer, Cham. https://doi.org/10.1007/978-3-319-07353-8_63

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-07353-8_63

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-07352-1

  • Online ISBN: 978-3-319-07353-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics