Noise Elimination from Web Page Based on Regular Expressions for Web Content Mining

Dutta, Amit; Paria, Sudipta; Golui, Tanmoy; Kole, Dipak Kumar

doi:10.1007/978-3-319-07353-8_63

Amit Dutta⁷,
Sudipta Paria⁸,
Tanmoy Golui⁸ &
…
Dipak Kumar Kole⁸

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 27))

1972 Accesses
1 Citations

Abstract

Web content mining is used for discovering useful knowledge or information from the web page. So, noisy data in web document significantly affect the performance of web content mining. In this paper, a noise elimination method has been proposedbased on regular expression followed by Site Style Tree (SST). The proposed technique consists of two phases. In the first phase, filtering method based on regular expression is used on web pages to remove noisy HTML tags The filtered document then undergoes to second phase where an entropy based measured is used for removing further noise. The page size is reduced considerably by eliminate a number of lines of code preceded by some predefined noisy HTML tags. The con-sized web document is then used to form Document Object Model (DOM) tree and consequently the Site Style Tree is formed by crawling the pages from the same URL path as of the website. The experiment conducted on some most popular websites like www.amazon.com, www.yahoo.com and www.abcnews.com. The experimental result reveals that the filtering method eliminates a significant amount of noise before introduction of SST, so the overall space and time complexity is reduced compared to other SST based approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Han, J., Chang, K.C.-C.: Data Mining for Web Intelligence. IEEE Computer 35(11), 64–70 (2002)
Article Google Scholar
Srivastava, J., Desikan, P., Kumar, V.: Web Mining - Concepts, Applications, and Research Directions. In: Chu, W., Lin, T.Y. (eds.) Foundations and Advances in Data Mining. STUDFUZZ, vol. 180, pp. 275–307. Springer, Heidelberg (2005)
Chapter Google Scholar
Sabnis, V., Thakur, R.S.: Department of Computer Applications, MANIT, Bhopal, India, GA Based Model for Web Content Mining. IJCSI International Journal of Computer Science Issues 10(2), 3 (2013)
Google Scholar
Eirinaki, M., Vazirgiannis, M.: Web mining for web personalization. ACM Transactions on Internet Technology 3(1), 1–27 (2003)
Article Google Scholar
Abraham, A.: Business Intelligence from Web Usage Mining. Journal of Information & Knowledge Management 2(4), 375–390 (2003)
Article Google Scholar
Turney, P.: Coherent Keyphrase Extraction via Web Mining. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pp. 434–439 (2003)
Google Scholar
Lin, S.-H., Ho, J.-M.: Discovering Informative Content Blocks from Web Documents. In: Proceedings of Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 588–593 (2002)
Google Scholar
Bar-Yossef, Z., Rajagopalan, S.: Template Detection via Data Mining and its Applications. In: Proceedings of the 11th International Conference on World Wide Web (2002)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: International Conference on Machine Learning (1997)
Google Scholar
Kushmerick, N.: Learning to remove Internet advertisements. In: Proceedings of Third Annual Conference on Autonomous Agents, pp. 175–181 (1999)
Google Scholar
Kao, J.Y., Lin, S.H., Ho, J.M., Chen, M.S.: Entropy-based link analysis for mining web informative structures. In: Proceedings of Eleventh International Conference on Information and Knowledge Management, pp. 574–581 (2002)
Google Scholar
Fried, J.: Mastering regular expressions. O’Reilly Media Inc. (2006)
Google Scholar
Lan, Y., Bing, L., Xiaoli, L.: Eliminating Noisy Information in Web Pages for Data Mining. In: Proceedings of Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 296–305 (2003)
Google Scholar
Kang, B.H., Kim, Y.S.: Noise Elimination from the Web Documents by using URL paths and Information Redundancy (2006)
Google Scholar
Cormen, T.H., Leiserson, C.E., Ronald, R.L., Clifford, S.: Introduction to Algorithm. The MIT Press (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of IT, St. Thomas’ College of Engineering & Technology, Kolkata, India
Amit Dutta
Department of CSE, St. Thomas’ College of Engineering & Technology, Kolkata, India
Sudipta Paria, Tanmoy Golui & Dipak Kumar Kole

Authors

Amit Dutta
View author publications
You can also search for this author in PubMed Google Scholar
Sudipta Paria
View author publications
You can also search for this author in PubMed Google Scholar
Tanmoy Golui
View author publications
You can also search for this author in PubMed Google Scholar
Dipak Kumar Kole
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amit Dutta .

Editor information

Editors and Affiliations

Indian Statistical Institute, Machine Intelligence Unit, Kolkata, India
Malay Kumar Kundu
Dept. of Computer Science and Engineering, National Institute of Technology Rourkela, Rourkela, India
Durga Prasad Mohapatra
Dept. of Electronics and Tele-Communication Engineering, Jadavpur University Artificial Intelligence Laboratory, Kolkata, India
Amit Konar
Dept. of Computer Science and Engineering, St. Thomas' College of Engineering & Technology, Kidderpore, West Bengal, India
Aruna Chakraborty

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dutta, A., Paria, S., Golui, T., Kole, D.K. (2014). Noise Elimination from Web Page Based on Regular Expressions for Web Content Mining. In: Kumar Kundu, M., Mohapatra, D., Konar, A., Chakraborty, A. (eds) Advanced Computing, Networking and Informatics- Volume 1. Smart Innovation, Systems and Technologies, vol 27. Springer, Cham. https://doi.org/10.1007/978-3-319-07353-8_63

Download citation

DOI: https://doi.org/10.1007/978-3-319-07353-8_63
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07352-1
Online ISBN: 978-3-319-07353-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics