Learning Noise in Web Data Prior to Elimination

  • Julius OnyanchaEmail author
  • Valentina Plekhanova
  • David Nelson
Conference paper


This research work explores how noise in web data is currently addressed. We establish that current research works eliminate noise in web data mainly based on the structure and layout of web pages i.e. they consider noise as any data that does not form part of the main web page. However, not all data that form part of the main web page is of a user interest and not every data considered noise is actually noise to a given user. The ability to determine what is useful from noise data taking into account dynamic change of user interests has not been fully addressed by current research works. We aim to justify a claim that it is important to learn noise prior to elimination, not only to decrease levels of noise but also reduce the loss of useful information otherwise eliminated as noise. This is because if the process of eliminating noise in web data is not user-driven, the interestingness of web data available to a user will not reflect their interests given the time of the request.


Dynamic session identification Noise web data learning User interest User profile Web log data Web usage mining 



Special thanks to the University of Sunderland, Computing and Engineering Library Services for their financial support towards publication of this research work.


  1. 1.
    L. Yi, B. Liu, X. Li, Eliminating noisy information in web pages for data mining, in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, NY, USA, 2003), pp. 296–305. (KDD ’03). Accessed 4 Apr 2017
  2. 2.
    C. Ramya, G. Kavitha, D.K. Shreedhara, Preprocessing: A Prerequisite for Discovering Patterns in Web Usage Mining Process (2011). ArXiv:11050350. Accessed 4 Apr 2017
  3. 3.
    S. Dias, J. Gadge, Identifying informative web content blocks using web page segmentation, in Entropy, vol. 1 (2014), p. 2Google Scholar
  4. 4.
    J. Srivastava, R. Cooley, M. Deshpande, P.-N. Tan, Web usage mining: discovery and applications of usage patterns from web data. SIGKDD Explor Newsl. 1(2), 12–23 (2000)CrossRefGoogle Scholar
  5. 5.
    M. Jafari, F. SoleymaniSabzchi, S. Jamali, Extracting users’ navigational behavior from web log data: a survey. J. Comput. Sci. Appl. 1(3), 39–45 (2013)Google Scholar
  6. 6.
    S. Gauch, M. Speretta, A. Chandramouli, A. Micarelli, User profiles for personalized information access, in The adaptive web (Springer, 2007), pp. 54–89, Accessed 4 Apr 2017
  7. 7.
    P. Peñas, R. del Hoyo, J. Vea-Murguía, C. González, S. Mayo, Collective knowledge ontology user profiling for twitter—automatic user profiling, in 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) (2013), pp. 439–444Google Scholar
  8. 8.
    S. Kanoje, S. Girase, D. Mukhopadhyay, User profiling trends, techniques and applications (2015). ArXiv:150307474. Accessed 4 Apr 2017
  9. 9.
    H. Xiong, G. Pandey, M. Steinbach, V. Kumar, Enhancing data analysis with noise removal. IEEE Trans. Knowl. Data Eng. 18(3), 304–319 (2006)CrossRefGoogle Scholar
  10. 10.
    H. Liu, V. Kešelj, Combined mining of web server logs and web contents for classifying user navigation patterns and predicting users’ future requests. Data Knowl. Eng. 61(2), 304–330 (2007)CrossRefGoogle Scholar
  11. 11.
    S.K. Dwivedi, B. Rawat, A review paper on data preprocessing: a critical phase in web usage mining process, in 2015 International Conference on Green Computing and Internet of Things (ICGCIoT) (2015), pp. 506–510Google Scholar
  12. 12.
    H. Yang, S. Fong, Moderated VFDT in stream mining using adaptive tie threshold and incremental pruning, in Data Warehousing and Knowledge Discovery. Lecture Notes in Computer Science (Springer, Berlin, Heidelberg, 2011), pp. 471–483, Accessed 18 July 2017
  13. 13.
    L. Sunithaa, M.B. Rajua, B.S. Srinivas, A comparative study between noisy data and outlier data in data mining. Int. J. Curr. Eng. Technol. (2013)Google Scholar
  14. 14.
    S. Lingwal, Noise reduction and content retrieval from web pages. Int. J. Comput. Appl. 73(4) (2013). Accessed 4 Apr 2017
  15. 15.
    S.S. Bhamare, B.V. Pawar, Survey on web page noise cleaning for web mining. Int. J. Comput. Sci. Inf. Technol. 4(6), 766–770 (2013)Google Scholar
  16. 16.
    J. Kapusta, M. Munk, k MD, Cut-off time calculation for user session identification by reference length, in 2012 6th International Conference on Application of Information and Communication Technologies (AICT) (2012), pp. 1–6Google Scholar
  17. 17.
    A. Dutta, S. Paria, T. Golui, D.K. Kole, Structural analysis and regular expressions based noise elimination from web pages for web content mining, in 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (2014), pp. 1445–1451Google Scholar
  18. 18.
    P. Sivakumar, Effectual web content mining using noise removal from web pages. Wirel. Pers. Commun. 84(1), 99–121 (2015)CrossRefGoogle Scholar
  19. 19.
    T. Htwe, N.S.M. Kham, Extracting data region in web page by removing noise using DOM and neural network, in 3rd International Conference on Information and Financial Engineering (2011), Accessed 4 Apr 2017
  20. 20.
    R.P. Velloso, C.F. Dorneles, Automatic web page segmentation and noise removal for structured extraction using tag path sequences. J. Inf. Data Manag. 4(3), 173 (2013)Google Scholar
  21. 21.
    J. Onyancha, V. Plekhanova, D. Nelson, Noise web data learning from a web user profile: position paper, in Proceedings of The World Congress on Engineering 2017. Lecture Notes in Engineering and Computer Science, 5–7 July, 2017, London, U.K., pp. 608–611Google Scholar
  22. 22.
    M. John, J.S. Jayasudha, Methods for removing noise from web pages: a review (2016), Accessed 4 Apr 2017
  23. 23.
    M.E. Akpınar, Y. Yesilada, Vision based page segmentation algorithm: extended and perceived success, in Revised Selected Papers of the ICWE 2013 International Workshops on Current Trends in Web Engineering—Volume 8295. New York, NY, USA (Springer New York, Inc., 2013), pp. 238–252, Accessed 4 Apr 2017
  24. 24.
    A. Garg, B. Kaur, Enhancing performance of web page by removing noises using LRU. Int. J. Comput. Appl., 103(6) (2014), Accessed 4 Apr 2017
  25. 25.
    N. Narwal, Improving web data extraction by noise removal, in Fifth International Conference on Advances in Recent Technologies in Communication and Computing (ARTCom 2013) (2013), pp. 388–395Google Scholar
  26. 26.
    P. Nithya, P. Sumathi, Novel pre-processing technique for web log mining by removing global noise and web robots, in 2012 National Conference on Computing and Communication Systems (2012), pp. 1–5Google Scholar
  27. 27.
    A.K. Santra, S. Jayasudha, Classification of web log data to identify interested users using Naïve Bayesian classification. Int. J. Comput. Sci. Issues 9(1), 381–387 (2012)Google Scholar
  28. 28.
    J. Sripriya, E.S. Samundeeswari, Comparison of Neural Networks and Support Vector Machines using PCA and ICA for Feature Reduction. Int J Comput Appl. 2012 Feb 29;40(16):31–6Google Scholar
  29. 29.
    H.K. Azad, R. Raj, R. Kumar, H. Ranjan, K. Abhishek, M.P. Singh, Removal of noisy information in web pages (ACM Press, 2014), pp. 1–5, 4 Apr 2017
  30. 30.
    S.P. Malarvizhi, B. Sathiyabhama, Enhanced reconfigurable weighted association rule mining for frequent patterns of web logs. Int. J. Comput. 13(2), 97–105 (2014)Google Scholar
  31. 31.
    A. Nanda, R. Omanwar, B. Deshpande, Implicitly learning a user interest profile for personalization of web search using collaborative filtering, in 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) (2014), pp. 54–62Google Scholar
  32. 32.
    K.S. Rao, D.A.R. Babu, D.M. Krishnamurthy, Mining user interests from user search by using web log data. J. Web Dev. Web Des., 2(1) (2017), Accessed 8 Aug 2017
  33. 33.
    X. Wei, Y. Wang, Z. Li, T. Zou, G. Yang, Mining users interest navigation patterns using improved ant colony optimization. Intell. Autom. Soft Comput. 21(3), 445–454 (2015)CrossRefGoogle Scholar
  34. 34.
    M. Grčar, D. Mladenič, M. Grobelnik, User profiling for interest-focused browsing history, in Proceedings of the Workshop on End User Aspects of the Semantic Web (2005), pp. 99–109, Accessed 9 Aug 2017
  35. 35.
    H. Kim, P.K. Chan, Implicit indicators for interesting web pages (2005), Accessed 10 July 2017
  36. 36.
    X. Wu, P. Wang, M. Liu, A Method of mining user’s interest in intelligent e-learning (2014)Google Scholar
  37. 37.
    K. Jiang, Y. Yang, Noise reduction of web pages via feature analysis, in 2015 2nd International Conference on Information Science and Control Engineering (2015), pp. 345–348Google Scholar
  38. 38.
    X. Wang, B. Chen, F. Chang, A classification algorithm for noisy data streams with concept-drifting. J. Comput. Inf. Syst. 7(12), 4392–4399 (2011)Google Scholar
  39. 39.
    H. Wang, Q. Xu, L. Zhou, Deep web search interface identification: a semi-supervised ensemble approach. Information. 5(4), 634–651 (2014)CrossRefGoogle Scholar
  40. 40.
    F. Hu, M. Li, Y.N. Zhang, T. Peng, Y. Lei, A non-template approach to purify web pages based on word density, in: Proceedings of the International Conference on Information Engineering and Applications (IEA) 2012 (Springer, London, 2013), pp. 221–228, Accessed 4 Apr 2017
  41. 41.
    S.N. Das, M. Mathew, P.K. Vijayaraghavan, An efficient approach for finding near duplicate web pages using minimum weight overlapping method, in 2012 Ninth International Conference on Information Technology—New Generations (2012), pp. 121–126Google Scholar
  42. 42.
    P. Sahoo, S.P. Rajagopalan, An efficient web search engine for noisy free information retrieval. Int. Arab. J. Inf. Technol. IAJIT. (2015)Google Scholar
  43. 43.
    I.-H. Ting, H.-J. Wu, in Web Mining Applications in E-Commerce and E-Services (Springer Science & Business Media, 2009), 181 p.Google Scholar
  44. 44.
    O. Hasan, B. Habegger, L. Brunie, N. Bennani, E. Damiani, A discussion of privacy challenges in user profiling with big data techniques: the excess use case, in 2013 IEEE International Congress on Big Data (BigData Congress) (IEEE, 2013), pp. 25–30, Accessed 22 Aug 2017
  45. 45.
    P. Patel, M. Parmar, Improve heuristics for user session identification through web server log in web usage mining. Int. J. Comput. Sci. Inf. Technol. 5(3), 3562–3565 (2014)Google Scholar
  46. 46.
    J.-C. Ou, C.-H. Lee, M.-S. Chen, Efficient algorithms for incremental Web log mining with dynamic thresholds. VLDB J. 17(4), 827–845 (2008)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Julius Onyancha
    • 1
    Email author
  • Valentina Plekhanova
    • 1
  • David Nelson
    • 1
  1. 1.Faculty of Computer ScienceUniversity of SunderlandSunderlandUK

Personalised recommendations