Towards ontology-based multilingual URL filtering: a big data problem

  • Mubashar Hussain
  • Mansoor Ahmed
  • Hasan Ali Khattak
  • Muhammad Imran
  • Abid Khan
  • Sadia Din
  • Awais Ahmad
  • Gwanggil Jeon
  • Alavalapati Goutham Reddy
Article
  • 18 Downloads

Abstract

Web content filtering is one among many techniques to limit the exposure of selective content on the Internet. It has gotten trivial with time, yet filtering of multilingual web content is still a difficult task, especially while considering big data landscape. The enormity of data increases the challenge of developing an effective content filtering system that can work in real time. There are several systems which can filter the URLs based on artificial intelligence techniques to identify the site with objectionable content. Most of these systems classify the URLs only in the English language. These systems either fail to respond when multilingual URLs are processed, or over-blocking is experienced. This paper introduces a filtering system that can classify multilingual URLs based on predefined criteria for URL, title, and metadata of a web page. Ontological approaches along with local multilingual dictionaries are used as the knowledge base to facilitate the challenging task of blocking URLs not meeting the filtering criteria. The proposed work shows high accuracy in classifying multilingual URLs into two categories, white and black. Evaluation results conducted on a large dataset show that the proposed system achieves promising accuracy, which is on a par with those achieved in state-of-the-art literature on semantic-based URL filtering.

Keywords

Filtering Information processing Classification Ontology engineering Big data 

References

  1. 1.
    Dalek J, Haselton B, Noman H, Senft A, Crete-Nishihata M, Gill P, Deibert RJ (2013) A method for identifying and confirming the use of URL filtering products for censorship. In: Proceedings of the 2013 Conference on Internet Measurement Conference. ACM, pp 23–30Google Scholar
  2. 2.
    Ma J, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp 1245–1254Google Scholar
  3. 3.
    Cowings D, Hoogstrate D, Jensen S, Medlar A, Schneider K (2012) U.S. Patent No. 8,145,710. U.S. Patent and Trademark Office, WashingtonGoogle Scholar
  4. 4.
    Srivastava M, Garg R, Mishra P (2014) Preprocessing techniques in web usage mining: a survey. Int J Comput Appl 97(18):1–9Google Scholar
  5. 5.
    Huang D, Xu K, Pei J (2014) Malicious URL detection by dynamically mining patterns without pre-defined elements. World Wide Web 17(6):1375–1394CrossRefGoogle Scholar
  6. 6.
    Chandrinos K, Androutsopoulos I, Paliouras G, Spyropoulos C (2000) Automatic web rating: filtering obscene content on the web. In: Research and Advanced Technology for Digital Libraries, pp 403–406Google Scholar
  7. 7.
    Lee LH, Juan YC, Chen HH, Tseng YH (2013) Objectionable content filtering by click-through data. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. ACM, pp 1581–1584Google Scholar
  8. 8.
    Zhou Z, Song T, Jia Y (2010) A high-performance url lookup engine for url filtering systems. In: 2010 IEEE International Conference on Communications (ICC). IEEE, pp 1–5Google Scholar
  9. 9.
    Zheng H, Liu H, Daoudi M (2004) Blocking objectionable images: adult images and harmful symbols. In: 2004 IEEE International Conference on Multimedia and Expo, 2004. ICME’04, vol. 2. IEEE, pp 1223–1226Google Scholar
  10. 10.
    Liu BB, Su JY, Lu ZM, Li Z (2008) Pornographic images detection based on CBIR and skin analysis. In: Fourth International Conference on Semantics, Knowledge and Grid, 2008. SKG’08. IEEE, pp 487–488Google Scholar
  11. 11.
    Imeshev S Cacheonix the big cache for big data. https://www.cacheonix.org/products/cacheonix/. Accessed 09 Aug 2017
  12. 12.
    Forte M, de Souza WL, do Prado AF (2006) A content classification and filtering server for the Internet. In: Proceedings of the 2006 ACM symposium on applied computing. ACM, pp 1166–1171Google Scholar
  13. 13.
    Thangaraj M, Karthikeyan VKT (2014) KT-grand: an algorithm for web content filtering. J Adva Resea Comp Sci Mana Stud 2(9):371–376Google Scholar
  14. 14.
    Rajalakshmi R, Aravindan C (2011) Naive Bayes approach for website classification. In: Das VV, Thomas G, Lumban Gaol F (eds) Information technology and mobile communication. Communications in computer and information science, vol 147. Springer, Berlin, HeidelbergGoogle Scholar
  15. 15.
    Neshatian K, Zhang M, Andreae P (2012) A filter approach to multiple feature construction for symbolic learning classifiers using genetic programming. IEEE Trans Evol Comput 16(5):645–661CrossRefGoogle Scholar
  16. 16.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1725–1732Google Scholar
  17. 17.
    Zhang JB, Xu ZM, Xiu KL, Pan QS (2010) A web site classification approach based on its topological structure. Int J Asian Lang Proc 20(2):75–86Google Scholar
  18. 18.
    Chou C, Condron L, Belland JC (2005) A review of the research on Internet addiction. Psychol Rev 17(4):363–388Google Scholar
  19. 19.
    Pai A (2011) FCC guide: children’s internet protection act. Federal Communications CommissionGoogle Scholar
  20. 20.
    Cisco (2005) Content-control software. https://www.opendns.com/. Accessed 15 Aug 2017
  21. 21.
    Lee LH, Juan YC, Tseng WL, Chen HH, Tseng YH (2015) Mining browsing behaviors for objectionable content filtering. J Assoc Inf Sci Technol 66(5):930–942CrossRefGoogle Scholar
  22. 22.
    Mahmood K, Takahashi H, Raza A, Qaiser A, Farooqui A (2015) Semantic based highly accurate autonomous decentralized URL classification system for Web filtering. In: 2015 IEEE twelfth international symposium on autonomous decentralized systems (ISADS). IEEE, pp 17–24Google Scholar
  23. 23.
    Feroz MN, Mengel S (2015). Phishing URL detection using URL ranking. In: 2015 IEEE international congress on Big Data (BigData congress). IEEE, pp 635–638Google Scholar
  24. 24.
    AOL (2016) “DMOZ,” AOL. http://www.dmoz.org/. Accessed 10 Aug 2017
  25. 25.
    “PhishTank.” https://www.phishtank.com/. Accessed 10 Aug 2017
  26. 26.
    Microsoft Corporation (2010) Microsoft reputation services. https://www.microsoft.com/emea/endtoend/sv-se/vision/reputation.aspx. Accessed 15 Aug 2017
  27. 27.
    Astrakhantsev N, Fedorenko D, Turdakov D (2014) Automatic enrichment of informal ontology by analyzing a domain-specific text collection. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue, vol. 13, pp 29–42Google Scholar
  28. 28.
    Barve A, Divakar S (2011) An efficient soft clustering algorithm for web page prediction. J Adv Eng Sci 1(1):3–6Google Scholar
  29. 29.
    Thomas K, Grier C, Ma J, Paxson V, Song D (2011) Design and evaluation of a real-time url spam filtering service. In: 2011 IEEE symposium on security and privacy (SP). IEEE, pp 447–462Google Scholar
  30. 30.
    Khare R (1999) Anatomy of a URL (and other internet-scale namespaces, part 1). IEEE Internet Comput 3(5):78CrossRefGoogle Scholar
  31. 31.
    McGuinness DL, Van Harmelen F (2004) OWL web ontology language overview. W3C Recomm 10(10):20Google Scholar
  32. 32.
    Pasin M, Motta E (2011) Ontological requirements for annotation and navigation of philosophical resources. Synthese 182(2):235–267CrossRefGoogle Scholar
  33. 33.
    Noy NF, Sintek M, Decker S, Crubézy M, Fergerson RW, Musen MA (2001) Creating semantic web contents with protege-2000. IEEE Intell Syst 16(2):60–71CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceCOMSATS Institute of Information TechnologyIslamabadPakistan
  2. 2.School of Computer Science and EngineeringKyungpook National UniversityDaeguKorea
  3. 3.Department of Computer ScienceBahria UniversityIslamabadPakistan
  4. 4.Department of Embedded Systems EngineeringIncheon National UniversityIncheonKorea
  5. 5.Department of Computer and Information SecuritySejong UniversitySeoulKorea

Personalised recommendations