Skip to main content

Hybridized Feature Set for Accurate Arabic Dark Web Pages Classification

  • Conference paper
  • First Online:
Intelligent Software Methodologies, Tools and Techniques (SoMeT 2015)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 532))

Abstract

Security informatics and computational intelligence are gaining more importance in detecting terrorist activities as the extremist groups are misusing many of the available Internet services to incite violence and hatred. However, inadequate performance of statistical based computational intelligence methods reduces intelligent techniques efficiency in supporting counterterrorism efforts, and limits the early detection opportunities of potential terrorist activities. In this paper, we propose a feature set hybridization method, based on feature selection and extraction methods, for accurate content classification in Arabic dark web pages. The proposed method hybridizes the feature sets so that the generated feature set contains less number of features that capable of achieving higher classification performance. A selected dataset from Dark Web Forum Portal (DWFP) is used to test the performance of the proposed method that based on Term Frequency - Inverse Document Frequency (TFIDF) as feature selection method on one hand, while Random Projection (RP) and Principal Component Analysis (PCA) feature selection methods on the other hand. Classification results using the Support Vector Machine (SVM) classifier show that a high classification performance has been achieved base on the hybridization of TFIDF and PCA, where 99 % of F1 and accuracy performance has been achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Downloaded from http://arabicstopwords.sourceforge.net/.

References

  1. Abbasi, A., Chen, H.: Affect intensity analysis of dark web forums. In: Proceedings of the 2007 IEEE International Conference on Intelligence and Security Informatics (ISI 2007). IEEE, New Brunswick (2007)

    Google Scholar 

  2. Zhou, Y., Qin, J., Lai, G., Reid, E., Chen, H.: Exploring the dark side of the web: collection and analysis of U.S. extremist online forums. In: Mehrotra, S., Zeng, D.D., Chen, H., Thuraisingham, B., Wang, F.-Y. (eds.) ISI 2006. LNCS, vol. 3975, pp. 621–626. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  3. Fu, T., Abbasi, A., Chen, H.: A focused crawler for dark web forums. J. Am. Soc. Inf. Sci. Technol. 61(6), 1213–1231 (2010)

    Google Scholar 

  4. Choi, D., et al.: Text analysis for detecting terrorism-related articles on the web. J. Netw. Comput. Appl. 38, 16–21 (2014)

    Article  Google Scholar 

  5. Selamat, A., Omatu, S.: Web page feature selection and classification using neural networks. Inf. Sci. 158(1), 69–88 (2004)

    Article  MathSciNet  Google Scholar 

  6. Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, San Francisco (2001)

    Google Scholar 

  7. Lee, Z.-S., et al.: Enhance term weighting algorithm as feature selection technique for illicit web content classification. In: Proceedings of the Eighth International Conference on Intelligent Systems Design and Applications (ISDA 2008). IEEE, Kaohsiung (2008)

    Google Scholar 

  8. Ran, L., Xianjiu, G.: An improved algorithm to term weighting in text classification. In: Proceedings of the International Conference on Multimedia Technology (ICMT), Ningbo, China (2010)

    Google Scholar 

  9. Greevy, E., Smeaton, A.F.: Classifying racist texts using a support vector machine. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Sheffield (2004)

    Google Scholar 

  10. Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  Google Scholar 

  11. Abbasi, A., Chen, H.: Applying Authorship Analysis to Extremist-Group Web Forum Messages. IEEE Intell. Syst. 20(5), 67–75 (2005)

    Article  Google Scholar 

  12. Abbasi, A., Chen, H.: Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inform. Syst. 26(2), 7 (2008)

    Article  Google Scholar 

  13. Zheng, R., et al.: A framework for authorship identification of online messages: Writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006)

    Article  Google Scholar 

  14. Chen, H.: IEDs in the dark web: genre classification of improvised explosive device web pages. In: Proceedings of the 2008 IEEE International Conference on Intelligence and Security Informatics (ISI 2008). IEEE, Taipei (2008)

    Google Scholar 

  15. Tianjun, F., Chun-Neng, H., Hsinchun, C.: Identification of extremist videos in online video sharing sites. In: Proceedings of the 2009 IEEE International Conference on Intelligence and Security Informatics (ISI 2009). IEEE, Dallas (2009)

    Google Scholar 

  16. Huang, C., Fu, T., Chen, H.: Text-based video content classification for online video-sharing sites. J. Am. Soc. Inform. Sci. Technol. 61(5), 891–906 (2010)

    Article  MATH  Google Scholar 

  17. Choi, D., et al.: Building knowledge domain N-Gram model for mobile devices. Inf. Int. Interdisc. J. 14(11), 3583–3590 (2011)

    Google Scholar 

  18. Chen, H.: Sentiment and affect analysis of Dark Web forums: measuring radicalization on the internet. In: Proceedings of the 2008 IEEE International Conference on Intelligence and Security Informatics (ISI 2008). IEEE, Taipei (2008)

    Google Scholar 

  19. Hwang, M., Choi, C., Kim, P.: Automatic Enrichment of Semantic Relation Network and Its Application to Word Sense Disambiguation. IEEE Trans. Knowl. Data Eng. 23(6), 845–858 (2011)

    Article  MATH  Google Scholar 

  20. Choi, D., Kim, P.: Automatic image annotation using semantic text analysis. In: Quirchmayr, G., Basl, J., You, I., Xu, L., Weippl, E. (eds.) CD-ARES 2012. LNCS, vol. 7465, pp. 479–487. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  21. Larkey, L.S., Ballesteros, L., Connell, M.E.: Light Stemming for Arabic Information Retrieval. In Soudi, A., van den Bosch, A., Neumann, G. (eds.) Arabic Computational Morphology, pp. 221–243. Springer, Netherlands (2007)

    Google Scholar 

  22. Chianga, D.-A., et al.: The Chinese text categorization system with association rule and category priority. Expert Syst. Appl. 35(1–2), 102–110 (2008)

    Article  Google Scholar 

  23. Ting, S.L., See-To, E.K., Tse, Y.K.: Web information retrieval for health professionals. J. Med. Syst. 37(3), 1–14 (2013)

    Article  Google Scholar 

  24. Paik, J.H.: A novel TF-IDF weighting scheme for effective ranking. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Dublin (2013)

    Google Scholar 

  25. Iezzi, D.F.: Centrality Measures for Text Clustering. Commun. Stat. Theory Methods 41(16–17), 3179–3197 (2012)

    Article  MathSciNet  Google Scholar 

  26. Alghamdi, H.M., Selamat, A.: Topic detections in Arabic Dark websites using improved Vector Space Model. In: Proceedings of the 4th Conference on Data Mining and Optimization (DMO 2012). IEEE, Langkawi (2012)

    Google Scholar 

  27. Elovici, Y., Shapira, B., Last, M., Zaafrany, O., Friedman, M., Schneider, M., Kandel, A.: Content-based detection of terrorists browsing the web using an Advanced Terror Detection System (ATDS). In: Kantor, P., Muresan, G., Roberts, F., Zeng, D.D., Wang, F.-Y., Chen, H., Merkle, R.C. (eds.) ISI 2005. LNCS, vol. 3495, pp. 244–255. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  28. L’Huillier, G., et al.: Topic-based social network analysis for virtual communities of interests in the dark web. In: Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics (ISI-KDD 2010). Association for Computing Machinery, Washington, DC (2010)

    Google Scholar 

  29. Yang, L., et al.: Discovering Topics from Dark Websites. In: Proceedings of the 2009 IEEE Symposium on Computational Intelligence in Cyber Security (CICS). IEEE, Nashville (2009)

    Google Scholar 

  30. Fodor, I.: A Survey of Dimension Reduction Techniques (2002)

    Google Scholar 

  31. Kabán, A., Durrant, R.J.: Dimension-Adaptive bounds on compressive FLD classification. In: Jain, S., Munos, R., Stephan, F., Zeugmann, T. (eds.) ALT 2013. LNCS, vol. 8139, pp. 294–308. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  32. Kohonen, T., et al.: Self organization of a massive document collection. IEEE Trans. Neural Networks 11(3), 574–585 (2000)

    Article  Google Scholar 

  33. Kaski, S.: Dimensionality reduction by random mapping: Fast similarity computation for clustering. In: Proceedings of the 1998 IEEE International Joint Conference on Neural Networks, IEEE World Congress on Computational Intelligence. IEEE, Anchorage (1998)

    Google Scholar 

  34. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence. Morgan Kaufmann Publishers Inc., Hyderabad (2007)

    Google Scholar 

  35. Kwang In, K., Franz, M.O., Scholkopf, B.: Iterative kernel principal component analysis for image modeling. IEEE Trans. Pattern Anal. Mach. Intell. 27(9), 1351–1366 (2005)

    Article  Google Scholar 

  36. Baldi, P., Hornik, K.: Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks 2(1), 53–58 (1989)

    Article  Google Scholar 

  37. Jolliffe, I.T.: Principal Component Analysis. Springer Series in Statistics. Springer, New York (2002)

    MATH  Google Scholar 

  38. Anwar, T., Abulaish, M.: Identifying cliques in dark web forums - An agglomerative clustering approach. In: Proceedings of the 2012 IEEE International Conference on Intelligence and Security Informatics (ISI 2012). IEEE, Washington, DC (2012)

    Google Scholar 

  39. Rios, S.A., Munoz, R.: Dark web portal overlapping community detection based on topic models. In: Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics (ISI-KDD 2012). Association for Computing Machinery, Beijing (2012)

    Google Scholar 

  40. Yang, C.C., Tang, X., Gong, X.: Identifying dark web clusters with temporal coherence analysis. In: Proceedings of the 2011 IEEE International Conference on Intelligence and Security Informatics (ISI 2011). IEEE, Beijing (2011)

    Google Scholar 

  41. Yang, C.C., Tang, X., Thuraisingham, B.M.: An analysis of user influence ranking algorithms on Dark Web Forums. In: Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics (ISI-KDD 2010). Association for Computing Machinery, Washington, DC (2010)

    Google Scholar 

  42. Kramer, S.: Anomaly detection in extremist web forums using a dynamical systems approach. In: Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics (ISI-KDD 2010). Association for Computing Machinery, Washington, DC (2010)

    Google Scholar 

  43. Sabbah, T., Selamat, A., Selamat, M.H.: Revealing Terrorism Contents form Web Page Using Frequency Weighting Techniques. In: International Conference on Artificial Life and Robotics (ICAROB), Japan (2014)

    Google Scholar 

  44. Saad, M.K., Ashour, W.: OSAC: Open Source Arabic Corpora. In: Proceedings of the 6th International Conference on Electrical and Computer Systems, Lefke, Cyprus (2010)

    Google Scholar 

  45. Zimbra, D., Chen, H.: Scalable sentiment classification across multiple dark web forums. In: Proceedings of the 2012 IEEE International Conference on Intelligence and Security Informatics (ISI 2012). IEEE, Washington, DC (2012)

    Google Scholar 

  46. Man, L., et al.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)

    Article  Google Scholar 

  47. Selamat, A., Ng, C.C.: Arabic script web page language identifications using decision tree neural networks. Pattern Recogn. 44(1), 133–144 (2011)

    Article  MATH  Google Scholar 

Download references

Acknowledgment

The Universiti Teknologi Malaysia (UTM) and Ministry of Education Malaysia under Research University grants 00M19, 02G71 and 4F550 are hereby acknowledged for some of the facilities that were utilized during the course of this research work. Moreover, The Al-Quds Open University – Palestine is acknowledged for supporting and funding the first author during his PhD study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Selamat .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Sabbah, T., Selamat, A. (2015). Hybridized Feature Set for Accurate Arabic Dark Web Pages Classification. In: Fujita, H., Guizzi, G. (eds) Intelligent Software Methodologies, Tools and Techniques. SoMeT 2015. Communications in Computer and Information Science, vol 532. Springer, Cham. https://doi.org/10.1007/978-3-319-22689-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-22689-7_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-22688-0

  • Online ISBN: 978-3-319-22689-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics