Hybridized Feature Set for Accurate Arabic Dark Web Pages Classification

Sabbah, Thabit; Selamat, Ali

doi:10.1007/978-3-319-22689-7_13

Thabit Sabbah³ &
Ali Selamat³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 532))

Included in the following conference series:

International Conference on Intelligent Software Methodologies, Tools, and Techniques

1279 Accesses
3 Citations

Abstract

Security informatics and computational intelligence are gaining more importance in detecting terrorist activities as the extremist groups are misusing many of the available Internet services to incite violence and hatred. However, inadequate performance of statistical based computational intelligence methods reduces intelligent techniques efficiency in supporting counterterrorism efforts, and limits the early detection opportunities of potential terrorist activities. In this paper, we propose a feature set hybridization method, based on feature selection and extraction methods, for accurate content classification in Arabic dark web pages. The proposed method hybridizes the feature sets so that the generated feature set contains less number of features that capable of achieving higher classification performance. A selected dataset from Dark Web Forum Portal (DWFP) is used to test the performance of the proposed method that based on Term Frequency - Inverse Document Frequency (TFIDF) as feature selection method on one hand, while Random Projection (RP) and Principal Component Analysis (PCA) feature selection methods on the other hand. Classification results using the Support Vector Machine (SVM) classifier show that a high classification performance has been achieved base on the hybridization of TFIDF and PCA, where 99 % of F1 and accuracy performance has been achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Downloaded from http://arabicstopwords.sourceforge.net/.

References

Abbasi, A., Chen, H.: Affect intensity analysis of dark web forums. In: Proceedings of the 2007 IEEE International Conference on Intelligence and Security Informatics (ISI 2007). IEEE, New Brunswick (2007)
Google Scholar
Zhou, Y., Qin, J., Lai, G., Reid, E., Chen, H.: Exploring the dark side of the web: collection and analysis of U.S. extremist online forums. In: Mehrotra, S., Zeng, D.D., Chen, H., Thuraisingham, B., Wang, F.-Y. (eds.) ISI 2006. LNCS, vol. 3975, pp. 621–626. Springer, Heidelberg (2006)
Chapter Google Scholar
Fu, T., Abbasi, A., Chen, H.: A focused crawler for dark web forums. J. Am. Soc. Inf. Sci. Technol. 61(6), 1213–1231 (2010)
Google Scholar
Choi, D., et al.: Text analysis for detecting terrorism-related articles on the web. J. Netw. Comput. Appl. 38, 16–21 (2014)
Article Google Scholar
Selamat, A., Omatu, S.: Web page feature selection and classification using neural networks. Inf. Sci. 158(1), 69–88 (2004)
Article MathSciNet Google Scholar
Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, San Francisco (2001)
Google Scholar
Lee, Z.-S., et al.: Enhance term weighting algorithm as feature selection technique for illicit web content classification. In: Proceedings of the Eighth International Conference on Intelligent Systems Design and Applications (ISDA 2008). IEEE, Kaohsiung (2008)
Google Scholar
Ran, L., Xianjiu, G.: An improved algorithm to term weighting in text classification. In: Proceedings of the International Conference on Multimedia Technology (ICMT), Ningbo, China (2010)
Google Scholar
Greevy, E., Smeaton, A.F.: Classifying racist texts using a support vector machine. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Sheffield (2004)
Google Scholar
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article Google Scholar
Abbasi, A., Chen, H.: Applying Authorship Analysis to Extremist-Group Web Forum Messages. IEEE Intell. Syst. 20(5), 67–75 (2005)
Article Google Scholar
Abbasi, A., Chen, H.: Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inform. Syst. 26(2), 7 (2008)
Article Google Scholar
Zheng, R., et al.: A framework for authorship identification of online messages: Writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006)
Article Google Scholar
Chen, H.: IEDs in the dark web: genre classification of improvised explosive device web pages. In: Proceedings of the 2008 IEEE International Conference on Intelligence and Security Informatics (ISI 2008). IEEE, Taipei (2008)
Google Scholar
Tianjun, F., Chun-Neng, H., Hsinchun, C.: Identification of extremist videos in online video sharing sites. In: Proceedings of the 2009 IEEE International Conference on Intelligence and Security Informatics (ISI 2009). IEEE, Dallas (2009)
Google Scholar
Huang, C., Fu, T., Chen, H.: Text-based video content classification for online video-sharing sites. J. Am. Soc. Inform. Sci. Technol. 61(5), 891–906 (2010)
Article MATH Google Scholar
Choi, D., et al.: Building knowledge domain N-Gram model for mobile devices. Inf. Int. Interdisc. J. 14(11), 3583–3590 (2011)
Google Scholar
Chen, H.: Sentiment and affect analysis of Dark Web forums: measuring radicalization on the internet. In: Proceedings of the 2008 IEEE International Conference on Intelligence and Security Informatics (ISI 2008). IEEE, Taipei (2008)
Google Scholar
Hwang, M., Choi, C., Kim, P.: Automatic Enrichment of Semantic Relation Network and Its Application to Word Sense Disambiguation. IEEE Trans. Knowl. Data Eng. 23(6), 845–858 (2011)
Article MATH Google Scholar
Choi, D., Kim, P.: Automatic image annotation using semantic text analysis. In: Quirchmayr, G., Basl, J., You, I., Xu, L., Weippl, E. (eds.) CD-ARES 2012. LNCS, vol. 7465, pp. 479–487. Springer, Heidelberg (2012)
Chapter Google Scholar
Larkey, L.S., Ballesteros, L., Connell, M.E.: Light Stemming for Arabic Information Retrieval. In Soudi, A., van den Bosch, A., Neumann, G. (eds.) Arabic Computational Morphology, pp. 221–243. Springer, Netherlands (2007)
Google Scholar
Chianga, D.-A., et al.: The Chinese text categorization system with association rule and category priority. Expert Syst. Appl. 35(1–2), 102–110 (2008)
Article Google Scholar
Ting, S.L., See-To, E.K., Tse, Y.K.: Web information retrieval for health professionals. J. Med. Syst. 37(3), 1–14 (2013)
Article Google Scholar
Paik, J.H.: A novel TF-IDF weighting scheme for effective ranking. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Dublin (2013)
Google Scholar
Iezzi, D.F.: Centrality Measures for Text Clustering. Commun. Stat. Theory Methods 41(16–17), 3179–3197 (2012)
Article MathSciNet Google Scholar
Alghamdi, H.M., Selamat, A.: Topic detections in Arabic Dark websites using improved Vector Space Model. In: Proceedings of the 4th Conference on Data Mining and Optimization (DMO 2012). IEEE, Langkawi (2012)
Google Scholar
Elovici, Y., Shapira, B., Last, M., Zaafrany, O., Friedman, M., Schneider, M., Kandel, A.: Content-based detection of terrorists browsing the web using an Advanced Terror Detection System (ATDS). In: Kantor, P., Muresan, G., Roberts, F., Zeng, D.D., Wang, F.-Y., Chen, H., Merkle, R.C. (eds.) ISI 2005. LNCS, vol. 3495, pp. 244–255. Springer, Heidelberg (2005)
Chapter Google Scholar
L’Huillier, G., et al.: Topic-based social network analysis for virtual communities of interests in the dark web. In: Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics (ISI-KDD 2010). Association for Computing Machinery, Washington, DC (2010)
Google Scholar
Yang, L., et al.: Discovering Topics from Dark Websites. In: Proceedings of the 2009 IEEE Symposium on Computational Intelligence in Cyber Security (CICS). IEEE, Nashville (2009)
Google Scholar
Fodor, I.: A Survey of Dimension Reduction Techniques (2002)
Google Scholar
Kabán, A., Durrant, R.J.: Dimension-Adaptive bounds on compressive FLD classification. In: Jain, S., Munos, R., Stephan, F., Zeugmann, T. (eds.) ALT 2013. LNCS, vol. 8139, pp. 294–308. Springer, Heidelberg (2013)
Chapter Google Scholar
Kohonen, T., et al.: Self organization of a massive document collection. IEEE Trans. Neural Networks 11(3), 574–585 (2000)
Article Google Scholar
Kaski, S.: Dimensionality reduction by random mapping: Fast similarity computation for clustering. In: Proceedings of the 1998 IEEE International Joint Conference on Neural Networks, IEEE World Congress on Computational Intelligence. IEEE, Anchorage (1998)
Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence. Morgan Kaufmann Publishers Inc., Hyderabad (2007)
Google Scholar
Kwang In, K., Franz, M.O., Scholkopf, B.: Iterative kernel principal component analysis for image modeling. IEEE Trans. Pattern Anal. Mach. Intell. 27(9), 1351–1366 (2005)
Article Google Scholar
Baldi, P., Hornik, K.: Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks 2(1), 53–58 (1989)
Article Google Scholar
Jolliffe, I.T.: Principal Component Analysis. Springer Series in Statistics. Springer, New York (2002)
MATH Google Scholar
Anwar, T., Abulaish, M.: Identifying cliques in dark web forums - An agglomerative clustering approach. In: Proceedings of the 2012 IEEE International Conference on Intelligence and Security Informatics (ISI 2012). IEEE, Washington, DC (2012)
Google Scholar
Rios, S.A., Munoz, R.: Dark web portal overlapping community detection based on topic models. In: Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics (ISI-KDD 2012). Association for Computing Machinery, Beijing (2012)
Google Scholar
Yang, C.C., Tang, X., Gong, X.: Identifying dark web clusters with temporal coherence analysis. In: Proceedings of the 2011 IEEE International Conference on Intelligence and Security Informatics (ISI 2011). IEEE, Beijing (2011)
Google Scholar
Yang, C.C., Tang, X., Thuraisingham, B.M.: An analysis of user influence ranking algorithms on Dark Web Forums. In: Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics (ISI-KDD 2010). Association for Computing Machinery, Washington, DC (2010)
Google Scholar
Kramer, S.: Anomaly detection in extremist web forums using a dynamical systems approach. In: Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics (ISI-KDD 2010). Association for Computing Machinery, Washington, DC (2010)
Google Scholar
Sabbah, T., Selamat, A., Selamat, M.H.: Revealing Terrorism Contents form Web Page Using Frequency Weighting Techniques. In: International Conference on Artificial Life and Robotics (ICAROB), Japan (2014)
Google Scholar
Saad, M.K., Ashour, W.: OSAC: Open Source Arabic Corpora. In: Proceedings of the 6th International Conference on Electrical and Computer Systems, Lefke, Cyprus (2010)
Google Scholar
Zimbra, D., Chen, H.: Scalable sentiment classification across multiple dark web forums. In: Proceedings of the 2012 IEEE International Conference on Intelligence and Security Informatics (ISI 2012). IEEE, Washington, DC (2012)
Google Scholar
Man, L., et al.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)
Article Google Scholar
Selamat, A., Ng, C.C.: Arabic script web page language identifications using decision tree neural networks. Pattern Recogn. 44(1), 133–144 (2011)
Article MATH Google Scholar

Download references

Acknowledgment

The Universiti Teknologi Malaysia (UTM) and Ministry of Education Malaysia under Research University grants 00M19, 02G71 and 4F550 are hereby acknowledged for some of the facilities that were utilized during the course of this research work. Moreover, The Al-Quds Open University – Palestine is acknowledged for supporting and funding the first author during his PhD study.

Author information

Authors and Affiliations

Faculty of Computing, Universiti Teknologi Malaysia (UTM), Skudai, Johor, Malaysia
Thabit Sabbah & Ali Selamat

Authors

Thabit Sabbah
View author publications
You can also search for this author in PubMed Google Scholar
Ali Selamat
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Selamat .

Editor information

Editors and Affiliations

Iwate Prefectural University, Takizawa, Japan
Hamido Fujita
University of Naples "Federico II", Napoli, Italy
Guido Guizzi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sabbah, T., Selamat, A. (2015). Hybridized Feature Set for Accurate Arabic Dark Web Pages Classification. In: Fujita, H., Guizzi, G. (eds) Intelligent Software Methodologies, Tools and Techniques. SoMeT 2015. Communications in Computer and Information Science, vol 532. Springer, Cham. https://doi.org/10.1007/978-3-319-22689-7_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-22689-7_13
Published: 01 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22688-0
Online ISBN: 978-3-319-22689-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics