Abstract
Security informatics and computational intelligence are gaining more importance in detecting terrorist activities as the extremist groups are misusing many of the available Internet services to incite violence and hatred. However, inadequate performance of statistical based computational intelligence methods reduces intelligent techniques efficiency in supporting counterterrorism efforts, and limits the early detection opportunities of potential terrorist activities. In this paper, we propose a feature set hybridization method, based on feature selection and extraction methods, for accurate content classification in Arabic dark web pages. The proposed method hybridizes the feature sets so that the generated feature set contains less number of features that capable of achieving higher classification performance. A selected dataset from Dark Web Forum Portal (DWFP) is used to test the performance of the proposed method that based on Term Frequency - Inverse Document Frequency (TFIDF) as feature selection method on one hand, while Random Projection (RP) and Principal Component Analysis (PCA) feature selection methods on the other hand. Classification results using the Support Vector Machine (SVM) classifier show that a high classification performance has been achieved base on the hybridization of TFIDF and PCA, where 99 % of F1 and accuracy performance has been achieved.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Downloaded from http://arabicstopwords.sourceforge.net/.
References
Abbasi, A., Chen, H.: Affect intensity analysis of dark web forums. In: Proceedings of the 2007 IEEE International Conference on Intelligence and Security Informatics (ISI 2007). IEEE, New Brunswick (2007)
Zhou, Y., Qin, J., Lai, G., Reid, E., Chen, H.: Exploring the dark side of the web: collection and analysis of U.S. extremist online forums. In: Mehrotra, S., Zeng, D.D., Chen, H., Thuraisingham, B., Wang, F.-Y. (eds.) ISI 2006. LNCS, vol. 3975, pp. 621–626. Springer, Heidelberg (2006)
Fu, T., Abbasi, A., Chen, H.: A focused crawler for dark web forums. J. Am. Soc. Inf. Sci. Technol. 61(6), 1213–1231 (2010)
Choi, D., et al.: Text analysis for detecting terrorism-related articles on the web. J. Netw. Comput. Appl. 38, 16–21 (2014)
Selamat, A., Omatu, S.: Web page feature selection and classification using neural networks. Inf. Sci. 158(1), 69–88 (2004)
Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, San Francisco (2001)
Lee, Z.-S., et al.: Enhance term weighting algorithm as feature selection technique for illicit web content classification. In: Proceedings of the Eighth International Conference on Intelligent Systems Design and Applications (ISDA 2008). IEEE, Kaohsiung (2008)
Ran, L., Xianjiu, G.: An improved algorithm to term weighting in text classification. In: Proceedings of the International Conference on Multimedia Technology (ICMT), Ningbo, China (2010)
Greevy, E., Smeaton, A.F.: Classifying racist texts using a support vector machine. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Sheffield (2004)
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Abbasi, A., Chen, H.: Applying Authorship Analysis to Extremist-Group Web Forum Messages. IEEE Intell. Syst. 20(5), 67–75 (2005)
Abbasi, A., Chen, H.: Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inform. Syst. 26(2), 7 (2008)
Zheng, R., et al.: A framework for authorship identification of online messages: Writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006)
Chen, H.: IEDs in the dark web: genre classification of improvised explosive device web pages. In: Proceedings of the 2008 IEEE International Conference on Intelligence and Security Informatics (ISI 2008). IEEE, Taipei (2008)
Tianjun, F., Chun-Neng, H., Hsinchun, C.: Identification of extremist videos in online video sharing sites. In: Proceedings of the 2009 IEEE International Conference on Intelligence and Security Informatics (ISI 2009). IEEE, Dallas (2009)
Huang, C., Fu, T., Chen, H.: Text-based video content classification for online video-sharing sites. J. Am. Soc. Inform. Sci. Technol. 61(5), 891–906 (2010)
Choi, D., et al.: Building knowledge domain N-Gram model for mobile devices. Inf. Int. Interdisc. J. 14(11), 3583–3590 (2011)
Chen, H.: Sentiment and affect analysis of Dark Web forums: measuring radicalization on the internet. In: Proceedings of the 2008 IEEE International Conference on Intelligence and Security Informatics (ISI 2008). IEEE, Taipei (2008)
Hwang, M., Choi, C., Kim, P.: Automatic Enrichment of Semantic Relation Network and Its Application to Word Sense Disambiguation. IEEE Trans. Knowl. Data Eng. 23(6), 845–858 (2011)
Choi, D., Kim, P.: Automatic image annotation using semantic text analysis. In: Quirchmayr, G., Basl, J., You, I., Xu, L., Weippl, E. (eds.) CD-ARES 2012. LNCS, vol. 7465, pp. 479–487. Springer, Heidelberg (2012)
Larkey, L.S., Ballesteros, L., Connell, M.E.: Light Stemming for Arabic Information Retrieval. In Soudi, A., van den Bosch, A., Neumann, G. (eds.) Arabic Computational Morphology, pp. 221–243. Springer, Netherlands (2007)
Chianga, D.-A., et al.: The Chinese text categorization system with association rule and category priority. Expert Syst. Appl. 35(1–2), 102–110 (2008)
Ting, S.L., See-To, E.K., Tse, Y.K.: Web information retrieval for health professionals. J. Med. Syst. 37(3), 1–14 (2013)
Paik, J.H.: A novel TF-IDF weighting scheme for effective ranking. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Dublin (2013)
Iezzi, D.F.: Centrality Measures for Text Clustering. Commun. Stat. Theory Methods 41(16–17), 3179–3197 (2012)
Alghamdi, H.M., Selamat, A.: Topic detections in Arabic Dark websites using improved Vector Space Model. In: Proceedings of the 4th Conference on Data Mining and Optimization (DMO 2012). IEEE, Langkawi (2012)
Elovici, Y., Shapira, B., Last, M., Zaafrany, O., Friedman, M., Schneider, M., Kandel, A.: Content-based detection of terrorists browsing the web using an Advanced Terror Detection System (ATDS). In: Kantor, P., Muresan, G., Roberts, F., Zeng, D.D., Wang, F.-Y., Chen, H., Merkle, R.C. (eds.) ISI 2005. LNCS, vol. 3495, pp. 244–255. Springer, Heidelberg (2005)
L’Huillier, G., et al.: Topic-based social network analysis for virtual communities of interests in the dark web. In: Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics (ISI-KDD 2010). Association for Computing Machinery, Washington, DC (2010)
Yang, L., et al.: Discovering Topics from Dark Websites. In: Proceedings of the 2009 IEEE Symposium on Computational Intelligence in Cyber Security (CICS). IEEE, Nashville (2009)
Fodor, I.: A Survey of Dimension Reduction Techniques (2002)
Kabán, A., Durrant, R.J.: Dimension-Adaptive bounds on compressive FLD classification. In: Jain, S., Munos, R., Stephan, F., Zeugmann, T. (eds.) ALT 2013. LNCS, vol. 8139, pp. 294–308. Springer, Heidelberg (2013)
Kohonen, T., et al.: Self organization of a massive document collection. IEEE Trans. Neural Networks 11(3), 574–585 (2000)
Kaski, S.: Dimensionality reduction by random mapping: Fast similarity computation for clustering. In: Proceedings of the 1998 IEEE International Joint Conference on Neural Networks, IEEE World Congress on Computational Intelligence. IEEE, Anchorage (1998)
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence. Morgan Kaufmann Publishers Inc., Hyderabad (2007)
Kwang In, K., Franz, M.O., Scholkopf, B.: Iterative kernel principal component analysis for image modeling. IEEE Trans. Pattern Anal. Mach. Intell. 27(9), 1351–1366 (2005)
Baldi, P., Hornik, K.: Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks 2(1), 53–58 (1989)
Jolliffe, I.T.: Principal Component Analysis. Springer Series in Statistics. Springer, New York (2002)
Anwar, T., Abulaish, M.: Identifying cliques in dark web forums - An agglomerative clustering approach. In: Proceedings of the 2012 IEEE International Conference on Intelligence and Security Informatics (ISI 2012). IEEE, Washington, DC (2012)
Rios, S.A., Munoz, R.: Dark web portal overlapping community detection based on topic models. In: Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics (ISI-KDD 2012). Association for Computing Machinery, Beijing (2012)
Yang, C.C., Tang, X., Gong, X.: Identifying dark web clusters with temporal coherence analysis. In: Proceedings of the 2011 IEEE International Conference on Intelligence and Security Informatics (ISI 2011). IEEE, Beijing (2011)
Yang, C.C., Tang, X., Thuraisingham, B.M.: An analysis of user influence ranking algorithms on Dark Web Forums. In: Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics (ISI-KDD 2010). Association for Computing Machinery, Washington, DC (2010)
Kramer, S.: Anomaly detection in extremist web forums using a dynamical systems approach. In: Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics (ISI-KDD 2010). Association for Computing Machinery, Washington, DC (2010)
Sabbah, T., Selamat, A., Selamat, M.H.: Revealing Terrorism Contents form Web Page Using Frequency Weighting Techniques. In: International Conference on Artificial Life and Robotics (ICAROB), Japan (2014)
Saad, M.K., Ashour, W.: OSAC: Open Source Arabic Corpora. In: Proceedings of the 6th International Conference on Electrical and Computer Systems, Lefke, Cyprus (2010)
Zimbra, D., Chen, H.: Scalable sentiment classification across multiple dark web forums. In: Proceedings of the 2012 IEEE International Conference on Intelligence and Security Informatics (ISI 2012). IEEE, Washington, DC (2012)
Man, L., et al.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)
Selamat, A., Ng, C.C.: Arabic script web page language identifications using decision tree neural networks. Pattern Recogn. 44(1), 133–144 (2011)
Acknowledgment
The Universiti Teknologi Malaysia (UTM) and Ministry of Education Malaysia under Research University grants 00M19, 02G71 and 4F550 are hereby acknowledged for some of the facilities that were utilized during the course of this research work. Moreover, The Al-Quds Open University – Palestine is acknowledged for supporting and funding the first author during his PhD study.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Sabbah, T., Selamat, A. (2015). Hybridized Feature Set for Accurate Arabic Dark Web Pages Classification. In: Fujita, H., Guizzi, G. (eds) Intelligent Software Methodologies, Tools and Techniques. SoMeT 2015. Communications in Computer and Information Science, vol 532. Springer, Cham. https://doi.org/10.1007/978-3-319-22689-7_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-22689-7_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22688-0
Online ISBN: 978-3-319-22689-7
eBook Packages: Computer ScienceComputer Science (R0)