Abstract
Recently random forest classification became a popular choice machine learning applications aimed to detect spam content in online social networks. In this paper, we report a systematic analysis of random forest classification for this purpose. We assessed the impact of key parameters, such as number of trees, depth of trees and minimum size of leaf nodes on classification performance. Our results show that controlling the complexity of random forest classifiers applied to social media spam is important in order to avoid overfitting and optimize performance We also conclude that in order to support reproducibility of experimental results it is important to report key parameters of random forest classifiers.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Statista: Number of social media users worldwide (2010–2020), https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/
Cao, Q., Sirivianos, M., Yang, X., Pregueiro, T.: Aiding the detection of fake accounts in large scale social online services. In: NSDI 2012 Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 15. USENIX Association (2012)
Zafarani, R., Liu, H.: 10 Bits of Surprise: Detecting Malicious Users with Minimum Information, pp. 423–431 (2015). doi:10.1145/2806416.2806535
Scott, P.: Fake News in U.S. Election? Elsewhere, That’s Nothing New (2016), http://www.nytimes.com/2016/11/18/technology/fake-news-on-facebook-in-foreign-elections-thats-not-new.html
Solon, O.: Facebook staff mount secret push to tackle fake news, reports say (2016)
Abu-Nimeh, S., Nappa, D., Wang, X., Nair, S.: A comparison of machine learning techniques for phishing detection. In: Proceedings Anti-Phishing Work, Groups 2nd Annual eCrime Res. Summit, eCrime 2007, pp. 60–69 (2007). doi:10.1145/1299015.1299021
Yang, C., Harkreader, R.C., Gu, G.: Empirical evaluation and new design for fighting evolving twitter spammers. IEEE Trans. Inf. Forensics Secur. 8, 1280–1293 (2013). doi:10.1109/TIFS.2013.2267732
Gupta, N., Aggarwal, A., Kumaraguru, P.: Bit.ly/malicious: deep dive into short URL based e-crime detection (2014)
Aggarwal, A., Rajadesingan, A., Kumaraguru, P.: PhishAri: automatic realtime phishing detection on twitter. eCrime Res. Summit, eCrime, pp. 1–12 (2012). doi:10.1109/eCrime.2012.6489521
Chu, Z., Widjaja, I., Wang, H.: Detecting social spam campaigns on Twitter. In: Bao, F., Samarati, P., Zhou, J. (eds.) ACNS 2012. LNCS, vol. 7341, pp. 455–472. Springer, Heidelberg (2012). doi:10.1007/978-3-642-31284-7_27
McCord, M., Chuah, M.: Spam detection on twitter using traditional classifiers. In: Calero, Jose M.Alcaraz, Yang, Laurence T., Mármol, F.G., García Villalba, L.J., Li, A.X., Wang, Y. (eds.) ATC 2011. LNCS, vol. 6906, pp. 175–186. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23496-5_13
Bosch, A., Zisserman, A., Mu, X., Munoz, X.: Image classification using random forests and ferns. In: IEEE 11th International Conference Computer Vision (ICCV), pp. 1–8 (2007). doi:10.1109/ICCV.2007.4409066
Lempitsky, V., Verhoek, M., Noble, J.Alison, Blake, A.: Random forest classification for automatic delineation of myocardium in real-time 3D echocardiography. In: Ayache, N., Delingette, H., Sermesant, M. (eds.) FIMH 2009. LNCS, vol. 5528, pp. 447–456. Springer, Heidelberg (2009). doi:10.1007/978-3-642-01932-6_48
Pal, M.: Random forest classifier for remote sensing classification. Int. J. Remote Sens. 26, 217–222 (2005). doi:10.1080/01431160412331269698
Liaw, A., Wiener, M., Hebebrand, J.: Classification and regression by randomForest. R News 2, 18–22 (2002). doi:10.1159/000323281
Provan, C.A., Cook, L., Cunningham, J.: A probabilistic airport capacity model for improved ground delay program planning. In: AIAA/IEEE Digital Avionics Systems Conference, Proceedings, pp. 1–12 (2011). doi:10.1109/DASC.2011.6095990
Invernizzi, L., Miskovic, S., Torres, R., Saha, S., Lee, S.-J., Mellia, M., Kruegel, C., Vigna, G.: Nazca: detecting malware distribution in large-scale networks. In: Network and Distributed System Security Symposium, pp. 1–16 (2014)
Aggarwal, A., Kumaraguru, P.: Followers or Phantoms? An Anatomy of Purchased Twitter Followers. (2014)
Chen, C., Zhang, J., Chen, X., Xiang, Y., Zhou, W.: 6 million spam tweets: a large ground truth for timely Twitter spam detection. In: IEEE International Conference on Communications 2015, pp. 7065–7070, September 2015. doi:10.1109/ICC.2015.7249453
Banfield, R.E., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: A comparison of decision tree ensemble creation techniques. IEEE Trans. Pattern Anal. Mach. Intell. 29, 173–180 (2007). doi:10.1109/TPAMI.2007.250609
Bradford, J.P., Kunz, C., Kohavi, R., Brunk, C., Brodley, C.E.: Pruning decision trees with misclassification costs. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 131–136. Springer, Heidelberg (1998). doi:10.1007/BFb0026682
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). doi:10.1023/A:1010933404324
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Al-Janabi, M., Andras, P. (2017). A Systematic Analysis of Random Forest Based Social Media Spam Classification. In: Yan, Z., Molva, R., Mazurczyk, W., Kantola, R. (eds) Network and System Security. NSS 2017. Lecture Notes in Computer Science(), vol 10394. Springer, Cham. https://doi.org/10.1007/978-3-319-64701-2_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-64701-2_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64700-5
Online ISBN: 978-3-319-64701-2
eBook Packages: Computer ScienceComputer Science (R0)