Advertisement

A MapReduce-Based Distributed SVM for Scalable Data Type Classification

  • Chong Jiang
  • Ting Wu
  • Jian Xu
  • Ning Zheng
  • Ming Xu
  • Tao YangEmail author
Conference paper
Part of the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering book series (LNICST, volume 201)

Abstract

Data type classification is a significant problem in digital forensics and information security field. Methods based on support vector machine have proven the most successful across varying classification approaches in the previous work. However, the training process of SVM is notably computationally intensive with the number of training vectors increased rapidly. In this study, we proposed parallel distributed SVM (PDSVM) based on Hadoop MapReduce for scalable data type classification. First the map phase determines support vectors (SVs) in the splits of dataset by running the sequential minimal optimization. Then the reduce phase merges SVs and computes the degree of global convergence. Finally, PDSVM utilizes the global convergence SVs to get SVM model. The experimental results demonstrate that PDSVM can not only process large scale training dataset, but also perform well in the term of classification accuracy.

Keywords

Data type classification Digital forensics Support vector machine Distributed MapReduce 

Notes

Acknowledgments

This work is support by Natural Science Foundation of China under Grant No. 61070212 and 61572165, the State Key Program of Zhejiang Province Natural Science Foundation of China under Grant No. LZ15F020003 and Key Lab of Information Network Security of Ministry of Public Security.

References

  1. 1.
    Foster, I., Kesselman, C., Nick, J., Tuecke, S.: The physiology of the grid: an open gridservices architecture for distributed systems integration. Technical report, Global GridGoogle Scholar
  2. 2.
    Zheng, N., Wang, J., Wu, T., et al.: A fragment classification method depending on data type. In: IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing. IEEE (2015)Google Scholar
  3. 3.
    Beebe, N.L., Maddox, L.A., Liu, L., et al.: Sceadan: using concatenated n-gram vectors for improved file and data type classification. IEEE Trans. Inf. Forensics Secur. 8(9), 1519–1530 (2013)CrossRefGoogle Scholar
  4. 4.
    Erbacher, R.F., Mulholland J.: Identification and localization of data types within large-scale file systems. In: International Workshop on Systematic Approaches to Digital Forensic Engineering, pp. 55–70. IEEE Computer Society (2007)Google Scholar
  5. 5.
    Beek, H.M.A.V., Eijk, E.J.V., Baar, R.B.V., et al.: Digital forensics as a service: game on. Digital Invest. 15, 20–38 (2015)CrossRefGoogle Scholar
  6. 6.
    Fitzgerald, S., Mathews, G., Morris, C., et al.: Using NLP techniques for file fragment classification. Digital Invest. 9(15), S44–S49 (2012)CrossRefGoogle Scholar
  7. 7.
    Xu, K., Wen, C., Yuan, Q., et al.: A MapReduce based parallel SVM for email classification. J. Networks, 9(6) (2014)Google Scholar
  8. 8.
    Ke, X., Jin, H., Xie, X., et al.: A distributed SVM method based on the iterative MapReduce. In: IEEE International Conference on Semantic Computing (ICSC), pp. 116–119. IEEE Computer Society (2015)Google Scholar
  9. 9.
    Çatak, F.Ö.: Polarization measurement of high dimensional social media messages with support vector machine algorithm using MapReduce (2015)Google Scholar
  10. 10.
    Guo, W., Alham, N.K., Liu, Y., et al.: A resource aware MapReduce based parallel SVM for large scale image classifications. Neural Process. Lett., 1–24 (2015)Google Scholar
  11. 11.
    Na, G., Shim, K., Moon, K., Kong, S., Kim, E., Lee, J.: Frame-based recovery of corrupted video files using codec specifications. IEEE Trans. Image Process. 23(2), 517–526 (2014)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Moody, S.J., Erbacher, R.F.: SÁDI - statistical analysis for data type identification. In: International Workshop on Systematic Approaches to Digital Forensic Engineering, SADFE 2008, Berkeley, California, USA, May, pp. 41–54 (2008)Google Scholar
  13. 13.
    Zhang, L., White, G.B.: An approach to detect executable content for anomaly based network intrusion detection. In: 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), Proceedings, 26–30 March 2007, Long Beach, California, USA, pp. 1–8 (2007)Google Scholar
  14. 14.
    Amirani, M.C., Toorani, M., Mihandoost, S.: Feature-based type identification of file fragments. Secur. Commun. Networks 6(1), 115–128 (2013)CrossRefGoogle Scholar
  15. 15.
    Amirani, M.C, Toorani, M., Beheshti, A.: A new approach to content-based file type detection. In: Computer Science, pp. 1103–1108 (2008)Google Scholar
  16. 16.
    Li, Q., Ong, A., Suganthan, P., et al.: A novel support vector machine approach to high entropy data fragment classification (2010)Google Scholar
  17. 17.
    Hazan, T., Man, A., Shashua, A.: A parallel decomposition solver for SVM: distributed dual ascend using fenchel duality, pp. 1–8 (2008)Google Scholar
  18. 18.
    Do, T.N., Poulet, F.: Classifying one billion data with a new distributed SVM algorithm. In: International Conference on Research, Innovation and Vision for the Future, pp. 59–66 (2006)Google Scholar
  19. 19.
    Chang, E.Y., Zhu, K., Wang, H., Bai, H., Li, J., Qiu, Z.: PSVM: parallelizing support vectormachines on distributed computers. In: Proceedings of Advances in Neural Information Processing Systems, pp. 257–264 (2007)Google Scholar
  20. 20.
    Zhu-Hong, Y., Jian-Zhong, Y., Lin, Z., Shuai, L., Zhen-Kun, W.: A MapReduce based parallel SVM for large-scale predicting protein-protein interactions. Neurocomputing 145, 37–43 (2014)CrossRefGoogle Scholar
  21. 21.
    Guo, W., Alham, N.K., Liu, Y., et al.: A resource aware MapReduce based parallel SVM for large scale image classifications. Neural Process. Lett., 1–24 (2005)Google Scholar
  22. 22.
    Graf, H., Cosatto, E., Bottou, L., Durdanovic, I., Vapnik, V.: Parallel support vectormachines: the cascade SVM. In: Proceedings of Advances in Neural Information Processing Systems (NIPS) (2004)Google Scholar
  23. 23.
    Sun, Z., Fox, G.: Study on Parallel SVM Based on MapReduce (2013)Google Scholar
  24. 24.
    Çatak, F.O., Balaban, M.E.: CloudSVM: training an SVM classifier in cloud computing systems. In: Proceedings of the Pervasive Computing and the Networked World—Joint International Conference (ICPCA/SWS), pp. 57–68 (2012)Google Scholar
  25. 25.
    Platt, J.: Sequential minimal optimization: a fast algorithm for training support vector machines. Technical report, MSR-TR-98-14, Microsoft Research (1998)Google Scholar
  26. 26.
    Fan, R.E., Chang, K.W., Hsieh, C.J., et al.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9(9), 1871–1874 (2008)zbMATHGoogle Scholar
  27. 27.
    Kun, D., Yih, L., Perera, A.: Parallel SMO for training support vector machines, SMA 5505, project final report (2003)Google Scholar
  28. 28.
    Apache Hadoop. http://hadoop.apache.org
  29. 29.
    Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), pp. 29–43 (2003)Google Scholar

Copyright information

© ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2017

Authors and Affiliations

  • Chong Jiang
    • 1
  • Ting Wu
    • 1
  • Jian Xu
    • 1
  • Ning Zheng
    • 1
  • Ming Xu
    • 1
  • Tao Yang
    • 2
    Email author
  1. 1.Internet and Network Security Laboratory, School of Computer Science and TechnologyHangzhou Dianzi UniversityHangzhouChina
  2. 2.The Third Research Institute of Ministry of Public SecurityHangzhouChina

Personalised recommendations