Cyber forensics framework for big data analytics in IoT environment using machine learning

  • Gurpal Singh ChhabraEmail author
  • Varinder Pal Singh
  • Maninder Singh


Forensic analyst skills are at stake for processing of growing data from IoT based environment platforms. Tangible sources often have the size limits, but that’s not the case for communication traffic source. Hence, increasing the thirst for an efficient benchmarking for big data analysis. Available solutions to date have used an anomaly-based approach or have proposed approaches based on the deviation from a regular pattern. To tackle the seized bytes, authors have proposed an approach for big data forensics, with efficient sensitivity and precision. In the presented work, a generalized forensic framework has been proposed that use Google’s programming model, MapReduce as the backbone for traffic translation, extraction, and analysis of dynamic traffic features. For the proposed technique, authors have used open source tools like Hadoop, Hive, and Mahout and R. Apart from being open source, these tools support scalability and parallel processing. Also, comparative analysis of globally accepted machine learning models of P2P malware analysis in mocked real-time is presented. Dataset from CAIDA was taken and executed in parallel to validate the proposed model. Finally, the forensic performance metrics of the model shows the results with the sensitivity of 99%.


Hadoop Hive HQL Mahout Sqoop Cyber forensic framework 



  1. 1.
    Al Fahdi M, Clarke NL, Furnell SM (2013) Challenges to digital forensics: a survey of researchers & practitioners attitudes and opinions. In Information Security for South Africa, 2013 (1-8). IEEE. doi:
  2. 2.
    Almulla S, Iraqi Y, Jones A (2013) A distributed snapshot framework for digital forensics evidence extraction and event reconstruction from cloud environment. Cloud Computing Technology and Science (CloudCom), 2013 IEEE 5th International Conference 1:699–704). IEEE. CrossRefGoogle Scholar
  3. 3.
    Apache Hive Documentation. The Apache Software Foundation, Available at:
  4. 4.
    Apache Mahout Documentation. The Apache Software Foundation, Available at:
  5. 5.
    Babar S, Mahalle P, Stango A, Prasad N, Prasad R (2010) Proposed security model and threat taxonomy for the internet of things (IoT). International conference on network security and applications. Springer, Berlin, Heidelberg, pp 420–429. zbMATHGoogle Scholar
  6. 6.
    Benvenuti C (2006) Understanding Linux network internals. “O’Reilly Media, IncGoogle Scholar
  7. 7.
    Bradford A (2002) (Mobile book), The Handbook of Brain Theory and Neural Netw, second edition, MIT PressGoogle Scholar
  8. 8.
    Brian Feeny Harvard Grad Student Blog Harvard (2017) Available at:
  9. 9.
    Carroll OL, Brannon SK, Song T, Littlefield MJ, Newby T (2008) Computer forensics, 56(1), p.1, US Department of Justice,
  10. 10.
    Conti M, Dehghantanha A, Franke K, Watson S (2018) Internet of things security and forensics: challenges and opportunities, doi:
  11. 11.
    Conti M, Dehghantanha A, Franke K, Watson S (2018) Internet of Things security and forensics. Chall Opportun 78(2):544–546. Google Scholar
  12. 12.
    Cook, Kristin, Georges Grinstein, Mark Whiting, Michael Cooper, Paul Havig, Kristen Liggett, Bohdan Nebesh, and Celeste Lyn Paul. (2012) VAST challenge 2012: visual analytics for big data. Visual Anal Sci Technol (VAST), 2012 IEEE Conf:251–255. doi:
  13. 13.
    Cui B, He S (2016) Anomaly detection model based on hadoop platform and weka interface. Innov Mobile Internet Serv Ubiquitous Comput (IMIS), 2016 10th International Conference on. IEEEGoogle Scholar
  14. 14.
    Europol (2016) Internet Organised Crime Threat Assessment (IOCTA) 2016, Available at:
  15. 15.
    Google LLC (“Google”). Available at:, United States
  16. 16.
    Grabosky P (2016) The 2 evolution of cybercrime. cybercrime through an interdisciplinary lens, vol 26. Routledge, Taylor & Francis, London and New York, p 15Google Scholar
  17. 17.
    Guarino A (2013) Digital forensics as a big data challenge. In ISSE 2013 securing electronic business processes (pp. 197-203). Springer Vieweg, Wiesbaden. doi:
  18. 18.
    Help Net Security (2015) Top IoT concerns? Data volumes and network stress, Available at:
  19. 19.
    Hsieh C-J, Ting-yuan Chan (2016) detection DDoS attacks based on neural-network using apache spark. Appl Syst Innov (ICASI), 2016 Int Conf IEEEGoogle Scholar
  20. 20.
    Ingersoll G (2009) Introducing apache mahout. Scalable, commercial friendly machine learning for building intelligent applications. IBM. Available at:
  21. 21.
    Lavion D (2018) Pulling fraud out of the shadows, PwC’s 2018 Global Economic Crime and Fraud Survey Available at:
  22. 22.
    Liao CF, Bao SW, Cheng CJ, Chen K (2017) On design issues and architectural styles for blockchain-driven IoT services. Consumer Electronics-Taiwan (ICCE-TW), 2017 IEEE Int Conf: 351–352). IEEE. doi:
  23. 23.
    Macdermott A, Baker T, Shi Q (2018) IoT Forensics: challenges for the IoA Era. New Technol Mobil Sec (NTMS), 2018 9th IFIP Int Conf: 1–5. IEEE. doi:
  24. 24.
    Mayhew M, Atighetchi M, Adler A, Greenstadt R (2015) Use of machine learning in big data analytics for insider threat detection. Military Commun Conf, MILCOM 2015-2015 IEEE: 915–922. IEEE. doi:
  25. 25.
    Meidan Y et al. (2017) "ProfilIoT: a machine learning approach for IoT device identification based on network traffic analysis." Proceedings of the Symposium on Applied Computing. ACMGoogle Scholar
  26. 26.
    Merino, B. (2013). Instant traffic analysis with Tshark how-to. Packt Publishing Ltd.Google Scholar
  27. 27.
    Mylavarapu G, Thomas J, Ashwin Kumar TK (2015) Real-time hybrid intrusion detection system using apache storm. High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on. IEEEGoogle Scholar
  28. 28.
    Neshatpour K, Malik M, Ghodrat MA, Homayoun H (2015) Accelerating big data analytics using fpgas. Field-Program Custom Comput Mach (FCCM), 2015 IEEE 23rd Ann Int Sym: 164. IEEE. doi:
  29. 29.
    Ngu AH, Gutierrez M, Metsis V, Nepal S, Sheng QZ (2017) IoT middleware: a survey on issues and enabling technologies. IEEE Internet Things J 4(1):1–20. CrossRefGoogle Scholar
  30. 30.
    Owen S (2012) Mahout in action, Available at:
  31. 31.
    Pajouh HH, Javidan R, Khayami R, Ali D, Choo KKR (2016) A two-layer dimension reduction and two-tier classification model for anomaly-based intrusion detection in IoT backbone networks. IEEE Trans Emerging Topics Comput.
  32. 32.
    Pakalra EG, Alma J, Rohm WA, Martens J, Rohrer B (2017) How to choose algorithms for Microsoft Azure Machine Learning, Microsoft Corporation. Available at:
  33. 33.
    Pan X, Tan J, Kavulya S, Gandhi R, Narasimhan P (2009). Blind Men and the Elephant: Piecing together Hadoop for diagnosis. Int Sym Softw Reliab Eng (ISSRE), Mysuru, IndiaGoogle Scholar
  34. 34.
    Perumal S, Norwawi NM, Raman V (2015) Internet of things (IoT) digital forensic investigation model: top-down forensic approach methodology. Digit Info Process Commun (ICDIPC), 2015 Fifth Int Conf: 19–23). IEEE, doi:
  35. 35.
    Puri C, Dukatz C (2015) Analyzing and predicting security event anomalies: lessons learned from a large Enterprise big data streaming analytics deployment. Database and Expert Systems Applications (DEXA), 2015 26th International Workshop on. IEEEGoogle Scholar
  36. 36.
    Puthal, D., Ranjan, R., Nepal, S., & Chen, J. (2017). IoT and big data: an architecture with data flow and security issues. In Cloud Infrastructures, Services, and IoT Systems for Smart Cities (pp. 243-252). Springer, Cham. doi:
  37. 37.
    Ramanathan R, Latha B (2018) Towards optimal resource provisioning for Hadoop-MapReduce jobs using scale-out strategy and its performance analysis in private cloud environment. Clust Comput: 1–11. doi:
  38. 38.
    Rathore MM, Ahmad A, Paul A (2016) Real time intrusion detection system for ultra-high-speed big data environments. J Supercomput 72(9):3489–3510CrossRefGoogle Scholar
  39. 39.
    Razzaq A, Latif K, Ahmad HF, Hur A, Anwar Z, Bloodsworth PC (2014) Semantic security against web application attacks. Inf Sci 254:19–38. CrossRefGoogle Scholar
  40. 40.
    Ripley BD, Murdoch DJ R for Windows FAQ, Available at:
  41. 41.
    Salcedo-Campos F, Díaz-Verdejo J, García-Teodoro P (2012) Segmental parameterisation and statistical modelling of e-mail headers for spam detection. Inf Sci 195:45–61. CrossRefGoogle Scholar
  42. 42.
    Sanchez-Artigas M, Herrera B (2013) Understanding the effects of P2P dynamics on trust bootstrapping. Inf Sci 236:33–55. CrossRefGoogle Scholar
  43. 43.
    Schoof R, Koning R (2007) Detecting peer-to-peer botnets. University of AmsterdamGoogle Scholar
  44. 44.
    Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on (pp. 1-10). IEEE, doi:
  45. 45.
    Singh K, Guntuku SC, Thakur A, Hota C (2014) Big data analytics framework for peer-to-peer botnet detection using random forests. Inf Sci 278:488–497 CrossRefGoogle Scholar
  46. 46.
    Skopkó T (2012) Loss analysis of the software-based packet capturing. Carpathian J Elect Comput Eng 5:107Google Scholar
  47. 47.
    Slay J (2018) Towards developing network forensic mechanism for botnet activities in the IoT based on machine learning techniques. In Mobile networks and management: 9th international conference, MONAMI 2017, Melbourne, Australia, December 13-15, 2017, Proceedings (Vol. 235, p. 30). SpringerGoogle Scholar
  48. 48.
    Sqoop Documentation, The Apache Software Foundation, Available at:
  49. 49.
    Srinivasan MK, Revathy P (2018) State-of-the-art big data security taxonomies. Proc 11th Innov Software Eng Conf: 16). ACM. doi:
  50. 50.
    Stergiou C, Psannis KE, Kim BG, Gupta B (2018) Secure integration of IoT and cloud computing. Futur Gener Comput Syst 78:964–975. CrossRefGoogle Scholar
  51. 51.
    Team R. Core (2000) R language definition. R foundation for statistical computing, Vienna, Austria Available at: Google Scholar
  52. 52.
    Terzi DS, Terzi R, Sagiroglu S (2017) "Big data analytics for network anomaly detection from netflow data." Computer Science and Engineering (UBMK), 2017 International Conference on. IEEEGoogle Scholar
  53. 53.
    Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endowment 2(2):1626–1629. CrossRefGoogle Scholar
  54. 54.
    Uddin MF, Gupta N (2014) Seven V's of Big Data understanding Big Data to extract value. In American Society for Engineering Education (ASEE Zone 1), 2014 Zone 1 Conference of the (pp. 1-5). IEEE. doi:
  55. 55.
    Uzun M, Abul O (2016) End-to-end internet speed analysis of mobile networks with mapReduce. Netw Comput Commun (ISNCC) 2016 Int Sym: 1–6. IEEE. doi:
  56. 56.
    Verma S, Kawamoto Y, Fadlullah ZM, Nishiyama H, Kato N (2017) A survey on network methodologies for real-time analytics of massive IoT data and open research issues. IEEE Commun Surv Tutor 19(3):1457–1477. CrossRefGoogle Scholar
  57. 57.
    Wang C, Chi CH, Zhou W, Wong RK (2015) Coupled interdependent attribute analysis on mixed data. AAAI: 1861–1867Google Scholar
  58. 58.
    Wang X et al (2018) D2D Big Data: content deliveries over wireless device-to-device sharing in large-scale mobile networks. IEEE Wirel Commun 25.1:32–38CrossRefGoogle Scholar
  59. 59.
    Yen TF, Reiter MK (2010) Are your hosts trading or plotting? Telling P2P file-sharing and bots apart. Distrib Comput Syst (ICDCS), 2010 IEEE 30th Int Conf: 241–252. IEEE. doi:
  60. 60.
    Zheng X et al (2015) Detecting spammers on social networks. Neurocomputing 159:27–34CrossRefGoogle Scholar
  61. 61.
    Zhou X et al. (2014) Exploring Netfow data using hadoop. Proc Second ASE Int Conf Big Data Sci ComputingGoogle Scholar
  62. 62.
    Zuech R, Khoshgoftaar TM, Wald R (2015) Intrusion detection and big heterogeneous data: a survey. J Big Data 2(1):3. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Computer Science and Engineering DepartmentThapar UniversityPatialaIndia

Personalised recommendations