Big Data Forensics: Hadoop Distributed File Systems as a Case Study

  • Mohammed Asim
  • Dean Richard McKinnel
  • Ali DehghantanhaEmail author
  • Reza M. Parizi
  • Mohammad Hammoudeh
  • Gregory Epiphaniou


Big Data has fast become one of the most adopted computer paradigms within computer science and is considered an equally challenging paradigm for forensics investigators. The Hadoop Distributed File System (HDFS) is one of the most favourable big data platforms within the market, providing an unparalleled service with regards to parallel processing and data analytics. However, HDFS is not without its risks, having been reportedly targeted by cyber criminals as a means of stealing and exfiltrating confidential data. Using HDFS as a case study, we aim to detect remnants of malicious users’ activities within the HDFS environment. Our examination involves a thorough analysis of different areas of the HDFS environment, including a range of log files and disk images. Our experimental environment was comprised of a total of four virtual machines, all running Ubuntu. This HDFS research provides a thorough understanding of the types of forensically relevant artefacts that are likely to be found during a forensic investigation.


HDFS Digital forensics Hadoop Big data Distributed file systems 



We would like to thank the editor and anonymous reviewers for their constructive comments. The views and opinions expressed in this article are those of the authors and not the organisation with whom the authors are or have been associated with or supported by.


  1. 1.
    S. Tahir and W. Iqbal, “Big Data-An evolving concern for forensic investigators,” in 2015 1st International Conference on Anti-Cybercrime, ICACC 2015, 2015.Google Scholar
  2. 2.
    W. Yang, G. Wang, K.-K. R. Choo, and S. Chen, “HEPart: A balanced hypergraph partitioning algorithm for big data applications,” Futur. Gener. Comput. Syst., Jan. 2018.Google Scholar
  3. 3.
    W. A. Günther, M. H. Rezazade Mehrizi, M. Huysman, and F. Feldberg, “Debating big data: A literature review on realizing value from big data,” J. Strateg. Inf. Syst., 2017.Google Scholar
  4. 4.
    T. H. Davenport and J. Dyche, “Big Data in Big Companies,” Int. Inst. Anal., no. May, pp. 1–31, 2013.Google Scholar
  5. 5.
    B. Fang and P. Zhang, “Big data in finance,” in Big Data Concepts, Theories, and Applications, 2016, pp. 391–412.Google Scholar
  6. 6.
    S. Sharma, U. S. Tim, J. Wong, S. Gadia, and S. Sharma, “A Brief Review on Leading Big Data Models,” Data Sci. J., vol. 13, no. December, pp. 138–157, 2014.Google Scholar
  7. 7.
    S. Yu and S. Guo, Big Data Concepts, Theories, and Applications, 1st ed. 20. Cham: Springer International Publishing, 2016.Google Scholar
  8. 8.
    X. Wu, X. Zhu, G. Q. Wu, and W. Ding, “Data mining with big data,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 1, pp. 97–107, 2014.Google Scholar
  9. 9.
    C. Vorapongkitipun and N. Nupairoj, “Improving performance of small-file accessing in Hadoop,” in 2014 11th Int. Joint Conf. on Computer Science and Software Engineering: “Human Factors in Computer Science and Software Engineering” - e-Science and High Performance Computing: eHPC, JCSSE 2014, 2014, pp. 200–205.Google Scholar
  10. 10.
    Y. Y. Teing, A. Dehghantanha, and K. K. R. Choo, “CloudMe forensics: A case of big data forensic investigation,” Concurrency Computation, 2017.Google Scholar
  11. 11.
    X. Fu, Y. Gao, B. Luo, X. Du, and M. Guizani, “Security Threats to Hadoop: Data Leakage Attacks and Investigation,” IEEE Netw., vol. 31, no. 2, pp. 67–71, 2017.Google Scholar
  12. 12.
    A. Azmoodeh, A. Dehghantanha, M. Conti, and K.-K. R. Choo, “Detecting crypto-ransomware in IoT networks based on energy consumption footprint,” J. Ambient Intell. Humaniz. Comput., pp. 1–12, Aug. 2017.Google Scholar
  13. 13.
    J. Baldwin and A. Dehghantanha, Leveraging support vector machine for opcode density based detection of crypto-ransomware, vol. 70. 2018.Google Scholar
  14. 14.
    A. D. James Baldwin, Omar Alhawi, Leveraging Machine Learning Techniques for Windows Ransomware Network Traffic Detection. Cyber Threat Intelligence- Springer Book, 2017.Google Scholar
  15. 15.
    D. Kiwia, A. Dehghantanha, K.-K. R. Choo, and J. Slaughter, “A cyber kill chain based taxonomy of banking Trojans for evolutionary computational intelligence,” J. Comput. Sci., Nov. 2017.Google Scholar
  16. 16.
    O. Osanaiye, H. Cai, K.-K. R. Choo, A. Dehghantanha, Z. Xu, and M. Dlodlo, “Ensemble-based multi-filter feature selection method for DDoS detection in cloud computing,” Eurasip J. Wirel. Commun. Netw., vol. 2016, no. 1, 2016.Google Scholar
  17. 17.
    F. Daryabar, A. Dehghantanha, and K.-K. R. Choo, “Cloud storage forensics: MEGA as a case study,” Aust. J. Forensic Sci., pp. 1–14, Apr. 2016.Google Scholar
  18. 18.
    M. Shariati, A. Dehghantanha, and K.-K. R. Choo, “SugarSync forensic analysis,” Aust. J. Forensic Sci., vol. 48, no. 1, pp. 95–117, Apr. 2015.Google Scholar
  19. 19.
    S. Almulla, Y. Iraqi, and A. Jones, “Cloud forensics: A research perspective,” in 2013 9th International Conference on Innovations in Information Technology, IIT 2013, 2013, pp. 66–71.Google Scholar
  20. 20.
    O. Tabona and A. Blyth, “A forensic cloud environment to address the big data challenge in digital forensics,” in 2016 SAI Computing Conference (SAI), 2016, pp. 579–584.Google Scholar
  21. 21.
    Y. Gao and B. Li, “A forensic method for efficient file extraction in HDFS based on three-level mapping,” Wuhan Univ. J. Nat. Sci., vol. 22, no. 2, pp. 114–126, 2017.Google Scholar
  22. 22.
    A. Guarino, “Digital Forensics as a Big Data Challenge,” in ISSE 2013 Securing Electronic Business Processes, 2013, pp. 197–203.Google Scholar
  23. 23.
    S. Zawoad and R. Hasan, “Digital Forensics in the Age of Big Data: Challenges, Approaches, and Opportunities,” in 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, 2015, pp. 1320–1325.Google Scholar
  24. 24.
    B. Agrawal, R. Hansen, C. Rong, and T. Wiktorski, “SD-HDFS: Secure deletion in hadoop distributed file system,” in Proceedings - 2016 IEEE International Congress on Big Data, BigData Congress 2016, 2016, pp. 181–189.Google Scholar
  25. 25.
    J. Baldwin, O. M. K. Alhawi, S. Shaughnessy, A. Akinbi, and A. Dehghantanha, “Emerging from the Cloud: A Bibliometric Analysis of Cloud Forensics Studies,” Springer, Cham, 2018, pp. 311–331.Google Scholar
  26. 26.
    F. Daryabar, A. Dehghantanha, B. Eterovic-Soric, and K.-K. R. Choo, “Forensic investigation of OneDrive, Box, GoogleDrive and Dropbox applications on Android and iOS devices,” Aust. J. Forensic Sci., pp. 1–28, Mar. 2016.Google Scholar
  27. 27.
    F. Norouzizadeh Dezfouli, A. Dehghantanha, B. Eterovic-Soric, and K.-K. R. Choo, “Investigating Social Networking applications on smartphones detecting Facebook, Twitter, LinkedIn and Google+ artefacts on Android and iOS platforms,” Aust. J. Forensic Sci., pp. 1–20, Aug. 2015.Google Scholar
  28. 28.
    S. H. Mohtasebi, A. Dehghantanha, and K.-K. R. Choo, Cloud Storage Forensics: Analysis of Data Remnants on SpiderOak, JustCloud, and pCloud. 2016.Google Scholar
  29. 29.
    A. Dehghantanha and T. Dargahi, Residual Cloud Forensics: CloudMe and 360Yunpan as Case Studies. 2016.Google Scholar
  30. 30.
    M. N. Yusoff, A. Dehghantanha, and R. Mahmod, Network Traffic Forensics on Firefox Mobile OS: Facebook, Twitter, and Telegram as Case Studies. 2016.Google Scholar
  31. 31.
    H. Haughey, G. Epiphaniou, H. Al-Khateeb, and A. Dehghantanha, Adaptive traffic fingerprinting for darknet threat intelligence, vol. 70. 2018.Google Scholar
  32. 32.
    Y.-Y. Teing, D. Ali, K. Choo, M. T. Abdullah, and Z. Muda, “Greening Cloud-Enabled Big Data Storage Forensics: Syncany as a Case Study,” IEEE Trans. Sustain. Comput., pp. 1–1, 2017.Google Scholar
  33. 33.
    B. Martini and K. K. R. Choo, “Distributed filesystem forensics: XtreemFS as a case study,” Digit. Investig., vol. 11, no. 4, pp. 295–313, 2014.Google Scholar
  34. 34.
    S. A. Thanekar, K. Subrahmanyam, and A. B. Bagwan, “A study on digital forensics in hadoop,” Int. J. Control Theory Appl., vol. 9, no. 18, pp. 8927–8933, 2016.Google Scholar
  35. 35.
    P. Leimich, J. Harrison, and W. J. Buchanan, “A RAM triage methodology for Hadoop HDFS forensics,” Digit. Investig., vol. 18, pp. 96–109, 2016.Google Scholar
  36. 36.
    Y. Gao, X. Fu, B. Luo, X. Du, and M. Guizani, “Haddle: A framework for investigating data leakage attacks in hadoop,” in 2015 IEEE Global Communications Conference, GLOBECOM 2015, 2015.Google Scholar
  37. 37.
    S. Dinesh, S. Rao, and K. Chandrasekaran, “Traceback: A Forensic Tool for Distributed Systems,” Proc. 3rd Int. Conf. Adv. Comput. Netw. Informatics, pp. 17–27, 2016.Google Scholar
  38. 38.
    E. Alshammari, G. Al-Naymat, and A. Hadi, “A New Technique for File Carving on Hadoop Ecosystem,” in The International Conference on new Trends in Computing Sciences (ICTCS’2017), At Jordan-Amman, 2017.Google Scholar
  39. 39.
    Y.-Y. Teing, A. Dehghantanha, K.-K. R. Choo, T. Dargahi, and M. Conti, “Forensic Investigation of Cooperative Storage Cloud Service: Symform as a Case Study,” J. Forensic Sci., vol. 62, no. 3, pp. 641–654, May 2017.Google Scholar
  40. 40.
    Y. Y. Teing, A. Dehghantanha, K. K. R. Choo, and L. T. Yang, “Forensic investigation of P2P cloud storage services and backbone for IoT networks: BitTorrent Sync as a case study,” Comput. Electr. Eng., vol. 58, pp. 350–363, 2017.Google Scholar
  41. 41.
    M. Kohn, J. H. P. Eloff, and M. S. Olivier, “Framework for a Digital Forensic Investigation,” Communications, no. March, pp. 1–7, 2006.Google Scholar
  42. 42.
    M. E. Alex and R. Kishore, “Forensics framework for cloud computing,” Comput. Electr. Eng., vol. 60, pp. 193–205, 2017.Google Scholar
  43. 43.
    B. Martini and K. K. R. Choo, “An integrated conceptual digital forensic framework for cloud computing,” Digit. Investig., vol. 9, no. 2, pp. 71–80, 2012.Google Scholar
  44. 44.
    M. Rathbone, “A Beginner’s Guide to Hadoop Storage Formats (or File Formats).”Google Scholar
  45. 45.
    P. Zeyliger, “Hadoop Default Ports Quick Reference – Cloudera Engineering Blog.”Google Scholar
  46. 46.
    Apache Hadoop, “Apache Hadoop 2.9.0 – MapReduce Tutorial.”Google Scholar
  47. 47.
    M. Conti, A. Dehghantanha, K. Franke, and S. Watson, “Internet of Things security and forensics: Challenges and opportunities,” Futur. Gener. Comput. Syst., vol. 78, pp. 544–546, Jan. 2018.Google Scholar
  48. 48.
    S. Watson and A. Dehghantanha, “Digital forensics: the missing piece of the Internet of Things promise,” Comput. Fraud Secur., vol. 2016, no. 6, pp. 5–8, Jun. 2016.Google Scholar
  49. 49.
    N. Milosevic, A. Dehghantanha, and K.-K. R. Choo, “Machine learning aided Android malware classification,” Comput. Electr. Eng. Google Scholar
  50. 50.
    S. Homayoun, A. Dehghantanha, M. Ahmadzadeh, S. Hashemi, and R. Khayami, “Know Abnormal, Find Evil: Frequent Pattern Mining for Ransomware Threat Hunting and Intelligence,” IEEE Trans. Emerg. Top. Comput., pp. 1–1, 2017.Google Scholar
  51. 51.
    H. H. Pajouh, A. Dehghantanha, R. Khayami, and K. K. R. Choo, “Intelligent OS X malware threat detection with code inspection,” Journal of Computer Virology and Hacking Techniques, pp. 1–11, 2017.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Mohammed Asim
    • 1
  • Dean Richard McKinnel
    • 1
  • Ali Dehghantanha
    • 2
    Email author
  • Reza M. Parizi
    • 3
  • Mohammad Hammoudeh
    • 4
  • Gregory Epiphaniou
    • 5
  1. 1.Department of Computer ScienceUniversity of SalfordManchesterUK
  2. 2.Cyber Science Lab, School of Computer ScienceUniversity of GuelphGuelphCanada
  3. 3.Department of Software Engineering and Game DevelopmentKennesaw State UniversityMariettaUSA
  4. 4.School of Computing, Mathematics and Digital Technology, Manchester Metropolitan UniversityManchesterUK
  5. 5.Wolverhampton Cyber Research Institute (WCRI), School of Mathematics and Computer Science, University of WolverhamptonWolverhamptonUK

Personalised recommendations