Associating Drives Based on Their Artifact and Metadata Distributions

  • Neil C. RoweEmail author
Conference paper
Part of the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering book series (LNICST, volume 259)


Associations between drive images can be important in many forensic investigations, particularly those involving organizations, conspiracies, or contraband. This work investigated metrics for comparing drives based on the distributions of 18 types of clues. The clues were email addresses, phone numbers, personal names, street addresses, possible bank-card numbers, GPS data, files in zip archives, files in rar archives, IP addresses, keyword searches, hash values on files, words in file names, words in file names of Web sites, file extensions, immediate directories of files, file sizes, weeks of file creation times, and minutes within weeks of file creation. Using a large corpus of drives, we computed distributions of document association using the cosine similarity TF/IDF formula and Kullback-Leibler divergence formula. We provide significance criteria for similarity based on our tests that are well above those obtained from random distributions. We also compared similarity and divergence values, investigated the benefits of filtering and sampling the data before measuring association, examined the similarities of the same drive at different times, and developed useful visualization techniques for the associations.


Drives Forensics Link analysis Similarity Divergence Artifacts Metadata 



This work was supported by the Naval Research Program at the Naval Postgraduate School under JON W7B27. The views expressed are those of the author and do not represent the U.S. Government. Edith Gonzalez-Reynoso and Sandra Falgout helped.


  1. 1.
    Abe, H., Tsumoto, S.: Text categorization with considering temporal patterns of term usages. In: Proceedings of IEEE International Conference on Data Mining Workshops, pp. 800–807 (2010)Google Scholar
  2. 2.
    Beverly, R., Garfinkel, S., Cardwell, G.: Forensic caving of network packets and associated data structures. Digital Invest. 8, S78–S89 (2011)CrossRefGoogle Scholar
  3. 3.
    Borgatti, S., Everett, M.: Models of core/periphery structures. Soc. Netw. 21(4), 375–395 (2000)CrossRefGoogle Scholar
  4. 4.
    Bulk Extractor 1.5: Digital Corpora: Bulk Extractor [software] (2013). 6 Feb 2015
  5. 5.
    Catanese, S., Fiumara, G., A visual tool for forensic analysis of mobile phone traffic. In: Proceedings ACM Workshop on Multimedia in Forensics, Security, and Intelligence, Firenze, Italy, October 2010, pp. 71–76 (2010)Google Scholar
  6. 6.
    Flaglien, Anders, Franke, Katrin, Arnes, Andre: Identifying Malware Using Cross-Evidence Correlation. In: Peterson, Gilbert, Shenoi, Sujeet (eds.) DigitalForensics 2011. IAICT, vol. 361, pp. 169–182. Springer, Heidelberg (2011). Scholar
  7. 7.
    Forman, G., Eshghi, K., Chiocchetti, S.: Finding similar files in large document repositories. In: Proceedings of 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, IL, US, August 2005, pp. 394–400 (2005)Google Scholar
  8. 8.
    Garfinkel, S.: Forensic feature extraction and cross-drive analysis. Digital Invest. 3S, S71–S81 (2006)CrossRefGoogle Scholar
  9. 9.
    Garfinkel, S., Farrell, P., Roussev, V., Dinolt, G.: Bringing science to digital forensics with standardized forensic corpora. Digital Invest. 6, S2–S11 (2009)CrossRefGoogle Scholar
  10. 10.
    Jones, A., Valli, C., Dardick, C., Sutherland, I., Dabibi, G., Davies, G.: The 2009 analysis of information remaining on disks offered for sale on the second hand market. J. Digital Forensics Secur. Law 5(4) (2010). Article 3Google Scholar
  11. 11.
    Mohammed, H., Clarke, N., Li, F.: An automated approach for digital forensic analysis of heterogeneous big data. J. Digital Forensics, Secur. Law 11(2) (2016). Article 9Google Scholar
  12. 12.
    Nassif, L., Hruschka, E.: Document clustering for forensic analysis: an approach for improving computer inspection. IEEE Trans. Inf. Forensics Secur. 8(1), 46–54 (2013)CrossRefGoogle Scholar
  13. 13.
    Pateriya, P., Lakshmi, Raj, G.: A pragmatic validation of stylometric techniques using BPA. In: Proceedings of International Conference on The Next Generation Information Technology: Confluence, pp. 124–131 (2014)Google Scholar
  14. 14.
    Patterson, J., Hargreaves, C.: The potential for cross-drive analysis using automated digital forensic timelines. In: Proceedings of 6th International Conference on Cybercrime Forensics and Training, Canterbury, NZ, October 2012 (2012)Google Scholar
  15. 15.
    Raghavan, S., Clark, A., Mohay, G.: FIA: an open forensic integration architecture for composing digital evidence. In: Proceedings of International Conference of Forensics in Telecommunications, Information and Multimedia, pp. 83–94 (2009)Google Scholar
  16. 16.
    Rowe, N.: Identifying forensically uninteresting files in a large corpus. EAI Endorsed Trans. Secur. Safety 16(7) (2016). Article e2CrossRefGoogle Scholar
  17. 17.
    Rowe, N.: Finding and rating personal names on drives for forensic needs. In: Proceedings of 9th EAI International Conference on Digital Forensics and Computer Crime, Prague, Czech Republic, October 2017Google Scholar
  18. 18.
    Rowe, N., Schwamm, R., McCarrin, M., Gera, R.: Making sense of email addresses on drives. J. Digital Forensics Secur. Law 11(2), 153–173 (2016)Google Scholar
  19. 19.
    Sippl, M., Scheraga, H.: Solution of the embedding problem and decomposition of symmetric matrices. In: Proceedings of National Academy of Sciences, USA, vol. 82, pp. 2197–2201, April 1985MathSciNetCrossRefGoogle Scholar
  20. 20.
    Sun, M., Xu, G., Zhang, J., Kim, D.: Tracking you through DNS traffic: Linking user sessions by clustering with Dirichlet mixture model. In: Proceedings of 20th ACM International Conference on Modeling, Analysis, and Simulation of Wireless and Mobile Systems, Miami, FL, US, November 2017, pp. 303–310 (2017)Google Scholar
  21. 21.
    Tabish, S., Shafiq, M., Farooq, M., Malware detection using statistical analysis of byte-level file content. In: Proceedings of ACM Workshop on Cybersecurity and Intelligence, Paris, France, June 2009, pp. 23–31 (2009)Google Scholar
  22. 22.
    Van Bruaene, J.: Large scale cross-drive correlation of digital media. M.S. thesis, U.S. Naval Postgraduate School, March 2016Google Scholar
  23. 23.
    Whissell, J., Clarke, C.: Effective measures for inter-document similarity. In: Proceedings of 22nd ACM International Conference on Information and Knowledge Management, pp. 1361–1370 (2013)Google Scholar
  24. 24.
    Woods, K., Lee, C., Garfinkel, S., Dittrich, D., Russell, A., Kearton, K.: Creating realistic corpora for security and forensic education. In: Proceedings of ADFSL Conference on Digital Forensics, Security, and Law, pp. 123–134 (2011)Google Scholar
  25. 25.
    Zhao, S., Yu, L., Cheng, B.: Probabilistic community using link and content for social networks. IEEE. Access PP(99), 27189–27202 (2017)CrossRefGoogle Scholar
  26. 26.
    Zhou, D., Manavoglu, E., Li, J., Giles, C., Zha, H.: Probabilistic models for discovering e-communities. In: Proceedings of WWW Conference, 23–26 May 2006, Edinburgh, Scotland, pp. 173–182 (2006)Google Scholar

Copyright information

© ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2019

Authors and Affiliations

  1. 1.Computer ScienceU.S. Naval Postgraduate SchoolMontereyUSA

Personalised recommendations