Abstract
Data corpora are very important for digital forensics education and research. Several corpora are available to academia; these range from small manually-created data sets of a few megabytes to many terabytes of real-world data. However, different corpora are suited to different forensic tasks. For example, real data corpora are often desirable for testing forensic tool properties such as effectiveness and efficiency, but these corpora typically lack the ground truth that is vital to performing proper evaluations. Synthetic data corpora can support tool development and testing, but only if the methodologies for generating the corpora guarantee data with realistic properties.
This paper presents an overview of the available digital forensic corpora and discusses the problems that may arise when working with specific corpora. The paper also describes a framework for generating synthetic corpora for education and research when suitable real-world data is not available.
Chapter PDF
Similar content being viewed by others
References
Air Force Office of Special Investigations, Foremost ( foremost.sourceforge.net ), 2001.
B. Carrier, The Sleuth Kit ( www.sleuthkit.org/sleuthkit ), 2013.
W. Cohen, Enron Email Dataset, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania ( www.cs.cmu.edu/~enron ), 2009.
S. Garfinkel, Forensic corpora, a challenge for forensic research, unpublished manuscript, 2007.
S. Garfinkel, Lessons learned writing digital forensics tools and managing a 30 TB digital evidence corpus, Digital Investigation, vol. 9(S), pp. S80–S89, 2012.
S. Garfinkel, Digital Corpora ( http://digitalcorpora.org ), 2013.
S. Garfinkel, P. Farrell, V. Roussev and G. Dinolt, Bringing science to digital forensics with standardized forensic corpora, Digital Investigation, vol. 6(S), pp. S2–S11, 2009.
M. Grgic and K. Delac, Face Recognition Homepage, Zagreb, Croatia ( www.face-rec.org/databases ), 2013.
B. Klimt and Y. Yang, Introducing the Enron Corpus, presented at the First Conference on Email and Anti-Spam, 2004.
B. Klimt and Y. Yang, The Enron Corpus: A new dataset for email classification research, Proceedings of the Fifteenth European Conference on Machine Learning, pp. 217–226, 2004.
Lincoln Laboratory, Massachusetts Institute of Technology, DARPA Intrusion Detection Data Sets, Lexington, Massachusetts ( www.ll.mit.edu/mission/communications/cyber/CSTcorpora/ideval/data ), 2013.
R. Lippmann, D. Fried, I. Graf, J. Haines, K. Kendall, D. McClung, D. Weber, S. Webster, D. Wyschogrod, R. Cunningham and M. Zissman, Evaluating intrusion detection systems: The 1998 DARPA off-line intrusion detection evaluation, Proceedings of the DARPA Information Survivability Conference and Exposition, vol. 2, pp. 12–26, 2000.
R. Lippmann, J. Haines, D. Fried, J. Korba and K. Das, The 1999 DARPA off-line intrusion detection evaluation, Computer Networks, vol. 34(4), pp. 579–595, 2000.
E. Lundin, H. Kvarnstrom and E. Jonsson, A synthetic fraud data generation methodology, Proceedings of the Fourth International Conference on Information and Communications Security, pp. 265–277, 2002.
E. Lundin Barse, H. Kvarnstrom and E. Jonsson, Synthesizing test data for fraud detection systems, Proceedings of the Nineteenth Annual Computer Security Applications Conference, pp. 384–394, 2003.
J. McHugh, Testing intrusion detection systems: A critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory, ACM Transactions on Information and System Security, vol. 3(4), pp. 262–294, 2000.
C. Moch and F. Freiling, The Forensic Image Generator Generator (Forensig2), Proceedings of the Fifth International Conference on IT Security Incident Management and IT Forensics, pp. 78–93, 2009.
C. Moch and F. Freiling, Evaluating the Forensic Image Generator Generator, Proceedings of the Third International Conference on Digital Forensics and Cyber Crime, pp. 238–252, 2011.
National Institute of Standards and Technology, The CFReDS Project, Gaithersburg, Maryland ( www.cfreds.nist.gov ), 2013.
K. Ricanek and T. Tesafaye, Morph: A longitudinal image database of normal adult age-progression, Proceedings of the Seventh International Conference on Automatic Face and Gesture Recognition, pp. 341–345, 2006.
M. Steinebach, H. Liu and Y. Yannikos, FaceHash: Face detection and robust hashing, presented at the Fifth International Conference on Digital Forensics and Cyber Crime, 2013.
T. Vidas, MemCorp: An open data corpus for memory analysis, Proceedings of the Forty-Fourth Hawaii International Conference on System Sciences, 2011.
Volatilty, The Volatility Framework ( http://code.google.com/p/volatility ), 2014.
WikiLeaks, The Global Intelligence Files ( http://wikileaks.org/the-gifiles.html ), 2013.
K. Woods, C. Lee, S. Garfinkel, D. Dittrich, A. Russell and K. Kearton, Creating realistic corpora for security and forensic education, Proceedings of the ADFSL Conference on Digital Forensics, Security and Law, 2011.
Y. Yannikos, F. Franke, C. Winter and M. Schneider, 3LSPG: Forensic tool evaluation by three layer stochastic process-based generation of data, Proceedings of the Fourth International Conference on Computational Forensics, pp. 200–211, 2010.
Y. Yannikos and C. Winter, Model-based generation of synthetic disk images for digital forensic tool testing, Proceedings of the Eighth International Conference on Availability, Reliability and Security, pp. 498–505, 2013.
Y. Yannikos, C. Winter and M. Schneider, Synthetic data creation for forensic tool testing: Improving performance of the 3LSPG Framework, Proceedings of the Seventh International Conference on Availability, Reliability and Security, pp. 613–619, 2012.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 IFIP International Federation for Information Processing
About this paper
Cite this paper
Yannikos, Y., Graner, L., Steinebach, M., Winter, C. (2014). Data Corpora for Digital Forensics Education and Research. In: Peterson, G., Shenoi, S. (eds) Advances in Digital Forensics X. DigitalForensics 2014. IFIP Advances in Information and Communication Technology, vol 433. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44952-3_21
Download citation
DOI: https://doi.org/10.1007/978-3-662-44952-3_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44951-6
Online ISBN: 978-3-662-44952-3
eBook Packages: Computer ScienceComputer Science (R0)