Text Clustering for Digital Forensics Analysis

  • Sergio Decherchi
  • Simone Tacconi
  • Judith Redi
  • Alessio Leoncini
  • Fabio Sangiacomo
  • Rodolfo Zunino
Conference paper
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 63)


In the last decades digital forensics have become a prominent activity in modern investigations. Indeed, an important data source is often constituted by information contained in devices on which investigational activity is performed. Due to the complexity of this inquiring activity, the digital tools used for investigation constitute a central concern. In this paper a clustering-based text mining technique is introduced for investigational purposes. The proposed methodology is experimentally applied to the publicly available Enron dataset that well fits a plausible forensics analysis context.


text clustering forensics analysis digital investigation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    U.S. Department of Justice, Electronic Crime Scene Investigation: A Guide for First Responders, I Edition, NCJ 219941 (2008),
  2. 2.
    Chen, H., Chung, W., Xu, J.J., Wang, G., Qin, Y., Chau, M.: Crime data mining: a general framework and some examples. IEEE Trans. Computer 37, 50–56 (2004)Google Scholar
  3. 3.
    Seifert, J.W.: Data Mining and Homeland Security: An Overview. CRS Report RL31798 (2007),
  4. 4.
    Mena, J.: Investigative Data Mining for Security and Criminal Detection. Butterworth-Heinemann (2003)Google Scholar
  5. 5.
    Sullivan, D.: Document warehousing and text mining. John Wiley and Sons, Chichester (2001)Google Scholar
  6. 6.
    Fan, W., Wallace, L., Rich, S., Zhang, Z.: Tapping the power of text mining. Comm. of the ACM 49, 76–82 (2006)CrossRefGoogle Scholar
  7. 7.
    Decherchi, S., Gastaldo, P., Redi, J., Zunino, R.: Hypermetric k-means clustering for content-based document management. In: First Workshop on Computational Intelligence in Security for Information Systems, Genova (2008)Google Scholar
  8. 8.
    The Enron Email Dataset,
  9. 9.
    Carrier, B.: File System Forensic Analysis. Addison-Wesley, Reading (2005)Google Scholar
  10. 10.
    Popp, R., Armour, T., Senator, T., Numrych, K.: Countering terrorism through information technology. Comm. of the ACM 47, 36–43 (2004)CrossRefGoogle Scholar
  11. 11.
    Zanasi, A. (ed.): Text Mining and its Applications to Intelligence, CRM and KM, 2nd edn. WIT Press (2007)Google Scholar
  12. 12.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)zbMATHGoogle Scholar
  13. 13.
    Baeza-Yates, R., Ribiero-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999)Google Scholar
  14. 14.
    Salton, G., Wong, A., Yang, L.S.: A vector space model for information retrieval. Journal Amer. Soc. Inform. Sci. 18, 613–620 (1975)zbMATHGoogle Scholar
  15. 15.
    Linde, Y., Buzo, A., Gray, R.M.: An algorithm for vector quantizer design. IEEE Trans. Commun. COM 28, 84–95 (1980)CrossRefGoogle Scholar
  16. 16.
    Bekkerman, R., McCallum, A., Huang, G.: Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora. CIIR Technical Report IR-418 (2004),

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Sergio Decherchi
    • 1
  • Simone Tacconi
    • 2
  • Judith Redi
    • 1
  • Alessio Leoncini
    • 1
  • Fabio Sangiacomo
    • 1
  • Rodolfo Zunino
    • 1
  1. 1.Dept. Biophysical and Electronic EngineeringUniversity of GenoaGenovaItaly
  2. 2.Servizio Polizia Postale e delle Comunicazioni, Ministero dell’Interno 

Personalised recommendations