Data Mining and Knowledge Discovery

, Volume 16, Issue 3, pp 251–275 | Cite as

A framework for condensation-based anonymization of string data

  • Charu C. Aggarwal
  • Philip S. Yu


In recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. An important method for privacy preserving data mining is the method of condensation. This method is often used in the case of multi-dimensional data in which pseudo-data is generated to mask the true values of the records. However, these methods are not easily applicable to the case of string data, since they require the use of multi-dimensional statistics in order to generate the pseudo-data. String data are especially important in the privacy preserving data-mining domain because most DNA and biological data are coded as strings. In this article, we will discuss a new method for privacy preserving mining of string data with the use of simple template-based models. The template-based model turns out to be effective in practice, and preserves important statistical characteristics of the strings such as intra-record distances. We will explore the behavior in the context of a classification application, and show that the accuracy of the application is not affected significantly by the anonymization process.


Privacy Strings Condensation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Aggarwal CC (2002) On effective classification of strings with wavelets. In: ACM KDD conferenceGoogle Scholar
  2. Aggarwal CC (2004) On k-anonymity and the curse of dimensionality. In: VLDB conference. Scalable clustering with balancing constraintsGoogle Scholar
  3. Agrawal D, Aggarwal CC (2002) On the design and quantification of privacy preserving data mining algorithms. In: ACM PODS conferenceGoogle Scholar
  4. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the VLDB conferenceGoogle Scholar
  5. Agrawal R, Srikant R (2000) Privacy preserving data mining. In: Proceedings of the ACM SIGMOD conferenceGoogle Scholar
  6. Aggarwal CC, Yu PS (2004) A condensation based approach to privacy preserving data mining. In: EDBT conferenceGoogle Scholar
  7. Aggarwal CC, Yu PS (2005) On variable constraints in privacy preserving data mining. In: ACM SIAM data mining conferenceGoogle Scholar
  8. Aggarwal CC, Yu PS (2007) On anonymization of strings. In: SIAM conference on data mining.
  9. Banerjee A and Ghosh J (2006). Scalable clustering with balancing constraints. Data Min Knowl Discov J 13: 365–395 CrossRefMathSciNetGoogle Scholar
  10. Bayardo RJ, Agrawal R (2005) Data privacy through optimal k-anonymization. In: ICDE conferenceGoogle Scholar
  11. Evfimievski A, Srikant R, Agrawal R, Gehrke J (2002) Privacy preserving mining of association rules. In: KDD conferenceGoogle Scholar
  12. Iyengar V (2000) Transforming data to satisfy privacy constraints. In: ACM KDD conferenceGoogle Scholar
  13. Kifer D, Gehrke J (2006) Injecting utility into anonymized data sets. In: ACM SIGMOD conferenceGoogle Scholar
  14. LeFevre K, Dewitt DJ, Ramakrishnan R (2006) Mondrian multi-dimensional k-anonymity. In: ICDE conferenceGoogle Scholar
  15. Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M (2006) l-Diversity: privacy beyond k-anonymity. In: ICDE conferenceGoogle Scholar
  16. Malin B (2004) Why methods for genomic data privacy fail and what we can do to fix it. In: AAAS Annual Meeting, Seattle, WAGoogle Scholar
  17. Malin B, Sweeney L (2001) Re-identification of DNA through an automated linkage process. In: Proceedings, Journal of American Medical Informatics Associations. Hanley & Belfus, Inc, Washington, DC, pp 423–427Google Scholar
  18. Meyerson A, Williams R (2004) On the complexity of optimal k-anonymity. In: ACM PODS conferenceGoogle Scholar
  19. Needleman S and Wunsch C (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3): 443–453 CrossRefGoogle Scholar
  20. Rizvi S, Haritsa J (2002) Maintaining data privacy in association rule mining. In: VLDB conferenceGoogle Scholar
  21. Samarati P, Sweeney L (1998) Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In: Proceedings of the IEEE symposium on research in security and privacyGoogle Scholar
  22. Sweeney L (1996) Replacing personally identifying information in medical records: the scrub system. In: Proceedings of the AMIA symposiumGoogle Scholar
  23. Wang K, Fung BCM, Yu PS (2006) Handicapping attacker’s confidence: an alternative to k-anonymization. Knowledge Inform Sys Int JGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  1. 1.IBM T. J. Watson Research CenterHawthorneUSA
  2. 2.University of Illinois at ChicagoChicagoUSA

Personalised recommendations