Privacy Preserving Clustering

  • Somesh Jha
  • Luis Kruger
  • Patrick McDaniel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3679)


The freedom and transparency of information flow on the Internet has heightened concerns of privacy. Given a set of data items, clustering algorithms group similar items together. Clustering has many applications, such as customerbehavior analysis, targeted marketing, forensics, and bioinformatics. In this paper, we present the design and analysis of a privacy-preserving k-means clustering algorithm, where only the cluster means at the various steps of the algorithm are revealed to the participating parties. The crucial step in our privacy-preserving k-means is privacy-preserving computation of cluster means.We present two protocols (one based on oblivious polynomial evaluation and the second based on homomorphic encryption) for privacy-preserving computation of cluster means. We have a JAVA implementation of our algorithm. Using our implementation, we have performed a thorough evaluation of our privacy-preserving clustering algorithm on three data sets. Our evaluation demonstrates that privacy-preserving clustering is feasible, i.e., our homomorphic-encryption based algorithm finished clustering a large data set in approximately 66 seconds.


Privacy Preserve Homomorphic Encryption Oblivious Transfer Message Space Bandwidth Cost 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    104th Congress. Public Law 104-191: Health Insurance Portability and Accountability Act of 1996 (August 1996)Google Scholar
  2. 2.
    Adam, N.R., Wortmann, J.C.: Security-control methods for statistical databases: A comparative study. ACM Computing Surveys 21 (1989)Google Scholar
  3. 3.
    Agrawal, R., Srikant, R.: Privacy-preserving data mining. In: Proceedings of the 2000 ACM SIGMOD Conference on Management of Data, Dallas, TX, May 2000, pp. 439–450 (2000)Google Scholar
  4. 4.
    Bardley, P.S., Fayyad, U.M.: Refining initial points for k-means clustering. In: Proceedings of 15th International Conference on Machine Learning (ICML), pp. 91–99 (1998)Google Scholar
  5. 5.
    Benaloh, J.: Dense probabilistic encryption. In: Workshop on Selected Areas of Cryptography, May 1994, pp. 120–128 (1994)Google Scholar
  6. 6.
    Boneh, D., Franklin, M.K.: Efficient generation of shared RSA keys. Journal of the ACM (JACM) 48(4), 702–722 (2001)zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Canetti, R.: Security and composition of multi-party cryptographic protocols. Journal of Cryptology 13(1), 143–202 (2000)zbMATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Cranor, L., Langheinrich, M., Marchiori, M., Presler-Marshall, M., Reagle, J.: The Platform for Privacy Preferences 1.0 (P3P1.0) Specification. W3C Recommendation, April 16 (2002)Google Scholar
  9. 9.
    Cranor, L.F.: Internet privacy. Communications of the ACM 42(2), 28–38 (1999)CrossRefGoogle Scholar
  10. 10.
    Denning, D.E.: A security model for the statistical database problem. ACM Transactions on Database Systems (TODS) 5 (1980)Google Scholar
  11. 11.
    Dhillon, I.S., Marcotte, E.M., Roshan, U.: Diametrical clustering for identifying anti-correlated gene clusters. Bioinformatics 19(13), 1612–1619 (2003)CrossRefGoogle Scholar
  12. 12.
    Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Proceedings of Large-scale Parallel KDD Systems Workshop (ACM SIGKDD), August 15-18 (1999)Google Scholar
  13. 13.
    Du, W., Atallah, M.J.: Privacy-preserving cooperative statistical analysis. In: Annual Computer Security Applications Conference ACSAC, New Orleans, Louisiana, USA, December 10-14, pp. 102–110 (2001)Google Scholar
  14. 14.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, Chichester (2001)zbMATHGoogle Scholar
  15. 15.
    Evfimievski, A., Srikant, R., Agrawal, R., Gehrke, J.: Privacy preserving mining of association rules. In: Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 23–26, pp. 217–228 (2002)Google Scholar
  16. 16.
    Feigenbaum, J., Ishai, Y., Malkin, T., Nissim, K., Strauss, M., Wright, R.N.: Secure multiparty computation of approximations. In: Orejas, F., Spirakis, P.G., van Leeuwen, J. (eds.) ICALP 2001. LNCS, vol. 2076, p. 927. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  17. 17.
    Gilboa, N.: Two party rsa key generation. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol. 1666, p. 116. Springer, Heidelberg (1999)Google Scholar
  18. 18.
    Goldberg, I., Wagner, D., Brewer, E.: Privacy-enhancing technologies for the internet. In: Proc. of 42nd IEEE Spring COMPCON, February 1997. IEEE Computer Society Press, Los Alamitos (1997)Google Scholar
  19. 19.
    Goldreich, O.: Foundations of Cryptography: Volume 1, Basic Tools. Cambridge University Press, Cambridge (2001)CrossRefGoogle Scholar
  20. 20.
    Goldreich, O.: Foundations of Cryptography: Volume 2, Basic Applications. Cambridge University Press, Cambridge (2004)Google Scholar
  21. 21.
    Goldreich, O., Micali, S., Wigderson, A.: How to play any mental game - a completeness theorem for protocols with honest majority. In: 19th Symposium on Theory of Computer Science, pp. 218–229 (1987)Google Scholar
  22. 22.
    Goldreich, O., Micali, S., Wigderson, A.: Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems. Journal of the ACM 38(1), 691–729 (1991)zbMATHMathSciNetGoogle Scholar
  23. 23.
    Goldreich, O., Petrank, E.: Quantifying knowledge complexity. Computational Complexity 8, 50–98 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    Goldwasser, S., Micali, S.: Probabilistic encryption. Journal of Computer and Systems Science 28, 270–299 (1984)zbMATHCrossRefMathSciNetGoogle Scholar
  25. 25.
    Tao Linux User Group. Tao Linux, version 1.0 (November 2004),
  26. 26.
    Information and Computer Science. Pioneer-1 Mobile Robot Data. University of California Irvine (November 1998),
  27. 27.
    Information and Computer Science. COIL 1999 Competition Data, The UCI KDD Archive. University of California Irvine (October 1999),
  28. 28.
    Information and Computer Science. Japanese Vowels. University of California Irvine (June 2000),
  29. 29.
    Julisch, K.: Clustering intrusion detection alarms to support root cause analysis. ACM Transactions on Information and System Security (TISSEC) 6(4), 443–471 (2003)CrossRefGoogle Scholar
  30. 30.
    Kargupta, H., Huang, W., Sivakumar, K., Johnson, E.: Distributed clustering using collective principal component analysis. Knowledge and Information Systems 3(4), 405–421 (2001)CrossRefGoogle Scholar
  31. 31.
    Klusch, M., Lodi, S., Moro, G.: Distributed clustering based on sampling local density estimates. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI 2003), pp. 485–490 (2003)Google Scholar
  32. 32.
    Kudo, M., Toyama, J., Shimbo, M.: Multidimensional Curve Classification Using Passing-Through Regions. Pattern Recognition Letters (11–13), 1103–1111 (1999)Google Scholar
  33. 33.
    Lindell, Y., Pinkas, B.: Privacy preserving data mining. In: Bellare, M. (ed.) CRYPTO 2000. LNCS, vol. 1880, pp. 36–54. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  34. 34.
    Llyod, S.P.: Least squares quantization in pcm. IEEE Transactions on Information Theory IT-2, 129–137 (1982)CrossRefGoogle Scholar
  35. 35.
    Malkhi, D., Nisan, N., Pinkas, B., Sella, Y.: Fairplay – A Secure Two-Party Computation System. In: Proceedings of 13th USENIX Security Symposium, San Diego, CA, September 2004. USENIX, pp. 287–302 (2004)Google Scholar
  36. 36.
    Marchette, D.: A statistical method for profiling network traffic. In: Workshop on Intrusion Detection and Network Monitoring, pp. 119–128 (1999)Google Scholar
  37. 37.
    Merugu, S., Ghosh, J.: Privacy-preserving distributed clustering using generative models. In: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pp. 211–218 (2003)Google Scholar
  38. 38.
    Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)zbMATHGoogle Scholar
  39. 39.
    Naccache, D., Stern, J.: A new public key cryptosystem based on higher residues. In: Proceedings of the 5th ACM Conference on Computer and Communications Security (CCS), San Francisco, California (1998)Google Scholar
  40. 40.
    Naor, M., Pinkas, B.: Oblivious transfer and polynomial evaluation. In: 31st Symposium on Theory of Computer Science, Atlanta, GA, May 1-4, pp. 245–254 (1999)Google Scholar
  41. 41.
    Oliveira, S., Zaiane, O.R.: Privacy preserving clustering by data transformation. In: XVIII Simpósio Brasileiro de Bancos de Dados, 6-8 de Outubro (SBBD 2003), pp. 304–318 (2003)Google Scholar
  42. 42.
    Paillier, P.: Public-key cryptosystems based on composite degree residuosity classes. In: Stern, J. (ed.) EUROCRYPT 1999. LNCS, vol. 1592, p. 223. Springer, Heidelberg (1999)Google Scholar
  43. 43.
    Pouget, F., Dacier, M.: Honeypot-based forensics. In: Proceedings Of AusCERT Asia Pacific Information technology Security Conference 2004 (AusCERT2004), Brisbane, Australia (May 2004)Google Scholar
  44. 44.
    Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)Google Scholar
  45. 45.
    Rind, D.M., Kohane, I.S., Szolovits, P., Safran, C., Chueh, H.C., Barnett, G.O.: Maintaining the confidentiality of medical records shared over the internet and the world wide web. Annals of Internal Medicine 127(2) (July 1997)Google Scholar
  46. 46.
    Rizvi, S.J., Harista, J.R.: Maintaining data privacy in association rule mining. In: Proceedings of 28th International Conference on Very Large Data Bases (VLDB), Hong Kong, August 20-23 (2002)Google Scholar
  47. 47.
    Sun Microsystems. Sun Java Virutal Machine, version 1.5 (November 2004),
  48. 48.
    Taylor, H.: Most people are “privacy pragmatists” who, while concerned about privacy, will sometimes trade it off for other benefits. The Harris Poll (17), March 19 (2003)Google Scholar
  49. 49.
    Turow, J.: Americans and online privacy: The system is broken. Technical report, Annenberg Public Policy Center (June 2003)Google Scholar
  50. 50.
    Vaidya, J., Clifton, C.: Privacy preserving association rule mining in vertically partitioned data. In: Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 23–26, pp. 217–228 (2002)Google Scholar
  51. 51.
    Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 206–215 (2003)Google Scholar
  52. 52.
    Yao, A.C.: How to generate and exchange secrets. In: 27th IEEE Symposium on Foundations of Computer Science, pp. 162–167 (1986)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Somesh Jha
    • 1
  • Luis Kruger
    • 1
  • Patrick McDaniel
    • 2
  1. 1.Computer Sciences DepartmentUniversity of WisconsinMadisonUSA
  2. 2.Computer Science and EngineeringPennsylvania State UniversityUniversity ParkUSA

Personalised recommendations