Toward Privacy in Public Databases

  • Shuchi Chawla
  • Cynthia Dwork
  • Frank McSherry
  • Adam Smith
  • Hoeteck Wee
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3378)


We initiate a theoretical study of the census problem. Informally, in a census individual respondents give private information to a trusted party (the census bureau), who publishes a sanitized version of the data. There are two fundamentally conflicting requirements: privacy for the respondents and utility of the sanitized data. Unlike in the study of secure function evaluation, in which privacy is preserved to the extent possible given a specific functionality goal, in the census problem privacy is paramount; intuitively, things that cannot be learned “safely” should not be learned at all.

An important contribution of this work is a definition of privacy (and privacy compromise) for statistical databases, together with a method for describing and comparing the privacy offered by specific sanitization techniques. We obtain several privacy results using two different sanitization techniques, and then show how to combine them via cross training. We also obtain two utility results involving clustering.


Auxiliary Information Statistical Database Private Information Retrieval Privacy Breach Real Database 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Agrawal, D., Aggarwal, C.: On the Design and Quantification of Privacy Preserving Data Mining Algorithms. In: Proceedings of the 20th Symposium on Principles of Database Systems (2001)Google Scholar
  2. 2.
    Adam, N.R., Wortmann, J.C.: Security-Control Methods for Statistical Databases. A Comparative Study, ACM Computing Surveys 21(4), 515–556 (1989)CrossRefGoogle Scholar
  3. 3.
    Arora, S., Kannan, R.: Learning mixtures of arbitrary Gaussians. In: ACM STOC (2001)Google Scholar
  4. 4.
    Agrawal, R., Srikant, R.: Privacy-preserving data mining. In: Proc. of the ACM SIGMOD Conference on Management of Data, pp. 439–450 (2000)Google Scholar
  5. 5.
    Beck, L.: A security machanism for statistical database. ACM Transactions on Database Systems (TODS) 5(3), 316–338 (1980)zbMATHCrossRefGoogle Scholar
  6. 6.
    Chawla, S., Dwork, C., McSherry, F., Talwar, K.: On the Utility of Privacy- Preserving Histograms (November 2004) (in preparation)Google Scholar
  7. 7.
    Cox, L.H.: New Results in Disclosure Avoidance for Tabulations. In: International Statistical Institute Proceedings of the 46th Session, Tokyo, pp. 83–84 (1987)Google Scholar
  8. 8.
    Dasgupta, S.: Learning mixtures of Gaussians. In: IEEE FOCS (1999)Google Scholar
  9. 9.
    Denning, D.: Secure statistical databases with random sample queries. ACM Transactions on Database Systems (TODS) 5(3), 291–315 (1980)zbMATHCrossRefGoogle Scholar
  10. 10.
    Diaconis, P., Sturmfels, B.: Algebraic Algorithms for Sampling from Conditional Distributions. Annals of Statistics 26(1), 363–397 (1998)zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Dinur, I., Nissim, K.: Revealing information while preserving privacy. In: Proceedings of the Symposium on Principles of Database Systems, pp. 202–210 (2003)Google Scholar
  12. 12.
    Dobra, A., Fienberg, S.E., Trottini, M.: Assessing the risk of disclosure of confidential categorical data, Bayesian Statistics 7, pp. 125–144. Oxford University Press, Oxford (2000)Google Scholar
  13. 13.
    Dwork, C.: A Cryptography-Flavored Approach to Privacy in Public Databases. In: lecture at Aladdin Workshop on Privacy in DATA (March 2003),
  14. 14.
    Dwork, C., Naor, M., et al.: Impossibility Results for Privacy-Preserving Data Sanitization (2004) (in preparation)Google Scholar
  15. 15.
    Dwork, C., Nissim, K.: Privacy-preserving datamining on vertically partitioned databases. In: Franklin, M. (ed.) CRYPTO 2004. LNCS, vol. 3152, pp. 528–544. Springer, Heidelberg (2004)Google Scholar
  16. 16.
    Evfimievski, A.V., Gehrke, J., Srikant, R.: Limiting privacy breaches in privacy preserving data mining. In: Proceedings of the Symposium on Principles of Database Systems, pp. 211–222 (2003)Google Scholar
  17. 17.
    Füredi, Z., Komlós, J.: The eigenvalues of random symmetric matrices, Combinatorica. Combinatorica 1(3), 233–241 (1981)zbMATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    Gasarch, W.: A Survey on Private Information Retrieval. BEATCS Computational Complexity Column 82, 72–107 (2004)zbMATHMathSciNetGoogle Scholar
  19. 19.
    Gavison, R.: Privacy and the Limits of the Law. In: Johnson, D.G., Nissenbaum, H. (eds.) Computers, Ethics, and Social Values, pp. 332–351. Prentice Hall, Englewood Cliffs (1995)Google Scholar
  20. 20.
    Goldreich, O.: The Foundations of Cryptography, vol. 2. Cambridge University Press, Cambridge (2004)CrossRefGoogle Scholar
  21. 21.
    Goldwasser, S., Micali, S.: Probabilistic Encryption. JCSS 28(2), 270–299 (1984)zbMATHMathSciNetGoogle Scholar
  22. 22.
    Gusfield, D.: A Graph Theoretic Approach to Statistical Data Security. SIAM Journal on Computing 17(3), 552–571 (1988)zbMATHCrossRefMathSciNetGoogle Scholar
  23. 23.
    Indyk, P., Motwani, R.: Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality. In: Proceedings of the 30th Annual ACM Symposium on Theory of Computing (1998)Google Scholar
  24. 24.
    Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the Privacy Preserving Properties of Random Data Perturbation Techniques. In: Proceedings of the Third ICDM IEEE International Conference on Data Mining (2003)Google Scholar
  25. 25.
    Kleinberg, J.M., Papadimitriou, C.H., Raghavan, P.: Auditing Boolean Attributes. J. Comput. Syst. Sci. 66(1), 244–253 (2003)zbMATHCrossRefMathSciNetGoogle Scholar
  26. 26.
    McSherry, F.: Spectral Partitioning of Random Graphs. In: Proc. 42nd FOCS, pp. 529–537 (2001)Google Scholar
  27. 27.
    Raghunathank, T.E., Reiter, J.P., Rubin, D.B.: Multiple Imputation for Statistical Disclosure Limitation. J. Official Statistics 19(1), 1–16 (2003)Google Scholar
  28. 28.
    Roque, G.: Application and Analysis of the Mixture-of-Normals Approach to Masking Census Public-use Microdata (2003) (Manuscript)Google Scholar
  29. 29.
    Rubin, D.B.: Discussion: Statistical Disclosure Limitation. Journal of Official Statistics 9(2), 461–468 (1993)zbMATHGoogle Scholar
  30. 30.
    Sibson, R.: SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal 16(1), 30–34 (1973)CrossRefMathSciNetGoogle Scholar
  31. 31.
    Sweeney, L.: k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10(5), 557–570 (2002)zbMATHCrossRefMathSciNetGoogle Scholar
  32. 32.
    Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10(5), 571–588 (2002)zbMATHCrossRefMathSciNetGoogle Scholar
  33. 33.
    Vempala, S., Wang, G.: A spectral algorithm for learning mixtures of distributions. In: IEEE FOCS (2002)Google Scholar
  34. 34.
    Winkler, W.E.: Masking and re-identification methods for public-use microdata: Overview and research problems. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 231–246. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  35. 35.
    Winkler, W.E.: Re-identification methods for masked microdata. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 216–230. Springer, Heidelberg (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Shuchi Chawla
    • 1
  • Cynthia Dwork
    • 2
  • Frank McSherry
    • 2
  • Adam Smith
    • 3
  • Hoeteck Wee
    • 4
  1. 1.Carnegie Mellon University 
  2. 2.Microsoft Research SVC 
  3. 3.Weizmann Institute of Science 
  4. 4.University of CaliforniaBerkeley

Personalised recommendations