Machine Learning

, Volume 93, Issue 1, pp 163–183 | Cite as

Differential privacy based on importance weighting



This paper analyzes a novel method for publishing data while still protecting privacy. The method is based on computing weights that make an existing dataset, for which there are no confidentiality issues, analogous to the dataset that must be kept private. The existing dataset may be genuine but public already, or it may be synthetic. The weights are importance sampling weights, but to protect privacy, they are regularized and have noise added. The weights allow statistical queries to be answered approximately while provably guaranteeing differential privacy. We derive an expression for the asymptotic variance of the approximate answers. Experiments show that the new mechanism performs well even when the privacy budget is small, and when the public and private datasets are drawn from different populations.


Privacy Differential privacy Importance weighting 



Zhanglong Ji was funded in part by NIH grants UH2HL108785, U54HL108460, and UL1TR0001000. Charles Elkan was funded in part by NIH grant GM077402-05A1. The authors are grateful to the anonymous reviewers and to Kamalika Chaudhuri for comments that helped to improve the paper notably.


  1. Blum, A., Ligett, K., & Roth, A. (2008). A learning theory approach to non-interactive database privacy. In C. Dwork (Ed.), STOC (pp. 609–618). New York: ACM. Google Scholar
  2. Chaudhuri, K., Monteleoni, C., & Sarwate, A. D. (2011). Differentially private empirical risk minimization. Journal of Machine Learning Research, 12, 1069–1109. MathSciNetGoogle Scholar
  3. Ding, B., Winslett, M., Han, J., & Li, Z. (2011). Differentially private data cubes: optimizing noise sources and consistency. In SIGMOD conference (pp. 217–228). Google Scholar
  4. Dwork, C. (2006). Differential privacy. In M. Bugliesi, B. Preneel, V. Sassone, & I. Wegener (Eds.), Lecture notes in computer science: Vol. 4052. ICALP (2) (pp. 1–12). Berlin: Springer. Google Scholar
  5. Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In S. Halevi & T. Rabin (Eds.), Lecture notes in computer science (Vol. 3876, pp. 265–284). Berlin: Springer. Google Scholar
  6. Elkan, C. (2010). Preserving privacy in data mining via importance weighting. Lecture notes in computer science: In Proceedings of the ECML/PKDD workshop on privacy and security issues in data mining and machine learning (PSDML). Berlin: Springer. Google Scholar
  7. Elkan, C., & Noto, K. (2008). Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, Nevada (pp. 213–220). CrossRefGoogle Scholar
  8. Frank, A., & Asuncion, A. (2010). UCI machine learning repository.
  9. Hahsler, M., Grün, B., & Hornik, K. (2011). arules: Mining Association Rules and Frequent Itemsets., R package version 1.0-7.
  10. Hardt, M., & Rothblum, G. N. (2010). A multiplicative weights mechanism for privacy-preserving data analysis. In FOCS (pp. 61–70). Google Scholar
  11. Hardt, M., Ligett, K., & McSherry, F. (2012). A simple and practical algorithm for differentially private data release. In NIPS (pp. 2348–2356). Google Scholar
  12. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1), 97–109. MATHCrossRefGoogle Scholar
  13. Hay, M., Rastogi, V., Miklau, G., & Suciu, D. (2010). Boosting the accuracy of differentially private histograms through consistency. Proceedings of the VLDB Endowment, 3(1), 1021–1032. Google Scholar
  14. Kanamori, T., Hido, S., & Sugiyama, M. (2009). A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10, 1391–1445. MathSciNetMATHGoogle Scholar
  15. Li, C., Hay, M., Rastogi, V., Miklau, G., & McGregor, A. (2010). Optimizing linear counting queries under differential privacy. In PODS (pp. 123–134). Google Scholar
  16. Li, Y. D., Zhang, Z., Winslett, M., & Yang, Y. (2011). Compressive mechanism: utilizing sparse representation in differential privacy. In Proceedings of the 10th annual ACM workshop on privacy in the electronic society (pp. 177–182). New York: ACM. CrossRefGoogle Scholar
  17. McSherry, F., & Mahajan, R. (2010). Differentially-private network trace analysis. In SIGCOMM (pp. 123–134). Google Scholar
  18. McSherry, F., & Mironov, I. (2009). Differentially private recommender systems: building privacy into the netflix prize contenders. In KDD (pp. 627–636). CrossRefGoogle Scholar
  19. Menon, A., Jiang, X., Vembu, S., Elkan, C., & Ohno-Machado, L. (2012). Predicting accurate probabilities with a ranking loss. In Proceedings of the international conference on machine learning (ICML). Google Scholar
  20. Mohammed, N., Chen, R., Fung, B. C. M., & Yu, P. S. (2011). Differentially private data release for data mining. In C. Apte, J. Ghosh, & P. Smyth (Eds.), KDD (pp. 493–501). New York: ACM. Google Scholar
  21. Rastogi, V., & Nath, S. (2010). Differentially private aggregation of distributed time-series with transformation and encryption. In SIGMOD conference (pp. 735–746). Google Scholar
  22. Scott, D. W. (1992). Multivariate density estimation: theory, practice, and visualization. New York: Wiley-Interscience. MATHCrossRefGoogle Scholar
  23. Smith, A. (2008, preprint). Efficient, differentially private point estimators. arXiv:0809.4794.
  24. Smith, A., & Elkan, C. (2004). A Bayesian network framework for reject inference. In Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD) (pp. 286–295). Google Scholar
  25. Xiao, Y., Xiong, L., & Yuan, C. (2010). Differentially private data release through multidimensional partitioning. In W. Jonker & M. Petkovic (Eds.), Secure data management, Springer, lecture notes in computer science (Vol. 6358, pp. 150–168). Google Scholar
  26. Zadrozny, B., & Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In Proceedings of the 18th international conference on machine learning (pp. 609–616). San Mateo: Morgan Kaufmann. Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.Department of Computer Science and Engineering 0404University of CaliforniaSan DiegoUSA

Personalised recommendations