pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity

  • Joshua SnokeEmail author
  • Aleksandra Slavković
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11126)


We propose a method for the release of differentially private synthetic datasets. In many contexts, data contain sensitive values which cannot be released in their original form in order to protect individuals’ privacy. Synthetic data is a protection method that releases alternative values in place of the original ones, and differential privacy (DP) is a formal guarantee for quantifying the privacy loss. We propose a method that maximizes the distributional similarity of the synthetic data relative to the original data using a measure known as the pMSE, while guaranteeing \(\epsilon \)-DP. We relax common DP assumptions concerning the distribution and boundedness of the original data. We prove theoretical results for the privacy guarantee and provide simulations for the empirical failure rate of the theoretical results under typical computational limitations. We give simulations for the accuracy of linear regression coefficients generated from the synthetic data compared with the accuracy of non-DP synthetic data and other DP methods. Additionally, our theoretical results extend a prior result for the sensitivity of the Gini Index to include continuous predictors.


Differential privacy Synthetic data Classification trees 



This work is supported by the U.S. Census Bureau and NSF grants BCS-0941553 and SES-1534433 to the Department of Statistics at the Pennsylvania State University. Thanks to Bharath Sriperumbudur for special aid in deriving the final form of the theoretical proof.


  1. Abowd, J.M., Vilhuber, L.: How protective are synthetic data? In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 239–246. Springer, Heidelberg (2008). Scholar
  2. Awan, J., Slavkovic, A.: Structure and sensitivity in differential privacy: comparing k-norm mechanisms. arXiv preprint arXiv:1801.09236 (2018)
  3. Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282. ACM (2007)Google Scholar
  4. Bertsimas, D., Dunn, J.: Optimal classification trees. Mach. Learn. 106(7), 1039–1082 (2017)MathSciNetCrossRefGoogle Scholar
  5. Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the SuLQ framework. In: Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 128–138. ACM (2005)Google Scholar
  6. Bowen, C.M., Liu, F.: Comparative study of differentially private data synthesis methods. arXiv preprint arXiv:1602.01063 (2016)
  7. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Belmont (1984)zbMATHGoogle Scholar
  8. Bun, M., Steinke, T.: Concentrated differential privacy: simplifications, extensions, and lower bounds. In: Hirt, M., Smith, A. (eds.) TCC 2016 Part I. LNCS, vol. 9985, pp. 635–658. Springer, Heidelberg (2016). Scholar
  9. Charest, A.-S.: How can we analyze differentially-private synthetic datasets? J. Priv. Confid. 2(2), 3 (2011)Google Scholar
  10. Chaudhuri, K., Sarwate, A., Sinha, K.: Near-optimal differentially private principal components. In: Advances in Neural Information Processing Systems, pp. 989–997 (2012)Google Scholar
  11. Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, vol. 201. Springer, New York (2011). Scholar
  12. Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: Vaudenay, S. (ed.) EUROCRYPT 2006. LNCS, vol. 4004, pp. 486–503. Springer, Heidelberg (2006). Scholar
  13. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). Scholar
  14. Dwork, C., Naor, M., Pitassi, T., Rothblum, G.N., Yekhanin, S.: Pan-private streaming algorithms. In: ICS, pp. 66–80 (2010)Google Scholar
  15. Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends®Theor. Comput. Sci. 9(3–4), 211–407 (2014)MathSciNetzbMATHGoogle Scholar
  16. Dwork, C., Rothblum, G.N.: Concentrated differential privacy. arXiv preprint arXiv:1603.01887 (2016)
  17. Friedman, A., Schuster, A.: Data mining with differential privacy. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 493–502. ACM (2010)Google Scholar
  18. Kapralov, M., Talwar, K.: On differentially private low rank approximation. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1395–1414. SIAM (2013)Google Scholar
  19. Karwa, V., Krivitsky, P.N., Slavković, A.B.: Sharing social network data: differentially private estimation of exponential family random-graph models. J. R. Stat. Soc.: Ser. C (Appl. Stat.) 66(3), 481–500 (2017)MathSciNetCrossRefGoogle Scholar
  20. Karwa, V., Slavković, A.: Inference using noisy degrees: differentially private \(\beta \)-model and synthetic graphs. Ann. Stat. 44(1), 87–112 (2016)MathSciNetCrossRefGoogle Scholar
  21. Kasiviswanathan, S.P., Lee, H.K., Nissim, K., Raskhodnikova, S., Smith, A.: What can we learn privately? SIAM J. Comput. 40(3), 793–826 (2011)MathSciNetCrossRefGoogle Scholar
  22. Kifer, D., Smith, A., Thakurta, A.: Private convex empirical risk minimization and high dimensional regression. In: Conference on Learning Theory, p. 25-1 (2012)Google Scholar
  23. Li, B., Karwa, V., Slavković, A., Steorts, B.: Release of differentially private high dimensional histograms (2018, Pre-print)Google Scholar
  24. Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A.: Optimizing linear counting queries under differential privacy. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 123–134. ACM (2010)Google Scholar
  25. Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: IEEE 24th International Conference on Data Engineering, ICDE 2008, pp. 277–286. IEEE (2008)Google Scholar
  26. McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2007, pp. 94–103. IEEE (2007)Google Scholar
  27. McSherry, F.D.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 19–30. ACM (2009)Google Scholar
  28. Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, pp. 75–84. ACM (2007)Google Scholar
  29. Nissim, K., et al.: Differential privacy: a primer for a non-technical audience (2017).
  30. Raab, G.M., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 4 (2017)Google Scholar
  31. Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1–17 (2003)Google Scholar
  32. Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18(4), 531–544 (2002)Google Scholar
  33. Reiter, J.P.: Using cart to generate partially synthetic, public use microdata. J. Off. Stat. 21(3), 441–462 (2005)Google Scholar
  34. Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. R. Stat. Soc.: Ser. A (Stat. Soc.) 181(3), 663–688 (2018)MathSciNetCrossRefGoogle Scholar
  35. Wang, Y.-X., Lei, J., Fienberg, S.E.: On-average KL-privacy and its equivalence to generalization for max-entropy mechanisms. In: Domingo-Ferrer, J., Pejić-Bach, M. (eds.) PSD 2016. LNCS, vol. 9867, pp. 121–134. Springer, Cham (2016). Scholar
  36. Wasserman, L., Zhou, S.: A statistical framework for differential privacy. J. Am. Stat. Assoc. 105(489), 375–389 (2010)MathSciNetCrossRefGoogle Scholar
  37. Woo, M.-J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1, 111–124 (2009)Google Scholar
  38. Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., Winslett, M.: Differentially private histogram publication. VLDB J. 22(6), 797–822 (2013)CrossRefGoogle Scholar
  39. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 25 (2017)MathSciNetCrossRefGoogle Scholar
  40. Zhang, J., Zhang, Z., Xiao, X., Yang, Y., Winslett, M.: Functional mechanism: regression analysis under differential privacy. Proc. VLDB Endow. 5(11), 1364–1375 (2012)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of StatisticsPennsylvania State UniversityUniversity ParkUSA

Personalised recommendations