Advertisement

Some Clarifications Regarding Fully Synthetic Data

  • Jörg DrechslerEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11126)

Abstract

There has been some confusion in recent years in which circumstances datasets generated using the synthetic data approach should be considered fully synthetic and which estimator to use for obtaining valid variance estimates based on the synthetic data. This paper aims at providing some guidance to overcome this confusion. It offers a review of the different approaches for generating synthetic datasets and discusses their similarities and differences. It also presents the different variance estimators that have been proposed for analyzing the synthetic data. Based on two simulation studies the advantages and limitations of the different estimators are discussed. The paper concludes with some general recommendations how to judge which synthesis strategy and which variance estimator is most suitable in which situation.

Keywords

Confidentiality Multiple imputation Fully synthetic Variance estimation 

References

  1. 1.
    Drechsler, J.: Improved variance estimation for fully synthetic datasets. In: Proceedings of the Joint UNECE/EUROSTAT Work Session on Statistical Data Confidentiality (2011)Google Scholar
  2. 2.
    Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation. LNS, vol. 201. Springer, New York (2011).  https://doi.org/10.1007/978-1-4614-0326-5CrossRefzbMATHGoogle Scholar
  3. 3.
    Drechsler, J., Reiter, J.P.: Disclosure risk and data utility for partially synthetic data: an empirical study using the German IAB establishment survey. J. Off. Stat. 25, 589–603 (2009)Google Scholar
  4. 4.
    Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. J. Am. Stat. Assoc. 105(492), 1347–1357 (2010)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Drechsler, J., Reiter, J.P.: Combining synthetic data with subsampling to create public use microdata files for large scale surveys. Surv. Methodol. 38, 73–79 (2012)Google Scholar
  6. 6.
    Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: the synthetic longitudinal business database. Int. Stat. Rev. 79, 362–384 (2011)CrossRefGoogle Scholar
  7. 7.
    Little, R.J.A.: Statistical analysis of masked data. J. Off. Stat. 9, 407–426 (1993)Google Scholar
  8. 8.
    Raab, G.M., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 4 (2017)Google Scholar
  9. 9.
    Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19, 1–16 (2003)Google Scholar
  10. 10.
    Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18, 531–544 (2002)Google Scholar
  11. 11.
    Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29, 181–189 (2003)Google Scholar
  12. 12.
    Reiter, J.P., Drechsler, J.: Releasing multiply-imputed, synthetic data generated in two stages to protect confidentiality. Stat. Sin. 20, 405–421 (2010)MathSciNetzbMATHGoogle Scholar
  13. 13.
    Reiter, J.P., Kinney, S.K.: Inferentially valid, partially synthetic data: generating from posterior predictive distributions not necessary. J. Off. Stat. 28(4), 583–590 (2012)Google Scholar
  14. 14.
    Rubin, D.B.: Discussion: statistical disclosure limitation. J. Off. Stat. 9, 462–468 (1993)Google Scholar
  15. 15.
    Si, Y., Reiter, J.P.: A comparison of posterior simulation and inference by combining rules for multiple imputation. J. Stat. Theory Pract. 5(2), 335–347 (2011)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Institute for Employment ResearchNurembergGermany

Personalised recommendations