Skip to main content

Disclosure Risk Evaluation for Fully Synthetic Categorical Data

  • Conference paper
Privacy in Statistical Databases (PSD 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8744))

Included in the following conference series:

Abstract

We present an approach for evaluating disclosure risks for fully synthetic categorical data. The basic idea is to compute probability distributions of unknown confidential data values given the synthetic data and assumptions about intruder knowledge. We use a “worst-case” scenario of an intruder knowing all but one of the records in the confidential data. To create the synthetic data, we use a Dirichlet process mixture of products of multinomial distributions, which is a Bayesian version of a latent class model. In addition to generating synthetic data with high utility, the likelihood function admits simple and convenient approximations to the disclosure risk probabilities via importance sampling. We illustrate the disclosure risk computations by synthesizing a subset of data from the American Community Survey.

This research was supported by U.S. National Science Foundation grants CNS-10-12141 and SES-11-31897.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Abowd, J., Stinson, M., Benedetto, G.: Final report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project. Tech. rep., U.S. Census Bureau Longitudinal Employer-Household Dynamics Program (2006), http://www.census.gov/sipp/synth_data.html

  • Abowd, J.M., Vilhuber, L.: How protective are synthetic data? In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 239–246. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  • Abowd, J.A., Schneider, M.J., Vilhuber, L.: Differential privacy applications to bayesian and linear mixed model estimation. Journal of Privacy and Confidentiality 5, 73–105 (2013)

    Google Scholar 

  • Abowd, J.M., Woodcock, S.D.: Disclosure limitation in longitudinal linked data. In: Doyle, P., Lane, J., Zayatz, L., Theeuwes, J. (eds.) Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, pp. 215–277. North-Holland, Amsterdam (2001)

    Google Scholar 

  • Abowd, J.M., Woodcock, S.D.: Multiply-imputing confidential characteristics and file links in longitudinal linked data. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 290–297. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  • An, D., Little, R.: Multiple imputation: an alternative to top coding for statistical disclosure control. Journal of the Royal Statistical Society, Series A 170, 923–940 (2007)

    Article  MathSciNet  Google Scholar 

  • Burgette, L., Reiter, J.P.: Multiple-shrinkage multinomial probit models with applications to simulating geographies in public use data. Bayesian Analysis 8, 453–478 (2013)

    Article  MathSciNet  Google Scholar 

  • Charest, A.S.: How can we analyze differentially private synthetic datasets. Journal of Privacy and Confidentiality 2(2), Article 3 (2010)

    Google Scholar 

  • Drechsler, J., Bender, S., Rässler, S.: Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB Establishment Panel. Transactions on Data Privacy 1, 105–130 (2008a)

    MathSciNet  Google Scholar 

  • Drechsler, J., Dundler, A., Bender, S., Rässler, S., Zwick, T.: A new approach for disclosure control in the IAB Establishment Panel–Multiple imputation for a better data access. Advances in Statistical Analysis 92, 439–458 (2008b)

    Article  Google Scholar 

  • Drechsler, J., Reiter, J.P.: Sampling with synthesis: A new approach for releasing public use census microdata. Journal of the American Statistical Association 105, 1347–1357 (2010)

    Article  MathSciNet  Google Scholar 

  • Drechsler, J., Reiter, J.P.: Combining synthetic data with subsampling to create public use microdata files for large scale surveys. Survey Methodology 38, 73–79 (2012)

    Google Scholar 

  • Duncan, G.T., Lambert, D.: The risk of disclosure for microdata. Journal of Business and Economic Statistics 7, 207–217 (1989)

    Google Scholar 

  • Dunson, D.B., Xing, C.: Nonparametric Bayes modeling of multivariate categorical data. Journal of the American Statistical Association 104, 1042–1051 (2009)

    Article  MathSciNet  Google Scholar 

  • Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006, Part II. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  • Fienberg, S.E.: A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Tech. rep., Department of Statistics, Carnegie-Mellon University (1994)

    Google Scholar 

  • Fienberg, S.E., Makov, U.E., Sanil, A.P.: A Bayesian approach to data disclosure: Optimal intruder behavior for continuous data. Journal of Official Statistics 13, 75–89 (1997)

    Google Scholar 

  • Graham, P., Penny, R.: Multiply imputed synthetic data files. Tech. rep., University of Otago (2005), http://www.uoc.otago.ac.nz/departments/pubhealth/pgrahpub.htm

  • Graham, P., Young, J., Penny, R.: Multiply imputed synthetic data: Evaluation of hierarchical Bayesian imputation models. Journal of Official Statistics 25, 245–268 (2009)

    Google Scholar 

  • Hawala, S.: Producing partially synthetic data to avoid disclosure. In: Proceedings of the Joint Statistical Meetings. American Statistical Association, Alexandria (2008)

    Google Scholar 

  • Ishwaran, H., James, L.F.: Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 161–173 (2001)

    Google Scholar 

  • Kennickell, A.B.: Multiple imputation and disclosure protection: The case of the 1995 Survey of Consumer Finances. In: Alvey, W., Jamerson, B. (eds.) Record Linkage Techniques, pp. 248–267. National Academy Press, Washington, D.C. (1997)

    Google Scholar 

  • Kinney, S., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: The synthetic Longitudinal Business Database. International Statistical Review 79, 363–384 (2011)

    Article  Google Scholar 

  • Little, R.J.A.: Statistical analysis of masked data. Journal of Official Statistics 9, 407–426 (1993)

    Google Scholar 

  • Little, R.J.A., Liu, F., Raghunathan, T.E.: Statistical disclosure techniques based on multiple imputation. In: Gelman, A., Meng, X.L. (eds.) Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, pp. 141–152. John Wiley & Sons, New York (2004)

    Google Scholar 

  • Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: Theory meets practice on the map. In: IEEE 24th International Conference on Data Engineering, pp. 277–286 (2008)

    Google Scholar 

  • Manrique-Vallier, D., Reiter, J.P.: Bayesian estimation of discrete multivariate latent structure models with strutural zeros. Journal of Computational and Graphical Statistics (to appear, 2014)

    Google Scholar 

  • McClure, D., Reiter, J.P.: Differential privacy and statistical disclosure risk measures: An illustration with binary synthetic data. Transactions on Data Privacy 5, 535–552 (2012)

    MathSciNet  Google Scholar 

  • Paiva, T., Chakraborty, A., Reiter, J.P., Gelfand, A.E.: Imputation of confidential data sets with spatial locations using disease mapping models. Statistics in Medicine (to appear, 2014)

    Google Scholar 

  • Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. Journal of Official Statistics 19, 1–16 (2003)

    Google Scholar 

  • Reiter, J.: Using multiple imputation to integrate and disseminate confidential microdata. International Statistical Review 77, 179–195 (2009)

    Article  Google Scholar 

  • Reiter, J., Raghunathan, T.E.: The multiple adaptations of multiple imputation. Journal of the American Statistical Association 102, 1462–1471 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  • Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics 18, 531–544 (2002)

    Google Scholar 

  • Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Survey Methodology 29, 181–189 (2003)

    Google Scholar 

  • Reiter, J.P.: Simultaneous use of multiple imputation for missing data and disclosure limitation. Survey Methodology 30, 235–242 (2004)

    Google Scholar 

  • Reiter, J.P.: Estimating identification risks in microdata. Journal of the American Statistical Association 100, 1103–1113 (2005a)

    Article  MATH  MathSciNet  Google Scholar 

  • Reiter, J.P.: Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Society, Series A 168, 185–205 (2005b)

    Article  MATH  MathSciNet  Google Scholar 

  • Reiter, J.P.: Discussion: Bayesian perspectives and disclosure risk assessment. International Statistical Review 80, 373–375 (2012)

    Article  MathSciNet  Google Scholar 

  • Reiter, J.P., Wang, Q., Zhang, B.: Bayesian estimation of disclosure risks in multiply imputed, synthetic data. Journal of Privacy and Confidentiality (to appear, 2014)

    Google Scholar 

  • Rubin, D.B.: Discussion: Statistical disclosure limitation. Journal of Official Statistics 9, 462–468 (1993)

    Google Scholar 

  • Sethuraman, J.: A constructive definition of Dirichlet priors. Statistica Sinica 4, 639–650 (1994)

    MATH  MathSciNet  Google Scholar 

  • Si, Y., Reiter, J.P.: Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics 38, 499–521 (2013)

    Article  Google Scholar 

  • Slavkovic, A.B., Lee, J.: Synthetic two-way contingency tables that preserve conditional frequencies. Statistical Methodology 7, 225–239 (2010)

    Article  MATH  MathSciNet  Google Scholar 

  • Wang, H., Reiter, J.P.: Multiple imputation for sharing precise geographies in public use data. Annals of Applied Statistics 6, 229–252 (2012)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Hu, J., Reiter, J.P., Wang, Q. (2014). Disclosure Risk Evaluation for Fully Synthetic Categorical Data. In: Domingo-Ferrer, J. (eds) Privacy in Statistical Databases. PSD 2014. Lecture Notes in Computer Science, vol 8744. Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11257-2_15

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11256-5

  • Online ISBN: 978-3-319-11257-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics