Abstract
We present an approach for evaluating disclosure risks for fully synthetic categorical data. The basic idea is to compute probability distributions of unknown confidential data values given the synthetic data and assumptions about intruder knowledge. We use a “worst-case” scenario of an intruder knowing all but one of the records in the confidential data. To create the synthetic data, we use a Dirichlet process mixture of products of multinomial distributions, which is a Bayesian version of a latent class model. In addition to generating synthetic data with high utility, the likelihood function admits simple and convenient approximations to the disclosure risk probabilities via importance sampling. We illustrate the disclosure risk computations by synthesizing a subset of data from the American Community Survey.
This research was supported by U.S. National Science Foundation grants CNS-10-12141 and SES-11-31897.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abowd, J., Stinson, M., Benedetto, G.: Final report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project. Tech. rep., U.S. Census Bureau Longitudinal Employer-Household Dynamics Program (2006), http://www.census.gov/sipp/synth_data.html
Abowd, J.M., Vilhuber, L.: How protective are synthetic data? In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 239–246. Springer, Heidelberg (2008)
Abowd, J.A., Schneider, M.J., Vilhuber, L.: Differential privacy applications to bayesian and linear mixed model estimation. Journal of Privacy and Confidentiality 5, 73–105 (2013)
Abowd, J.M., Woodcock, S.D.: Disclosure limitation in longitudinal linked data. In: Doyle, P., Lane, J., Zayatz, L., Theeuwes, J. (eds.) Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, pp. 215–277. North-Holland, Amsterdam (2001)
Abowd, J.M., Woodcock, S.D.: Multiply-imputing confidential characteristics and file links in longitudinal linked data. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 290–297. Springer, Heidelberg (2004)
An, D., Little, R.: Multiple imputation: an alternative to top coding for statistical disclosure control. Journal of the Royal Statistical Society, Series A 170, 923–940 (2007)
Burgette, L., Reiter, J.P.: Multiple-shrinkage multinomial probit models with applications to simulating geographies in public use data. Bayesian Analysis 8, 453–478 (2013)
Charest, A.S.: How can we analyze differentially private synthetic datasets. Journal of Privacy and Confidentiality 2(2), Article 3 (2010)
Drechsler, J., Bender, S., Rässler, S.: Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB Establishment Panel. Transactions on Data Privacy 1, 105–130 (2008a)
Drechsler, J., Dundler, A., Bender, S., Rässler, S., Zwick, T.: A new approach for disclosure control in the IAB Establishment Panel–Multiple imputation for a better data access. Advances in Statistical Analysis 92, 439–458 (2008b)
Drechsler, J., Reiter, J.P.: Sampling with synthesis: A new approach for releasing public use census microdata. Journal of the American Statistical Association 105, 1347–1357 (2010)
Drechsler, J., Reiter, J.P.: Combining synthetic data with subsampling to create public use microdata files for large scale surveys. Survey Methodology 38, 73–79 (2012)
Duncan, G.T., Lambert, D.: The risk of disclosure for microdata. Journal of Business and Economic Statistics 7, 207–217 (1989)
Dunson, D.B., Xing, C.: Nonparametric Bayes modeling of multivariate categorical data. Journal of the American Statistical Association 104, 1042–1051 (2009)
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006, Part II. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006)
Fienberg, S.E.: A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Tech. rep., Department of Statistics, Carnegie-Mellon University (1994)
Fienberg, S.E., Makov, U.E., Sanil, A.P.: A Bayesian approach to data disclosure: Optimal intruder behavior for continuous data. Journal of Official Statistics 13, 75–89 (1997)
Graham, P., Penny, R.: Multiply imputed synthetic data files. Tech. rep., University of Otago (2005), http://www.uoc.otago.ac.nz/departments/pubhealth/pgrahpub.htm
Graham, P., Young, J., Penny, R.: Multiply imputed synthetic data: Evaluation of hierarchical Bayesian imputation models. Journal of Official Statistics 25, 245–268 (2009)
Hawala, S.: Producing partially synthetic data to avoid disclosure. In: Proceedings of the Joint Statistical Meetings. American Statistical Association, Alexandria (2008)
Ishwaran, H., James, L.F.: Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 161–173 (2001)
Kennickell, A.B.: Multiple imputation and disclosure protection: The case of the 1995 Survey of Consumer Finances. In: Alvey, W., Jamerson, B. (eds.) Record Linkage Techniques, pp. 248–267. National Academy Press, Washington, D.C. (1997)
Kinney, S., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: The synthetic Longitudinal Business Database. International Statistical Review 79, 363–384 (2011)
Little, R.J.A.: Statistical analysis of masked data. Journal of Official Statistics 9, 407–426 (1993)
Little, R.J.A., Liu, F., Raghunathan, T.E.: Statistical disclosure techniques based on multiple imputation. In: Gelman, A., Meng, X.L. (eds.) Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, pp. 141–152. John Wiley & Sons, New York (2004)
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: Theory meets practice on the map. In: IEEE 24th International Conference on Data Engineering, pp. 277–286 (2008)
Manrique-Vallier, D., Reiter, J.P.: Bayesian estimation of discrete multivariate latent structure models with strutural zeros. Journal of Computational and Graphical Statistics (to appear, 2014)
McClure, D., Reiter, J.P.: Differential privacy and statistical disclosure risk measures: An illustration with binary synthetic data. Transactions on Data Privacy 5, 535–552 (2012)
Paiva, T., Chakraborty, A., Reiter, J.P., Gelfand, A.E.: Imputation of confidential data sets with spatial locations using disease mapping models. Statistics in Medicine (to appear, 2014)
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. Journal of Official Statistics 19, 1–16 (2003)
Reiter, J.: Using multiple imputation to integrate and disseminate confidential microdata. International Statistical Review 77, 179–195 (2009)
Reiter, J., Raghunathan, T.E.: The multiple adaptations of multiple imputation. Journal of the American Statistical Association 102, 1462–1471 (2007)
Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics 18, 531–544 (2002)
Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Survey Methodology 29, 181–189 (2003)
Reiter, J.P.: Simultaneous use of multiple imputation for missing data and disclosure limitation. Survey Methodology 30, 235–242 (2004)
Reiter, J.P.: Estimating identification risks in microdata. Journal of the American Statistical Association 100, 1103–1113 (2005a)
Reiter, J.P.: Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Society, Series A 168, 185–205 (2005b)
Reiter, J.P.: Discussion: Bayesian perspectives and disclosure risk assessment. International Statistical Review 80, 373–375 (2012)
Reiter, J.P., Wang, Q., Zhang, B.: Bayesian estimation of disclosure risks in multiply imputed, synthetic data. Journal of Privacy and Confidentiality (to appear, 2014)
Rubin, D.B.: Discussion: Statistical disclosure limitation. Journal of Official Statistics 9, 462–468 (1993)
Sethuraman, J.: A constructive definition of Dirichlet priors. Statistica Sinica 4, 639–650 (1994)
Si, Y., Reiter, J.P.: Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics 38, 499–521 (2013)
Slavkovic, A.B., Lee, J.: Synthetic two-way contingency tables that preserve conditional frequencies. Statistical Methodology 7, 225–239 (2010)
Wang, H., Reiter, J.P.: Multiple imputation for sharing precise geographies in public use data. Annals of Applied Statistics 6, 229–252 (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Hu, J., Reiter, J.P., Wang, Q. (2014). Disclosure Risk Evaluation for Fully Synthetic Categorical Data. In: Domingo-Ferrer, J. (eds) Privacy in Statistical Databases. PSD 2014. Lecture Notes in Computer Science, vol 8744. Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-11257-2_15
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11256-5
Online ISBN: 978-3-319-11257-2
eBook Packages: Computer ScienceComputer Science (R0)