Disclosure Risk Evaluation for Fully Synthetic Categorical Data

Hu, Jingchen; Reiter, Jerome P.; Wang, Quanli

doi:10.1007/978-3-319-11257-2_15

Jingchen Hu¹⁶,
Jerome P. Reiter¹⁶ &
Quanli Wang¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8744))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

1506 Accesses
18 Citations
3 Altmetric

Abstract

We present an approach for evaluating disclosure risks for fully synthetic categorical data. The basic idea is to compute probability distributions of unknown confidential data values given the synthetic data and assumptions about intruder knowledge. We use a “worst-case” scenario of an intruder knowing all but one of the records in the confidential data. To create the synthetic data, we use a Dirichlet process mixture of products of multinomial distributions, which is a Bayesian version of a latent class model. In addition to generating synthetic data with high utility, the likelihood function admits simple and convenient approximations to the disclosure risk probabilities via importance sampling. We illustrate the disclosure risk computations by synthesizing a subset of data from the American Community Survey.

This research was supported by U.S. National Science Foundation grants CNS-10-12141 and SES-11-31897.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abowd, J., Stinson, M., Benedetto, G.: Final report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project. Tech. rep., U.S. Census Bureau Longitudinal Employer-Household Dynamics Program (2006), http://www.census.gov/sipp/synth_data.html
Abowd, J.M., Vilhuber, L.: How protective are synthetic data? In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 239–246. Springer, Heidelberg (2008)
Chapter Google Scholar
Abowd, J.A., Schneider, M.J., Vilhuber, L.: Differential privacy applications to bayesian and linear mixed model estimation. Journal of Privacy and Confidentiality 5, 73–105 (2013)
Google Scholar
Abowd, J.M., Woodcock, S.D.: Disclosure limitation in longitudinal linked data. In: Doyle, P., Lane, J., Zayatz, L., Theeuwes, J. (eds.) Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, pp. 215–277. North-Holland, Amsterdam (2001)
Google Scholar
Abowd, J.M., Woodcock, S.D.: Multiply-imputing confidential characteristics and file links in longitudinal linked data. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 290–297. Springer, Heidelberg (2004)
Chapter Google Scholar
An, D., Little, R.: Multiple imputation: an alternative to top coding for statistical disclosure control. Journal of the Royal Statistical Society, Series A 170, 923–940 (2007)
Article MathSciNet Google Scholar
Burgette, L., Reiter, J.P.: Multiple-shrinkage multinomial probit models with applications to simulating geographies in public use data. Bayesian Analysis 8, 453–478 (2013)
Article MathSciNet Google Scholar
Charest, A.S.: How can we analyze differentially private synthetic datasets. Journal of Privacy and Confidentiality 2(2), Article 3 (2010)
Google Scholar
Drechsler, J., Bender, S., Rässler, S.: Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB Establishment Panel. Transactions on Data Privacy 1, 105–130 (2008a)
MathSciNet Google Scholar
Drechsler, J., Dundler, A., Bender, S., Rässler, S., Zwick, T.: A new approach for disclosure control in the IAB Establishment Panel–Multiple imputation for a better data access. Advances in Statistical Analysis 92, 439–458 (2008b)
Article Google Scholar
Drechsler, J., Reiter, J.P.: Sampling with synthesis: A new approach for releasing public use census microdata. Journal of the American Statistical Association 105, 1347–1357 (2010)
Article MathSciNet Google Scholar
Drechsler, J., Reiter, J.P.: Combining synthetic data with subsampling to create public use microdata files for large scale surveys. Survey Methodology 38, 73–79 (2012)
Google Scholar
Duncan, G.T., Lambert, D.: The risk of disclosure for microdata. Journal of Business and Economic Statistics 7, 207–217 (1989)
Google Scholar
Dunson, D.B., Xing, C.: Nonparametric Bayes modeling of multivariate categorical data. Journal of the American Statistical Association 104, 1042–1051 (2009)
Article MathSciNet Google Scholar
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006, Part II. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006)
Chapter Google Scholar
Fienberg, S.E.: A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Tech. rep., Department of Statistics, Carnegie-Mellon University (1994)
Google Scholar
Fienberg, S.E., Makov, U.E., Sanil, A.P.: A Bayesian approach to data disclosure: Optimal intruder behavior for continuous data. Journal of Official Statistics 13, 75–89 (1997)
Google Scholar
Graham, P., Penny, R.: Multiply imputed synthetic data files. Tech. rep., University of Otago (2005), http://www.uoc.otago.ac.nz/departments/pubhealth/pgrahpub.htm
Graham, P., Young, J., Penny, R.: Multiply imputed synthetic data: Evaluation of hierarchical Bayesian imputation models. Journal of Official Statistics 25, 245–268 (2009)
Google Scholar
Hawala, S.: Producing partially synthetic data to avoid disclosure. In: Proceedings of the Joint Statistical Meetings. American Statistical Association, Alexandria (2008)
Google Scholar
Ishwaran, H., James, L.F.: Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 161–173 (2001)
Google Scholar
Kennickell, A.B.: Multiple imputation and disclosure protection: The case of the 1995 Survey of Consumer Finances. In: Alvey, W., Jamerson, B. (eds.) Record Linkage Techniques, pp. 248–267. National Academy Press, Washington, D.C. (1997)
Google Scholar
Kinney, S., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: The synthetic Longitudinal Business Database. International Statistical Review 79, 363–384 (2011)
Article Google Scholar
Little, R.J.A.: Statistical analysis of masked data. Journal of Official Statistics 9, 407–426 (1993)
Google Scholar
Little, R.J.A., Liu, F., Raghunathan, T.E.: Statistical disclosure techniques based on multiple imputation. In: Gelman, A., Meng, X.L. (eds.) Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, pp. 141–152. John Wiley & Sons, New York (2004)
Google Scholar
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: Theory meets practice on the map. In: IEEE 24th International Conference on Data Engineering, pp. 277–286 (2008)
Google Scholar
Manrique-Vallier, D., Reiter, J.P.: Bayesian estimation of discrete multivariate latent structure models with strutural zeros. Journal of Computational and Graphical Statistics (to appear, 2014)
Google Scholar
McClure, D., Reiter, J.P.: Differential privacy and statistical disclosure risk measures: An illustration with binary synthetic data. Transactions on Data Privacy 5, 535–552 (2012)
MathSciNet Google Scholar
Paiva, T., Chakraborty, A., Reiter, J.P., Gelfand, A.E.: Imputation of confidential data sets with spatial locations using disease mapping models. Statistics in Medicine (to appear, 2014)
Google Scholar
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. Journal of Official Statistics 19, 1–16 (2003)
Google Scholar
Reiter, J.: Using multiple imputation to integrate and disseminate confidential microdata. International Statistical Review 77, 179–195 (2009)
Article Google Scholar
Reiter, J., Raghunathan, T.E.: The multiple adaptations of multiple imputation. Journal of the American Statistical Association 102, 1462–1471 (2007)
Article MATH MathSciNet Google Scholar
Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics 18, 531–544 (2002)
Google Scholar
Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Survey Methodology 29, 181–189 (2003)
Google Scholar
Reiter, J.P.: Simultaneous use of multiple imputation for missing data and disclosure limitation. Survey Methodology 30, 235–242 (2004)
Google Scholar
Reiter, J.P.: Estimating identification risks in microdata. Journal of the American Statistical Association 100, 1103–1113 (2005a)
Article MATH MathSciNet Google Scholar
Reiter, J.P.: Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Society, Series A 168, 185–205 (2005b)
Article MATH MathSciNet Google Scholar
Reiter, J.P.: Discussion: Bayesian perspectives and disclosure risk assessment. International Statistical Review 80, 373–375 (2012)
Article MathSciNet Google Scholar
Reiter, J.P., Wang, Q., Zhang, B.: Bayesian estimation of disclosure risks in multiply imputed, synthetic data. Journal of Privacy and Confidentiality (to appear, 2014)
Google Scholar
Rubin, D.B.: Discussion: Statistical disclosure limitation. Journal of Official Statistics 9, 462–468 (1993)
Google Scholar
Sethuraman, J.: A constructive definition of Dirichlet priors. Statistica Sinica 4, 639–650 (1994)
MATH MathSciNet Google Scholar
Si, Y., Reiter, J.P.: Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics 38, 499–521 (2013)
Article Google Scholar
Slavkovic, A.B., Lee, J.: Synthetic two-way contingency tables that preserve conditional frequencies. Statistical Methodology 7, 225–239 (2010)
Article MATH MathSciNet Google Scholar
Wang, H., Reiter, J.P.: Multiple imputation for sharing precise geographies in public use data. Annals of Applied Statistics 6, 229–252 (2012)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Duke University, Durham, NC, 27708, USA
Jingchen Hu, Jerome P. Reiter & Quanli Wang

Authors

Jingchen Hu
View author publications
You can also search for this author in PubMed Google Scholar
Jerome P. Reiter
View author publications
You can also search for this author in PubMed Google Scholar
Quanli Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Engineering and Mathematics, UNESCO Chair in Data Privacy, Universitat Rovira i Virgili, Av. Països Catalans 26, 43007, Tarragona, Catalonia
Josep Domingo-Ferrer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, J., Reiter, J.P., Wang, Q. (2014). Disclosure Risk Evaluation for Fully Synthetic Categorical Data. In: Domingo-Ferrer, J. (eds) Privacy in Statistical Databases. PSD 2014. Lecture Notes in Computer Science, vol 8744. Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-11257-2_15
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11256-5
Online ISBN: 978-3-319-11257-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics