Abstract
We present a new synthesizer for categorical data based on the Quasi-Multinomial distribution. Characteristics of the Quasi-Multinomial distribution provide a tuning parameter, which allows a Quasi-Multinomial synthesizer to control the balance of the utility and the disclosure risks of synthetic data. We develop a Quasi-Multinomial synthesizer based on a popular categorical data synthesizer, the Dirichlet process mixtures of products of multinomial distributions. The general sampling methods and algorithm of the Quasi-Multinomial synthesizer are developed and presented. We illustrate its balance of the utility and the disclosure risks by synthesizing a sample from the American Community Survey.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Akande, O., Li, F., Reiter, J.P.: An empirical comparison of multiple imputation methods for categorical data. Am. Stat. 71, 162–170 (2017)
Akande, O., Reiter, J. P., Barrientos, A. F.: Multiple imputation of missing values in household data with structural zeros (2017+). arXiv:1707.05916
Consul, P.C., Mittal, S.P.: A new urn model with predetermined strategy. Biometrische Zeitschrift 17, 67–75 (1975)
Consul, P.C., Mittal, S.P.: Some discrete multinomial probability models with predetermined strategy. Biometrische Zeitschrift 19, 161–173 (1977)
Devroye, L.: Non-uniform Random Variate Generation. Springer, New York (1986). https://doi.org/10.1007/978-1-4613-8643-8
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5
Drechsler, J., Hu, J.: Strategies to facilitate access to detailed geocoding information based on synthetic data (2017+). arXiv:1803.05874
Drechsler, J., Reiter, J.P.: Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 227–238. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87471-3_19
Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach to releasing public use microdata samples of census data. J. Am. Stat. Assoc. 105, 1347–1357 (2010)
Duncan, G.T., Lambert, D.: Disclosure-limited data dissemination. J. Am. Stat. Assoc. 10, 10–28 (1986)
Duncan, G.T., Lambert, D.: The risk of disclosure for microdata. J. Bus. Econ. Stat. 7, 207–217 (1989)
Dunson, D.B., Xing, C.: Nonparametric Bayes modeling of multivariate categorical data. J. Am. Stat. Assoc. 104, 1042–1051 (2009)
Fienberg, S.E., Makov, U., Sanil, A.P.: A Bayesian approach to data disclosure: optimal intruder behavior for continuous data. J. Off. Stat. 13, 75–89 (1997)
Firth, D.: Bias reduction of maximum likelihood estimates. Biometrika 80, 27–38 (1993)
Franconi, L., Polettini, S.: Individual risk estimation in \(\mu \)-argus: a review. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 262–272. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25955-8_20
Ho, F.C.M., Gentle, J.E., Kennedy, W.J.: Generation of random variates from the multinomial distribution. In: Proceedings of the American Statistical Association Statistical Computing Section (1979)
Hoshino, N.: The quasi-multinomial distribution as a tool for disclosure risk assessment. J. Off. Stat. 25, 269–291 (2009)
Hu, J.: Bayesian estimation of attribute and identification disclosure risks in synthetic data (2018+). arXiv:1804.02784
Hu, J., Reiter, J.P., Wang, Q.: Disclosure risk evaluation for fully synthetic categorical data. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 185–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_15
Hu, J., Reiter, J.P., Wang, Q.: Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Anal. 13, 183–200 (2018)
Ishwaran, H., James, L.F.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96, 161–173 (2001)
Lambert, D.: Measures of disclosure risk and harm. J. Off. Stat. 9, 313–331 (1993)
Malefaki, S., Iliopoulos, G.: Simulating from a multinomial distribution with large number of categories. Comput. Stat. Data Anal. 51, 5471–5476 (2007)
Manrique-Vallier, D., Hu, J.: Bayesian non-parametric generation of fully synthetic multivariate categorical data in the presence of structural zeros. J. Roy. Stat. Soc. Ser. A (2018, to appear)
Manrique-Vallier, D., Reiter, J.P.: Bayesian estimation of discrete multivariate latent structure models with structural zeros. J. Comput. Graph. Stat. 23, 1061–1079 (2014)
Murray, J.S.: Multiple imputation: a review of practical and theoretical findings. Stat. Sci. (2018+)
Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29, 181–188 (2003)
Reiter, J.P.: Estimating risks of identification disclosure in microdata. J. Am. Stat. Assoc. 100, 1103–1112 (2005)
Reiter, J.P., Mitra, R.: Estimating risks of identification disclosure in partially synthetic data. J. Priv. Confid. 1, 99–110 (2009)
Reiter, J.P., Raghunathan, T.E.: The multiple adaptations of multiple imputation. J. Am. Stat. Assoc. 102, 1462–1471 (2007)
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, New York (1987)
Sethuraman, J.: A constructive definition of Dirichlet priors. Statistica Sinica 4, 639–650 (1994)
Si, Y., Reiter, J.P.: Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. J. Educ. Behav. Stat. 38, 499–521 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
1.1 Appendix 1
Proof of Theorem 3: By symmetry it suffices to show the case of \(a_1<a_2\). To simplify our argument, write \(g(y;a)=\varGamma (a+y)/(\varGamma (a+1)(a+y)^{y-1})\). Then \({p_{BB}(y)}/{p_{QB}(y)}=g(y;a_1)g(n-y;a_2)/g(n;a_\cdot )\). Hence \({p_{BB}(y)}/{p_{QB}(y)}\) is minimized when \(g(y;a_1)g(n-y;a_2)\) is minimized. Also we note that \(g(y;a)r(y;a)=g(y+1;a)\), where \(r(y;a)=(a+y)^y/(a+y+1)^{y}\). Since \(r(0,a)=1\) and \(r(y,a)<1\) for \(y\ge 1\), g(y; a) is monotonically decreasing when y increases. Therefore to minimize \(g(y;a_1)g(n-y;a_2)\), we increase \(y_1\) or \(y_2\) one by one so that \(g(y_1;a_1)g(y_2;a_2)\) is more reduced. Denote the ith step values by \((y_1,y_2)_i, i=1,\dots ,n\). At each step \(y_1\) increases by one if \(r(y_1,a)\le r(y_2,a)\) otherwise \(y_2\) increases by one. Actually \(r(0,a_1)=r(0,a_2)=1\), but if \(0<a_1<a_2\) then \(r(1,a_1)<r(1,a_2)\). Hence we begin with \((y_1,y_2)_1=(1,0)\). Observing
we note that \(r(y,a)>r(y',a)\) when \(y<y'\). Therefore \(r(1,a_2)>r(1,a_1)>r(i,a_1)\) for \(i\ge 2\), which leads to \((y_1,y_2)_n=(n,0)\). \(\square \)
1.2 Appendix 2
A summary of average acceptance rates r of our QB sampler proposed in Sect. 2.
1.3 Appendix 3
The table contains detailed results of regression-based utility of the DPMPM synthetic data in Sect. 4.
1.4 Appendix 4
The Table 4 contains the minimum, first quartile, median, third quartile, and maximum of the identification disclosure risks of the DPMPM synthesizer. They are all marked as horizontal lines in Figs. 3, 4 and 5.
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Hu, J., Hoshino, N. (2018). The Quasi-Multinomial Synthesizer for Categorical Data. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-99771-1_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99770-4
Online ISBN: 978-3-319-99771-1
eBook Packages: Computer ScienceComputer Science (R0)