The Quasi-Multinomial Synthesizer for Categorical Data

Hu, Jingchen; Hoshino, Nobuaki

doi:10.1007/978-3-319-99771-1_6

The Quasi-Multinomial Synthesizer for Categorical Data

Jingchen Hu¹⁵ &
Nobuaki Hoshino¹⁶

Conference paper
First Online: 25 August 2018

860 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11126))

Abstract

We present a new synthesizer for categorical data based on the Quasi-Multinomial distribution. Characteristics of the Quasi-Multinomial distribution provide a tuning parameter, which allows a Quasi-Multinomial synthesizer to control the balance of the utility and the disclosure risks of synthetic data. We develop a Quasi-Multinomial synthesizer based on a popular categorical data synthesizer, the Dirichlet process mixtures of products of multinomial distributions. The general sampling methods and algorithm of the Quasi-Multinomial synthesizer are developed and presented. We illustrate its balance of the utility and the disclosure risks by synthesizing a sample from the American Community Survey.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Akande, O., Li, F., Reiter, J.P.: An empirical comparison of multiple imputation methods for categorical data. Am. Stat. 71, 162–170 (2017)
Article MathSciNet Google Scholar
Akande, O., Reiter, J. P., Barrientos, A. F.: Multiple imputation of missing values in household data with structural zeros (2017+). arXiv:1707.05916
Consul, P.C., Mittal, S.P.: A new urn model with predetermined strategy. Biometrische Zeitschrift 17, 67–75 (1975)
Article MathSciNet Google Scholar
Consul, P.C., Mittal, S.P.: Some discrete multinomial probability models with predetermined strategy. Biometrische Zeitschrift 19, 161–173 (1977)
Article MathSciNet Google Scholar
Devroye, L.: Non-uniform Random Variate Generation. Springer, New York (1986). https://doi.org/10.1007/978-1-4613-8643-8
Book MATH Google Scholar
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5
Book MATH Google Scholar
Drechsler, J., Hu, J.: Strategies to facilitate access to detailed geocoding information based on synthetic data (2017+). arXiv:1803.05874
Drechsler, J., Reiter, J.P.: Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 227–238. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87471-3_19
Chapter Google Scholar
Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach to releasing public use microdata samples of census data. J. Am. Stat. Assoc. 105, 1347–1357 (2010)
Article Google Scholar
Duncan, G.T., Lambert, D.: Disclosure-limited data dissemination. J. Am. Stat. Assoc. 10, 10–28 (1986)
Article Google Scholar
Duncan, G.T., Lambert, D.: The risk of disclosure for microdata. J. Bus. Econ. Stat. 7, 207–217 (1989)
Google Scholar
Dunson, D.B., Xing, C.: Nonparametric Bayes modeling of multivariate categorical data. J. Am. Stat. Assoc. 104, 1042–1051 (2009)
Article MathSciNet Google Scholar
Fienberg, S.E., Makov, U., Sanil, A.P.: A Bayesian approach to data disclosure: optimal intruder behavior for continuous data. J. Off. Stat. 13, 75–89 (1997)
Google Scholar
Firth, D.: Bias reduction of maximum likelihood estimates. Biometrika 80, 27–38 (1993)
Article MathSciNet Google Scholar
Franconi, L., Polettini, S.: Individual risk estimation in $\mu $-argus: a review. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 262–272. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25955-8_20
Chapter Google Scholar
Ho, F.C.M., Gentle, J.E., Kennedy, W.J.: Generation of random variates from the multinomial distribution. In: Proceedings of the American Statistical Association Statistical Computing Section (1979)
Google Scholar
Hoshino, N.: The quasi-multinomial distribution as a tool for disclosure risk assessment. J. Off. Stat. 25, 269–291 (2009)
Google Scholar
Hu, J.: Bayesian estimation of attribute and identification disclosure risks in synthetic data (2018+). arXiv:1804.02784
Hu, J., Reiter, J.P., Wang, Q.: Disclosure risk evaluation for fully synthetic categorical data. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 185–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_15
Chapter Google Scholar
Hu, J., Reiter, J.P., Wang, Q.: Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Anal. 13, 183–200 (2018)
Article MathSciNet Google Scholar
Ishwaran, H., James, L.F.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96, 161–173 (2001)
Article MathSciNet Google Scholar
Lambert, D.: Measures of disclosure risk and harm. J. Off. Stat. 9, 313–331 (1993)
Google Scholar
Malefaki, S., Iliopoulos, G.: Simulating from a multinomial distribution with large number of categories. Comput. Stat. Data Anal. 51, 5471–5476 (2007)
Article MathSciNet Google Scholar
Manrique-Vallier, D., Hu, J.: Bayesian non-parametric generation of fully synthetic multivariate categorical data in the presence of structural zeros. J. Roy. Stat. Soc. Ser. A (2018, to appear)
Google Scholar
Manrique-Vallier, D., Reiter, J.P.: Bayesian estimation of discrete multivariate latent structure models with structural zeros. J. Comput. Graph. Stat. 23, 1061–1079 (2014)
Article MathSciNet Google Scholar
Murray, J.S.: Multiple imputation: a review of practical and theoretical findings. Stat. Sci. (2018+)
Google Scholar
Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29, 181–188 (2003)
Google Scholar
Reiter, J.P.: Estimating risks of identification disclosure in microdata. J. Am. Stat. Assoc. 100, 1103–1112 (2005)
Article MathSciNet Google Scholar
Reiter, J.P., Mitra, R.: Estimating risks of identification disclosure in partially synthetic data. J. Priv. Confid. 1, 99–110 (2009)
Google Scholar
Reiter, J.P., Raghunathan, T.E.: The multiple adaptations of multiple imputation. J. Am. Stat. Assoc. 102, 1462–1471 (2007)
Article MathSciNet Google Scholar
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, New York (1987)
Book Google Scholar
Sethuraman, J.: A constructive definition of Dirichlet priors. Statistica Sinica 4, 639–650 (1994)
MathSciNet MATH Google Scholar
Si, Y., Reiter, J.P.: Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. J. Educ. Behav. Stat. 38, 499–521 (2013)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Vassar College, Poughkeepsie, USA
Jingchen Hu
Kanazawa University, Kanazawa, Japan
Nobuaki Hoshino

Authors

Jingchen Hu
View author publications
You can also search for this author in PubMed Google Scholar
Nobuaki Hoshino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jingchen Hu .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Spain
Josep Domingo-Ferrer
University of Valencia, Burjassot, Spain
Francisco Montes

Appendix

1.1 Appendix 1

Proof of Theorem 3: By symmetry it suffices to show the case of $a_1<a_2$. To simplify our argument, write $g(y;a)=\varGamma (a+y)/(\varGamma (a+1)(a+y)^{y-1})$. Then ${p_{BB}(y)}/{p_{QB}(y)}=g(y;a_1)g(n-y;a_2)/g(n;a_\cdot )$. Hence ${p_{BB}(y)}/{p_{QB}(y)}$ is minimized when $g(y;a_1)g(n-y;a_2)$ is minimized. Also we note that $g(y;a)r(y;a)=g(y+1;a)$, where $r(y;a)=(a+y)^y/(a+y+1)^{y}$. Since $r(0,a)=1$ and $r(y,a)<1$ for $y\ge 1$, g(y; a) is monotonically decreasing when y increases. Therefore to minimize $g(y;a_1)g(n-y;a_2)$, we increase $y_1$ or $y_2$ one by one so that $g(y_1;a_1)g(y_2;a_2)$ is more reduced. Denote the ith step values by $(y_1,y_2)_i, i=1,\dots ,n$. At each step $y_1$ increases by one if $r(y_1,a)\le r(y_2,a)$ otherwise $y_2$ increases by one. Actually $r(0,a_1)=r(0,a_2)=1$, but if $0<a_1<a_2$ then $r(1,a_1)<r(1,a_2)$. Hence we begin with $(y_1,y_2)_1=(1,0)$. Observing

$$\begin{aligned} \frac{d\log r(y,a)}{dy}= & {} \log \left( 1-\frac{1}{a+y+1}\right) +\frac{y}{(a+y)(a+y+1)}\\= & {} -\frac{1}{a+y+1}-\frac{1}{2}(\frac{1}{a+y+1})^2-\cdots +\frac{y}{(a+y)(a+y+1)}<0, \end{aligned}$$

we note that $r(y,a)>r(y',a)$ when $y<y'$. Therefore $r(1,a_2)>r(1,a_1)>r(i,a_1)$ for $i\ge 2$, which leads to $(y_1,y_2)_n=(n,0)$. $\square $

1.2 Appendix 2

A summary of average acceptance rates r of our QB sampler proposed in Sect. 2.

Table 2. Average acceptance rates: $r(0.1/\beta ,0.9/\beta ,n)$

Full size table

1.3 Appendix 3

The table contains detailed results of regression-based utility of the DPMPM synthetic data in Sect. 4.

Table 3. 95% confidence intervals of logistic regression coefficients based on the original data and based on the m = 20 synthetic data generated by the DPMPM synthesizer.

Full size table

1.4 Appendix 4

The Table 4 contains the minimum, first quartile, median, third quartile, and maximum of the identification disclosure risks of the DPMPM synthesizer. They are all marked as horizontal lines in Figs. 3, 4 and 5.

Table 4. Table of the minimum, first quartile (Q1), median, third quartile (Q3), and maximum of the identification disclosure risks of the DPMPM synthesizer.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, J., Hoshino, N. (2018). The Quasi-Multinomial Synthesizer for Categorical Data. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-99771-1_6
Published: 25 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99770-4
Online ISBN: 978-3-319-99771-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Buying options

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Appendix 1

1.2 Appendix 2

1.3 Appendix 3

1.4 Appendix 4

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation