Skip to main content

The Quasi-Multinomial Synthesizer for Categorical Data

  • Conference paper
  • First Online:
  • 860 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11126))

Abstract

We present a new synthesizer for categorical data based on the Quasi-Multinomial distribution. Characteristics of the Quasi-Multinomial distribution provide a tuning parameter, which allows a Quasi-Multinomial synthesizer to control the balance of the utility and the disclosure risks of synthetic data. We develop a Quasi-Multinomial synthesizer based on a popular categorical data synthesizer, the Dirichlet process mixtures of products of multinomial distributions. The general sampling methods and algorithm of the Quasi-Multinomial synthesizer are developed and presented. We illustrate its balance of the utility and the disclosure risks by synthesizing a sample from the American Community Survey.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  • Akande, O., Li, F., Reiter, J.P.: An empirical comparison of multiple imputation methods for categorical data. Am. Stat. 71, 162–170 (2017)

    Article  MathSciNet  Google Scholar 

  • Akande, O., Reiter, J. P., Barrientos, A. F.: Multiple imputation of missing values in household data with structural zeros (2017+). arXiv:1707.05916

  • Consul, P.C., Mittal, S.P.: A new urn model with predetermined strategy. Biometrische Zeitschrift 17, 67–75 (1975)

    Article  MathSciNet  Google Scholar 

  • Consul, P.C., Mittal, S.P.: Some discrete multinomial probability models with predetermined strategy. Biometrische Zeitschrift 19, 161–173 (1977)

    Article  MathSciNet  Google Scholar 

  • Devroye, L.: Non-uniform Random Variate Generation. Springer, New York (1986). https://doi.org/10.1007/978-1-4613-8643-8

    Book  MATH  Google Scholar 

  • Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5

    Book  MATH  Google Scholar 

  • Drechsler, J., Hu, J.: Strategies to facilitate access to detailed geocoding information based on synthetic data (2017+). arXiv:1803.05874

  • Drechsler, J., Reiter, J.P.: Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 227–238. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87471-3_19

    Chapter  Google Scholar 

  • Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach to releasing public use microdata samples of census data. J. Am. Stat. Assoc. 105, 1347–1357 (2010)

    Article  Google Scholar 

  • Duncan, G.T., Lambert, D.: Disclosure-limited data dissemination. J. Am. Stat. Assoc. 10, 10–28 (1986)

    Article  Google Scholar 

  • Duncan, G.T., Lambert, D.: The risk of disclosure for microdata. J. Bus. Econ. Stat. 7, 207–217 (1989)

    Google Scholar 

  • Dunson, D.B., Xing, C.: Nonparametric Bayes modeling of multivariate categorical data. J. Am. Stat. Assoc. 104, 1042–1051 (2009)

    Article  MathSciNet  Google Scholar 

  • Fienberg, S.E., Makov, U., Sanil, A.P.: A Bayesian approach to data disclosure: optimal intruder behavior for continuous data. J. Off. Stat. 13, 75–89 (1997)

    Google Scholar 

  • Firth, D.: Bias reduction of maximum likelihood estimates. Biometrika 80, 27–38 (1993)

    Article  MathSciNet  Google Scholar 

  • Franconi, L., Polettini, S.: Individual risk estimation in \(\mu \)-argus: a review. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 262–272. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25955-8_20

    Chapter  Google Scholar 

  • Ho, F.C.M., Gentle, J.E., Kennedy, W.J.: Generation of random variates from the multinomial distribution. In: Proceedings of the American Statistical Association Statistical Computing Section (1979)

    Google Scholar 

  • Hoshino, N.: The quasi-multinomial distribution as a tool for disclosure risk assessment. J. Off. Stat. 25, 269–291 (2009)

    Google Scholar 

  • Hu, J.: Bayesian estimation of attribute and identification disclosure risks in synthetic data (2018+). arXiv:1804.02784

  • Hu, J., Reiter, J.P., Wang, Q.: Disclosure risk evaluation for fully synthetic categorical data. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 185–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_15

    Chapter  Google Scholar 

  • Hu, J., Reiter, J.P., Wang, Q.: Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Anal. 13, 183–200 (2018)

    Article  MathSciNet  Google Scholar 

  • Ishwaran, H., James, L.F.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96, 161–173 (2001)

    Article  MathSciNet  Google Scholar 

  • Lambert, D.: Measures of disclosure risk and harm. J. Off. Stat. 9, 313–331 (1993)

    Google Scholar 

  • Malefaki, S., Iliopoulos, G.: Simulating from a multinomial distribution with large number of categories. Comput. Stat. Data Anal. 51, 5471–5476 (2007)

    Article  MathSciNet  Google Scholar 

  • Manrique-Vallier, D., Hu, J.: Bayesian non-parametric generation of fully synthetic multivariate categorical data in the presence of structural zeros. J. Roy. Stat. Soc. Ser. A (2018, to appear)

    Google Scholar 

  • Manrique-Vallier, D., Reiter, J.P.: Bayesian estimation of discrete multivariate latent structure models with structural zeros. J. Comput. Graph. Stat. 23, 1061–1079 (2014)

    Article  MathSciNet  Google Scholar 

  • Murray, J.S.: Multiple imputation: a review of practical and theoretical findings. Stat. Sci. (2018+)

    Google Scholar 

  • Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29, 181–188 (2003)

    Google Scholar 

  • Reiter, J.P.: Estimating risks of identification disclosure in microdata. J. Am. Stat. Assoc. 100, 1103–1112 (2005)

    Article  MathSciNet  Google Scholar 

  • Reiter, J.P., Mitra, R.: Estimating risks of identification disclosure in partially synthetic data. J. Priv. Confid. 1, 99–110 (2009)

    Google Scholar 

  • Reiter, J.P., Raghunathan, T.E.: The multiple adaptations of multiple imputation. J. Am. Stat. Assoc. 102, 1462–1471 (2007)

    Article  MathSciNet  Google Scholar 

  • Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, New York (1987)

    Book  Google Scholar 

  • Sethuraman, J.: A constructive definition of Dirichlet priors. Statistica Sinica 4, 639–650 (1994)

    MathSciNet  MATH  Google Scholar 

  • Si, Y., Reiter, J.P.: Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. J. Educ. Behav. Stat. 38, 499–521 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingchen Hu .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Appendix 1

Proof of Theorem 3: By symmetry it suffices to show the case of \(a_1<a_2\). To simplify our argument, write \(g(y;a)=\varGamma (a+y)/(\varGamma (a+1)(a+y)^{y-1})\). Then \({p_{BB}(y)}/{p_{QB}(y)}=g(y;a_1)g(n-y;a_2)/g(n;a_\cdot )\). Hence \({p_{BB}(y)}/{p_{QB}(y)}\) is minimized when \(g(y;a_1)g(n-y;a_2)\) is minimized. Also we note that \(g(y;a)r(y;a)=g(y+1;a)\), where \(r(y;a)=(a+y)^y/(a+y+1)^{y}\). Since \(r(0,a)=1\) and \(r(y,a)<1\) for \(y\ge 1\), g(ya) is monotonically decreasing when y increases. Therefore to minimize \(g(y;a_1)g(n-y;a_2)\), we increase \(y_1\) or \(y_2\) one by one so that \(g(y_1;a_1)g(y_2;a_2)\) is more reduced. Denote the ith step values by \((y_1,y_2)_i, i=1,\dots ,n\). At each step \(y_1\) increases by one if \(r(y_1,a)\le r(y_2,a)\) otherwise \(y_2\) increases by one. Actually \(r(0,a_1)=r(0,a_2)=1\), but if \(0<a_1<a_2\) then \(r(1,a_1)<r(1,a_2)\). Hence we begin with \((y_1,y_2)_1=(1,0)\). Observing

$$\begin{aligned} \frac{d\log r(y,a)}{dy}= & {} \log \left( 1-\frac{1}{a+y+1}\right) +\frac{y}{(a+y)(a+y+1)}\\= & {} -\frac{1}{a+y+1}-\frac{1}{2}(\frac{1}{a+y+1})^2-\cdots +\frac{y}{(a+y)(a+y+1)}<0, \end{aligned}$$

we note that \(r(y,a)>r(y',a)\) when \(y<y'\). Therefore \(r(1,a_2)>r(1,a_1)>r(i,a_1)\) for \(i\ge 2\), which leads to \((y_1,y_2)_n=(n,0)\).    \(\square \)

1.2 Appendix 2

A summary of average acceptance rates r of our QB sampler proposed in Sect. 2.

Table 2. Average acceptance rates: \(r(0.1/\beta ,0.9/\beta ,n)\)

1.3 Appendix 3

The table contains detailed results of regression-based utility of the DPMPM synthetic data in Sect. 4.

Table 3. 95% confidence intervals of logistic regression coefficients based on the original data and based on the m = 20 synthetic data generated by the DPMPM synthesizer.

1.4 Appendix 4

The Table 4 contains the minimum, first quartile, median, third quartile, and maximum of the identification disclosure risks of the DPMPM synthesizer. They are all marked as horizontal lines in Figs. 3, 4 and 5.

Table 4. Table of the minimum, first quartile (Q1), median, third quartile (Q3), and maximum of the identification disclosure risks of the DPMPM synthesizer.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hu, J., Hoshino, N. (2018). The Quasi-Multinomial Synthesizer for Categorical Data. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99771-1_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99770-4

  • Online ISBN: 978-3-319-99771-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics