Nonparametric Generation of Synthetic Data for Small Geographic Areas

Sakshaug, Joseph W.; Raghunathan, Trivellore E.

doi:10.1007/978-3-319-11257-2_17

Joseph W. Sakshaug¹⁶ &
Trivellore E. Raghunathan¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8744))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

1447 Accesses
5 Citations

Abstract

Computing and releasing statistics for small geographic areas is a common task for many statistical agencies, but releasing public-use microdata for these areas is much less common due to data confidentiality concerns. Accessing the restricted microdata is usually only possible within a research data center (RDC). This arrangement is inconvenient for many researchers who must travel large distances and, in some cases, pay a sizeable data usage fee to access the nearest RDC. An alternative data dissemination method that has been explored is to release public-use synthetic data. In general, synthetic data consists of imputed values drawn from a predictive model based on the observed data. Data confidentiality is preserved because no actual data values are released. The imputed values are typically drawn from a standard, parametric distribution, but often key variables of interest do not follow strict parametric forms. In this paper, we apply a nonparametric method for generating synthetic data for continuous variables collected from small geographic areas. The method is evaluated using data from the 2005-2007 American Community Survey. The analytic validity of the synthetic data is assessed by comparing parametric (baseline) and nonparametric inferences obtained from the synthetic data with those obtained from the observed data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Diez Roux, A.V.: Estimating Neighborhood Health Effects: The Challenges of Causal Inference in a Complex World. Soc. Sci. Med. 58, 1953–1960 (2004)
Article Google Scholar
Fisher, K.J., Li, M.Y., Cleveland, M.: Neighborhood-Level Influences on Physical Activity Among Older Adults: A Multilevel Analysis. J. Aging. Phys. Activ. 12, 45–63 (2004)
Google Scholar
Auchincloss, A.H., Roux, A.V., Brown, D., Erdmann, C.A., Bertoni, A.G.: Neighborhood Resources for Physical Activity and Healthy Foods and their Association with Insulin Resistance. Epidemiology 19, 146–157 (2008)
Article Google Scholar
Mujahid, M.S., Diez Roux, A.V., Morenoff, J.D., Raghunathan, T.E., Cooper, R.S., Ni, H., Shea, S.: Neighborhood Characteristics and Hypertension. Epidemiology 19, 590–598 (2008)
Article Google Scholar
Bell, W., Basel, W., Cruse, C., Dalzell, L., Maples, J., O’Hara, B., Powers, D.: Use of ACS Data to Produce SAIPE Model-Based Estimates of Poverty for Counties. Technical Report, U.S. Bureau of the Census (2007), http://www.census.gov/did/www/saipe/publications/files/report.pdf
Fisher, R., Turner, J.: Small Area Estimation of Health Insurance Coverage from the Current Population Survey’s Social and Economic Supplement and the Survey of Income and Program Participation. Presented at the American Statistical Association Meetings, Toronto, Canada (2004)
Google Scholar
Sakshaug, J.W., Raghunathan, T.E.: Synthetic Data for Small Area Estimation. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 162–173. Springer, Heidelberg (2010)
Chapter Google Scholar
Rubin, D.B.: Satisfying Confidentiality Constraints Through the Use of Synthetic Multiply-Imputed Microdata. J. Off. Stat. 9, 461–468 (1993)
Google Scholar
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple Imputation for Statistical Disclosure Limitation. J. Off. Stat. 19, 1–16 (2003)
Google Scholar
Abowd, J.M., Stinson, M., Benedetto, G.: Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project. Technical Report, U.S. Census Bureau Longitudinal Employer-Household Dynamics Program (2006), http://www.census.gov/sipp/SSAfinal.pdf
Rodriguez, R.: Synthetic Data Disclosure Control for American Community Survey Group Quarters. In: Proceedings of the Joint Statistical Meetings, pp. 1439–1450 (2007)
Google Scholar
Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards Unrestricted Public Use Microdata: The Synthetic Longitudinal Business Database. Int. Stat. Rev. 79, 362–384 (2011)
Article Google Scholar
Meng, X.L.: Multiple Imputation Inference with Uncongenial Sources of Input (with discussion). Stat. Sci. 9, 538–573 (1994)
Google Scholar
Reiter, J.P.: Using Multiple Imputation to Integrate and Disseminate Confidential Microdata. Int. Stat. Rev. 77, 179–195 (2009)
Article Google Scholar
Reiter, J.P.: Using CART to Generate Partially Synthetic Public Use Microdata. J. Off. Stat. 21, 441–462 (2005)
Google Scholar
Caiola, G., Reiter, J.P.: Random Forests for Generating Partially Synthetic, Categorical Data. Transactions on Data Privacy 3, 27–42 (2010)
MathSciNet Google Scholar
Drechsler, J., Reiter, J.P.: An Empirical Evaluation of Easily Implemented, Nonparametric Methods for Generating Synthetic Data Sets. Comput. Stat. Data An. 55, 3232–3243 (2011)
Article MathSciNet Google Scholar
Woodcock, S.D., Benedetto, G.: Distribution-Preserving Statistical Disclosure Limitation. Comput. Stat. Data An. 53, 4228–4242 (2009)
Article MATH MathSciNet Google Scholar
Reiter, J.P.: Releasing Multiply-Imputed, Synthetic Public Use Microdata: An Illustration and Empirical Study. J. Royal Stat. Soc. Series A 168, 185–205 (2005)
Article MATH MathSciNet Google Scholar
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, New York (1987)
Book Google Scholar
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 2nd edn. Wiley (2002)
Google Scholar
Raghunathan, T.E., Rubin, D.B.: Bayesian Multiple Imputation to Preserve Confidentiality in Public-Use Data Sets. In: ISBA 2000: The Sixth World Meeting of the International Society for Bayesian Analysis (2000)
Google Scholar
Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J., Solenberger, P.: A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models. Surv. Methodol. 27, 85–95 (2001)
Google Scholar
Fay, R.E., Herriot, R.A.: Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data. J. Am. Stat. Assoc. 74, 269–277 (1979)
Article MathSciNet Google Scholar
Datta, G.S., Fay, R.E., Ghosh, M.: Hierarchical and Empirical Bayes Analysis in Small-Area Estimation. In: Proceedings of the Annual Research Conference, U.S. Bureau of the Census, pp. 63–78 (1991)
Google Scholar
Rao, J.N.K.: Small Area Estimation. Wiley, New York (2003)
Book MATH Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Royal Stat. Soc. Series B 39, 1–38 (1977)
MATH MathSciNet Google Scholar
Lindley, D.V., Smith, A.F.M.: Bayes Estimates for the Linear Model. J. Royal Stat. Soc. Series B 34, 1–41 (1972)
MATH MathSciNet Google Scholar
Rubin, D.B., Schenker, N.: Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse. J. Am. Stat. Assoc. 81, 366–374 (1986)
Article MATH MathSciNet Google Scholar
Rubin, D.B.: The Bayesian Bootstrap. Ann. Stat. 9, 130–134 (1981)
Article Google Scholar
Schenker, N., Taylor, J.M.G.: Partially Parametric Techniques for Multiple Imputation. Comput. Stat. Data An. 22, 425–446 (1996)
Article MATH Google Scholar
Harrell, F.E.: Regression Modeling Strategies with Applications to Linear Models, Logistic Regression and Survival Analysis. Springer, New York (2001)
Google Scholar
U.S. Census Bureau: American Community Survey: Design and Methodology (2009), http://www.census.gov/acs/www/Downloads/survey_methodology/acs_design_methodology.pdf

Download references

Author information

Authors and Affiliations

Institute for Employment Research, Nuremberg, 90478, Germany
Joseph W. Sakshaug
University of Michigan, Ann Arbor, MI, 48104, USA
Trivellore E. Raghunathan

Authors

Joseph W. Sakshaug
View author publications
You can also search for this author in PubMed Google Scholar
Trivellore E. Raghunathan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Engineering and Mathematics, UNESCO Chair in Data Privacy, Universitat Rovira i Virgili, Av. Països Catalans 26, 43007, Tarragona, Catalonia
Josep Domingo-Ferrer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sakshaug, J.W., Raghunathan, T.E. (2014). Nonparametric Generation of Synthetic Data for Small Geographic Areas. In: Domingo-Ferrer, J. (eds) Privacy in Statistical Databases. PSD 2014. Lecture Notes in Computer Science, vol 8744. Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-11257-2_17
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11256-5
Online ISBN: 978-3-319-11257-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics