Abstract
Computing and releasing statistics for small geographic areas is a common task for many statistical agencies, but releasing public-use microdata for these areas is much less common due to data confidentiality concerns. Accessing the restricted microdata is usually only possible within a research data center (RDC). This arrangement is inconvenient for many researchers who must travel large distances and, in some cases, pay a sizeable data usage fee to access the nearest RDC. An alternative data dissemination method that has been explored is to release public-use synthetic data. In general, synthetic data consists of imputed values drawn from a predictive model based on the observed data. Data confidentiality is preserved because no actual data values are released. The imputed values are typically drawn from a standard, parametric distribution, but often key variables of interest do not follow strict parametric forms. In this paper, we apply a nonparametric method for generating synthetic data for continuous variables collected from small geographic areas. The method is evaluated using data from the 2005-2007 American Community Survey. The analytic validity of the synthetic data is assessed by comparing parametric (baseline) and nonparametric inferences obtained from the synthetic data with those obtained from the observed data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Diez Roux, A.V.: Estimating Neighborhood Health Effects: The Challenges of Causal Inference in a Complex World. Soc. Sci. Med. 58, 1953–1960 (2004)
Fisher, K.J., Li, M.Y., Cleveland, M.: Neighborhood-Level Influences on Physical Activity Among Older Adults: A Multilevel Analysis. J. Aging. Phys. Activ. 12, 45–63 (2004)
Auchincloss, A.H., Roux, A.V., Brown, D., Erdmann, C.A., Bertoni, A.G.: Neighborhood Resources for Physical Activity and Healthy Foods and their Association with Insulin Resistance. Epidemiology 19, 146–157 (2008)
Mujahid, M.S., Diez Roux, A.V., Morenoff, J.D., Raghunathan, T.E., Cooper, R.S., Ni, H., Shea, S.: Neighborhood Characteristics and Hypertension. Epidemiology 19, 590–598 (2008)
Bell, W., Basel, W., Cruse, C., Dalzell, L., Maples, J., O’Hara, B., Powers, D.: Use of ACS Data to Produce SAIPE Model-Based Estimates of Poverty for Counties. Technical Report, U.S. Bureau of the Census (2007), http://www.census.gov/did/www/saipe/publications/files/report.pdf
Fisher, R., Turner, J.: Small Area Estimation of Health Insurance Coverage from the Current Population Survey’s Social and Economic Supplement and the Survey of Income and Program Participation. Presented at the American Statistical Association Meetings, Toronto, Canada (2004)
Sakshaug, J.W., Raghunathan, T.E.: Synthetic Data for Small Area Estimation. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 162–173. Springer, Heidelberg (2010)
Rubin, D.B.: Satisfying Confidentiality Constraints Through the Use of Synthetic Multiply-Imputed Microdata. J. Off. Stat. 9, 461–468 (1993)
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple Imputation for Statistical Disclosure Limitation. J. Off. Stat. 19, 1–16 (2003)
Abowd, J.M., Stinson, M., Benedetto, G.: Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project. Technical Report, U.S. Census Bureau Longitudinal Employer-Household Dynamics Program (2006), http://www.census.gov/sipp/SSAfinal.pdf
Rodriguez, R.: Synthetic Data Disclosure Control for American Community Survey Group Quarters. In: Proceedings of the Joint Statistical Meetings, pp. 1439–1450 (2007)
Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards Unrestricted Public Use Microdata: The Synthetic Longitudinal Business Database. Int. Stat. Rev. 79, 362–384 (2011)
Meng, X.L.: Multiple Imputation Inference with Uncongenial Sources of Input (with discussion). Stat. Sci. 9, 538–573 (1994)
Reiter, J.P.: Using Multiple Imputation to Integrate and Disseminate Confidential Microdata. Int. Stat. Rev. 77, 179–195 (2009)
Reiter, J.P.: Using CART to Generate Partially Synthetic Public Use Microdata. J. Off. Stat. 21, 441–462 (2005)
Caiola, G., Reiter, J.P.: Random Forests for Generating Partially Synthetic, Categorical Data. Transactions on Data Privacy 3, 27–42 (2010)
Drechsler, J., Reiter, J.P.: An Empirical Evaluation of Easily Implemented, Nonparametric Methods for Generating Synthetic Data Sets. Comput. Stat. Data An. 55, 3232–3243 (2011)
Woodcock, S.D., Benedetto, G.: Distribution-Preserving Statistical Disclosure Limitation. Comput. Stat. Data An. 53, 4228–4242 (2009)
Reiter, J.P.: Releasing Multiply-Imputed, Synthetic Public Use Microdata: An Illustration and Empirical Study. J. Royal Stat. Soc. Series A 168, 185–205 (2005)
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, New York (1987)
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 2nd edn. Wiley (2002)
Raghunathan, T.E., Rubin, D.B.: Bayesian Multiple Imputation to Preserve Confidentiality in Public-Use Data Sets. In: ISBA 2000: The Sixth World Meeting of the International Society for Bayesian Analysis (2000)
Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J., Solenberger, P.: A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models. Surv. Methodol. 27, 85–95 (2001)
Fay, R.E., Herriot, R.A.: Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data. J. Am. Stat. Assoc. 74, 269–277 (1979)
Datta, G.S., Fay, R.E., Ghosh, M.: Hierarchical and Empirical Bayes Analysis in Small-Area Estimation. In: Proceedings of the Annual Research Conference, U.S. Bureau of the Census, pp. 63–78 (1991)
Rao, J.N.K.: Small Area Estimation. Wiley, New York (2003)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Royal Stat. Soc. Series B 39, 1–38 (1977)
Lindley, D.V., Smith, A.F.M.: Bayes Estimates for the Linear Model. J. Royal Stat. Soc. Series B 34, 1–41 (1972)
Rubin, D.B., Schenker, N.: Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse. J. Am. Stat. Assoc. 81, 366–374 (1986)
Rubin, D.B.: The Bayesian Bootstrap. Ann. Stat. 9, 130–134 (1981)
Schenker, N., Taylor, J.M.G.: Partially Parametric Techniques for Multiple Imputation. Comput. Stat. Data An. 22, 425–446 (1996)
Harrell, F.E.: Regression Modeling Strategies with Applications to Linear Models, Logistic Regression and Survival Analysis. Springer, New York (2001)
U.S. Census Bureau: American Community Survey: Design and Methodology (2009), http://www.census.gov/acs/www/Downloads/survey_methodology/acs_design_methodology.pdf
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Sakshaug, J.W., Raghunathan, T.E. (2014). Nonparametric Generation of Synthetic Data for Small Geographic Areas. In: Domingo-Ferrer, J. (eds) Privacy in Statistical Databases. PSD 2014. Lecture Notes in Computer Science, vol 8744. Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-11257-2_17
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11256-5
Online ISBN: 978-3-319-11257-2
eBook Packages: Computer ScienceComputer Science (R0)