Skip to main content

Nonparametric Generation of Synthetic Data for Small Geographic Areas

  • Conference paper
Privacy in Statistical Databases (PSD 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8744))

Included in the following conference series:

Abstract

Computing and releasing statistics for small geographic areas is a common task for many statistical agencies, but releasing public-use microdata for these areas is much less common due to data confidentiality concerns. Accessing the restricted microdata is usually only possible within a research data center (RDC). This arrangement is inconvenient for many researchers who must travel large distances and, in some cases, pay a sizeable data usage fee to access the nearest RDC. An alternative data dissemination method that has been explored is to release public-use synthetic data. In general, synthetic data consists of imputed values drawn from a predictive model based on the observed data. Data confidentiality is preserved because no actual data values are released. The imputed values are typically drawn from a standard, parametric distribution, but often key variables of interest do not follow strict parametric forms. In this paper, we apply a nonparametric method for generating synthetic data for continuous variables collected from small geographic areas. The method is evaluated using data from the 2005-2007 American Community Survey. The analytic validity of the synthetic data is assessed by comparing parametric (baseline) and nonparametric inferences obtained from the synthetic data with those obtained from the observed data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Diez Roux, A.V.: Estimating Neighborhood Health Effects: The Challenges of Causal Inference in a Complex World. Soc. Sci. Med. 58, 1953–1960 (2004)

    Article  Google Scholar 

  2. Fisher, K.J., Li, M.Y., Cleveland, M.: Neighborhood-Level Influences on Physical Activity Among Older Adults: A Multilevel Analysis. J. Aging. Phys. Activ. 12, 45–63 (2004)

    Google Scholar 

  3. Auchincloss, A.H., Roux, A.V., Brown, D., Erdmann, C.A., Bertoni, A.G.: Neighborhood Resources for Physical Activity and Healthy Foods and their Association with Insulin Resistance. Epidemiology 19, 146–157 (2008)

    Article  Google Scholar 

  4. Mujahid, M.S., Diez Roux, A.V., Morenoff, J.D., Raghunathan, T.E., Cooper, R.S., Ni, H., Shea, S.: Neighborhood Characteristics and Hypertension. Epidemiology 19, 590–598 (2008)

    Article  Google Scholar 

  5. Bell, W., Basel, W., Cruse, C., Dalzell, L., Maples, J., O’Hara, B., Powers, D.: Use of ACS Data to Produce SAIPE Model-Based Estimates of Poverty for Counties. Technical Report, U.S. Bureau of the Census (2007), http://www.census.gov/did/www/saipe/publications/files/report.pdf

  6. Fisher, R., Turner, J.: Small Area Estimation of Health Insurance Coverage from the Current Population Survey’s Social and Economic Supplement and the Survey of Income and Program Participation. Presented at the American Statistical Association Meetings, Toronto, Canada (2004)

    Google Scholar 

  7. Sakshaug, J.W., Raghunathan, T.E.: Synthetic Data for Small Area Estimation. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 162–173. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  8. Rubin, D.B.: Satisfying Confidentiality Constraints Through the Use of Synthetic Multiply-Imputed Microdata. J. Off. Stat. 9, 461–468 (1993)

    Google Scholar 

  9. Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple Imputation for Statistical Disclosure Limitation. J. Off. Stat. 19, 1–16 (2003)

    Google Scholar 

  10. Abowd, J.M., Stinson, M., Benedetto, G.: Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project. Technical Report, U.S. Census Bureau Longitudinal Employer-Household Dynamics Program (2006), http://www.census.gov/sipp/SSAfinal.pdf

  11. Rodriguez, R.: Synthetic Data Disclosure Control for American Community Survey Group Quarters. In: Proceedings of the Joint Statistical Meetings, pp. 1439–1450 (2007)

    Google Scholar 

  12. Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards Unrestricted Public Use Microdata: The Synthetic Longitudinal Business Database. Int. Stat. Rev. 79, 362–384 (2011)

    Article  Google Scholar 

  13. Meng, X.L.: Multiple Imputation Inference with Uncongenial Sources of Input (with discussion). Stat. Sci. 9, 538–573 (1994)

    Google Scholar 

  14. Reiter, J.P.: Using Multiple Imputation to Integrate and Disseminate Confidential Microdata. Int. Stat. Rev. 77, 179–195 (2009)

    Article  Google Scholar 

  15. Reiter, J.P.: Using CART to Generate Partially Synthetic Public Use Microdata. J. Off. Stat. 21, 441–462 (2005)

    Google Scholar 

  16. Caiola, G., Reiter, J.P.: Random Forests for Generating Partially Synthetic, Categorical Data. Transactions on Data Privacy 3, 27–42 (2010)

    MathSciNet  Google Scholar 

  17. Drechsler, J., Reiter, J.P.: An Empirical Evaluation of Easily Implemented, Nonparametric Methods for Generating Synthetic Data Sets. Comput. Stat. Data An. 55, 3232–3243 (2011)

    Article  MathSciNet  Google Scholar 

  18. Woodcock, S.D., Benedetto, G.: Distribution-Preserving Statistical Disclosure Limitation. Comput. Stat. Data An. 53, 4228–4242 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  19. Reiter, J.P.: Releasing Multiply-Imputed, Synthetic Public Use Microdata: An Illustration and Empirical Study. J. Royal Stat. Soc. Series A 168, 185–205 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  20. Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, New York (1987)

    Book  Google Scholar 

  21. Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 2nd edn. Wiley (2002)

    Google Scholar 

  22. Raghunathan, T.E., Rubin, D.B.: Bayesian Multiple Imputation to Preserve Confidentiality in Public-Use Data Sets. In: ISBA 2000: The Sixth World Meeting of the International Society for Bayesian Analysis (2000)

    Google Scholar 

  23. Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J., Solenberger, P.: A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models. Surv. Methodol. 27, 85–95 (2001)

    Google Scholar 

  24. Fay, R.E., Herriot, R.A.: Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data. J. Am. Stat. Assoc. 74, 269–277 (1979)

    Article  MathSciNet  Google Scholar 

  25. Datta, G.S., Fay, R.E., Ghosh, M.: Hierarchical and Empirical Bayes Analysis in Small-Area Estimation. In: Proceedings of the Annual Research Conference, U.S. Bureau of the Census, pp. 63–78 (1991)

    Google Scholar 

  26. Rao, J.N.K.: Small Area Estimation. Wiley, New York (2003)

    Book  MATH  Google Scholar 

  27. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Royal Stat. Soc. Series B 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  28. Lindley, D.V., Smith, A.F.M.: Bayes Estimates for the Linear Model. J. Royal Stat. Soc. Series B 34, 1–41 (1972)

    MATH  MathSciNet  Google Scholar 

  29. Rubin, D.B., Schenker, N.: Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse. J. Am. Stat. Assoc. 81, 366–374 (1986)

    Article  MATH  MathSciNet  Google Scholar 

  30. Rubin, D.B.: The Bayesian Bootstrap. Ann. Stat. 9, 130–134 (1981)

    Article  Google Scholar 

  31. Schenker, N., Taylor, J.M.G.: Partially Parametric Techniques for Multiple Imputation. Comput. Stat. Data An. 22, 425–446 (1996)

    Article  MATH  Google Scholar 

  32. Harrell, F.E.: Regression Modeling Strategies with Applications to Linear Models, Logistic Regression and Survival Analysis. Springer, New York (2001)

    Google Scholar 

  33. U.S. Census Bureau: American Community Survey: Design and Methodology (2009), http://www.census.gov/acs/www/Downloads/survey_methodology/acs_design_methodology.pdf

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Sakshaug, J.W., Raghunathan, T.E. (2014). Nonparametric Generation of Synthetic Data for Small Geographic Areas. In: Domingo-Ferrer, J. (eds) Privacy in Statistical Databases. PSD 2014. Lecture Notes in Computer Science, vol 8744. Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11257-2_17

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11256-5

  • Online ISBN: 978-3-319-11257-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics