Imputed data; Multiple imputation; Simulated data
Publication of synthetic – i.e., simulated – data is an alternative to masking for statistical disclosure control of microdata. The idea is to randomly generate data with the constraint that certain statistics or internal relationships of the original dataset should be preserved.
The operation of the original proposal by Rubin  is next outlined. Consider an original microdata set X of size n records drawn from a much larger population of N individuals, where there are background attributes A, non-confidential attributes B and confidential attributes C. Background attributes are observed and available for all N individuals in the population, whereas B and C are only available for the n records in the sample X. The first step is to construct from X a multiply-imputed population of N individuals. This population consists of the n records in X and M(the number of multiple imputations, typically between 3...
- 1.Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Lenz R, Longhurst J, Nordholt ES, Seri G, De Wolf P-P. Handbook on statistical disclosure control. CENEX SDC Project, November 2006 (manuscript version 1.0). http://neon.vb.cbs.nl/CENEX/
- 2.Rubin DB. Discussion of statistical disclosure limitation. J Off Stat. 1993;9(2):461–8.Google Scholar