Abstract
Synthetic data generation has been proposed as a flexible alternative to more traditional statistical disclosure control (SDC) methods for limiting disclosure risk. Synthetic data generation is functionally distinct from standard SDC methods in that it breaks the link between the data subjects and the data such that reidentification is no longer meaningful. Therefore orthodox measures of disclosure risk assessment - which are based on reidentification - are not applicable. Research into developing disclosure assessment measures specifically for synthetic data has been relatively limited. In this paper, we develop a method called Differential Correct Attribution Probability (DCAP). Using DCAP, we explore the effect of multiple imputation on the disclosure risk of synthetic data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The level of confidence which is regarded as disclosive is a subjective judgement.
- 2.
A synthetic dataset often contains multiple synthetic samples (m).
- 3.
A statistically unique record is a record in the dataset, in which no other record in the dataset has that particular combination of characteristics.
- 4.
Elliot (2014) presents a variant where the target is continuous but we do not consider that here.
- 5.
It is worth noting that if the mean CAP score of the whole synthetic dataset is at the baseline, that effectively means that the target is independent of the key which may be indicative that the data have a utility issue.
- 6.
The different imputation levels (m) are nested, rather than independent synthetic datasets.
- 7.
We used Welch’s T-Test DF = 5,131.
- 8.
See for example Abowd and Vilhuber, 2008; Charest 2010 for uses of differential privacy in the synthesizing mechanism.
References
Caiola, G., Reiter, J.: Random forests for generating partially synthetic, categorical data. Trans. Data Priv. 3, 27–42 (2010)
Charest, A.: How can we analyze differentially-private synthetic datasets? J. Priv. Confid. 2(2), 21–33 (2010)
Chen, Y., Elliot, M., Sakshaug, J.: A genetic algorithm approach to synthetic data production. In: PrAISe 2016 Proceedings of the 1st International Workshop on AI for Privacy and Security, Hague, Netherlands. ACM (2016)
Dinur, I., Nissim, K.: Revealing information while preserving privacy. In: Principles of Database Systems, pp. 202–210 (2003)
Drechsler, J.: Using support vector machines for generating synthetic datasets. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 148–161. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15838-4_14
Drechsler, J., Bender, S., Rässler, S.: Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB establishment panel. Trans. Data Priv. 1, 105–130 (2008)
Drechsler, J., Reiter, J.: An empirical evaluation of easily implemented, non-parametric methods for generating synthetic data. Comput. Stat. Data Anal. 55, 3232–3243 (2011)
Elliot, M.: Final Report on the Disclosure Risk Associated with the Synthetic Data Produced by the SYLLS Team. CMIST (2014).http://hummedia.manchester.ac.uk/institutes/cmist/archive-publications/reports/. Accessed 17 Mar 2017
Elliot, M., Mackey, E., O’Hara, K., Tudor, C.: The Anonymisation Decision-Making Framework, 1st edn. UKAN, Manchester (2016)
Elliot, M., Mackey, E., O’Shea, S., Tudor, C., Spicer, K.: End user licence to open government data? A simulated penetration attack on two social survey datasets. J. Off. Stat. 32(2), 329–348 (2016)
Elliot, M., Manning, A., Ford, R.: A computational algorithm for handling the special uniques problem. Int. J. Uncerta. Fuzziness Knowl.-Based Syst. 10(5), 493–509 (2002)
Fienberg, S., Makov, U.: Confidentiality, uniqueness, and disclosure limitation for categorical data. J. Off. Stat. 14(4), 385–397 (1998)
Kim, J., Winkler, W.: Masking microdata files. In: Proceedings of the Survey Research Methods Section, pp. 114–119. American Statistical Association (1995)
Little, R.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1(1), 1–52 (2007)
NatCen Social Research: British Social Attitudes Survey, 2014, [data collection], UK Data Service, 2nd edn (2016). Accessed 30 Apr 2018. SN: 7809. https://doi.org/10.5255/UKDA-SN-7809-2
Nowok, B., Raab, G., Dibben, C.: synthpop: Bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)
Office for National Statistics: Department for Environment, Food and Rural Affairs, Living Costs and Food Survey, 2014, [data collection], UK Data Service, 2nd edn (2016). Accessed 08 Mar 2018. SN: 7992. https://doi.org/10.5255/UKDA-SN-7992-3
Raab, G., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 67–97 (2016)
Reiter, J.: Using CART to generate partially synthetic, public use microdata. J. Off. Stat. 21, 441–462 (2005)
Reiter, J., Mitra, R.: Estimating risks of identification disclosure in partially synthetic data. J. Priv. Confid. 1(1), 99–110 (2009)
Reiter, J., Wang, Q., Zhang, B.: Bayesian estimation of disclosure risks for multiply imputed, synthetic data. J. Priv. Confid. 6(1), 17–33 (2014)
Rubin, D.B.: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
Skinner, C., Elliot, M.: A measure of disclosure risk for microdata. J. R. Stat. Soc. Ser. B 64(4), 855–867 (2002)
Smith, D., Elliot, M.: A measure of disclosure risk for tables of counts. Trans. Data Priv. 1(1), 34–52 (2008)
Winkler, W.: Re-identification methods for evaluating the confidentiality of analytically valid microdata. Research Report Series, 9 (2005)
Yancey, W., Winkler, W., Creecy, R.: Disclosure risk assessment in perturbative microdata protection. Research Report Series, 1 (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A An exploration into the CAP scores when smaller sized keys are used or the LCF
Key 6: GOR, Output area classifier, tenure, dwelling type, internet in hh, household size
Key 5: GOR, Output area classifier, tenure, dwelling type, internet in hh
Key 4: GOR, Output area classifier, tenure, dwelling type
Key 3: GOR, Output area classifier, tenure (Table 5).
B The average CAP scores for the BSA
(See Table 6).
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Taub, J., Elliot, M., Pampaka, M., Smith, D. (2018). Differential Correct Attribution Probability for Synthetic Data: An Exploration. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-99771-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99770-4
Online ISBN: 978-3-319-99771-1
eBook Packages: Computer ScienceComputer Science (R0)