Differential Correct Attribution Probability for Synthetic Data: An Exploration

  • Jennifer TaubEmail author
  • Mark Elliot
  • Maria Pampaka
  • Duncan Smith
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11126)


Synthetic data generation has been proposed as a flexible alternative to more traditional statistical disclosure control (SDC) methods for limiting disclosure risk. Synthetic data generation is functionally distinct from standard SDC methods in that it breaks the link between the data subjects and the data such that reidentification is no longer meaningful. Therefore orthodox measures of disclosure risk assessment - which are based on reidentification - are not applicable. Research into developing disclosure assessment measures specifically for synthetic data has been relatively limited. In this paper, we develop a method called Differential Correct Attribution Probability (DCAP). Using DCAP, we explore the effect of multiple imputation on the disclosure risk of synthetic data.


Synthetic data Disclosure risk CART 


  1. Caiola, G., Reiter, J.: Random forests for generating partially synthetic, categorical data. Trans. Data Priv. 3, 27–42 (2010)MathSciNetGoogle Scholar
  2. Charest, A.: How can we analyze differentially-private synthetic datasets? J. Priv. Confid. 2(2), 21–33 (2010)Google Scholar
  3. Chen, Y., Elliot, M., Sakshaug, J.: A genetic algorithm approach to synthetic data production. In: PrAISe 2016 Proceedings of the 1st International Workshop on AI for Privacy and Security, Hague, Netherlands. ACM (2016)Google Scholar
  4. Dinur, I., Nissim, K.: Revealing information while preserving privacy. In: Principles of Database Systems, pp. 202–210 (2003)Google Scholar
  5. Drechsler, J.: Using support vector machines for generating synthetic datasets. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 148–161. Springer, Heidelberg (2010). Scholar
  6. Drechsler, J., Bender, S., Rässler, S.: Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB establishment panel. Trans. Data Priv. 1, 105–130 (2008)MathSciNetGoogle Scholar
  7. Drechsler, J., Reiter, J.: An empirical evaluation of easily implemented, non-parametric methods for generating synthetic data. Comput. Stat. Data Anal. 55, 3232–3243 (2011)CrossRefGoogle Scholar
  8. Elliot, M.: Final Report on the Disclosure Risk Associated with the Synthetic Data Produced by the SYLLS Team. CMIST (2014). Accessed 17 Mar 2017
  9. Elliot, M., Mackey, E., O’Hara, K., Tudor, C.: The Anonymisation Decision-Making Framework, 1st edn. UKAN, Manchester (2016)Google Scholar
  10. Elliot, M., Mackey, E., O’Shea, S., Tudor, C., Spicer, K.: End user licence to open government data? A simulated penetration attack on two social survey datasets. J. Off. Stat. 32(2), 329–348 (2016)CrossRefGoogle Scholar
  11. Elliot, M., Manning, A., Ford, R.: A computational algorithm for handling the special uniques problem. Int. J. Uncerta. Fuzziness Knowl.-Based Syst. 10(5), 493–509 (2002)CrossRefGoogle Scholar
  12. Fienberg, S., Makov, U.: Confidentiality, uniqueness, and disclosure limitation for categorical data. J. Off. Stat. 14(4), 385–397 (1998)zbMATHGoogle Scholar
  13. Kim, J., Winkler, W.: Masking microdata files. In: Proceedings of the Survey Research Methods Section, pp. 114–119. American Statistical Association (1995)Google Scholar
  14. Little, R.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)Google Scholar
  15. Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1(1), 1–52 (2007)CrossRefGoogle Scholar
  16. NatCen Social Research: British Social Attitudes Survey, 2014, [data collection], UK Data Service, 2nd edn (2016). Accessed 30 Apr 2018. SN: 7809.
  17. Nowok, B., Raab, G., Dibben, C.: synthpop: Bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)CrossRefGoogle Scholar
  18. Office for National Statistics: Department for Environment, Food and Rural Affairs, Living Costs and Food Survey, 2014, [data collection], UK Data Service, 2nd edn (2016). Accessed 08 Mar 2018. SN: 7992.
  19. Raab, G., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 67–97 (2016)CrossRefGoogle Scholar
  20. Reiter, J.: Using CART to generate partially synthetic, public use microdata. J. Off. Stat. 21, 441–462 (2005)Google Scholar
  21. Reiter, J., Mitra, R.: Estimating risks of identification disclosure in partially synthetic data. J. Priv. Confid. 1(1), 99–110 (2009)Google Scholar
  22. Reiter, J., Wang, Q., Zhang, B.: Bayesian estimation of disclosure risks for multiply imputed, synthetic data. J. Priv. Confid. 6(1), 17–33 (2014)Google Scholar
  23. Rubin, D.B.: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)Google Scholar
  24. Skinner, C., Elliot, M.: A measure of disclosure risk for microdata. J. R. Stat. Soc. Ser. B 64(4), 855–867 (2002)MathSciNetCrossRefGoogle Scholar
  25. Smith, D., Elliot, M.: A measure of disclosure risk for tables of counts. Trans. Data Priv. 1(1), 34–52 (2008)MathSciNetGoogle Scholar
  26. Winkler, W.: Re-identification methods for evaluating the confidentiality of analytically valid microdata. Research Report Series, 9 (2005)Google Scholar
  27. Yancey, W., Winkler, W., Creecy, R.: Disclosure risk assessment in perturbative microdata protection. Research Report Series, 1 (2002)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Jennifer Taub
    • 1
    Email author
  • Mark Elliot
    • 1
  • Maria Pampaka
    • 1
  • Duncan Smith
    • 1
  1. 1.The University of ManchesterManchesterUK

Personalised recommendations