Assessing the privacy of randomized vector-valued queries to a database using the area under the receiver operating characteristic curve

  • Gregory J. Matthews
  • Ofer Harel


As the amount of data generated continues to increase, consideration of individuals’ privacy is a growing concern. As a result, there has been a vast quantity of research done on methods of statistical disclosure control. Some of these methods propose to release a randomized version of the data rather than the actual data. While methods of this type certainly offer some layer of protection, there is still the potential for private information to be disclosed. Quantifying the level of privacy provided by these methods is often difficult. In the past, a method for assessing privacy using the receiver operating characteristic (ROC) curve based on ideas related to differential privacy was proposed. However, the method was only demonstrated for univariate randomized releases. Here, the ROC-based privacy measure is extended to the release of randomized vectors.


Privacy ROC curve Statistical disclosure limitation 



This project was partially supported by Award Number K01MH087219 from the National Institute of Mental Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Mental Health or the National Institutes of Health.


  1. Abowd, J., Woodcock, S.: Disclosure limitation in longitudinal linked data. In: Doyle, P., Lane, J., Theeuwes, J., Zayatz, L. (eds.) Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pp. 215–277. Elsevier, Amsterdam (2001)Google Scholar
  2. Adam, N.R., Wortmann, J.C.: Security-control methods for statistical databases: a comparative study. ACM Comput. Surv. 21(4), 515–556 (1989)CrossRefGoogle Scholar
  3. Cox, L.H.: Suppression methodology and statistical disclosure control. J. Am. Stat. Assoc. 75, 377–385 (1980)Google Scholar
  4. Cox, L.H.: Disclosure control methods for frequency count data. Technical report, U.S. Bureau of the Census (1984)Google Scholar
  5. Cox, L.H.: A constructive procedure for unbiased controlled rounding. J. Am. Stat. Assoc. 82, 520–524 (1987)Google Scholar
  6. Cox, L.H.: Matrix masking methods for disclosure limitation in microdata. Surv. Methodol. 6, 165–169 (1994)Google Scholar
  7. Cox, L.H., Fagan, J.T., Greenberg, B., Hemmig, R.: Disclosure avoidance techniques for tabular data. Technical report, U.S. Bureau of the Census (1987)Google Scholar
  8. Dalenius, T., Reiss, S.P.: Data-swapping: a technique for disclosure control. J. Stat. Plan. Inference 6, 73–85 (1982)CrossRefGoogle Scholar
  9. De Waal, A., Hundepool, A., Willenborg, L.: Argus: Software for statistical disclosure control of microdata. U.S. Census Bureau (1995)Google Scholar
  10. Duncan, G., Lambert, D.: The risk of disclosure for microdata. J. Bus. Econ. Stat. 7, 207–217 (1989)CrossRefGoogle Scholar
  11. Duncan, G., Pearson, R.: Enhancing access to microdata while protecting confidentiality: prospects for the future (with discussion). Stat. Sci. 6, 219–232 (1991)CrossRefGoogle Scholar
  12. Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP, pp. 1–12. Springer, Heidelberg (2006)Google Scholar
  13. Fienberg, S.E., McIntyre, J.: Data swapping: variations on a theme. Technical report, National Institute of Statistical Sciences, Research Triangle Park (2005)Google Scholar
  14. Fuller, W.: Masking procedurse for microdata disclosure limitation. J. Off. Stat. 9, 383–406 (1993)Google Scholar
  15. Gouweleeuw, J., Kooiman, L.W.P., de Wolf, P.-P.: Post randomisation for statistical disclosure control: theory and implementation. J. Off. Stat. 14(4), 463–478 (1998)Google Scholar
  16. Harel, O., Zhou, X.-H.: Multiple imputation: review and theory, implementation and software. Stat. Med. 26, 3057–3077 (2007)PubMedCrossRefGoogle Scholar
  17. Hundepool, A., Wetering, A.v.d., Ramaswamy, R., Wolf, P.d., Giessing, S., Fischetti, M., Salazar, J., Castro, J., Lowthian, P.: τ-argus 3.1 User Manual. Statistics Netherlands, Voorburg NL (2005)Google Scholar
  18. Kennickell, A.B.: Multiple imputation and disclosure protection: the case of the 1995 survey of consumer finances. In: Alvey, W., Jamerson, B. (eds.) Record Linkage Techniques, pp. 248–267. National Academy Press, Washington (1997)Google Scholar
  19. Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (1987)Google Scholar
  20. Liu, F., Little, R.J.A.: Selective multiple mputation of keys for statistical disclosure control in microdata. In: Proceedings of Joint Statistical Meeting, pp. 2133–2138 (2002)Google Scholar
  21. Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: International Conference on Data Engineering, p. 10. Cornell University Comuputer Science Department, Cornell, USA (2008)Google Scholar
  22. Manning, A.M., Haglin, D.J., Keane, J.A.: A recursive search algorithm for statistical disclosure assessment. Data Min. Knowl. Discov. 16(2), 165–196 (2008)CrossRefGoogle Scholar
  23. Matthews, G.J., Harel, O.: Data confidentiality: a review of methods for statistical disclosure limitation and methods for assessing privacy. Stat. Surv. 5, 1–29 (2011)CrossRefGoogle Scholar
  24. Matthews, G.J., Harel, O., Aseltine, R.H.: Assessing database privacy using the area under the receiver-operator characteristic curve. Health Serv. Outcomes Res. Method. 10(1), 1–15 (2010a)CrossRefGoogle Scholar
  25. Matthews, G.J., Harel, O., Aseltine, R.H.: Examining the robustness of fully synthetic data techniques for data with binary variables. J. Stat. Comput. Simul. 80(6), 609–624 (2010b)CrossRefGoogle Scholar
  26. McIntosh, M.W., Pepe, M.S.: Combining several screening tests: optimality of the risk score. Biometrics 58(3), 657–664 (2002)PubMedCrossRefGoogle Scholar
  27. Moore, Jr., R.: Controlled data-swapping techniques for masking public use microdata. Census Tech Report (1996)Google Scholar
  28. Mugge, R.: Issues in protecting confidentiality in national health statistics. In: Proceedings of the Section on Survey Research Methods. American Statistical Association, Washington (1983)Google Scholar
  29. Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: STOC ’07: Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, pp. 75–84, San Diego (2007)Google Scholar
  30. Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1–16 (2003)Google Scholar
  31. Reiter, J.P.: Satisfying disclosure restriction with synthetic data sets. J. Off. Stat. 18(4), 531–543 (2002)Google Scholar
  32. Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29(2), 181–188 (2003)Google Scholar
  33. Reiter, J.P.: New approaches to data dissemination: a glimpse into the future (?). Chance 17(3), 11–15 (2004a)Google Scholar
  34. Reiter, J.P.: Simultaneous use of multiple imputation for missing data and disclosure limitation. Surv. Methodol. 30(2), 235–242 (2004b)Google Scholar
  35. Reiter, J.P.: Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study. J. R. Stat. Soc. Ser. A Stat. Soc. 168(1), 185–205 (2005a)CrossRefGoogle Scholar
  36. Reiter, J.P.: Using CART to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441–462 (2005b)Google Scholar
  37. Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, Hoboken (1987)CrossRefGoogle Scholar
  38. Rubin, D.B.: Comment on “Statistical disclosure limitation”. J. Off. Stat. 9, 461–468 (1993)Google Scholar
  39. Sarathy, R., Muralidhar, K.: The security of confidential numerical data in databases. Inf. Syst. Res. 13(4), 389–403 (2002)CrossRefGoogle Scholar
  40. Schafer, J.L., Graham, J.W.: Missing data: our view of state of the art. Psychol. Methods 7(2), 147–177 (2002)PubMedCrossRefGoogle Scholar
  41. Singh, A., Yu, F., Dunteman, G.: MASSC: A new data mask for limiting statistical information loss and disclosure. In: Proceedings of the Joint UNECE/EUROSTAT Work Session on Statistical Data Confidentiality, pp. 373–394, Luxembourg (2003)Google Scholar
  42. Sweeney, L.: Replacing personally-identifying information in medical records, the scrub system. In: American Medical Informatics Association, pp. 333–337. Hanley and Belfus, Inc., Washington (1996)Google Scholar
  43. Sweeney, L.: Guaranteeing anonymity when sharing medical data, the datafly system. J. Am. Med. Inform. Assoc. 4, 51–55 (1997)Google Scholar
  44. Sweeney, L.: The identifiability of data (2000)Google Scholar
  45. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowledge-Based Syst. 10(5), 557–570 (2002)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Division of Biostatistics and EpidemiologyUniversity of Massachusetts, AmherstAmherstUSA
  2. 2.Department of StatisticsUniversity of Connecticut, StorrsStorrsUSA

Personalised recommendations