Abstract
The rapid progress of human genome studies leads to a strong demand of aggregate human DNA data (e.g, allele frequencies, test statistics, etc.), whose public dissemination, however, has been impeded by privacy concerns. Prior research shows that it is possible to identify the presence of some participants in a study from such data, and in some cases, even fully recover their DNA sequences. A critical issue, therefore, becomes how to evaluate such a risk on individual data-sets and determine when they are safe to release. In this paper, we report our research that makes the first attempt to address this issue. We first identified the space of the aggregate-data-release problem, through examining common types of aggregate data and the typical threats they are facing. Then, we performed an in-depth study on different scenarios of attacks on different types of data, which sheds light on several fundamental questions in this problem domain. Particularly, we found that attacks on aggregate data are difficult in general, as the adversary often does not have enough information and needs to solve NP-complete or NP-hard problems. On the other hand, we acknowledge that the attacks can succeed under some circumstances, particularly, when the solution space of the problem is small. Based upon such an understanding, we propose a risk-scale system and a methodology to determine when to release an aggregate data-set and when not to. We also used real human-genome data to verify our findings.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Haplotype Estimation and Association (2005), http://slack.ser.man.ac.uk/theory/association_hap.html
NIH Background Fact Sheet on GWAS Policy Update (2008), http://grants.nih.gov/grants/gwas/background_fact_sheet_20080828.pdf
fastPHASE (2010), http://stephenslab.uchicago.edu/software.html
Genome-Wide Association Studies (2010), http://grants.nih.gov/grants/gwas/
Ibm ilog cplex optimizer (2010), http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/
International HapMap Project (2010), http://www.hapmap.org/
National Institutes of Health (2010), http://www.nih.gov/
Policy for sharing of data obtained in nih supported or conducted genome-wide association studies (gwas) (2010), http://grants.nih.gov/grants/guide/notice-files/not-od-07-088.html
The R project for statistical computing (2010), http://www.r-project.org/
Re-identification and its discontents (2010), http://www.genomicslawreport.com/index.php/2009/10/13/re-identification-and-its-discontents/
SecureGenome (2010), http://securegenome.icsi.berkeley.edu/securegenome/
SNP.plotter (2010), http://cbdb.nimh.nih.gov/~kristin/snp.plotter.html
Wellcome Trust Case Control Consortium (WTCCC1) (2010), https://www.wtccc.org.uk/ccc1/
Agrawal, D., Aggarwal, C.C.: On the design and quantification of privacy preserving data mining algorithms. In: PODS 2001: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 247–255. ACM, New York (2001)
Agrawal, R., Srikant, R.: Privacy-preserving data mining. SIGMOD Rec. 29(2), 439–450 (2000)
Atallah, M.J., Kerschbaum, F., Du, W.: Secure and private sequence comparisons. In: WPES 20: Proceedings of the 2003 ACM Workshop on Privacy in the Electronic Society, pp. 39–44. ACM, New York (2003)
Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: PODS 2007: Proceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282. ACM, New York (2007)
Beck, L.L.: A security machanism for statistical database. ACM Trans. Database Syst. 5(3), 316–3338 (1980)
Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the sulq framework. In: PODS 2005: Proceedings of the Twenty-fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 128–138. ACM, New York (2005)
Bonferroni, C.E.: Teoria statistica delle classi e calcolo delle probability. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze 8(1), 3–62 (1936)
Braun, R., Rowe, W., Schaefer, C., Zhang, J., Buetow, K.: Needles in the haystack: Identifying individuals present in pooled genomic data. PLoS Genet 5(10), e1000668 (2009)
Bruekers, F., Katzenbeisser, S., Kursawe, K., Tuyls, P.: Privacy-preserving matching of dna profiles. Technical Report Report 2008/203, ACR Cryptology ePrint Archive (2008)
Chen, Y., Diaconis, P., Holmes, S.P., Liu, J.S.: Sequential monte carlo methods for statistical analysis of tables. Journal of the American Statistical Association 100, 109–120 (2003)
Chin, F.Y., Ozsoyoglu, G.: Auditing and inference control in statistical databases. IEEE Trans. Softw. Eng. 8(6), 574–582 (1982)
Chiò, A., Schymick, J.C., et al.: A two-stage genome-wide association study of sporadic amyotrophic lateral sclerosis. Hum. Mol. Genet. 18(8), 1524–1532 (2009)
Chvatal, V.: Recognizing intersection patterns. In: Combinatorics 79, Part I, pp. 249–251. North-Holland Publishing Company, Amsterdam (1980)
Dobra, A., Fienberg, S.E.: Bounds for cell entries in contingency tables induced by fixed marginal totals. Statistical Journal of the United Nations ECE 18, 363–371 (2001)
Duerr, R.H.H., et al.: A genome-wide association study identifies il23r as an inflammatory bowel disease gene. Science (October 2006)
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006)
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006)
Edwards, A.O., Ritter, R., et al.: Complement factor H polymorphism and age-related macular degeneration. Science 308(5720), 421–424 (2005)
Fienberg, S.E.: Datamining and disclosure limitation for categorical statistical databases. In: Proceedings of Workshop on Privacy and Security Aspects of Data Mining, Fourth IEEE International Conference on Data Mining (ICDM 2004), pp. 1–12. Nova Science Publishing, Bombay (2004)
Gehrke, J.: Models and methods for privacy-preserving data analysis and publishing. In: ICDE 2006: Proceedings of the 22nd International Conference on Data Engineering, p. 105. IEEE Computer Society, Washington, DC, USA (2006)
Goldreich, O., Vadhan, S.: Special issue on worst-case versus average-case complexity editors’ foreword. Comput. Complex. 16, 325–330 (2007)
Greenspan, G., Geiger, D.: Modeling haplotype block variation using markov chains. Genetics 172(4), 2583–2599 (2006)
Haines, J.L., et al.: Complement factor H variant increases the risk of age-related macular degeneration. Science 308(5720), 419–421 (2005)
Herman, G.T., Kuba, A.: Advances in Discrete Tomography and Its Applications (Applied and Numerical Harmonic Analysis). Birkhauser, Basel (2007)
Hoeffding, W.: Scale-invariant correlation theory. Masstabinvariante Korrelationstheorie, Schriften des Matematischen Instituts und des Instituts fr Angewandte Mathematik der University 5, 179–233 (1940)
Homer, N., et al.: Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS Genet. 4(8), e1000167+ (2008)
Jacobs, K.B., et al.: A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nature Genetics 41(11), 1253–1257 (2009)
Jha, S., Kruger, L., Shmatikov, V.: Towards practical privacy for genomic computation. In: 2008 IEEE Symposium on Security and Privacy (2008)
Kim, Y., Feng, S., Zeng, Z.B.: Measuring and partitioning the high-order linkage disequilibrium by multiple order markov chains. Genet. Epidemiol. 32(4), 301–312 (2008)
Morris, A.P., Whittaker, J.C., Balding, D.J.: Little loss of information due to unknown phase for fine-scale linkage-disequilibrium mapping with single-nucleotide-polymorphism genotype data. Am. J. Hum. Genet. 74(5), 945–953 (2004)
Renström, F., et al.: Replication and extension of genome-wide association study results for obesity in 4,923 adults from northern sweden. Hum. Mol. Genet. (January 2009)
Robbins, R.: Some applications of mathematics to breeding problems iii. Genetics 3(4), 375–389 (1918)
Sankararaman, S., Obozinski, G., Jordan, M.I., Halperin, E.: Genomic privacy and limits of individual detection in a pool. Nat. Genet. 41(9), 965–967 (2009)
Scott, L., et al.: A genome-wide association study of type 2 diabetes in finns detects multiple susceptibility variants. Science (April 2007)
Sladek, R., et al.: A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature (February 2007)
Stephens, M., Donnelly, P.: A comparison of bayesian methods for haplotype reconstruction from population genotype data. American Journal of Human Genetics 73(5), 1162–1169 (2003)
Stephens, M., Smith, N., Donnelly, P.: A new statistical method for haplotype reconstruction from population data. The American Journal of Human Genetics 68(4), 978–989 (2001)
Visscher, P.M., Hill, W.G.: The limits of individual identification from sample allele frequencies: Theory and statistical analysis. PLoS Genet. 5(10), e1000628 (2009)
Wang, R., Li, Y.F., Wang, X., Tang, H., Zhou, X.: Learning your identity and disease from research papers: Information leaks in genome wide association study. In: CCS 2009: Proceedings of the 15th ACM Conference on Computer and Communications Security, pp. 534–544. ACM, New York (2009)
Yeager, M., et al.: Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nature Genetics 39(5), 645–649 (2007)
Yuguo Chen, I.H.D., Sullivant, S.: Sequential importance sampling for multiway tables. The Annals of Statistics 34(1), 523–545 (2006)
Zhou, X., Peng, B., Li, Y.F., Chen, Y., Tang, H., Wang, X.: Technical report tr696: To release or not to release: Evaluating information leaks in aggregate human-genome data (2011), https://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR696
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhou, X., Peng, B., Li, Y.F., Chen, Y., Tang, H., Wang, X. (2011). To Release or Not to Release: Evaluating Information Leaks in Aggregate Human-Genome Data. In: Atluri, V., Diaz, C. (eds) Computer Security – ESORICS 2011. ESORICS 2011. Lecture Notes in Computer Science, vol 6879. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23822-2_33
Download citation
DOI: https://doi.org/10.1007/978-3-642-23822-2_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23821-5
Online ISBN: 978-3-642-23822-2
eBook Packages: Computer ScienceComputer Science (R0)