To Release or Not to Release: Evaluating Information Leaks in Aggregate Human-Genome Data

Zhou, Xiaoyong; Peng, Bo; Li, Yong Fuga; Chen, Yangyi; Tang, Haixu; Wang, XiaoFeng

doi:10.1007/978-3-642-23822-2_33

Xiaoyong Zhou¹⁸,
Bo Peng¹⁸,
Yong Fuga Li¹⁸,
Yangyi Chen¹⁸,
Haixu Tang¹⁸ &
…
XiaoFeng Wang¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 6879))

Included in the following conference series:

European Symposium on Research in Computer Security

3045 Accesses
25 Citations
3 Altmetric

Abstract

The rapid progress of human genome studies leads to a strong demand of aggregate human DNA data (e.g, allele frequencies, test statistics, etc.), whose public dissemination, however, has been impeded by privacy concerns. Prior research shows that it is possible to identify the presence of some participants in a study from such data, and in some cases, even fully recover their DNA sequences. A critical issue, therefore, becomes how to evaluate such a risk on individual data-sets and determine when they are safe to release. In this paper, we report our research that makes the first attempt to address this issue. We first identified the space of the aggregate-data-release problem, through examining common types of aggregate data and the typical threats they are facing. Then, we performed an in-depth study on different scenarios of attacks on different types of data, which sheds light on several fundamental questions in this problem domain. Particularly, we found that attacks on aggregate data are difficult in general, as the adversary often does not have enough information and needs to solve NP-complete or NP-hard problems. On the other hand, we acknowledge that the attacks can succeed under some circumstances, particularly, when the solution space of the problem is small. Based upon such an understanding, we propose a risk-scale system and a methodology to determine when to release an aggregate data-set and when not to. We also used real human-genome data to verify our findings.

Download to read the full chapter text

Chapter PDF

Private Genome Data Dissemination

Controlling the signal: Practical privacy protection of genomic data sharing through Beacon services

Article Open access 26 July 2017

Reconstruction of private genomes through reference-based genotype imputation

Article Open access 06 December 2023

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Haplotype Estimation and Association (2005), http://slack.ser.man.ac.uk/theory/association_hap.html
NIH Background Fact Sheet on GWAS Policy Update (2008), http://grants.nih.gov/grants/gwas/background_fact_sheet_20080828.pdf
fastPHASE (2010), http://stephenslab.uchicago.edu/software.html
Genome-Wide Association Studies (2010), http://grants.nih.gov/grants/gwas/
Ibm ilog cplex optimizer (2010), http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/
International HapMap Project (2010), http://www.hapmap.org/
National Institutes of Health (2010), http://www.nih.gov/
Policy for sharing of data obtained in nih supported or conducted genome-wide association studies (gwas) (2010), http://grants.nih.gov/grants/guide/notice-files/not-od-07-088.html
The R project for statistical computing (2010), http://www.r-project.org/
Re-identification and its discontents (2010), http://www.genomicslawreport.com/index.php/2009/10/13/re-identification-and-its-discontents/
SecureGenome (2010), http://securegenome.icsi.berkeley.edu/securegenome/
SNP.plotter (2010), http://cbdb.nimh.nih.gov/~kristin/snp.plotter.html
Wellcome Trust Case Control Consortium (WTCCC1) (2010), https://www.wtccc.org.uk/ccc1/
Agrawal, D., Aggarwal, C.C.: On the design and quantification of privacy preserving data mining algorithms. In: PODS 2001: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 247–255. ACM, New York (2001)
Chapter Google Scholar
Agrawal, R., Srikant, R.: Privacy-preserving data mining. SIGMOD Rec. 29(2), 439–450 (2000)
Article Google Scholar
Atallah, M.J., Kerschbaum, F., Du, W.: Secure and private sequence comparisons. In: WPES 20: Proceedings of the 2003 ACM Workshop on Privacy in the Electronic Society, pp. 39–44. ACM, New York (2003)
Google Scholar
Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: PODS 2007: Proceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282. ACM, New York (2007)
Chapter Google Scholar
Beck, L.L.: A security machanism for statistical database. ACM Trans. Database Syst. 5(3), 316–3338 (1980)
Article Google Scholar
Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the sulq framework. In: PODS 2005: Proceedings of the Twenty-fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 128–138. ACM, New York (2005)
Chapter Google Scholar
Bonferroni, C.E.: Teoria statistica delle classi e calcolo delle probability. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze 8(1), 3–62 (1936)
MATH Google Scholar
Braun, R., Rowe, W., Schaefer, C., Zhang, J., Buetow, K.: Needles in the haystack: Identifying individuals present in pooled genomic data. PLoS Genet 5(10), e1000668 (2009)
Article Google Scholar
Bruekers, F., Katzenbeisser, S., Kursawe, K., Tuyls, P.: Privacy-preserving matching of dna profiles. Technical Report Report 2008/203, ACR Cryptology ePrint Archive (2008)
Google Scholar
Chen, Y., Diaconis, P., Holmes, S.P., Liu, J.S.: Sequential monte carlo methods for statistical analysis of tables. Journal of the American Statistical Association 100, 109–120 (2003)
Article MathSciNet MATH Google Scholar
Chin, F.Y., Ozsoyoglu, G.: Auditing and inference control in statistical databases. IEEE Trans. Softw. Eng. 8(6), 574–582 (1982)
Article MathSciNet MATH Google Scholar
Chiò, A., Schymick, J.C., et al.: A two-stage genome-wide association study of sporadic amyotrophic lateral sclerosis. Hum. Mol. Genet. 18(8), 1524–1532 (2009)
Article Google Scholar
Chvatal, V.: Recognizing intersection patterns. In: Combinatorics 79, Part I, pp. 249–251. North-Holland Publishing Company, Amsterdam (1980)
Chapter Google Scholar
Dobra, A., Fienberg, S.E.: Bounds for cell entries in contingency tables induced by fixed marginal totals. Statistical Journal of the United Nations ECE 18, 363–371 (2001)
Google Scholar
Duerr, R.H.H., et al.: A genome-wide association study identifies il23r as an inflammatory bowel disease gene. Science (October 2006)
Google Scholar
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006)
Chapter Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006)
Chapter Google Scholar
Edwards, A.O., Ritter, R., et al.: Complement factor H polymorphism and age-related macular degeneration. Science 308(5720), 421–424 (2005)
Article Google Scholar
Fienberg, S.E.: Datamining and disclosure limitation for categorical statistical databases. In: Proceedings of Workshop on Privacy and Security Aspects of Data Mining, Fourth IEEE International Conference on Data Mining (ICDM 2004), pp. 1–12. Nova Science Publishing, Bombay (2004)
Google Scholar
Gehrke, J.: Models and methods for privacy-preserving data analysis and publishing. In: ICDE 2006: Proceedings of the 22nd International Conference on Data Engineering, p. 105. IEEE Computer Society, Washington, DC, USA (2006)
Google Scholar
Goldreich, O., Vadhan, S.: Special issue on worst-case versus average-case complexity editors’ foreword. Comput. Complex. 16, 325–330 (2007)
Article MathSciNet Google Scholar
Greenspan, G., Geiger, D.: Modeling haplotype block variation using markov chains. Genetics 172(4), 2583–2599 (2006)
Article Google Scholar
Haines, J.L., et al.: Complement factor H variant increases the risk of age-related macular degeneration. Science 308(5720), 419–421 (2005)
Article Google Scholar
Herman, G.T., Kuba, A.: Advances in Discrete Tomography and Its Applications (Applied and Numerical Harmonic Analysis). Birkhauser, Basel (2007)
Google Scholar
Hoeffding, W.: Scale-invariant correlation theory. Masstabinvariante Korrelationstheorie, Schriften des Matematischen Instituts und des Instituts fr Angewandte Mathematik der University 5, 179–233 (1940)
Google Scholar
Homer, N., et al.: Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS Genet. 4(8), e1000167+ (2008)
Google Scholar
Jacobs, K.B., et al.: A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nature Genetics 41(11), 1253–1257 (2009)
Article Google Scholar
Jha, S., Kruger, L., Shmatikov, V.: Towards practical privacy for genomic computation. In: 2008 IEEE Symposium on Security and Privacy (2008)
Google Scholar
Kim, Y., Feng, S., Zeng, Z.B.: Measuring and partitioning the high-order linkage disequilibrium by multiple order markov chains. Genet. Epidemiol. 32(4), 301–312 (2008)
Article Google Scholar
Morris, A.P., Whittaker, J.C., Balding, D.J.: Little loss of information due to unknown phase for fine-scale linkage-disequilibrium mapping with single-nucleotide-polymorphism genotype data. Am. J. Hum. Genet. 74(5), 945–953 (2004)
Article Google Scholar
Renström, F., et al.: Replication and extension of genome-wide association study results for obesity in 4,923 adults from northern sweden. Hum. Mol. Genet. (January 2009)
Google Scholar
Robbins, R.: Some applications of mathematics to breeding problems iii. Genetics 3(4), 375–389 (1918)
Google Scholar
Sankararaman, S., Obozinski, G., Jordan, M.I., Halperin, E.: Genomic privacy and limits of individual detection in a pool. Nat. Genet. 41(9), 965–967 (2009)
Article Google Scholar
Scott, L., et al.: A genome-wide association study of type 2 diabetes in finns detects multiple susceptibility variants. Science (April 2007)
Google Scholar
Sladek, R., et al.: A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature (February 2007)
Google Scholar
Stephens, M., Donnelly, P.: A comparison of bayesian methods for haplotype reconstruction from population genotype data. American Journal of Human Genetics 73(5), 1162–1169 (2003)
Article Google Scholar
Stephens, M., Smith, N., Donnelly, P.: A new statistical method for haplotype reconstruction from population data. The American Journal of Human Genetics 68(4), 978–989 (2001)
Article Google Scholar
Visscher, P.M., Hill, W.G.: The limits of individual identification from sample allele frequencies: Theory and statistical analysis. PLoS Genet. 5(10), e1000628 (2009)
Google Scholar
Wang, R., Li, Y.F., Wang, X., Tang, H., Zhou, X.: Learning your identity and disease from research papers: Information leaks in genome wide association study. In: CCS 2009: Proceedings of the 15th ACM Conference on Computer and Communications Security, pp. 534–544. ACM, New York (2009)
Google Scholar
Yeager, M., et al.: Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nature Genetics 39(5), 645–649 (2007)
Article Google Scholar
Yuguo Chen, I.H.D., Sullivant, S.: Sequential importance sampling for multiway tables. The Annals of Statistics 34(1), 523–545 (2006)
Article MathSciNet MATH Google Scholar
Zhou, X., Peng, B., Li, Y.F., Chen, Y., Tang, H., Wang, X.: Technical report tr696: To release or not to release: Evaluating information leaks in aggregate human-genome data (2011), https://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR696

Download references

Author information

Authors and Affiliations

Indiana University, Bloomington, USA
Xiaoyong Zhou, Bo Peng, Yong Fuga Li, Yangyi Chen, Haixu Tang & XiaoFeng Wang

Authors

Xiaoyong Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Bo Peng
View author publications
You can also search for this author in PubMed Google Scholar
Yong Fuga Li
View author publications
You can also search for this author in PubMed Google Scholar
Yangyi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Haixu Tang
View author publications
You can also search for this author in PubMed Google Scholar
XiaoFeng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

MSIS Department and CIMIC, Rutgers University, Washington Park 1, 07102, Newark, NJ, USA
Vijay Atluri
K.U. Leuven ESAT-COSIC, Kasteelpark Arenberg 10, 3001, Leuven-Heverlee, Belgium
Claudia Diaz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, X., Peng, B., Li, Y.F., Chen, Y., Tang, H., Wang, X. (2011). To Release or Not to Release: Evaluating Information Leaks in Aggregate Human-Genome Data. In: Atluri, V., Diaz, C. (eds) Computer Security – ESORICS 2011. ESORICS 2011. Lecture Notes in Computer Science, vol 6879. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23822-2_33

Download citation

DOI: https://doi.org/10.1007/978-3-642-23822-2_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23821-5
Online ISBN: 978-3-642-23822-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

To Release or Not to Release: Evaluating Information Leaks in Aggregate Human-Genome Data

Abstract

Chapter PDF

Similar content being viewed by others

Private Genome Data Dissemination

Controlling the signal: Practical privacy protection of genomic data sharing through Beacon services

Reconstruction of private genomes through reference-based genotype imputation

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

To Release or Not to Release: Evaluating Information Leaks in Aggregate Human-Genome Data

Abstract

Chapter PDF

Similar content being viewed by others

Private Genome Data Dissemination

Controlling the signal: Practical privacy protection of genomic data sharing through Beacon services

Reconstruction of private genomes through reference-based genotype imputation

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation