Exploring the unknown: assumptions about allelic architecture and strategies for susceptibility variant discovery
- 5.6k Downloads
Identification of common-variant associations for many common disorders has been highly effective, but the loci detected so far typically explain only a small proportion of the genetic predisposition to disease. Extending explained genetic variance is one of the major near-term goals of human genetic research. Next-generation sequencing technologies offer great promise, but optimal strategies for their deployment remain uncertain, not least because we lack a clear view of the characteristics of the variants being sought. Here, I discuss what can and cannot be inferred about complex trait disease architecture from the information currently available and review the implications for future research strategies.
KeywordsDark Matter Causal Variant Rare Causal Variant Complex Trait Disease Gene Discovery Effort
copy number variant
Genome-wide association (GWA) analysis has provided the first effective strategy to allow a systematic dissection of the genetic basis of common, complex, multifactorial traits [1, 2]. Several hundred loci have been identified to stringent levels of significance . Although for many of these we remain some distance from a complete enumeration of causal mechanisms, there have already been substantial advances in understanding of disease - the role of autophagy in inflammatory bowel disease  and cell adhesion in autism [5, 6], for instance.
However, for most common traits the proportion of the overall phenotypic variance explained remains small, limiting the extent to which prediction of individual disease risk is possible. There is growing speculation about the mechanisms that might account for the substantial proportion of trait heritability that remains to be characterized .
This speculation has repercussions well beyond recondite theoretical discussion about the genetic architecture of complex traits. With advances in technology (particularly next-generation sequencing) and growing enthusiasm for funding large-scale gene discovery efforts, hypotheses about the nature of this so-called 'genetic dark matter'  have a direct bearing on research strategies. Recently, this debate has seemed increasingly polarized between those who feel a continued search for common susceptibility variants is of limited value, because all that remains to be found are variants of vanishingly small effect , and those who feel that, pending reductions in costs that will allow high-quality, whole-genome sequence data to be generated in adequately powered sample sizes, there is virtue in persisting with an approach of proven worth .
There is good reason to assume that this 'dark matter' is neither an illusion created by inflated estimates of heritability nor the consequence of marked non-additivity of effects [10, 11]. If so, then the sum total of genetic variance should largely be explicable in terms of the main effects of all the risk alleles of various types (single nucleotide polymorphisms, indels, copy number variants (CNVs) and inversions), allele frequencies (rare, low-frequency and common) and effect sizes. So far, the only parts of this 'space' explored systematically are those occupied by rare, penetrant alleles (principally through linkage analysis of monogenic phenotypes) and common, mostly low-effect alleles (accessible through GWA analysis). As we seek to make sensible decisions about the direction of future discovery efforts - in terms of the characteristics of the variants we are seeking and the technologies we should use to find them - we need to understand what the exploration of the 'known' genetic landscape can tell us about the parts that remain largely uncharted.
Contrasting views of the genetic landscape
One long-standing view is that complex trait susceptibility is predominantly a matter of common variants . Common variants collectively account for most individual variation in DNA sequence, and the same might be expected to be the case for phenotypic variation. If true, the results of GWA studies so far indicate that most of the as-yet-undiscovered variants must (in Europeans at least) have very small effects, because the high coverage and large sample sizes used will have left few, if any, large common-variant effects undiscovered. Evidence (for example, from large-scale meta-analyses ) is, for many traits, consistent with the notion of a long 'polygenic tail' of small effects, but it remains unclear how much of overall heritability can be explained under this model. The idea that complex-trait susceptibility involves a very large number of variants of modest effect has led some to suggest that the value of all such discoveries is diminished, on the basis that one learns little about the biology of disease if too many genes are implicated . However, for many phenotypes, the overall salience of the loci of greatest effect emerging from GWA studies (the pathways implicated and the relationships to monogenic forms of the same traits) argues forcefully against such a nihilistic interpretation [9, 13, 14].
The contrasting viewpoint holds that common-trait susceptibility derives mostly from the action of rare or low-frequency variants [15, 16]. Although such variants account for less individual sequence variation than common variants, there may be a disproportionate effect on disease susceptibility. The more recent origin of low-frequency variants may allow alleles with more dramatic phenotypic effects to be represented in the population. Also, large-effect alleles may cause phenotypic disturbances that are not as easily buffered by compensatory changes during development as are well tolerated, small-effect, common-variant alleles. Recent evidence that large, rare CNVs are associated with behavioral and psychiatric disease phenotypes [5, 17, 18] supports this view. Some argue that such a rare variant architecture is precisely what one would expect for diseases causing low reproductive fitness, though this rationalization fails to explain the high yield of common-variant signals reported for other diseases, such as type 1 diabetes, that were, until recently, fatal during early life . It has even been suggested that many of the common-variant associations discovered by recent GWA studies may turn out to be due to the concerted action of multiple low-frequency and rare causal variants. The NOD2 (CARD15) signal for Crohn's disease indicates that this is certainly possible . For many diseases, however, evidence that common-variant associations are consistent across multiple ethnic groups  represents a strong counter to such a model: one would expect the linkage disequilibrium patterns around recent rare and low-frequency causal variants to result in far more inter-ethnic heterogeneity than is actually observed.
The best of both worlds
Although both extreme positions have merit, the likelihood is that, for most diseases, the architecture of predisposition features causal variants that have a wide range of allele frequencies and effect sizes. For most complex traits, the absence of compelling signals from linkage studies conducted in families segregating multifactorial diseases imposes an upper bound to feasible effect sizes; even so, it is easy to show that a limited number of low-frequency susceptibility alleles of medium effect could go a long way to explaining missing heritability. For example, the effect of a low-frequency variant with a population minor allele frequency of 1% and a per-allele odds ratio of 3, when measured in terms of sibling relative risk (a commonly used measure of familial aggregation), exceeds that of the largest common-variant effect known for type 2 diabetes (around TCF7L2). Twenty such variants across the genome would account for most of the unexplained heritability for this condition. Such a constellation of variants could provide a respectable tool for individual disease prediction, and the variants discovered would (because of their relatively large effect size) be valuable resources for detailed molecular and physiological study. The extent to which variants with these characteristics are segregating in the population remains unknown, but this is an area in which the combination of next-generation sequencing technologies and large-scale association analysis provides a powerful stimulus to discovery. Early results of this approach (such as the identification of low-frequency variants within the IFIH1 gene that have a marked effect on type 1 diabetes susceptibility) are encouraging .
Strategy and the 'lumpiness' of the genome
Ultimately, we can expect large-scale, high-depth, genome-wide sequencing to enable the systematic exploration of the entire allele-frequency, effect-size space and provide empirical resolution of many of these issues. However, there remain serious financial, logistical and analytical barriers to the implementation of this technology, and the number of such experiments that could be supported by the major funders is, for the time being, limited.
All this means that, for the next few years, the power of next-generation sequencing will need to be used carefully if a profusion of underpowered discovery efforts is to be avoided. Efforts targeted to specific genomic regions (around particular candidate genes or pathways or exons across the genome, for example) are attractive because high coverage of the selected areas in large sample sizes can be generated at reasonable cost. Whole-genome sequencing will, for now, be restricted to low-pass coverage across respectable sample sizes, or high-depth coverage in smaller, highly selected, phenotypically extreme sample sets.
Letting several well designed flowers bloom
With only limited empirical data to guide future locus-discovery efforts, extrapolation from the modest proportion of genetic variance so far explained is fraught with danger. The menu of possible research strategies is large, but each choice makes some implicit assumption about the characteristics of the variants being sought and the genomic architecture of the disease under consideration. Given uncertainties over the true state of nature, it is difficult to say which approaches will be most productive. This argues for open minds, a healthy disdain for orthodoxy, and careful exploration of the technological and methodological options. At the same time, it is important that the next wave of large-scale discovery efforts is designed so as to test assumptions about trait architecture and technological performance so that lessons of generic value to the field can be learned.
I thank the many colleagues around the world who contributed to the discussions that informed this article.
- 2.Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA: Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA. 2009, 106: 9362-9367. 10.1073/pnas.0903103106PubMedCentralCrossRefPubMedGoogle Scholar
- 3.OPG: a Catalog of Published Genome-Wide Association Studies. http://www.genome.gov/gwastudies
- 4.Cadwell K, Liu JY, Brown SL, Miyoshi H, Loh J, Lennerz JK, Kishi C, Kc W, Carrero JA, Hunt S, Stone CD, Brunt EM, Xavier RJ, Sleckman BP, Li E, Mizushima N, Stappenbeck TS, Virgin HW: A key role for autophagy and the autophagy gene Atg16l1 in mouse and human intestinal Paneth cells. Nature. 2008, 456: 259-263. 10.1038/nature07416PubMedCentralCrossRefPubMedGoogle Scholar
- 5.Glessner JT, Wang K, Cai G, Korvatska O, Kim CE, Wood S, Zhang H, Estes A, Brune CW, Bradfield JP, Imielinski M, Frackelton EC, Reichert J, Crawford EL, Munson J, Sleiman PM, Chiavacci R, Annaiah K, Thomas K, Hou C, Glaberson W, Flory J, Otieno F, Garris M, Soorya L, Klei L, Piven J, Meyer KJ, Anagnostou E, Sakurai T, et al.: Autism genome-wide copy number variation reveals ubiquitin and neuronal genes. Nature. 2009, 459: 569-573. 10.1038/nature07953PubMedCentralCrossRefPubMedGoogle Scholar
- 6.Wang K, Zhang H, Ma D, Bucan M, Glessner JT, Abrahams BS, Salyakina D, Imielinski M, Bradfield JP, Sleiman PM, Kim CE, Hou C, Frackelton E, Chiavacci R, Takahashi N, Sakurai T, Rappaport E, Lajonchere CM, Munson J, Estes A, Korvatska O, Piven J, Sonnenblick LI, Alvarez Retuerto AI, Herman EI, Dong H, Hutman T, Sigman M, Ozonoff S, Klin A, et al.: Common genetic variants on 5p14.1 associate with autism spectrum disorders. Nature. 2009, 459: 528-533. 10.1038/nature07999PubMedCentralCrossRefPubMedGoogle Scholar
- 13.Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino M, Freathy RM, Perry JR, Stevens S, Hall AS, Samani NJ, Shields B, Prokopenko I, Farrall M, Dominiczak A, , Johnson T, Bergmann S, Beckmann JS, Vollenweider P, Waterworth DM, Mooser V, Palmer CN, Morris AD, Ouwehand WH, , Zhao JH, Li S, Loos RJ, et al.: Genome-wide association analysis identifies 20 loci that influence adult height. Nat Genet. 2008, 40: 575-583. 10.1038/ng.121PubMedCentralCrossRefPubMedGoogle Scholar
- 14.Loos RJF, Lindgren CM, Li S, Wheeler E, Zhao JH, Prokopenko I, Inouye M, Freathy RM, Attwood AP, Beckmann JS, Berndt SI, The Prostate, , Bergmann S, Bennett AJ, Bingham SA, Bochud M, Brown M, Cauchi S, Connell JM, Cooper C, Davey Smith G, Day I, Dina C, De S, Dermitzakis ET, Doney ASD, Elliott KS, Elliott P, Evans DM, Farooqi IS, et al.: Association studies involving over 90, 000 people demonstrate that common variants near to MC4R influence fat mass, weight and risk of obesity. Nat Genet. 2008, 40: 768-775. 10.1038/ng.140PubMedCentralCrossRefPubMedGoogle Scholar
- 18.Stefansson H, Rujescu D, Cichon S, Pietiläinen OP, Ingason A, Steinberg S, Fossdal R, Sigurdsson E, Sigmundsson T, Buizer-Voskamp JE, Hansen T, Jakobsen KD, Muglia P, Francks C, Matthews PM, Gylfason A, Halldorsson BV, Gudbjartsson D, Thorgeirsson TE, Sigurdsson A, Jonasdottir A, Jonasdottir A, Bjornsson A, Mattiasdottir S, Blondal T, Haraldsson M, Magnusdottir BB, Giegling I, Möller HJ, Hartmann A, et al.: Large recurrent microdeletions associated with schizophrenia. Nature. 2008, 455: 232-236. 10.1038/nature07229PubMedCentralCrossRefPubMedGoogle Scholar
- 19.Barrett JC, Clayton DG, Concannon P, Akolkar B, Cooper JD, Erlich HA, Julier C, Morahan G, Nerup J, Nierras C, Plagnol V, Pociot F, Schuilenburg H, Smyth DJ, Stevens H, Todd JA, Walker NM, Rich SS, : Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nat Genet. 2009, doi:10.1038/ng.381Google Scholar
- 21.Ng MC, Park KS, Oh B, Tam CH, Cho YM, Shin HD, Lam VK, Ma RC, So WY, Cho YS, Kim HL, Lee HK, Chan JC, Cho NH: Implication of genetic variants near TCF7L2, SLC30A8, HHEX, CDKAL1, CDKN2A/B, IGF2BP2, and FTO in type 2 diabetes and obesity in 6, 719 Asians. Diabetes. 2008, 57: 2226-2233. 10.2337/db07-1583PubMedCentralCrossRefPubMedGoogle Scholar
- 24.Ghoussaini M, Song H, Koessler T, Al Olama AA, Kote-Jarai Z, Driver KE, Pooley KA, Ramus SJ, Kjaer SK, Hogdall E, DiCioccio RA, Whittemore AS, Gayther SA, Giles GG, Guy M, Edwards SM, Morrison J, Donovan JL, Hamdy FC, Dearnaley DP, Ardern-Jones AT, Hall AL, O'Brien LT, Gehr-Swain BN, Wilkinson RA, Brown PM, Hopper JL, Neal DE, Pharoah PD, Ponder BA, et al.: Multiple loci with different cancer specificities within the 8q24 gene desert. J Natl Cancer Inst. 2008, 100: 962-966. 10.1093/jnci/djn190PubMedCentralCrossRefPubMedGoogle Scholar