Genomic annotation of disease-associated variants reveals shared functional contexts
Variation in non-coding DNA, encompassing gene regulatory regions such as enhancers and promoters, contributes to risk for complex disorders, including type 2 diabetes. While genome-wide association studies have successfully identified hundreds of type 2 diabetes loci throughout the genome, the vast majority of these reside in non-coding DNA, which complicates the process of determining their functional significance and level of priority for further study. Here we review the methods used to experimentally annotate these non-coding variants, to nominate causal variants and to link them to diabetes pathophysiology. In recent years, chromatin profiling, massively parallel sequencing, high-throughput reporter assays and CRISPR gene editing technologies have rapidly become indispensable tools. Rather than treating individual variants in isolation, we discuss the importance of accounting for context, both genetic (such as flanking DNA sequence) and environmental (such as cellular state or environmental exposure). Incorporating these features shows promise in terms of revealing biologically convergent molecular signatures across distant and seemingly unrelated loci. Studying regulatory elements in the proper context will be crucial for interpreting the functional significance of disease-associated variants and applying the resulting knowledge to improve patient care.
KeywordsChromatin Diabetes Epigenome Gene expression Genetics Genome-wide association study Human Reporter assay Review Transcription
Assay for transposase-accessible chromatin sequencing
Chromatin immunoprecipitation sequencing
Dead CRISPR-associated protein 9
DNase I hypersensitive site
Expression quantitative trait loci
GATA-binding factor 1
Green fluorescent protein
Genome-wide association studies
Massively parallel reporter assay
Regulatory factor X
Single guide RNA
Self-transcribing active regulatory region sequencing
Here, we review genome-wide approaches to map regulatory elements, highlighting studies that have successfully applied these approaches to add gene regulatory context to type 2 diabetes GWAS variants. We discuss how integrating these maps may identify convergent mechanisms across distant loci, and shared biology underpinning type 2 diabetes pathogenesis. Finally, we propose future studies that could provide additional mechanistic information.
Tissue-specific maps of chromatin state identify relevant regulatory regions
In eukaryotes, DNA is wrapped around histone proteins to form nucleosomes; strings of nucleosomes then form a higher-order structure called chromatin in the cell nucleus. Beyond structural roles in packaging DNA, histones contribute to establishing and maintaining cell-type-specific gene expression programs by signalling through post-translational modifications and by regulating the accessibility of DNA to transcription factors.
Protruding ends of histone proteins can be post-translationally modified with various covalent marks (e.g. methylation, acetylation). Genome-wide maps for various histone marks have been generated for human cell lines and primary tissues using chromatin immunoprecipitation and sequencing (ChIP-seq) [11, 12], revealing that histones in different regions of the genome are decorated with distinct marks, reflecting the regulatory activity of those regions . For example, transcription start sites are marked by tri-methylation of histone H3 lysine 4 (H3K4me3), while enhancers are marked by both mono-methylation of H3K4 (H3K4me1) and acetylation of H3K27 (H3K27ac). Chromatin segmentation analyses integrate combinations of histone marks to annotate the genome into discrete ‘states’, each of which can be given labels that describe the underlying regulatory activity such as promoters and enhancers  (Fig. 1c). Parker, Stitzel and colleagues constructed chromatin state maps for pancreatic islets, and identified islet-specific stretch enhancers (SEs), which are long (≥3 kb) segments of the genome that are continuously decorated with enhancer-associated histone marks. SEs were found near to critical pancreatic islet genes (e.g. INS1), and were enriched for GWAS variants associated with type 2 diabetes and related traits . Similar observations have been made for disease-relevant cell types in other disease models [16, 17, 18]. These studies collectively represent the first level of functional convergence in which disease-relevant variants across the genome are enriched in a set of large enhancers active in specific tissues. However, while chromatin state analysis is useful for narrowing down the regions of interest to a small subset of regulatory regions, the resolution of analysis is approximately 200 bp (a consequence of the fact that each nucleosome contains about 147 bp of DNA wrapped around the histones), which is still too coarse to pinpoint the underlying sequence motif(s) that could be mediating a genetic regulatory effect.
Accessible/open chromatin regions
Compared with analysis of histone marks, open chromatin analyses (especially ATAC-seq) have a higher resolution, permitting the identification of specific transcription factor motifs that may be systematically altered by risk alleles. These findings represent a higher-resolution form of convergence: not only are islet enhancers enriched for overlap with type 2 diabetes GWAS variants, but the specific RFX motifs within these larger islet enhancers are systematically disrupted by risk alleles. The next step towards identifying the putative target gene of the regulatory motif can be accomplished with expression quantitative trait loci (eQTL) studies, which look at population-level statistical associations between gene expression and genetic variation to assign SNPs to target genes (Fig. 2b). Several such studies have been conducted across diverse and diabetes-relevant human tissues, such as adipose tissue, islets, liver and skeletal muscle [23, 25, 26, 27, 28], and larger emerging studies promise to be a valuable sources of eQTLs. Additional layers of regulatory annotation could reveal additional signatures of convergence.
Activity-based functional genomics to nominate causal variants
While mapping histone marks and open chromatin regions can identify candidate regulatory regions, complementary approaches are needed to functionally validate the effect of individual genetic variants on enhancer activity. One such method is to narrow down putative causal variants to a subset through statistical genetic fine-mapping (reviewed in ). However, even with large cohorts and diverse ancestries, there frequently remain a number of plausible candidate regulatory variants. Because statistical genetic fine-mapping techniques and functional genomics may yield discordant results, it is important to compare the subsets of variants that emerge from these approaches.
Massively parallel reporter assay
One such study sought to functionally validate variants associated with erythrocyte traits. They tested a library of 75 GWAS lead SNPs and 2,681 others in high linkage disequilibrium (r2 ≥ 0.8) . Sequences with significant activity in their MPRA were more likely to originate from highly accessible regions of chromatin termed DNase I hypersensitive sites (DHSs) in erythroid cell types in vivo, and were also more likely to overlap with the binding sites for a critical erythroid transcription factor GATA-binding factor 1 (GATA1) and its cofactor T cell acute lymphocytic leukaemia 1 (TAL1), as compared with sequences that did not exhibit activity in their reporter assay. There were 32 variants with a concordant effect on regulatory activity and chromatin accessibility, i.e. alleles that were more active in MPRA exhibited higher chromatin accessibility at their native loci. In another recent study, an MPRA library was tested in two different human immortalised cell lines (HepG2 and K562), identifying motifs and variants predictive of cell-type-specific activity . Thus, reporter assays at least partially reflect the native transcriptional regulatory environment, and employing these assays in distinct cell types provides cellular and developmental contexts needed to understand the biological effects of genetic variants in different tissues and organs.
Self-transcribing active regulatory region sequencing (STARR-seq) is another promising MPRA approach , in which cloned candidate DNA fragments stimulate their own transcription, and the resulting enhancer activities are measured by RNA sequencing (RNA-seq). One recent application of STARR-seq examined putative enhancers overlapping GWAS loci associated with cancer risk , and observed that ~18% of fragments tested had activity above background. Active fragments were enriched for those bearing active histone marks at their endogenous loci, which suggests that the regulatory signatures needed to establish these chromatin states in vivo is at least partially encoded on these fragments, and may be disrupted by variants within them.
There are several limitations associated with current MPRA approaches. First, fragments may have different activities on episomal plasmids as compared with their endogenous loci, where they are packaged into chromatin and flanked by native sequence. Second, the activity of some of the enhancers may require the presence of multiple transcription factor binding motifs (reviewed in ), such that MPRAs may fail to detect activity for individual fragments in isolation. Third, some reporter plasmids use synthetic, heterologous elements such as a minimal promoter to test enhancer fragments. Prior studies have suggested there may be a requirement for promoters and enhancers to be of compatible ‘types’ [39, 40]; therefore, if candidate enhancers are incompatible with the type of promoters used in the MPRA, this may lead to false-negative or false-positive results. Finally, MPRAs are only intended to test whether individual cis-regulatory elements (or allelic variants) are sufficient to activate gene expression; a key complementary question, which we discuss next, is whether individual sites are necessary for enhancer activity.
In situ mutagenesis of regulatory elements using CRISPR/Cas9
Genome and epigenome engineering methods provide powerful new tools for studying gene dysregulation in type 2 diabetes. Gene or regulatory element knockout cells or animals can now be routinely derived using CRISPR/Cas9-directed mutagenesis (Fig. 3c). In addition, modulation of target gene expression has been demonstrated by fusing catalytically dead CRISPR-associated protein 9 (dCas9) protein to various transcriptional effectors (reviewed in ). These tools allow researchers to test enhancer function in the native chromosomal context (i.e. in situ) to circumvent some of the limitations associated with MPRAs, and to create animal or cell line models for selected alleles.
Pooled in situ approaches combine Cas9 nuclease with libraries of single guide RNAs (sgRNAs) that densely tile a target locus to map functional regulatory elements that are required for target gene expression. One of the first uses of this approach targeted DHSs surrounding the BCL11A gene, and discovered several critical regions that, upon deletion, resulted in reduced expression of the BCL11A gene and a concomitant increase in expression of fetal haemoglobin (which is normally repressed by BCL11A). Within these critical regions, the authors also identified a binding site for the key erythroid transcription factor GATA1 . In this example, maps of erythroid-specific chromatin accessibility narrowed the region of interest of potentially important sites, but a systematic knockout screen was needed to define the regulatory grammar.
Small indels (insertion or deletion of bases) induced by non-homologous end-joining may not be sufficient to disrupt enhancer function; therefore, more recent approaches use pairs of sgRNAs in each cell to delete larger DNA fragments [43, 44]. An alternative strategy to investigate the impact of regulatory variants using in situ mutagenesis involves the introduction of precise single-nucleotide mutations at the target locus, which can be achieved by providing an exogenous DNA template with desired mutations for homology-directed repair.
Current pooled in situ mutagenesis approaches require a functional phenotype that can be perturbed and selected for. One such example is the expression level of an endogenous gene (e.g. fetal haemoglobin in ref. ), or a tagged gene product (e.g. GFP). However, it is difficult to apply these schemes to broad collections of enhancers, since the vast majority of their target genes are not known, or even if they were, would require laborious construction of many bespoke reporter cell lines. A recently developed approach, MOsaic Single-cell Analysis by Indexed CRISPR Sequencing (MOSAIC-seq) provides a promising general alternative for in situ enhancer mutagenesis, which combines the targeting of a dCas9-Krüppel associated box (KRAB) transcriptional repressor to candidate regulatory loci (i.e. CRISPR interference, or CRISPRi), with a read-out provided by single-cell RNA sequencing (scRNA-seq) to measure the resulting change in gene expression . Besides conducting a proof-of-principle experiment by targeting the β-globin locus in K562 cells, the authors demonstrated its utility by targeting constituent enhancers within 15 different super-enhancers to dissect the relative contribution of each constituent on target gene expression. Establishing appropriate cellular models and read-outs remains a challenge for applying these techniques to type 2 diabetes, but, nevertheless, they hold promise for finely mapping the individual cis-regulatory sites and establishing their grammar.
Beyond the CRISPR modification and expression profiling experiments described above, proximity in the 3D chromatin environment of the nucleus provides another signal to pair cis-regulatory SNPs with their target genes. One such example in islets is a C3 region and ISL gene promoter . A potential gold-standard for linking a candidate cis-regulatory SNP to its target gene would be the observation of consistent results across these approaches, from statistical association to experimental profiling and perturbation.
Cis-regulatory elements operate in a context-specific manner
Factors that modulate the nuclear trans environment (e.g. transcription factor abundance and localisation) greatly influence how cells execute cis-regulatory programs. Such factors could be intrinsic properties of different cell types established during development, or could be modulated by extrinsic stimuli, such as stress or hormone signalling. However, most functional genomic maps and reporter screens carried out to date have been obtained under steady-state (or basal) conditions. Therefore, integrating these maps and screens with developmental or treatment-induced dynamics represents an important direction for future studies. Here we present examples that illustrate how studying the impact of genetic variants under the proper context may be crucial for revealing functional convergence of disease-associated variants (Fig. 3a).
Environmental perturbation may be required to reveal the activity of some regulatory elements. For example, one study described ‘latent enhancers’ in mouse bone marrow-derived macrophages, which under basal conditions do not exhibit either histone marks typically associated with enhancer or chromatin accessibility but rapidly acquire these marks in response to an inflammatory agent (lipopolysaccharides [LPS)]) or inflammatory cytokines such as IL-4 and IFNγ . Similar observations were reported in human primary monocytes upon stimulation, and genetic associations were uncovered, with gene expression which varied when different immune stimuli were applied, such that both treatment and genotype interacted to affect gene expression . In a particularly striking example, at one SNP associated with HIP1 expression, the direction of association reversed under treatment, with the selected allele negatively correlated with HIP1 expression in unstimulated cells but positively correlated after stimulation with LPS. Another example from the same study highlighted how dynamic gene expression kinetics required selecting the proper experimental time points: when cells were stimulated with LPS for 2 h, SNP rs2275888 was only associated with expression of one gene; however, the same SNP became associated with expression of five others after 24 h of stimulation.
Genome-wide regulatory maps made under different treatments illustrate the widespread impact of environmental exposures on gene regulation. For example, analyses in the mouse liver showed that 24 h of fasting induced changes in chromatin accessibility and H3K27ac signals around thousands of DHSs located nearby fasting-induced genes. Combining RNA-seq and ChIP-seq analyses for key transcription factors whose motifs were enriched within fasting-induced DHSs, the authors identified the glucocorticoid receptor as a critical factor that makes fasting-induced enhancers accessible so that other factors such as cAMP responsive element binding protein 1 (CREB1) can bind and activate gluconeogenesis programs in the liver .
While the above studies clearly highlight the importance of studying gene regulation under diverse environmental conditions, the application of this emerging concept is still limited in the diabetes genomics literature; a few recent examples are described here. To identify glucose-responsive regulatory elements in pancreatic islet beta cells, the INS-1E rat pancreatic islet beta cell line was treated with glucose for 2 and 12 h and genome-wide changes in occupancy of MED1 protein (a subunit of the mediator complex that is involved in long-range interaction between enhancers and promoters), DHS and enhancer RNA transcription were measured . Clustering analysis based on temporal dynamics identified six different patterns, which correlated with temporal dynamics of nearby glucose-responsive genes in their RNA-seq data. Motif enrichment analyses within these glucose-responsive regulatory elements identified the motif for carbohydrate response element binding protein (ChREBP), a transcription factor that was previously implicated in glucose-induced gene regulation in pancreatic islet beta cells. Two recent studies focused on a type 2 diabetes variant (rs508419) that overlaps with a skeletal muscle-specific promoter region at the ANK1 locus [28, 50]. Human skeletal muscle eQTL data indicate that risk allele dosages result in higher ANK1 expression . Testing of the SNP region by luciferase reporter assays in the C2C12 mouse skeletal muscle myoblast cell line showed that the risk allele exhibited higher promoter activity than the non-risk counterpart . Interestingly, however, the researchers were able to detect impairment only when they treated cells with insulin; under basal conditions, increased ANK1 protein did not affect glucose uptake. Of note, prior islet eQTL studies showed that the risk allele of the same variant (rs508419) is associated with reduced expression of the transcription factor NKX6-3 [23, 25]), representing a tissue-dependent effect of regulatory variants, and potentially more complicated genetic architecture at this locus that is yet to be revealed. These examples highlight the importance and the challenges of modelling environmental stimuli in functional genomic studies of diabetes.
GWAS continue to identify genomic loci contributing to type 2 diabetes risk; however, interpretation of these signals remains challenging because most GWAS variants occur outside protein-coding genes. In recent years, massively parallel sequencing, high-throughput reporter assays and CRISPR gene editing technologies have quickly become indispensable tools for researchers to further understand the molecular basis of complex human diseases such as type 2 diabetes. In this review, we have considered how these approaches may be employed to further resolve GWAS-detected loci to identify individual variants and their functional effects. While the data generated so far have provided deeper insight into the gene regulation of type 2 diabetes risk variants, our understanding of the tissue specificity of these variants, and their interplay with environmental stimuli remains limited. Since enhancers integrate and transduce environmental signals to execute gene expression programs, studying the impact of genetic variants under diverse conditions will be crucial for furthering our understanding of disease-associated variants. Moving forward, we believe that generating functional annotations in different environmental contexts and genetic perturbations will help partition swathes of GWAS signals into coherent, tissue-specific subsets to shed light on underlying pathophysiologies. In summary, by employing the approaches discussed, additional convergent functional contexts are likely to emerge, and this information would enable higher-resolution patient stratification and determination of individualised risk.
We thank members of the Kitzman and Parker laboratories, and associated collaborators, for invaluable discussions. We apologise in advance to authors whose work we were unable to cite or discuss because of space limitations.
All authors were responsible for drafting the article and revising it critically for important intellectual content. All authors approved the version to be published.
Work in the laboratories of SCJP is supported by the American Diabetes Association Pathway to Stop Diabetes Initiator Award 1-14-INI-07 (SCJP) and NIH/NIDDK grants R00 DK099240 and R01 DK117960 (SCJP).
Duality of interest
The authors declare that there is no duality of interest associated with this manuscript.
- 3.Mahajan A, Taliun D, Thurner M et al (2018) Fine-mapping of an expanded set of type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. bioRxiv. https://doi.org/10.1101/245506
- 5.Type 2 Diabetes Knowledge Portal. Available from www.type2diabetesgenetics.org/gene/geneInfo/WFS1. Accessed 5 November 2018
- 22.Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ (2013) Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods 10(12):1213–1218. https://doi.org/10.1038/nmeth.2688 CrossRefPubMedPubMedCentralGoogle Scholar
- 25.van de Bunt M, Manning Fox JE, Dai X et al (2015) Transcript expression data from human islets links regulatory signals from genome-wide association studies for type 2 diabetes and glycemic traits to their downstream effectors. PLoS Genet 11(12):e1005694. https://doi.org/10.1371/journal.pgen.1005694 CrossRefPubMedPubMedCentralGoogle Scholar
- 43.Gasperini M, Findlay GM, McKenna A et al (2017) CRISPR/Cas9-mediated scanning for regulatory elements required for hprt1 expression via thousands of large, programmed genomic deletions. Am J Hum Genet 101(2):192–205. https://doi.org/10.1016/j.ajhg.2017.06.010 CrossRefPubMedPubMedCentralGoogle Scholar