A semi-supervised approach uncovers thousands of intragenic enhancers differentially activated in human cells
- 1.3k Downloads
Transcriptional enhancers are generally known to regulate gene transcription from afar. Their activation involves a series of changes in chromatin marks and recruitment of protein factors. These enhancers may also occur inside genes, but how many may be active in human cells and their effects on the regulation of the host gene remains unclear.
We describe a novel semi-supervised method based on the relative enrichment of chromatin signals between 2 conditions to predict active enhancers. We applied this method to the tumoral K562 and the normal GM12878 cell lines to predict enhancers that are differentially active in one cell type. These predictions show enhancer-like properties according to positional distribution, correlation with gene expression and production of enhancer RNAs. Using this model, we predict 10,365 and 9777 intragenic active enhancers in K562 and GM12878, respectively, and relate the differential activation of these enhancers to expression and splicing differences of the host genes.
We propose that the activation or silencing of intragenic transcriptional enhancers modulate the regulation of the host gene by means of a local change of the chromatin and the recruitment of enhancer-related factors that may interact with the RNA directly or through the interaction with RNA binding proteins. Predicted enhancers are available at http://regulatorygenomics.upf.edu/Projects/enhancers.html.
KeywordsTranscription Start Site K562 Cell Relative Enrichment GM12878 Cell Transcriptional Enhancer
RNA Polymerase II
Histone 3 Lysine 36 tri-methylation
Histone 3 Lysine 9 de-methylation
Histone 3 Lysine 27 tri-methylation
Histone 3 Lysine 27 acetylation
Histone 3 Lysine 4 mono/di/tri methylation
Histone 3 Lysine 79 tri-methylation
Histone 4 Lysine 20 mono-methylation
Histone 3 Lysine 9 acetylation
H2A Histone family member Z
Signal transducer and activator of transcription 1
Bayesian Information Criterion
Transcription start site
Chromatin interaction analysis by paired-end tag sequencing
Transcriptional enhancers are characterized by specific chromatin signatures, which differ depending of whether the enhancer is active or not [1, 2, 3, 4, 5]. Transcriptional enhancers have been generally identified by studying the genome-wide binding of the acetyl-transferase P300, a ubiquitous enhancer co-activator [1, 6, 7]. However, not all P300-bound enhancers show activity . Enhancers have also been characterized by their chromatin state [1, 2, 9, 10]; and, although the mono-methylation of histone 3 lysine 4 (H3K4me1) has been identified to be an important signature for enhancers , this mark is not sufficient for enhancer activation [3, 11]. In fact, recent evidence shows that other marks like H3K27ac [1, 3, 4, 5] and H3K4me3 [5, 11] may be necessary for enhancer activity. Additionally, the recruitment of RNAPII and the concomitant production of enhancer-associated RNAs (eRNAs) have also been associated to active enhancers [3, 4, 5, 12, 13].
Although enhancers are typically defined to regulate gene transcription at a distance, about 50 % of potential enhancers predicted by high-throughput methods lie within protein-coding genes  and some overlap exons [14, 15]. Intragenic enhancers can regulate the expression of the host gene  or of a nearby gene , and have been proposed to act as alternative promoters . These results raise the question of how many intragenic enhancers may be active in a cell and whether upon their activation or silencing they may affect the processing of the host gene, possibly by means of local changes of the chromatin state. In this direction, there is evidence that some enhancers upstream of a reporter gene can affect splicing in vitro , and that intragenic enhancers bound by Argonaute-1 (AGO1) protein can affect the constitutive and alternative splicing of the host gene . In this work we describe a computational method to predict active enhancers based on chromatin signals. This method, which uses the relative enrichment of chromatin signals between cell lines to the detect cell specific active enhancers, predicts thousands of intragenic active enhancers. Additionally, we find evidence that the differential activation of enhancers inside genes affect the expression and splicing of the host genes. We propose that the activation or silencing of intragenic transcriptional enhancers can modulate the regulation of the host gene through a local change of the chromatin.
Annotated human enhancers with a mouse homologous enhancer that has been experimentally validated were downloaded from VISTA . The gene set was obtained from the 7th release of GENCODE, human assembly GRCh37 (hg19). ChIP-Seq and RNA-Seq datasets were downloaded from ENCODE  for K562 and GM12878 cells. The datasets used were: ChIP-Seq for CTCF, EZH2, P300, RNAPII, PU.1 (SPI1), STAT1, H3K9ac, H3K27ac, H3K4me1, H3K4me2, H3K4me3, H3K27me3, H3K36me3, H3K79me2, H4K20me1 and H2A.Z; one Control ChIP-Seq experiment and one input experiment; RNA-Seq for short (<200 nt) and long (>200 nt), polyA+ and polyA- RNAs from whole cell, nucleus and cytosol; and DNaseI data for the same cell lines. All datasets were downloaded in the form of mapped reads to the reference hg19 genome in BAM format.
Relative enrichment calculation
We considered sliding windows of 1500 nt along the entire genome, as suggested by the length distribution of experimentally validated enhancers [19, 21] (Additional file 1: Figure S1), with a slide shift of 500 nt, resulting in a total of 3,086,047 overlapping windows. In order to avoid mixing enhancer signal with genic and promoter signals, we discarded windows that were closer than 500 nt to an annotated TSS. The same approach was applied to intergenic (Additional file 1: Figure S2A) and intragenic (Additional file 1: Figure S2B) regions. Although there are more intergenic windows (~3⋅106 vs ~2.2⋅106) in both cases the amount of windows with signal was similar (~1.5 million windows), which were then kept for further processing. The relative enrichment of chromatin signals between 2 cell lines was calculated to predict active enhancers in K562 (relative increase of activation marks in K562 with respect to GM12878) and silent enhancers in K562 (relative decrease of activation marks in GM12878 with respect to K562, i.e. active in GM12878). Full quantile normalization for counts and GC content was applied using EDASeq . GC content in each region was calculated as the proportion of G + C in the 1500 nt window. After normalization, the z-score of the relative enrichment of each ChIP-Seq signals between K562 and GM12878 was calculated with Pyicoteo  using the pyicoenrich function (https://bitbucket.org/regulatorygenomicsupf/pyicoteo). A vector of z-scores per region was obtained, which we refer to as attributes, consisting of the 17 enrichment z-scores for the ChIP-Seq and Input datasets. A positive z-score for a region indicates an increased in ChIP-Seq signal in K562 relative to GM12878 in that region, whereas a negative z-score indicates a decreased signal in K562 relative to GM12878; and z-scores close to zero indicate no significant differences between the cell lines. For all datasets, except for the ChIP-Seq with non-specific antibody and for the RNA-Seq datasets, we used replicates. The relative enrichments were calculated with respect to the distribution described by the comparison between replicates. When replicates were not available, these were simulated by pooling the two conditions and dividing them using random sampling .
Feature selection was performed using Boruta , which finds informative features by measuring the relevance of each attribute with respect to a reference attribute, also called correlation class, and in comparison with a random model extracted from the original dataset. Boruta uses the correlation class to evaluate the other features against it using Random Forests . We performed this analysis using as correlation class each of the individual marks (Additional file 1: Figure S3). In each case, the correlation was performed 10 times using normalized counts on a subset of 5000 intergenic windows, sampled randomly in each one of the 10 iterations. In order to avoid possible biases, in each analysis the correlation class was defined as the ChIP-Seq signal minus the level of Input DNA. The features used as negative controls were the ChIP-Seq sample for a non-specific antibody (Control sample) (Additional file 1: Figure S3D) and H4K20me1 (Additional file 1: Figure S3E), which has been associated to transcription repression and heterochromatin but not to enhancer activity [26, 27]. Running the selection algorithm with the H3K4me1 mark, the average Boruta score for the control increased notably, suggesting that the mark is present in many regions along the genome (Additional file 1: Figure S3C).
Fifteen thousand arbitrary intergenic windows of length 1500 bp were used as seed for the prediction model. Various different seed selections of the same size did not change the results significantly. These 15,000 windows were clustered using Mclust . Mclust is based on finite normal mixture modeling and uses the Bayesian Information Criterion (BIC)  for model optimization. The BIC score plateaus at 3 clusters for most models (Additional file 1: Figure S4A). The seed windows corresponded to 552 active, 616 silent and 13,832 no-change windows. This indicates that there are mostly three main classes, two that correspond to active and silent enhancers, and a class composed of a gradient of multiple chromatin states, which show little or no relative change of chromatin activity. This is further supported by the uncertainty plot, which shows that regions classified with higher certainty are on the extreme values of the correlation (Additional file 1: Figure S4B). The final model used for clustering was the centroid type (labeled as VEV), which creates clusters with variable volume, equal shape, and variable orientation. This model was used to classify the genome-wide 1500 bp windows (Additional file 1: Figure S2) using the same clustering method Mclust to predict intergenic enhancers. Intragenic enhancers were calculated using the same seed of 15,000 intergenic windows as before. The clustering was performed in the same way as for intergenic enhancers. As controls we calculated 4 sets of randomized positions (intergenic/intragenic and active/silent putative predictions). These sets were calculated from the predicted enhancers by randomizing the positions, not closer than 500 nt to any gene, avoiding gaps, genic regions, and other random locations previously generated, and keeping the same length and the same number of regions (Additional file 1: Figure S2A). Random intragenic enhancers were generated similarly by placing the intragenic enhancers in a random location inside the same gene, avoiding regions of 1 kb around any internal TSSs and avoiding other random enhancers previously generated (Additional file 1: Figure S2B). All predicted intergenic and intragenic enhancers can be visualized in the UCSC genome browser through the link http://regulatorygenomics.upf.edu/Projects/enhancers.html.
Linking enhancers to genes
Enhancers were linked to genes by selecting the closest TSS on either direction and by using ChIA-PET data for RNAPII in K562 cells for two replicates from ENCODE . An enhancer was considered connected to a gene if there were at least 3 ChIA-PET pairs connecting both the predicted enhancer and the region of 1 kb around the TSS of the gene. Random enhancers used as controls were calculated as described above. For the association of enhancers to genes, only enhancers that were between 2 and 100 kb from a TSS were considered. Genes associated to cancer were obtained from the Cancer Gene Census (http://cancer.sanger.ac.uk/cancergenome/projects/census/) .
Expression and splicing analysis
For every gene in GENCODE (v7) annotation , the most upstream TSS (TSS1) and all alternative TSSs (TSS2, TSS3, etc.) were considered. Each pair TSS1-TSS2, TSS2-TSS3, etc. was considered as an alternative transcription event. RNAPII relative enrichment levels were measured around each TSS using the same method as before. To control possible association with upstream enhancers, we discarded all alternative TSS events that had a predicted intergenic enhancer (active or silent) 100 kb upstream of the gene. We calculated the expression levels of the annotated transcript isoforms using cufflinks v2.1.1  with parameters --library-type fr-firststrand --no-effective-length-correction --min-frags-per-transfrag 5 and masking all rRNAs, tRNAs and mitochondrial sequences. The relative changes in transcript abundance were obtained using Cuffdiff with parameters --library-type fr-firststrand --min-reps-for-js-test 1, using the merged GTF file obtained from Cufflinks for GM12878 and K562, along with the bam files of GM12878 and K562 with replicates. This provided 3552 genes (6.68 %) with relative changes in expression between the two cell lines.
We defined the events to be regulated if they had |delta PSI| > 0.1 in at least one replicate comparison between cell lines. Using two pairings of the replicates, this gave rise to 339 and 293 events (148 in common) with the cytosolic samples, and 367 and 378 (210 in common) for the nuclear samples. Additionally, we defined a set of alternative events that do not change splicing by imposing |delta PSI| < 0.05 between the same replicate comparisons used before. This gave rise to 1722 and 1534 (1328 in common) for the cytosolic samples, and 1627 and 1497 (1278 in common) for the nuclear samples.
Results and discussion
Modeling and prediction of active transcriptional enhancers
Interestingly, when P300 or H3K4me1 were used as a correlation feature, the signals H3K27ac and H3K4me3 did not appear as the most significantly associated (Additional file 1: Figure S3B and C). Additionally, P300 seemed to associate with the largest subset of features, which is consistent with experimental evidence showing that P300 associates generally to enhancers [1, 6]. However, enhancers with H3K4me1 and/or P300 occupancy are not always active [3, 11], since H3K4me1 precedes enhancer-binding factors and P300 may be present in poised and intermediate enhancer states . On the other hand, we did not find RNAPII and H3K36me3 to be strong predictors of enhancer activity (Fig. 1a and Additional file 1: Figure S3A), even though they have been previously detected on enhancers [12, 13]. Additionally, although we found a strong correlation of PU.1 (SPI1) with H3K27ac, it does not correlate with H3K4me3, hence it is likely that PU.1 associates to a subset of the putative enhancers . Based on these results, we decided to keep those features that scored consistently above the technical and biological controls in the feature selection analysis using H3K27ac and H3K4me3 as correlation classes, including these two marks. That is, we used as predictors of enhancer activity the following signals: P300, H3K27ac, H3K9ac, H3k4me1/me2/me3 and H2A.Z.
Clustering the genomic windows according to the relative enrichment of the selected features (Methods) resulted in three optimal classes (Additional file 1: Figure S4). We recovered one class characterized for being enriched in H3K4me3 and H3K27ac (Fig. 1b), which we considered to be enhancers that are active in K562 cells (silent in GM12878). We recovered a second class characterized by a depletion of these same marks in K562 (Fig. 1b), which we considered to be active enhancers in GM12878 (silent in K562). Finally, the third cluster showed small or no changes in most of the signals, indicating that these regions do not have any differential activity between the two cell lines. These regions do not necessarily represent enhancers and are labeled as no-change. These three groups (active, silent, no-change) define the three predictable classes of our computational model, two of which can be identified with enhancer classes: active and silent. The genome wide classification analysis resulted in 66,079 windows predicted to be active in K562 (silent in GM12878) and 64,436 windows predicted to be active in GM12878 (silent in K562).
In-silico validation of active transcriptional enhancers
In order to evaluate the accuracy of our predictions, we first compared our predicted enhancers windows with the enhancer regions predicted in the same cell lines by ChromHMM . The majority of our enhancers predicted as active in K562 or GM12878 overlap with ChromHMM windows labeled as weak or strong enhancers in the same cells (Additional file 1: Figure S5A and B). On the other hand, when we compared active windows with ChromHMM labels in the other cell line, the majority corresponds to ChromHMM silent windows (Additional file 1: Figure S5C and D), as expected. Furthermore, the overlap of our active enhancers with predicted ChromHMM enhancers increases with the posterior probability of our predictions (Additional file 1: Figure S5E and F). In contrast, when comparing the active enhancers in one cell line with the ChromHMM labels from the other cell line, we found no correlation with the posterior probability (Additional file 1: Figure S5G and H). Based on these comparisons, we kept predictions with a posterior probability of > 0.95, which resulted in 36,301 active windows in K562 and 37,859 active windows in GM12878. Overlapping windows were then clustered into 16,646 active enhancers in K562 and 16,328 active enhancers in GM12878, which distribute evenly along the genome (Additional file 1: Figure S6A). These enhancers have mean length of 3053 bp and the majority of them (87.65 %) are shorter than 5 kb (Additional file 1: Figure S6B). There were also 273 (1.38 %) predictions longer than 10 kb, which may correspond to large-scale chromatin domains  or to clusters of enhancers . We filtered out those predictions longer than 5 kb, resulting in 10,365 active enhancers and 9777 silenced enhancers, with mean lengths of 2704.6 and 2588 bp, (median lengths of 2500 and 2000 bp), respectively. These average lengths are in agreement with previous analyses of enhancers from ChIP-Seq data of histone marks and protein factors [5, 11, 15].
We additionally investigated whether enhancer-associated RNAs (eRNAs) are found in our predictions. Enhancer activity correlates with the production of polyA+ (Fig. 2c) and polyA- (Additional file 1: Figure S8A) long (>200 bp) nuclear RNAs, compared to silent enhancers. This relative enrichment is much larger than for the other RNA subclasses (Additional file 1: Figure S8B). Interestingly, there is also enrichment of cytosolic polyA+ RNAs (Additional file 1: Figure S8C), but not of cytosolic polyA- RNAs (Additional file 1: Figure S8D) or short RNAs (<200 bp) (Additional file 1: Figure S8E and F). Moreover, not all enhancers predicted as active appear to generate eRNAs: 26.4 and 32.1 % of the predicted active enhancers in K562 have a significant (left-tailed p-value < 0.01) increase of nuclear polyA+ and polyA-, respectively. In comparison, only 1.25 % of active enhancers have significant (left-tailed p-value < 0.01) increase for short nuclear RNAs. For cytosolic polyA+, 18.7 % of the predicted active enhancers in K562 have a significant (left-tailed p-value < 0.01) increase of eRNAs. In contrast, only 9.2 % of these active enhancers have a significant enrichment of short total RNAs and polyA- cytosolic RNAs, respectively.
Although enhancers can regulate genes from afar, they tend to be enriched upstream of genes (Visel et al. ). We therefore connected enhancers to genes by choosing for each enhancer the closest annotated transcription start site (TSS) in either direction. With this approximation, active intergenic enhancers show enrichment at distances close to TSSs compared to random regions and to silent enhancers (Fig. 2d). Using these enhancer-TSS pairs, we calculated the relative change in gene expression measured from RNA-Seq data (Methods). We observed that genes with activated enhancers at a distance between 2 and 10 kb show up-regulation, whereas genes with silenced enhancers in the same distance range show down-regulation (Fig. 2e). Moreover, this association is conserved when the distance range of the enhancers is extended to be between 10 and 100 kb from the closest gene (Additional file 1: Figure S9A). Further support for transcription activity in association to our predicted enhancers was found measuring the relative density of RNAPII around the TSS in genes close to predicted enhancers, which was found to correlate with enhancer activity (Additional file 1: Figure S9B).
We additionally searched for evidence of direct physical interactions for the enhancer-TSS pairs calculated above by using ChIA-PET data for RNAPII . Although only a small fraction of active enhancers have ChIA-PET links to TSS regions (1.6 %), there is enrichment over silent enhancers and randomized regions (Additional file 1: Figure S9C), indicating that predicted active enhancers tend to have more ChIA-PET links than silent enhancers and expected by chance. Finally, we investigated whether enhancers active in K562 have any association to genes that have been involved in cancer. Using the cancer gene census , we found that enhancers predicted to be active in K562 are enriched for genes related to cancer, compared to random regions and to enhancers silent in K562 (active in GM12878) (Fig. 2f). Interestingly, oncogenes can be linked more frequently to active enhancers and suppressors can be linked more frequently to silent enhancers (Additional file 1: Figure S10). In summary, these analyses indicate that our predicted enhancers show properties of active enhancers. We therefore set out to predict intragenic enhancers using the same computational model.
Thousands of intragenic enhancers are differentially activated in human cells
Active enhancers regulating the expression of nearby genes have been observed in exons [14, 15] and about 50 % of enhancers predicted by high-throughput methods lie within protein-coding genes . Additionally, by comparing the overlap of validated VISTA elements with the annotation in Gencode.v7 , we observe that there is no preference for intragenic or intergenic regions (Additional file 1: Figure S1). All these evidences indicate that intragenic enhancers represent an important regulatory component of the genome. However, it remains an open question how many intragenic enhancers may be active in a given cell. Accordingly, we decided to apply our predictive model to localize putative intragenic enhancers that are activated in K562 relative to GM12878, and vice versa.
In order to predict intragenic active enhancers, we considered 1.5 kb sliding windows inside genes, starting 500 bp downstream of the first TSS and eliminating all windows that overlap with a 1 kb region around every annotated alternative TSS (Additional file 1: Figure S2). This resulted in an initial set of 2,206,307 possible 1.5 kb windows, for which we used the same selected chromatin features as for the intergenic enhancers. Using a seed of 15,000 intergenic regions and the same clustering approach as before, we predicted 73,080 active and 92,225 silenced regions. As we did previously with intergenic enhancers, we compared our predicted intragenic predictions with ChromHMM predictions with similar results (Additional file 1: Figure S11). Accordingly, we only kept windows predicted with posterior probability > 0.95, resulting in 42,297 and 55,624 active intragenic enhancer windows in K562 and GM12878, respectively. After clustering overlapping windows, we obtained 17,791 active intragenic enhancers in K562 (relative to GM12878) and 21,108 active intragenic enhancers in GM12878 (relative to K562), falling inside a total of 5162 genes (10.11 % of all genes) and 5933 (11.61 %) genes, respectively. The mean length of these predictions is 3665 bp, with the majority (82.81 %) being shorter than 5 kb (Additional file 1: Figure S12). As before, we kept those shorter than 5 kb, resulting in 11,055 and 11,917 candidate active intragenic enhancers in K562 and GM12878, respectively.
Our predicted intragenic enhancers tend to occur in separate genes, with only 29.2 % of the genes hosting enhancers of both types. The majority of intragenic enhancers active in K562 (78.24 %) or active in GM12878 (80.61 %) fall in intronic regions, and 26.02 % in K562 (22.07 % in Gm12878) overlap at least partially with an exon. However, comparing the proportion of exonic and intronic regions covered by enhancers with the actual proportions of these regions in genes, we find no preference for exons or introns (Additional file 2). Additionally, even though we observed a preference for intragenic enhancers to fall on the first intron (Additional file 1: Figure S13), this effect can be explained by the fact that first introns are on average longer in human (Additional file 2) .
Intragenic enhancers affect the regulation of the host gene
We next decided to evaluate whether there is any association between the presence of enhancers and regulated events in genes. To this end, we compared only genes that have one or more of the 5319 calculated alternative splicing events (Methods) and separated these genes according to whether they have one or more regulated events (|delta PSI| > 0.1) or not (|delta PSI| < 0.05) between the two cell lines. We found that in all comparisons the proportion of genes with regulated events was higher for those genes that have active enhancers (either in K562 or GM12878) (Additional file 1: Table S4), being the comparison statistically significant (Fisher p-value < 0.05) for both replicates for genes with active enhancers in GM12878, using nuclear RNA-Seq for the calculation of PSI values; whereas the same association for enhancers active in K562 was only significant for one of the replicate comparisons (Fisher p-value = 0.01) (Additional file 1: Table S4). Moreover, these associations remained significant when we considered only those genes that do not change expression between both cell lines (Additional file 1: Table S5). The regulated events in genes with active or silent intragenic enhancers present equal proportions of each pattern of PSI change, i.e. increase or decrease PSI (Additional file 1: Figure S16), which is consistent with the observed dual effect that a chromatin change can have on splicing . Additionally, the direction of change of PSI does not correlate with the position, upstream or downstream, of the enhancer relative to the regulated exon (Additional file 1: Figure S17). Remarkably, the majority of the regulated events are located 5000 nt from an enhancer (Additional file 1: Figure S18). However, we did not find any significant difference with the distribution of distances of non-regulated events to nearby enhancers (Additional file 1: Figure S19).
As an example, we show the case of a regulated exon in the microtubule-actin crosslinking factor 1 gene (MACF1) (Fig. 4c). We observe a cassette exon with increased inclusion (delta PSI = 0.72) in K562 cells. The regulated exon is flanked by two enhancers predicted to be active in K562, one of which shows binding of PU.1 in K562, but not in GM12878 (Fig. 4c). This, together with the rest of our findings, suggests that the binding of PU.1 to a nearby enhancer, possibly in combination with other factors, could control the inclusion of this exon in MACF1. MACF1 has been implicated in the Wnt signalling pathway  and the inclusion of a cassette exon in MACF1 was observed before to be associated to lung adenocarcinoma . This result suggests the interesting possibility that the binding of PU.1 to an enhancer inside the MACF1 gene may affect its splicing, thereby altering Wnt signaling and contributing to the oncogenic transformation associated to PU.1 . In conclusion, we have found a possible association between the activity of intragenic enhancers and the regulation of the pre-mRNA. In particular, we find evidence that the activation of intragenic enhancers, besides affecting the activity of internal TSSs, can also potentially influence the inclusion of nearby exons.
We have developed a novel semi-supervised method that exploits the relative enrichment and depletion of multiple signals from ChIP-Seq experiments to predict enhancers that are active in one cell line relative to another. Applying this method to ENCODE data we predicted a total of 21,420 enhancers that are active in K562 relative to GM12878 (silent in GM12878 cells) and 21,694 enhancers that are active GM12878 relative to K562 (silent in K562), including intragenic and intergenic enhancers.
The number of active enhancers is cell type specific and very much dependent on the method used to detect them . Although activation of enhancers is generally associated to a number of histone modifications, only a small fraction of the many candidate enhancers previously identified using a variety of techniques may be active in a given cell. For instance, Heintzman et al. found 24,566 putative enhancers in K562 cells with approximately 20 % of them overlapping putative enhancers detected in HeLa cells . In contrast, ChromHMM  predicts more than 60,000 non-abutting genomic regions to be strong enhancers and about three times as many for weak enhancers. There are two main reasons for the discrepancies with our predicted number of active enhancers: the resolution of the genome segmentation is very different and we only predict enhancers that are active in one condition but not in the other. That is, we do not detect enhancers active or silent in both cell types. Nonetheless, we found a good agreement between the regions we predict as active or silent enhancers and the annotations from ChromHMM for the same cell lines.
Our predicted enhancers are H3K27ac dependent and are defined almost entirely by chromatin signals. The relevant predictive features confirm that active enhancers are characterized not only by the presence of H3K4me1, but also by the presence of H3K27ac, H3K4me3 and RNAPII [4, 5, 12, 13]. We also observed that active enhancers show an enrichment of the histone variant H2A.Z, which has been identified to demarcate regulatory regions . In contrast, CTCF and EZH2 and the histone marks H3K36me3 and H4K20me1 do not seem to play any prominent role in enhancer activation. H3K27me3 is the only feature that shows a pattern of depletion in active enhancers and enrichment in silent enhancers, but mainly in long enhancer-like regions (data not shown), which may be related to other regulatory mechanisms. We additionally found that predicted enhancer activity correlates strongly with production of long nuclear RNAs, rather than short ones, which can be polyA+ as well as polyA-. However, we observe that not all active enhancers produce eRNAs. Furthermore, although RNAPII and H3K36me3 have been detected on enhancers in relation to eRNA production [12, 13], we did not find them as strong predictors of enhancer activity.
When we applied the same predictive model to predict intragenic enhancers, we found a similar number of active intragenic enhancers as for intergenic ones. This result suggests that there exist in cells a considerable number of differentially activated intragenic enhancers, which may have a relevant contribution to the mechanisms of cell-specific gene regulation. Since active enhancers are characterized by a local modification of the chromatin state, we hypothesized that our predictions could be linked to relative differences between the same two cell lines in expression and splicing. We observed that intragenic transcriptional enhancers, upon activation or silencing, affect the activity of downstream alternative transcription start sites. Surprisingly, they can also affect the most upstream TSS. This generalizes previous findings indicating that intragenic enhancers can act as internal alternative promoters . We also found that intragenic enhancers, upon activation or silencing, associate to the differential inclusion of nearby exons. However, a considerable proportion of splicing changes occur in genes that change expression (51.1 % for genes containing differentially included exons in K562 and 52.9 % for differentially excluded exons in GM12878). This indicates that the main effect of the activation of enhancers may be related to the activation of alternative transcription in the gene and alternative splicing may be a byproduct of that. The observed changes may be mediated by the changes in the RNAPII elongation produced due to the chromatin change. However, active intragenic enhancers show enrichment in open chromatin marks (H3K4me3, H3K27ac) that have not been associated before to changes in RNAPII elongation.
On the other hand, we found here a strong association of PU.1 (SPI1) to active enhancers in K562 cells and in particular, a significant increase in PU.1 occupancy in 26.8 % of active enhancers. PU.1 has been shown before to be an essential co-factor for enhancer activity  and is known to bind to H3K4me1 sites in macrophages and B cells in a cell-specific manner [55, 56]. Moreover, PU.1 has been observed to regulate alternative splicing from the promoter  and can interact with the RNA binding proteins FUS (TLS) and NONO (p54nrb) [58, 59]. In fact, PU.1 has been proposed to bind RNA  and to perform an antagonistic function to the RNA binding proteins TLS and NONO in the regulation of splicing [59, 60]. In this direction, we found enrichment of regulated events in genes with enhancers, which suggests that PU.1 could be regulating the splicing of some of these genes through its binding to intragenic enhancers, possibly interacting with RNA binding proteins . In support of this model, we find that there is an enrichment of regulated events in genes with enhancers that are active or silent relative to the other cell line. We postulate that intragenic enhancers provide localized and cell-type specific mechanisms to link the chromatin state to RNA processing.
In summary, there is increasing evidence that changes in the chromatin state can affect the processing of the pre-mRNA [44, 45, 46, 47, 48, 61, 62, 63, 64, 65] and different models for this regulation have been proposed. From our analysis a picture emerges whereby localized chromatin changes inside genes can take place by means of the activation of intragenic transcriptional enhancers. We propose that the differential activation and silencing of transcriptional enhancers that fall within genes could explain the localized chromatin variation that have been observed before to affect the expression and splicing of genes, either through the modulation of RNAPII activity or through the recruitment of factors that can interfere with RNA processing, like PU.1.
The authors would like to thank E. Furlong, Y. Barash, B. Blencowe and U. Braunschweig for useful discussions. This work was supported by grants from Plan Nacional I + D (BIO2011-23920) and Consolider (CSD2009-00080) from MINECO (Spanish Government), and by the Sandra Ibarra Foundation for Cancer (FSI 2013). JGV and BS were supported FPI grants from the MINECO (Spanish Government) BES-2009-018064 and BES-2012-052683, respectively.
- 3.Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ, HannaJ, et al. Histone H3K27ac separates active from poised enhancers and predicts developmental state. PNAS. 2010;21931–6. doi: 10.1073/pnas.1016071107.
- 15.Birnbaum RY, Clowney EJ, Agamy O, Kim MJ, Zhao J, Yamanaka T, Pappalardo Z, et al. Coding exons function as tissue-specific enhancers of nearby genes. Genome Res. 2012;1059–1068.Google Scholar
- 18.Alló M, Agirre E, Bessonov S, Bertucci P, Gómez Acuña L, Buggiano V, Bellora N, Singh B, Petrillo E, Blaustein M, Miñana B, Dujardin G, Pozzi B, Pelisch F, Bechara E, Agafonov DE, Srebrow A, Lührmann R, Valcárcel J, Eyras E, Kornblihtt AR. Argonaute-1 binds transcriptional enhancers and controls constitutive and alternative splicing in human cells. Proc Natl Acad Sci U S A. 2014;111(44):15622–9.PubMedCentralPubMedCrossRefGoogle Scholar
- 24.Kursa MB, Jankowski A, Rudnicki WR. Boruta–a system for feature selection. Fundamenta Informaticae. 2010;101(4):271–85.Google Scholar
- 25.Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002;2(3):18–22.Google Scholar
- 28.Fraley C, Raftery A. Mclust version 3 for R: normal mixture modeling and model-based clustering. 2007.Google Scholar
- 31.Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M, Rodriguez JM, Ezkurdia I, van Baren J, Brent M, Haussler D, Kellis M, Valencia A, Reymond A, Gerstein M, Guigó R, Hubbard TJ. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012;22(9):1760–74.PubMedCentralPubMedCrossRefGoogle Scholar
- 32.Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28(5):511–5.PubMedCentralPubMedCrossRefGoogle Scholar
- 44.Sims 3rd RJ, Millhouse S, Chen CF, Lewis BA, Erdjument-Bromage H, Tempst P, Manley JL, Reinberg D. Recognition of trimethylated histone H3 lysine 4 facilitates the recruitment of transcription postinitiation factors and pre-mRNA splicing. Mol Cell. 2007;28(4):665–76.PubMedCentralPubMedCrossRefGoogle Scholar
- 54.Ghisletti S, Barozzi I, Mietton F, Polletti S, De Santa F, Venturini E, Gregory L, Lonie L, Chew A, Wei CL, Ragoussis J, Natoli G. Identification and characterization of enhancers controlling the inflammatory gene expression program in macrophages. Immunity. 2010;32(3):317–28.PubMedCrossRefGoogle Scholar
- 55.Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, and Glass CK. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010;38:576–89.PubMedCentralPubMedCrossRefGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.