Combined single-cell profiling of expression and DNA methylation reveals splicing regulation and heterogeneity
Alternative splicing is a key regulatory mechanism in eukaryotic cells and increases the effective number of functionally distinct gene products. Using bulk RNA sequencing, splicing variation has been studied across human tissues and in genetically diverse populations. This has identified disease-relevant splicing events, as well as associations between splicing and genomic features, including sequence composition and conservation. However, variability in splicing between single cells from the same tissue or cell type and its determinants remains poorly understood.
We applied parallel DNA methylation and transcriptome sequencing to differentiating human induced pluripotent stem cells to characterize splicing variation (exon skipping) and its determinants. Our results show that variation in single-cell splicing can be accurately predicted based on local sequence composition and genomic features. We observe moderate but consistent contributions from local DNA methylation profiles to splicing variation across cells. A combined model that is built based on genomic features as well as DNA methylation information accurately predicts different splicing modes of individual cassette exons. These categories include the conventional inclusion and exclusion patterns, but also more subtle modes of cell-to-cell variation in splicing. Finally, we identified and characterized associations between DNA methylation and splicing changes during cell differentiation.
Our study yields new insights into alternative splicing at the single-cell level and reveals a previously underappreciated link between DNA methylation variation and splicing.
KeywordsSingle-cell analysis Alternative splicing DNA methylation Splicing prediction Cell differentiation Multi-omics
- ES cell
Embryonic stem cell
- iPS cell
Induced pluripotent stem cell
RNA splicing enables efficient gene encoding and contributes to gene expression variation by alternative exon usage . Alternative splicing is pervasive and affects more than 95% of human genes . Splicing is known to be regulated in a tissue-specific manner [3, 4], and alternative splicing events have been implicated in human diseases . Bulk RNA sequencing (RNA-seq) of human tissues and cell lines has been applied to identify and quantify different splicing events , where in particular exon skipping at cassette exons, the most prevalent form of alternative splicing , has received considerable attention.
Different factors have been linked to splicing of cassette exons, including sequence conservation  and genomic features such as the local sequence composition as well as the length of the exon and flanking introns [5, 8]. Although there is some evidence for a role of DNA methylation in splicing regulation, this relationship is not fully understood and alternative models have been proposed [9, 10, 11]. The transcriptional repressor CTCF has been shown to slow down RNA polymerase II (Pol II), resulting in increased exon inclusion rates. By inhibiting CTCF binding, DNA methylation can cause reduced exon inclusion rate . Alternatively, increased DNA methylation of the MeCP2 pathway has been associated with increased exon inclusion rates. MeCP2 recruits histone deacetylases in methylated contexts that wrap the DNA more tightly around the histones. This interplay between MeCP2 and DNA methylation slows down Pol II, thus leading to an increased exon inclusion rate . Finally, HP1, which serves as an adapter between DNA methylation and transcription factors, increases the exon inclusion rate if it is bound upstream of the alternative exon. Binding of HP1 to the alternative exon leads to increased exon skipping . These alternative mechanisms point to a complex regulation of splicing via an interplay between DNA sequence and DNA methylation, both in proximal as well as distal contexts of the alternative exon.
Technological advances in single-cell RNA-seq have enabled investigating splicing variation at a single-cell resolution [8, 12, 13]. We here leverage recent protocols for parallel sequencing of RNA and bisulfite-treated DNA from the same cell (single-cell methylation and transcriptome sequencing; scM&T-seq ) to study single-cell splicing while accounting for cell-specific DNA methylome profiles. We apply our approach to investigate the associations between single-cell splicing variation and DNA methylation at two states of human induced pluripotent stem (iPS) cell differentiation.
Single-cell splicing variation during endoderm differentiation
We applied parallel single-cell methylation and transcriptome sequencing (scM&T-seq) to differentiating induced pluripotent stem (iPS) cells from one cell line (joxm_1) of the Human Induced Pluripotent Stem Cell Initiative (HipSci) [15, 16]. We profiled 93 cells from 2 different cell types, namely cells in the iPS state (iPS) and cells following 3 days of differentiation towards definitive endoderm (endoderm). After quality control, this resulted in 84 and 57 cells, respectively (the “Methods” section), which were used for analysis. In each cell, we quantified cassette exon inclusion rates (the “Methods” section, Additional file 1: Table S1, Additional file 2: Table S2). We quantified splicing rates for between 1386 and 4917 cassette exons in each cell (minimum coverage of 5 reads), estimating splicing rates (PSI) as the fraction of reads that include the alternative exon versus the total number of reads at the cassette exon (the “Methods” section). Differences in sequencing depth and cell type explained most of the differences in the number of quantified splicing events between cells (Additional file 3: Figure S1, Additional file 1: Table S1, Additional file 2: Table S2). DNA methylation profiles were imputed using DeepCpG , yielding on average 23.1 M CpG sites in iPS and 21.6 M CpG sites in endoderm cells. We considered 6265 iPS and 3873 endoderm cassette exons that were detected in at least 10 cells for further analysis.
Methylation heterogeneity across cells is associated with splicing variability
Next, to identify locus-specific correlations between DNA methylation heterogeneity and variation in splicing across cells, we tested for associations between differences in imputed DNA methylation levels across cells and splicing rates (Spearman correlation; the “Methods” section).
For each cassette exon, we tested for associations between the splicing rate (PSI) and variation in DNA methylation in each of 7 sequence contexts: the upstream, alternative, and downstream exons, and the 5′ and 3′ end of the 2 introns (the “Methods” section, Fig. 1c). Genome-wide, this identified 424 cassette exons with a methylation-splicing associations in iPS cells (out of 5564 tested cassette exons, Q < 0.05, Additional file 3: Figure S2a, Additional file 4: Table S3) and 245 associations in endoderm cells (out of 2811 tested, Q < 0.05, Additional file 3: Figure S2a, Additional file 4: Table S3). The majority of these associations were observed in the upstream alternative exon (~ 75%), with approximately equal numbers of positive (increased DNA methylation is linked to increased alternative exon inclusion) and negative (increased DNA methylation is linked to decreased alternative exon inclusion) associations. In iPSC, 58% of correlations are positive, and 55% of the correlations are positive in endoderm cells. Most associations could be detected significantly in more than 1 context for a given exon with consistent effect directions (Additional file 3: Figure S2b, c). Similarly, we observed largely concordant associations across the 2 cell types in our data. Among the exons that are expressed in both iPS and endoderm (n = 3743), 77% of the associations identified in iPS were nominally replicated in endoderm cells (P < 0.05, with a consistent effect direction), and 89% of the associations identified in endoderm were also observed in iPS cells (P < 0.05, with a consistent effect direction). Genes with negative associations between DNA methylation in the 3 upstream regions and PSI were enriched for HOXA2 transcription factor binding sites (iPS—78/118 query genes linked to HOXA2, adjusted P = 6.02 × 10−4; endoderm—60/90 query genes linked to HOXA2, adjusted P = 9.03 × 10−3; enrichment based on g:Profiler ).
Prediction of splicing at the single-cell level
Next, we considered using an extended feature set in the linear models, using up to 826 DNA methylation features, including a k-mer alphabet that takes the methylation status of cytosines into account, as well as the DNA methylation rate and variance across CpG sites in each of 7 sequence contexts of a cassette exon (the “Methods” section). We considered 2 strategies to aggregate the methylation features, either (i) to capture patterns of methylation in individual cells (“genomic and cell methylation” features) or (ii) using averaged methylation features across all cells (“genomic and mean methylation” features; Additional file 5: Table S4, Fig. 1c). The inclusion of either type of methylation features into the model yielded an increased prediction accuracy, where larger gains in prediction accuracy were observed when accounting for DNA methylation information in individual cells (“genomic and cell methylation” versus “genomic and mean methylation”). Notably, the inclusion of DNA methylation features did not improve the accuracy of models for average splicing rates (“pseudo-bulk” models; Additional file 3: Figure S4). This observation in combination with the results from the association analysis between methylation and splicing indicates that DNA methylation can primarily explain the cell-to-cell variation in splicing at individual loci, whereas genomic features by design explain the variation across loci. These findings were consistent across iPS and endoderm cells, and we observed analogous benefits of accounting for DNA methylation when applying the same models to previous scMT-seq datasets from mouse embryonic stem (ES) cells  (the “Methods” section, Additional file 3: Figure S3 & S4).
Next, to assess the relevance of the considered features, we considered regression models based on individual features trained in individual cells. Consistent with previous bulk studies [5, 7], this identified features derived from the alternative exon and its neighboring contexts, namely the 3′ end of the upstream intron and the 5′ end of the downstream intron, as most informative (Additional file 6: Table S5). Within these contexts, sequence conservation of the alternative exon was the most relevant individual feature. Other relevant features included the k-mers CT, CTC, and CCT of the alternative exon (Fig. 2b), sequence patterns that show a close resemblance to CTCF-binding motifs. Although CTCF or CTCF-like motifs have previously been implicated splicing, these previous studies identified motifs upstream  or downstream  of the alternative exon as associated with increased splicing, whereas the k-mers in our model are located in the alternative exon and associated with decrease the inclusion rate [9, 21].
The relevance of the cell-specific features for splicing prediction as quantified by regression weights was markedly consistent across iPS and endoderm cells. This consistency extends to the mouse ES cell dataset, where again features of the alternative exon and sequence conservation scores were identified as the most relevant predictors for splicing (Additional file 6: Table S5, Additional file 3: Figure S5). Despite the overall consistency in feature relevance (r2 = 0.79, average correlation between weights across all iPS and endoderm cells), principal component analysis (PCA) applied to the feature relevance matrix across all cells identified subtle coordinated axes of variation of the feature relevance (Fig. 2c). The first two principal components (PC) clearly separate iPS from endoderm cells, differences that are primarily attributed to k-mers of the downstream intron (I2) that contain methylated and unmethylated cytosine bases (Fig. 2d, Additional file 7: Table S6). Consistent with this, a single-cell methylation model trained on endoderm cells yielded only moderate prediction accuracy in iPS cells (r2 = 0.52), highlighting the cell-type specificity of splicing models that account for DNA methylation information. This points towards a combination of differences in sequence composition, potentially transcription factor activity, and DNA methylation as the main determinants of cell-type specific splicing regulation.
Finally, we considered more complex regression models based on convolutional neural networks to predict single-cell splicing based on DNA sequence and an extended genomics alphabet including base-level DNA methylation information (deposited at kipoi , the “Methods” section). We observed only limited benefits when including DNA methylation information (Additional file 3: Supplementary Results and Figure S6). These results line up with the locus-specific DNA methylation and the linear regression results, supporting the hypothesis that global splicing information is primarily encoded by DNA sequence and conservation, and DNA methylation is linked to splicing in a locus-specific manner.
Prediction of splicing modes for individual exons
We found that a model based on DNA methylation alone did not yield accurate predictions although methylation contained some information for identifying underdispersed cassette exons (Fig. 3d, Additional file 3: Figure S8b). Given this, we investigated the distribution of DNA methylation patterns across splicing categories, observing distinct distributions of DNA methylation in the upstream exon of underdispersed cassette exons (Fig. 3e). This effect was consistent, although less pronounced, in other sequence contexts (decreasing from the upstream to the downstream exon, Additional file 3: Figure S9a-b).
We assessed the consistency of these results across iPS and endoderm cells, as well as in mouse ES cells. To do this, we trained the genomic model on endoderm cassette exons and assessed this model’s predictions on iPS-specific cassette exons, which resulted in a prediction accuracy that was similar to the within cell-type prediction performance (macro-AUC = 0.82, Additional file 3: Figure S10a). However, the inclusion of the DNA methylation features into the model resulted in a decline in the cross-prediction performance (macro-AUC = 0.54, Additional file 3: Figure S10b). As in the linear model cross-replication analysis, this finding emphasizes the importance of cell type-specific DNA methylation for accurately predicting splicing. Next, we observed that the performance for splicing category prediction in mouse ES cells was very similar to the performance in the endoderm and iPS cells (macro-AUC = 0.82, in the genomic and the genomic and methylation model). We observed the same distinct distributions of DNA methylation in the upstream exon of underdispersed cassette exons (Additional file 3: Figure S9c). However, the relationship between the DNA methylation levels and underdispersed cassette exons category could not be replicated in the mouse ES cells (Additional file 3: Figure S7b).
Splicing category switches across cell differentiation
After observing the category switches between the cell types, we set out to build a final set of logistic ridge regression models based on genomic and methylation features to predict category switching ability of cassette exons during differentiation (Fig. 4c for prediction performance, Additional file 10: Table S9). This model had limited power to predict category switches (AUCs 0.51–0.64), and DNA methylation did not significantly improve the prediction of any category although moderately higher predictions can be seen for the switching behavior of over- and underdispersed cassette exons.
Lastly, we assessed if DNA methylation changed within the cassette exons switching between the cell types. The DNA methylation levels of cassette exons that switched category only changed minimally (Additional file 3: Figure S11). However, we observed that DNA methylation of the alternative exon of switching cassette exons differed from non-switching cassette exons at the iPS state (Fig. 4b). DNA methylation of both switching included and switching excluded cassette exons was increased around C1 in comparison to their relevant non-switching counterparts. In the case of switching overdispersed cassette exons, we observed higher DNA methylation levels within and in the vicinity of the alternative exon.
Here, we present the first analysis of alternative splicing in single cells that considers both genomic and epigenetic factors. Our study focuses on the variation of splicing in cassette exons at two different states of cell differentiation. We show that splicing events do not strictly follow the previously suggested cell or gene models of splicing patterns, but instead, we find a substantial proportion of exons that are better described by an intermediate model (Fig. 1b).
We show that single-cell splicing of cassette exons is influenced by genomic features as previously assessed in bulk data, but also by DNA methylation differences. We observe that DNA methylation is related to splicing phenotypes, with the strongest link to single-cell splicing ratios. When assessing splicing variation in bulk populations (pseudo-bulk), most of the information encoded in DNA methylation is lost. A reason for this might be the strong correlation between genomic and methylation features, in particular between DNA methylation and cytosine-related features. Additionally, our results indicate that the relationship between splicing and DNA methylation is locus-specific (Additional file 3: Figure S2). This may also explain why DNA methylation has limited benefits when prediction average splicing rates across cells or splicing quantified using bulk RNA-seq.
Besides the sequence conservation, a feature that has previously been described in bulk studies , the most relevant features to predict splicing were the k-mers CTC, CT, and CCT within the alternative exon (Additional file 3: Figure S2b). These k-mers point towards involvement of CTCF. Previous work has shown that CTCF motifs within introns are linked to splicing by slowing down RNA polymerase II, thereby leading to a higher chance of exon inclusion . Interestingly, there is a known link between DNA methylation and CTCF motifs . Methylation of CTCF-binding sites can block CTCF and thereby result in decreased inclusion rates of an exon. As the methylated k-mer equivalents were less predictive of splicing, we suggest a more complex involvement of DNA methylation in alternative splicing, potentially by locus-specific effects, which our current models are not able to capture.
In addition to modeling splicing ratios, we also considered categorical models of splicing to gain insights into the variability of splicing across cells (Fig. 3). The categories considered in our model reflect both the overall splicing rate and splicing variability across cells. Exons with included versus excluded splicing states could be accurately predicted. In contrast, the intermediate splicing categories which are reflective of single-cell variability could only be predicted with a lower accuracy. This might be due to the lower number of cassette exons assigned to these categories (multimodal n = 506, overdispersed n = 427, underdispersed n = 110, versus included n = 3278 and excluded n = 1944 in iPS cells) or reflect increased vulnerability to assay noise or more complex regulatory dependencies. As in the linear regression models, we observed that DNA sequence conservation scores were the most informative features for predicting splicing categories (Additional file 6: Table S5). Interestingly, for intermediate categories, the genomic information in the vicinity of the alternative exon rather than of the exon itself seemed to be predictive of splicing variability. Whereas DNA methylation did not contribute to improving the splicing prediction, we observe that DNA methylation levels of underdispersed cassette exons were significantly reduced in all genomic contexts, most significantly in the upstream exon. We hypothesize that the lower DNA methylation levels of underdispersed cassette exons give the sequence motifs more power to control splicing levels, i.e., increased DNA methylation levels lead to more stochasticity in splicing. This hypothesis is supported by the effect direction of methylation features, which are opposite between overdispersed and underdispersed cassette exons. We finally observe that the methylation k-mers are on average less informative of splicing than non-methylation features, potentially further supporting our hypothesis.
By leveraging data from two cell types, we were able to assess the consistency of splicing prediction across cell types and the relevant genomic and methylation features (Fig. 2c). The differences between features predictive of splicing between iPS and endoderm cells were primarily observed within the (methylated) k-mers, which are consistent with the known alteration of transcription factor activity and DNA methylation differences between cell types. Next, we were able to confirm the findings from Song et al.  that only a limited number of cassette exons switch splicing categories between cell types (Fig. 4a). Additionally, as previously described in the context of neural differentiation , switches between included and excluded categories were not observed. Most of the category switches were observed within the three intermediate splicing categories. Hereby, DNA methylation differences seemed to predate the switching ability. Using ridge regression, we were able to predict if a cassette exon would switch its splicing category between the cell types. Again, DNA methylation seemed to be particularly informative of intermediate splicing. It improved the predictability of switching in over- and underdispersed categories.
The novelties of our analyses also represent their main limitations. Single-cell sequencing intrinsically delivers fewer reads to assess gene expression and DNA methylation levels. Especially the genome coverage of the bisulfite-treated DNA sequencing remains low due to the low quantities of starting material. Using computational imputation, we were able to mitigate this effect to some extent. However, imputation strategies have limitations and in particular, loci that lack methylation information cannot be recovered.
The intrinsic properties of single-cell data also affect the accuracy of the estimated splicing ratios per cassette exon. We opted for a lenient threshold on read depth to determine splicing ratio, which delivered more cassette exons to train our models, but also rendered splicing ratios less accurate in comparison to deep-sequenced bulk data. The low read depth increases the chance of missing an isoform or cassette exon, an effect known as a dropout. Dropouts in single-cell RNA-seq data can have a strong impact on the fit of the cell or gene model. If one of the isoforms was completely unobserved, this would decrease the fit of the gene model. On the contrary, sequencing multiple cells at once would decrease the fit of the cell model. Given that our results are robust across cassette exons, cell types, and species, the overall findings we report are however not likely to be affected.
In summary, we showed for the first time that alternative splicing and splicing variability across cells can be predicted with genomic and DNA methylation information in single cells. We assessed the impact of DNA methylation and cellular features on cassette exon splicing and were able to replicate our findings in two human cell types and mouse ES cells. We investigated the stability and variance of splicing between the two cell types, and importantly, we showed that DNA methylation primes splicing switches during differentiation.
Single-cell transcription and methylation data was generated from a single donor from the Human Induced Pluripotent Stem Cells Initiative (HipSci) [15, 16], using the previously described protocol for single-cell methylation and transcriptome sequencing in the same cells (scM&T-seq) (see  for details). Line joxm_1, an induced pluripotent stem cell (iPSC) line derived from fibroblasts cells from HipSci project, was cultured and triggered into differentiation towards endoderm. scM&T-seq data was generated for 93 cells (together with 1 empty well as negative control and two 15-cell and 50-cell positive controls) at the undifferentiated time point (iPS) and the definitive endoderm time point (endoderm), yielding 186 cells for analysis.
Cell handling and differentiation
The joxm_1 IPSC line was cultured in Essential 8 (E8) media (LifeTech) according to the manufacturer’s instructions. For dissociation and plating, cells were washed × 1 with DPBS and dissociated using StemPro Accutase (Life Technologies, A1110501) at 37 °C for 3–5 min. Colonies were fully dissociated through gentle pipetting. Cells were washed × 1 with MEF medium  and pelleted gently by centrifuging at 285×g for 5 min. Cells were re-suspended in E8 media, passed through a 40-μm cell strainer, and plated at a density of 60,000 cells per well of a gelatin/MEF-coated 12-well plate in the presence of 10 μM Rock inhibitor—Y27632 [10 mM] (Sigma, Cat # Y0503—5 mg). Media was replaced with fresh E8 free of Rock inhibitor every 24 h post-plating. Differentiation into definitive endoderm commenced 72 h post-plating as previously described .
FACS preparation and analysis of cells
During all staining steps, cells were protected from light. Cells were dissociated into single cells using Accutase and washed × 1 with MEF medium as described above. Approximately 1 × 106 cells were resuspended in 0.5 mL of differentiation state-specific medium containing 5 μL of 1 mg/mL Hoechst 33342 (Thermo Scientific). Staining with Hoechst was carried out at 37 °C for 30 min. Unbound Hoechst dye was removed by washing the cells with 5 mL PBS + 2% BSA + 2 mM EDTA (FACS buffer); BSA and PBS were nuclease-free. For the staining of cell surface markers Tra-1-60 (BD560380) and CXCR4 (eBioscience 12-9999-42), cells were resuspended in 100 μL of FACS buffer with enough antibodies to stain 1 × 106 cells according to the manufacturer’s instructions and were placed on ice for 30 min. Cells were washed with 5 mL of FACS buffer, passed through a 35-μM filter to remove clumps, and re-suspended in 250 μL of FACS buffer for live cell sorting on the BD Influx Cell Sorter (BD Biosciences). Live/dead marker 7AAD (eBioscience 00-6993) was added just prior to analysis according to the manufacturer’s instructions, and only living cells were considered when determining the differentiation capacities. Living cells stained with Hoechst but not Tra-1-60 or CXCR4 were used as gating controls.
As previously described in Angermeuller et al. , scM&T-seq library preparation was performed following the published protocols for G&T-seq  and scBS-seq , with minor modifications as follows. G&T-seq washes were performed with 20 μl volumes, reverse transcription and cDNA amplification were performed using the original Smart-seq2 volumes , and Nextera XT libraries were generated from 100 to 400 pg of cDNA, using 1/5 of the published volumes. RNA-seq libraries were sequenced as 96-plexes on a HiSeq 2000 using v4 chemistry and 125 bp paired-end reads. BS-seq libraries were sequenced as 24-plexes using the same machine and settings, which yielded a mean of 7.4 M raw reads after trimming.
Gene expression quantification
For single-cell RNA-seq data, adapters were trimmed from reads using Trim Galore! [27, 28, 29], using default settings. Trimmed reads were mapped to the human reference genome build 37 using STAR  (version: 020201) in two-pass alignment mode, using the defaults proposed by the ENCODE consortium (STAR manual). Expression quantification was performed separately using Salmon  (version: 0.8.2), using the “--seqBias,” “--gcBias,” and “VBOpt” options on transcripts derived from ENSEMBL 75. Transcript-level expression values were summarized at the gene level (estimated counts) and quality control of scRNA-seq data was performed using scater . Cells with the following features were retained for analysis: (i) at least 50,000 counts from endogenous genes, (ii) at least 5000 genes with non-zero expression, (iii) less than 90% of counts are assigned to the top 100 expressed genes per cell, (iv) less than 20% of counts are assigned to ERCC spike-in sequences, and (v) a Salmon mapping rate of at least 40%. These filters jointly removed 9 iPS cells and 36 endoderm cells from our analysis.
Of the 186 cells, 84 (iPS) and 57 (endoderm) cells passed QC on gene expression data as described above. Exon splicing rates in individual cells were quantified using the data-dependent module of BRIE . BRIE calls splicing at predefined cassette exons and quantifies splicing using exon reads in single-cell data. By default, BRIE combines informative prior learned from sequence features and a likelihood calculated from RNA-seq reads by a mixture modeling framework that is similar to MISO . As our aim is to model the local and global determinants of splicing, we used splicing rate estimates based on the observed data at individual exons only. We detected and quantified splicing for between 1386 and 4917 exons per cell (minimum coverage 5 reads, in total considered 6265 (iPS) and 3873 (endoderm) cassette exons that were detected in at least 10 cells for further analysis.
The following settings were used to quantify splicing with BRIE: exons have to be located on autosomes and input chromosomes and should not be overlapped by any other alternatively spliced exon. The surrounding introns have to be longer than 100 bp, the length of the alternative exon regions has to be between 50 and 450 bp with a minimum distance of 500 bp from the next TSS or TTS, and the exon has to be surrounded by AG-GT. The default annotation file gencode.v19.annotation.gtf and the reference genome GRCh37.p13.genome.fa were downloaded from https://www.gencodegenes.org/human/release_19.html (May 2018) and used for subsequent analyses.
We used three different measurements to quantify splicing ratios (PSI), namely single-cell splicing ratios, pseudo-bulk splicing ratios, and variance of splicing ratios. To calculate single-cell PSI per cassette exon per cell, we only considered splicing events that were supported by at least five reads and limited the analysis to cassette exons which were observed in at least ten cells. To derive pseudo-bulk PSI per cassette exon, we aggregated the single-cell PSI values per cassette exon. The variance of PSI per cassette exon was defined as the standard deviation of PSI across single cells.
DNA methylation pre-processing and quantification
For DNA methylation data, single-cell bisulfite sequencing (scBS-seq) data was processed as previously described . Reads were trimmed with Trim Galore! [27, 28, 29], using default settings for DNA methylation data and additionally removing the first 6 bp. Subsequently, Bismark  (v0.16.3) was used to map the bisulfite data to the human reference genome (build 38), in single-end non-directional mode, which was followed by de-duplication and DNA methylation calling using default settings. We removed cells with low alignment rates (alignment rate < 15%) and cells with a library size of less than 1 M reads, resulting in 84 iPS cells and 53 endoderm cells with RNA and DNA methylation information.
To mitigate typically low coverage of scBS-seq profiles (20–40%; ), we applied DeepCpG  to impute unobserved methylation states of individual CpG sites. DNA methylation profiles in iPS and endoderm cells were imputed separately. The cell type-specific models were built using CpG and genomic information according to DeepCpG’s setup of a joint model (see  for details and default values; see Additional file 1: Table S1 for imputation accuracy as measured on a validation set per sample).
Predicted methylation states were binarized according to DeepCpG probability outputs as follows: sites with a probability of equal to or lower than 0.3 were set to 0 (un-methylated base), all methylation sites with a probability of greater than 0.7 were set to 1 (methylated base). Intermediate methylation levels were handled as missing. After imputation the methylation data was aligned back to human genome version 37 to match the expression data, using the UCSC lift-over tool .
We integrated the imputed methylation information into the DNA sequence by distinguishing methylated (M) and un-methylated (U) cytosines. Cytosines without methylation information after imputation were assigned the value of the closest cytosine with methylation information. If there was no methylation information within 900 bp around the cytosine, its state was set to un-methylated.
Cell and gene model assumptions
To assess if our PSI variation patterns follow the gene or the cell model , we compared the distribution of splicing rates to a binomial distribution that is expected according to the cell model and to the expected distribution according to the gene model.
The cell model assumes that each individual cell expresses only a single splice isoform, and hence models PSI variation as a bimodal distribution at the single cell level. The alternative gene model assumes splicing regulation on the gene level. The mean PSI of a gene is determined by the sequence. Each time a gene is transcribed, the probability of exon inclusion equals mean PSI. However, the limited number of transcripts leads to fluctuation in the observed PSI, and the binomial distribution is restrained by the upper boundary of the standard deviation. To obtain this upper boundary, we simulated the PSI of each cell as a binomial distribution and calculated the standard deviation across the cells. We only considered genes that were covered by at least 5 reads per cell in least 10 cells. To obtain the mean standard deviation, we repeated this simulation 400 times.
The genomic features used to predict the splicing ratios and its variance were based on the features described by BRIE and Xiong et al. [5, 8]. As these features were specifically designed to study exon skipping events at cassette exons, they capture sequence variation around the alternatively spliced exon. This region is first split in five genomic contexts: the alternative exon itself, the two neighboring exons and the introns between the exons. Logarithmic length, relative length, and the strength of the splice site motifs at the exon-intron boundaries were calculated per genomic context. The strength of the splice site was defined as the similarity between this splice site and known splice motives. Additional features were calculated on seven genomic contexts, the three exons and the 5′ and 3′ boundaries of the two introns. Only the two boundary contexts of the introns (300 bp length) were used since intron length is highly variable and the boundaries are found to be the most relevant contexts for splicing.
Altogether, 607 features were calculated for these genomic contexts per cassette exon: PhastCons scores  that describe sequence conservation, length of the sequence contexts, and sequence composition-based k-mer frequencies (with k ≤ 3) (“genomic” features, the “Methods” section, Additional file 5: Table S4). The k-mers reflect the percentage of nucleotides in the context that match the respective specific motif. The PhastCons scores were retrieved for alignments of 99 vertebrate genomes with the human genome from hg19.100way.phastCons.bw from UCSC (May 2018) .
In addition to the genomic features, we defined up to 826 DNA methylation features derived from the imputed DNA methylation information, including an extended k-mer alphabet that takes the methylation status into account, as well as DNA methylation average and variance (across CpG sites), in each of the 7 sequence contexts of a cassette exon. Methylation features describe the methylation patterns of either individual cells (“genomic and cell methylation” features) or averaged across cells (“genomic and mean methylation” features; Additional file 5: Table S4). More specifically, for the single-cell PSI model, we considered cell-specific methylation levels; the k-mer features were extended by including un-methylated (U) and methylated (M) cytosine into the alphabet as follows: Cytosines without methylation information after imputation were assigned the value of the closest cytosine with methylation information. If there was no methylation information within 900 bp around the cytosine, its state was set to un-methylated. For the bPSI model, we included the mean frequencies of the k-mers that contained “M” or “U” across cells and the averaged methylation values as described above.
In bulk RNA-seq data, splicing events can be broadly categorized into two major categories: included and excluded. Leveraging the single-cell information, we defined more fine-grained splicing categories that reflect both splicing rates and splicing variability across cells (inspired by Song et al. ): (1) excluded (mean PSI < 0.2), (2) included (mean PSI > 0.8), (3) overdispersed, (4) underdispersed, and (5) multimodal (Fig. 3a). The latter three categories categorize the extent of splicing variation across cells, since cassette exons with intermediate average splicing rates (here 0.2 ≤ mean PSI ≤ 0.8, Fig. 1) exhibit substantial differences in splicing variance. To characterize cells into these three categories, we calculated the distribution of the distance between the observed and the expected variation per cell type. The expected variation was calculated by a scaled binomial standard deviation, where the scaling factor and the mean splice rate of the alternative exon  are fit to all data points. We then defined the overdispersed cassette exons as those for which the deviation from the expected PSI was higher than the third quartile plus 1.5x interquartile range (IQR) (corresponding to > 0.016 in iPS and > 0.022 in endoderm). Likewise, for the definition of the underdispersed cassette exons, we used the first quartile minus 1.5x IQR as the threshold (corresponding to less than − 0.032 in iPS and less than − 0.039 in endoderm cells). The remaining cassette exons were assigned to the multimodal category.
Relating DNA methylation heterogeneity and splicing
We applied Spearman correlation to link splicing at a single locus to variation in DNA methylation observed between cells. The test was performed per sequence context of the cassette exon (Fig. 1c). We only considered cassette exons where variation in splicing and variation of DNA methylation of the relevant context were observed. In total, 5280 iPS and 2622 endoderm cassette exons were tested. The P values were adjusted for multiple testing using the Q value [37, 38] package in R. The gene enrichment across the cassette exons was performed using g:Profiler  (version: 2017-10-25, g:Profiler Ensembl 90), using all observed cassette exons per cell type as background. Multiple testing correction for the enrichments was performed within g:Profiler.
Prediction of PSI and categories
We applied linear ridge regression to model single-cell and pseudo-bulk PSI and (multi-class) logistic ridge regression to model PSI categories. The models are based on only the genomic features or on both genomic and DNA methylation features. The performance of linear models was evaluated using Pearson r2 between predicted and observed splicing rates. For the multi-class prediction models, we applied a one-versus-rest scheme and report the per-category and the macro-average area under the receiver operating curves (AUC). To determine the most relevant individual features, we additionally trained regression models based on each single feature. Per feature, we report, in the case of the linear models, Pearson correlation (r, r2) and, in the case of the logistic models, the absolute weight multiplied by the standard deviation of the feature and the AUC. We assessed the performance and parameters of the models by using a tenfold cross validation (CV) with fixed training-validation splits. To assess the variability of prediction performances, we repeated the CV procedure four times with different CV splits. Error bars indicate ± 1 standard deviation of the respective statistic (AUC, r2).
To replicate our results, we processed the mouse ES single-cell scM&T-seq data (n = 80) presented in Angermueller et al. . We reprocessed the aligned RNA and DNA methylation data to quantify splicing following the same protocols that were applied to the human data, with the following changes: GRCm38 was used as a reference for imputation, genome and transcriptome annotations were based on gencode v18 (“GRCm38.p6.genome.fa” as genomic, “gencode.vM18.annotation.gff3” as transcriptomic reference, available at ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/ [August 2018]), and conservation scores were taken from “mm10.60way.phastCons.bw” downloaded from UCSC  (August 2018).
Out of the 80 cells, in total, 12 cells did not pass quality control on the transcriptome data, Cells with less than 500,000 sequenced reads or had less than 80% of the reads aligned to the genome were removed. Additionally, 4 cells did not pass quality on the DNA methylome data. Cells with less than 1 million reads aligned and bismark mapping efficiency below 7% were discarded. The filters yielded 68 cells that were used for the splicing analysis and 64 that are used for the analyses including DNA-methylation data. In these cells, we quantified between 649 and 1433 cassette exons per mouse ES cell (minimum coverage of 5 reads); in the replication analysis, we considered 2194 exons that were supported by at least 1 cells.
Availability of source code
Python and R were used for data processing, modeling, and visualization of the results. All regression models are based on implementations available in the package scikit-learn . Software and scripts are available as jupyter notebooks at https://github.com/PMBio/scmt_splicing .
We thank the staff in the Cellular Genetics and Phenotyping and Sequencing core facilities at the Wellcome Trust Sanger Institute. We acknowledge the participation of all NIHR Cambridge BioResource volunteers, and thank the NIHR Cambridge BioResource centre staff for their contribution. We thank the National Institute for Health Research and NHS Blood and Transplant. The NIHR/Wellcome Trust Cambridge Clinical Research Facility supported the volunteer recruitment. We thank Y. Huang & R Argelaguet for helpful discussions and comments on the manuscript.
This work was funded with a strategic award from the Wellcome Trust and UK Medical Research Council (WT098503). Work at the Wellcome Trust Sanger Institute was further supported by Wellcome Trust grant WT090851. S.M.L. received support from the Stiftung Familie Klee. L.U. received support from core funding of the European Molecular Biology Laboratory and the European Union’s Horizon 2020 research and innovation programme (grant agreement number N635290). M.J.B. was supported by a fellowship from the EMBL Interdisciplinary Postdoc (EI3POD) program under Marie Skłodowska-Curie Actions COFUND (grant number 664726). Research from L.V. laboratories is supported by the European Research Council advanced grant New-Chol, the Cambridge University Hospitals National Institute for Health Research Biomedical Research Center, a core support grant from the Wellcome Trust and Medical Research Council to the Wellcome Trust – Medical Research Council Cambridge Stem Cell Institute (PSAG028). WR was supported by grants from BBSRC (BB/K010867/1), Wellcome Trust (095645/Z/11/Z), EU BLUEPRINT, and EpiGeneSys.
Availability of data and materials
All sample metadata can be accessed via the HipSci data portal (http://www.hipsci.org), which references to EMBL-EBI archives that are used to store the HipSci data. DNA methylation and RNA expression data corresponding to the main analysis of this manuscript are available via the SRA under bioproject: PRJEB15062 , sample metadata is available in Additional File 1: Table S1. The replication dataset, originally generated for the Angermueller et al.  paper , is available via the SRA under bioproject: PRJNA300666.
SML contributed to the data processing, splicing ratio modeling, deep modeling, and writing of the manuscript. LU contributed to the splicing variation modeling, supervision of splicing modeling and deep modeling, and writing of the manuscript. DJM contributed to the gene expression data processing and cell-level feature calculations and contributed to the editing. MC, SC, SA, LV, and WR contributed to the data generation and editing. IE supervised the master thesis project this study is based on and contributed to the editing. OS contributed to the supervision of modeling, writing of the manuscript, and study design. MJB contributed to the methylation data processing, supervision of splicing modeling, and writing of the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
All samples for the HipSci resource were collected from consented research volunteers recruited from the NIHR Cambridge BioResource (http://www.cambridgebioresource.org.uk). Samples were collected initially under ethics for iPSC derivation (REC Ref: 09/H0304/77, V2 04/01/2013), with later samples collected under a revised consent (REC Ref: 09/H0304/77, V3 15/03/2013).
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.Black DL. Mechanisms of alternative pre-messenger RNA splicing. Annu Rev Biochem. 2003;72:291–336. https://doi.org/10.1146/annurev.biochem.72.121801.161720.CrossRefPubMedGoogle Scholar
- 21.Brooks AN, Aspden JL, Podgornaia AI, Rio DC, Brenner SE. Identification and experimental validation of splicing regulatory elements in Drosophila melanogaster reveals functionally conserved splicing enhancers in metazoans. RNA. 2011;17:1884–94. https://doi.org/10.1261/rna.2696311.CrossRefPubMedPubMedCentralGoogle Scholar
- 22.Avsec Z, Kreuzhuber R, Israeli J , Xu N, Cheng J, Shrikumar A, et al. Kipoi: accelerating the community exchange andreuse of predictive models for genomics. BioRxiv. 2018:375345. https://doi.org/10.1101/375345.
- 27.Krueger F. Trim Galore!: A wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files, with some extra functionality for MspI-digested RRBS-type (Reduced Representation Bisufite-Seq) libraries. 2012. https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/.
- 28.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 2011;17:10–12Google Scholar
- 29.Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc.Google Scholar
- 37.Storey JD. False discovery rates. USA: Princet Univ Princeton; 2010.Google Scholar
- 38.Bass Jds, Dabney A, Robinson D. qvalue: Q-value estimation for false discovery rate control. 2015. http://github.com/jdstorey/qvalue.
- 39.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011 12;2825–30.Google Scholar
- 40.Linker S, Urban L, Clark SJ, Chhatiwala M, Amatya S,McCarthy DJ et al. Source code available from: https://doi.org/10.5281/zendo.2542560.
- 41.Linker S, Urban L, Clark SJ, Chhatiwala M, Amatya S,McCarthy DJ et al. [Dataset] Available from: https://www.ebi.ac.uk/ena/data/view/PRJEB15062
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.