IntEREst: intron-exon retention estimator
In-depth study of the intron retention levels of transcripts provide insights on the mechanisms regulating pre-mRNA splicing efficiency. Additionally, detailed analysis of retained introns can link these introns to post-transcriptional regulation or identify aberrant splicing events in human diseases.
We present IntEREst, Intron–Exon Retention Estimator, an R package that supports rigorous analysis of non-annotated intron retention events (in addition to the ones annotated by RefSeq or similar databases), and support intra-sample in addition to inter-sample comparisons. It accepts binary sequence alignment/map (.bam) files as input and determines genome-wide estimates of intron retention or exon-exon junction levels. Moreover, it includes functions for comparing subsets of user-defined introns (e.g. U12-type vs U2-type) and its plotting functions allow visualization of the distribution of the retention levels of the introns. Statistical methods are adapted from the DESeq2, edgeR and DEXSeq R packages to extract the significantly more or less retained introns. Analyses can be performed either sequentially (on single core) or in parallel (on multiple cores). We used IntEREst to investigate the U12- and U2-type intron retention in human and plant RNAseq dataset with defects in the U12-dependent spliceosome due to mutations in the ZRSR2 component of this spliceosome. Additionally, we compared the retained introns discovered by IntEREst with that of other methods and studies.
IntEREst is an R package for Intron retention and exon-exon junction levels analysis of RNA-seq data. Both the human and plant analyses show that the U12-type introns are retained at higher level compared to the U2-type introns already in the control samples, but the retention is exacerbated in patient or plant samples carrying a mutated ZRSR2 gene. Intron retention events caused by ZRSR2 mutation that we discovered using IntEREst (DESeq2 based function) show considerable overlap with the retained introns discovered by other methods (e.g. IRFinder and edgeR based function of IntEREst). Our results indicate that increase in both the number of biological replicates and the depth of sequencing library promote the discovery of retained introns, but the effect of library size gradually decreases with more than 35 million reads mapped to the introns.
KeywordsRNA-seq Intron retention Alternative splicing RNA Expression analysis Bioconductor U12-type introns U2-type introns
Fragments per Kilo base-pair per Million reads
Percent Spliced In
Alternative pre-mRNA splicing is a cellular process in eukaryotes that generates multiple transcripts from a single gene. Of the various types of alternative splicing (reviewed by Hamid and Makeyev ) intron retention (IR) events have been less characterized than the alternative splicing events that are more frequent in mammals, such as exon skipping and choice of alternative 5' splice site (5'ss) and 3' splice site (3'ss). While the best characterized IR events have been detected from humans with diseases caused by mutations in the core pre-mRNA splicing machinery, recent work has established that regulated IR events are also part of the normal regulation of gene expression [2, 3] and function in important biological processes such as cellular differentiation . Furthermore, in some taxa such as plants, IR is one of the most prominent mechanisms of alternative splicing .
A well-established example of IR involves U12-type introns (also called minor introns), which are spliced less efficiently compared to the U2-type (major) introns . The classification to major U2-type introns and minor U12-type introns derives from the coexistence of two parallel pre-mRNA splicing machineries in the cells of most metazoan species. Majority of metazoan introns are excised by the "major" U2-dependent spliceosome, and are therefore referred as the U2-type or major introns. A small subset of metazoan introns, approximately 0.35% or roughly 700-800 introns in mammals, are excised by a parallel U12-dependent spliceosome, also known as the minor spliceosome . The targets of the U12-dependent spliceosome are minor introns, which feature highly conserved, but divergent 5'ss and branch point sequences (BPS), which makes it possible to identify these introns computationally [8, 9]. One of the main characteristics of the minor spliceosome is that it is less efficient compared to the major spliceosome [10, 11, 12]. As a result of the inefficient splicing, elevated levels of transcripts containing unspliced minor introns are retained in the nucleus and targeted by nuclear RNA decay pathways . Moreover, disease-causing mutations in the snRNA and protein components of the minor spliceosome (e.g. U4atac and U12 snRNAs, U11/U12-65K and ZRSR2 proteins) show, among other splicing defects, a further increase in IR levels of the U12-type introns [13, 14, 15, 16, 17, 18, 19].
Various alternative splicing analysis tools have been developed [20, 21, 22], however few tools exist that focus on extracting novel intron retention (IR) events and perform differential IR analysis . For a robust analysis of retention levels of introns within and between various samples we developed IntEREst, i.e. Intron–Exon Retention Estimator that is based on the intron retention analysis used in Niemelä et al. . IntEREst accepts standard binary sequence alignment/map (.bam) files as an input and estimates the genome-wide retention levels of the introns using sequencing reads mapping to introns, intron-exon boundaries, or to exon-exon junctions. The results are provided both as IR fold changes and relative PSI or Ψ (percent spliced in)  values and can be further analyzed by any of the several statistical packages included, e.g. differential intron retention test based on the “exon usage test” provided by DEXSeq [25, 26], differential IR test based on count data differential analysis tools provided by DESeq2 , or exact test, generalized linear models and quasi-likelihood test adapted from edgeR [28, 29]. The statistical tests calculate p-values based on the null hypothesis that IR does not vary across the analyzed sample groups. The resulting p-values estimated for each intron allow subsequent identification of introns that show statistically significant difference of IR between the sample groups. IntEREst also provides tools for plotting the distribution of retention levels of the introns of interest within single or multiple samples. In addition, large datasets that demand significant computation time can be analyzed in parallel on multiple computing cores. IntEREst is available as a Bioconductor package and together with the manuals are accessible through https://bioconductor.org/packages/release/bioc/html/IntEREst.html.
IntEREst is an R package that supports various functions to measure the retention levels of the introns, perform statistical differential intron retention analysis across various samples, and plot the distribution of retention levels of different types of introns across various samples. The main design aim of IntEREst has been to support analysis of relative low level IR values (>10%) that are more challenging to implement with the existing software  but are typical for the U12-type introns  and for U2-type introns in human diseases with a mild defects with spliceosome function. In such cases the commonly used Ψ values, particularly with default cutoffs, may underestimate the extent of IR. Specifically, the advantages of IntEREs are the ability to use multiple test samples and controls, possibility to define complicated design experiments (incorporating various sample annotations such as age, sex, and etc.) for IR comparisons across samples, parallelization of the computation and running on multiple nodes/cores, integration to Bioconductor environment and the use of both intronic and exon junctions reads, either alone or together, to estimate the IR levels. Additionally, besides providing a global IR analysis, IntEREst supports analysis of user-defined subset of introns, e.g. U12-type and U2-type introns.
The RNAseq read summarization functions (i.e. interest() and interest.sequential()) accept a .bam read alignment file and a reference as inputs, and output the raw (un-normalized) and normalized number of fragments mapping to each exon or intron. The reference includes coordinates of exons and introns together with their annotations, such as gene and transcript names, and intron type identifier. The reference can be built using the referencePrepare() function supported by IntEREst. Note that the intron identifiers used in our analysis are U12- and U2-type introns, but the application of IntEREst is not limited to the comparison of these intron types. Other classifications can be defined by the user and the retention levels of the introns can be plotted and compared across the user-defined classes. The functions in the IntEREst package that are specific to the comparisons of the U2- vs U12-type introns, e.g. u12Boxplot(), u12DensityPlot(), u12Index() start with “u12”.
IntEREst features two functions that estimate the raw and normalized intron retention levels: 1) interest(), capable of running in parallel on multiple computing cores and 2) interest.sequential(), that runs sequentially on a single computing core. These functions use the bpiterate() function from the BiocParallel R Bioconductor package  to read and analyze the mapped reads, m reads at a time (by default m = 1 million) to comply with the limitations of the memory usage in the running environment. When running interest.sequential(), the mapped reads are analyzed as batches of m reads (or read pairs if the isPaired parameter is set TRUE) at a time on a single computing core. With interest() it is possible to analyze n batches of m reads (i.e. m×n reads or read pairs) simultaneously while they are distributed over n computing cores and repeat this process until all reads have been analyzed.
IntEREst provides a function lfc() that estimates the log2 FC of the retention levels across two various conditions, moreover it includes a function psi() to measure the Ψ values, i.e. the percent spliced in, for all studied introns. We have adapted several statistical tests from multiple sources for intron retention and exon-junctions analysis: DESEq2 , edgeR [21, 22], and DEXSeq [25, 26]. All these methods can be used to study the intron retention changes across the samples in a genome-wide scale. However, the DEXSeq based method (i.e. DEXSeqInterest() function) differs from the others as it uses the differential exon usage method to perform gene-wise comparisons
Results and discussion
Genome-wide analysis of retention of U2 and U12-type introns
To demonstrate the application of IntEREst in comparing retention levels of various types of introns across several samples, we reanalyzed the RNAseq data from myelodysplastic syndrome (MDS) patients and control subjects included in Madan et al.  study. Specifically, we compared the genome-wide retention levels of U12-type introns vs U2-type across the MDS samples. This disease is caused by mutations in the ZRSR2 gene that encodes an integral protein component of the minor spliceosome. Moreover, the original analysis of the dataset reported that the ZRSR2 mutations in the patient samples led to increased retention of primarily U12-type introns while the U2-type introns were reported to be less affected . The dataset represents 16 individuals: 8 were diagnosed with MDS and featured mutations in the ZRSR2 gene (referred to as ZRSR2mut), 4 were diagnosed with MDS but lacked the ZRSR2 mutations (referred to as ZRSR2wt), and 4 were healthy individuals (HEALTHY).
We ran genome-wide retention comparison of U12-type introns to U2-type introns. To carry out the analysis, we used RefSeq as a reference and identified and annotated 510 U12-type introns using the annotateU12() function that uses Position Weigh Matrices (PWM) extracted from the U12DB database . Next we performed the differential IR analysis using the DESeq2-based function of IntEREst (comparing the ZRSR2mut samples vs ZRSR2wt and HEALTHY). The DESeq2 test was run by considering both results from intron retention and exon-exon junction runs of interest() function. Initially, by using the interestResultIntEx() function a result object was built that includes information of both intron retention and exon-exon junction levels (see Additional file 1 for more details).
To further evaluate the validity and generality of our results, we compared the MDS results to the similar results that we obtained from analyzing an additional Maize data  (see Additional file 1 for more details). The Maize data is constructed of 6 samples (i.e. 3 roots and 3 shoots referred to as RGH3mut) that feature mutations in the gene RGH3 (ortholog of Human ZRSR2 gene) and 6 samples (3 roots and 3 shoots referred to as RGH3wt) that lack the mutation. The results of the Maize data analysis mirror our findings with the MDS data. Analogous to MDS data, the RGH3mut samples showed increased IR with ~46% of U12-type introns, while only a ~0.46% of the U2-type introns showed an increase in IR (see Additional file 1: Figure S7).
Together, our results suggest that IntEREst provides reliable quantification of differential IR events; Specifically, our results are not only consistent with the well-documented increased retention levels of U12-type introns [6, 11, 31], but are also in concordance with the molecular function of the ZRSR2 protein (and its Maize ortholog, i.e. RGH3) in the recognition of U12-type introns [17, 34].
Benchmarking and comparison to other methods
We evaluated the performance of the IntEREst in two ways using the MDS benchmark dataset. First, we carried out internal analysis comparing IntEREst results in conjunction with different statistical analysis packages implemented in IntEREst. Subsequently, we carried out comparison with both, the published results of the MDS analysis  and IRfinder , i.e. dedicated software for IR analysis. Note that all comparisons described in the following are based on the introns that were available in the both references used by the compared counterparts.
Differential up- and down–regulated introns in methods implemented in IntEREst
We compared the three methods implemented in IntEREst for differential intron retention analysis, i.e. DESeq2, GLM function of edgeR and DEXSeq, referred hereafter as IntEREst-DESeq2, IntEREst-edgeR and IntEREst-DEXSeq, respectively. The DESeq2 and edgeR have been previously reported to result in somewhat dissimilar results in differential gene expression analysis . In contrast, DEXseq method differs in its application (see above). For IntEREst-DESeq2 and IntEREst-edgeR comparison, we first merged the intron-exon and the exon-exon junction results (obtained by running interest() in its two running modes) using interestResultIntEx(). Subsequently, we used deseqInterest() and glmInterest() functions (i.e. the IntEREst functions based on DESeq2 and edgeR-GLM) to analyze the change of IR relative to the change of the junction levels of their flanking exons. We used an adjusted p-value (Benjamini and Hochberg ) threshold cutoff of 0.01 to identify introns that are retained at significantly higher or lower level in the ZRSR2mut samples compared to controls (see Additional file 1 for more details).
Comparison of the IntEREst-DESeq2 to IntEREst-DEXSeq revealed a considerable overlap between the two methods (Fig. 3 c). However, IntEREst-DEXSeq identified a large number of significantly less retained introns not identified by the IntEREst-DESeq2 (Fig. 3 d). This outcome reflects the gene-wise method adapted in DEXSeq where the variation in the retention levels of each intron is compared to the relative retention variation of all other introns within the same gene, rather than solely comparing the genome-wide changes of IR levels. This results in a more symmetric distribution of up/down regulated intron retention signals (Fig. S4). As a consequence, the significantly more and less retained introns discovered by IntEREst-DEXSeq were more than twice more frequently observed in the same genes compared to those identified by IntEREst-DESeq2. Furthermore, the IntEREst-DEXSeq only consider the reads that map to either introns or exons (here the intron read counts were used) and does not support the usage of both intron retention and exon-exon junction information.
IntEREst-DESeq2 and IRFinder show extensive overlap
We next compared the IntEREst-DESeq2 to IRFinder, a dedicated IR analysis software, which also uses DESeq2 package in its downstream analysis . Since IntEREst-DESeq2 counts reads that map to the exons, we used the mean of the number of reads mapping to the 5’ and 3’ flanking exons. In contrast, the IRFinder counts the junction reads that map across the flanking exons. Running IRFinder with the default parameters extracted 250 introns showing significantly increased IR in ZRSR2mut samples, most of which (i.e. 235) overlapped with the introns discovered by the IntEREst-DESeq2 (Fig. 3 e). Note that IntEREst utilized more intron/exon-mapped reads compared to IRFinder. This was particularly evident with introns with lower retention levels, thus providing better-supported fold-change estimates for such introns (Additional file 1: Figure S5).
Enhanced discovery of IR events in MDS samples
We further compared our IR results with the original analysis of the MDS dataset by Madan et al. . We found that IntEREs-DESeq2 was able to identify most (i.e. 177 out of 205 introns) of the significant IR events reported by Madan et al. , but it also discovered a large number of additional events not reported in the original study (Fig. 3f), representing both U12-type (149) and U2-type (1195) introns. On the contrary, the events that were reported in the original study, but missed in our analysis all represent borderline cases featuring low fold-changes and statistical significance (Additional file 1: Figure S6).
Together, our results revealed that the different methods implemented in IntEREst are able to identify a highly overlapping set of high-confidence differentially retained introns. Additionally, each method also identified IR events that are unique to a particular method. This provides the flexibility to select an approach best fitting to the particular research questions.
Sample size and sequencing library size sensitivity
A similar trend was also observed when analyzing the effect of intron/exon read coverage levels. Here we distributed 5-50 million reads according to the relative retention levels of the introns and exon-exon junction levels (based on the complete data) in each sample, followed by analysis with IntEREst-DESeq2. In our analyses we assumed that the quality and read coverage is equal in all the individual MDS datasets. As a result, we observed that an increase in the sequencing library size leads to a discovery of increasing numbers of introns showing statistically significant deviation in the IR levels. However, the slope of increase of the number of discovered IR events decreases and levels off at the highest library sizes (more than 35M; Fig. 4c).
Here we present IntEREst, an R package for intron retention and exon-exon junction analysis. Our method is able to extract the significantly retained introns and carry out intra- and inter-sample comparisons of the retention levels of the introns and exon junction levels. We used IntEREst to analyze the publicly available MDS data  and our results confirm that mutations in the ZRSR2 gene, a component of the minor spliceosome involved in recognition of 3΄ splice site of the U12-type introns, leads to increased IR particularly with the U12-type introns. Furthermore, our results show that compared to the U2-type introns, the IR of U12-type introns is already higher in the control samples, but the mutations in the ZRSR2 gene further exacerbate the IR in the patient cells. These conclusion are further supported by our analysis of Maize data with a mutations in plant ortholog of the ZRSR2 gene, which, similarly to human data, also show strong bias towards increased IR of the U12-type, but not U2-type introns. The introns showing significantly higher or lower IR in the ZRSR2mut samples vs control samples in MDS dataset that we discovered using the IntEREst-DESeq2 (Additional file 2) overlap with the introns identified by IRFinder and IntEREst-edgeR. Furthermore, our results not only detect the same IR events reported in the original study by Madan et al. , but we also discovered additional significant IR events featuring both the U12- and U2-type introns.
The resampling analysis of ZRSR2mut vs control samples show that by including more biological replications and considering a larger sequencing library size, increasing number of significant IR events can be discovered. While the maximum number of biological replicates (eight) used in this study is not sufficient to estimate the optimal required for IR discovery, we note that library sizes with more than 35M mapped reads start to approach the point where the improvements in detecting novel IR events are marginal. In sum, we believe that IntEREst is a reliable tool in R/Bioconductor environment for detailed intron retention analysis of RNAseq datasets.
Availability and requirements
IntEREst is implemented as an R package freely available at the Bioconductor repository.
Project name: IntEREst
Archived version: 1.2.2
Project home page: https://github.com/gacatag/IntEREst/
Operating system(s): Platform independent
Programming language: R
Other requirements: R v 3.4 or higher
Any restrictions to use by non-academics: No restrictions
Work was supported by Academy of Finland grants (140087, 308657 and 284601 to MJF; 275151 and 292307 to DG) and Sigrid Júselius Foundation (to MJF). AO has been supported by the University of Helsinki Viikki Doctoral Programme in Molecular Biosciences
Availability of data and materials
The myelodysplastic syndrome (MDS) data is available in NCBI Gene expression Omnibus database under accession GSE63816 and the Maize data is available under accession GSE57466.
AO developed the software, ran analyses and wrote the manuscript. DG directed the project and assisted with writing the manuscript. MJF directed the project, assisted with the analysis and wrote the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 19.Verma B, Akinyi MV, Norppa AJ, Frilander MJ. Minor spliceosome and disease. Semin Cell Dev Biol. 2017;Google Scholar
- 22.Sacomoto GA, Kielbassa J, Chikhi R, Uricaru R, Antoniou P, Sagot MF, et al. KISSPLICE: de-novo calling alternative splicing events from RNA-seq data. BMC Bioinform. 2012;13(Suppl 6):S5.Google Scholar
- 30.Morgan M, Obenchain V, Lang M, Thompson R. BiocParallel: Bioconductor facilities for parallel evaluation. 2016. Available from: https://github.com/Bioconductor/BiocParallel
- 32.Terpstra T. The asymptotic normality and consistency of Kendall’s test against trend, when ties are present in one ranking. Proc Kon Ned Akad V Wetensch A. 1952;55:327–33.Google Scholar
- 36.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995:289–300.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.