Splicing heterogeneity: separating signal from noise
Single-cell analyses have revealed a tremendous variety among cells in the abundance and chemical composition of RNA. Much of this heterogeneity is due to alternative splicing by the spliceosome. Little is known about how many of the resulting isoforms are biologically functional or just provide noise with little to no impact. The dynamic nature of the spliceosome provides numerous opportunities for regulation but is also the source of stochastic fluctuations. We discuss possible origins of splicing stochasticity, the experimental approaches for studying heterogeneity in isoforms, and the potential biological significance of noisy splicing in development and disease.
Expressed sequence tag
Heterogeneous nuclear ribonucleoprotein
Inosine fluorescence in situ hybridization
Single molecule fluorescent in situ hybridization
Small nuclear RNA
Single nucleotide variant
In recent years, there has been substantial progress in the development of methodologies to interrogate gene expression in single cells. Single-cell imaging has historically been the workhorse technology for such studies, but applications such as single-cell sequencing have rapidly advanced, with recent publications drawing conclusions from tens of thousands of individual cells [1, 2, 3, 4]. The picture that emerges from these studies is that gene expression varies from cell to cell. These differences can be both genetic and non-genetic, and they can be stable or dynamic. Differences can arise from programmed specialization during development or through random processes that occur in the cell. Even at the mRNA level, abundance, sequence, and chemical modifications can vary among transcripts that are produced from the same sequence of DNA. Making sense of this variation has become an immense experimental and theoretical challenge.
The process of RNA synthesis leads to variation in mRNA abundance, which has been studied extensively . However, RNA processing, specifically pre-mRNA splicing, has the potential to be an equally important source of variability in gene expression. Since the first discovery of splicing 40 years ago [6, 7, 8], accumulating knowledge about the spliceosome’s assembly and enzymatic mechanism, about the process of splice site selection, and on the coupling with transcription depicts a complex, multi-step, dynamic model involving a massive molecular machine. Each of these steps in splicing is subject to regulation, leading to the amazing diversity of alternatively spliced transcripts in virtually every organism in which RNA splicing is present. Each of these steps is, however, also subject to random fluctuations. Like all reactions that occur at the molecular level and rely on small numbers of molecules, stochastic (i.e., random) effects are the rule rather than the exception. This phenomenon was evident in the earliest observations of alternative splicing using chromatin spreads of the Drosophila chorion gene. In the same transcription unit, two alternative splicing isoforms were observed at the single-molecule level . Since then, the proportion of transcripts that show alternative splicing has asymptotically approached 100%. From ‘a list of genes’ in the 1980s , to ‘74% of all genes’ inferred from expressed sequence tag (EST)–genomic alignments and microarrays [11, 12], to ‘98–100% of multi-exonic genes’ in the next-generation sequencing era [13, 14, 15, 16]. Single-cell sequencing has now revealed that splicing variability exists among tissues and between individuals [17, 18, 19, 20].
Which transcripts are functional? How do we detect meaningful changes not only in alternative splicing but also in RNA editing or alternative poly-adenylation? And what experimental and conceptual advances will be needed for the next stage of research? In this special issue, new techniques and datasets are presented that are at the forefront of RNA biology. Here, we focus on the current understanding of variability in RNA processing, mostly on splicing. We hope to frame the following questions. 1) Where does splicing stochasticity come from? 2) How do we measure splicing variability? 3) What is the biological significance of splicing heterogeneity?
Noise in splicing: where does it come from?
For each nascent RNA molecule generated from transcription, the spliceosome needs to first recognize the correct splice sites, then assemble to complete intron removal and exon ligation, and then disassemble. Intron and exon definition is the key step in the initiation of spliceosome assembly. The 5′ splice site consensus AG|GURAGU is present at the exon–intron junction. The 15-nucleotide 3′ splice site Y10NCAG|G is present at the intron–exon junction. At a variable distance upstream of the 3′ splice site (10–50 nucleotides (nt) for human transcripts) is the branch point consensus YNYURAY [44, 45, 46, 47]. The dinucleotide pair GU-AG is present in over 98% of all intron sequences that are removed by the spliceosome, but variations are found in neighboring bases [48, 49] (Fig. 1c).
Randomness is generated by several aspects of this splice-site-recognition step. First, the sequence information from the nascent RNA transcripts is ambiguous and highly degenerate, especially in mammals. The intron or exon definition step requires spliceosomes to read the information from more than ~ 30 bases accurately . This recognition mostly relies on the base-pairing between U1 and U2 small nuclear RNAs (snRNAs) and the nascent RNA, but RNA modifications and bulged nucleotides make this base-pairing highly flexible [49, 51]. Sequence alone is not sufficient allow the accurate identification of splicing boundaries, even for short introns (≤ 134 bp) in human transcripts . Moreover, many sequences in the mammalian genome match the consensus but are not recognized as real splice sites and the mechanism behind this discrimination is poorly understood. Second, mutations and single nucleotide variants (SNVs) in the template sequence generate moving targets for the spliceosome. Millions of genetic variants in the human genome have been uncovered through the 1000 Genomes project . Multiple methods, such as machine learning , splicing quantitative trait loci (QTL) , and integrative genome-wide association studies (iGWAS)  have revealed that SNVs are associated with alternative splicing. These SNVs could change the splice sites directly or could alter a splicing regulatory sequence. Furthermore, the long introns in human transcripts also provide ample mutational opportunities for the creation of new or weak splice sites and for the generation of new exons (exonization) [20, 57]. Third, this ‘reading and recognition’ process is coordinated by splicing enhancer and silencer sequences through recruitment of SR proteins and hnRNPs . Binding motifs for SR proteins and hnRNPs can be found in the majority of exons and introns [59, 60]. The role of RNA-binding proteins (RBP) can be synergic or competitive. The output of a splicing event will be affected by the motif sequence of the pre-mRNA and the array of RBP concentrations in the cell.
The complexity in the template pre-mRNA brings a primary source of stochasticity, even before considering the assembly of the spliceosome itself. The spliceosome consists of hundreds of proteins and multiple snRNAs. Initially, splicing ‘commitment’ was thought to occur once the intron–exon boundary has been defined . Recent studies have revealed, however, that the spliceosome is a highly flexible and reversible enzyme. Spliceosome assembly can be initiated by either a U1- or a U2-first pathway . After assembly initiation, the spliceosome can switch between different catalytic conformations that favor forward or reverse progress . The splicing catalytic process is iso-energetic and driven by numerous ATPases, resulting in two transesterification processes that are both reversible in the proper ionic environment [63, 64]. Recent single-molecule research on spliceosome assembly has revealed that almost all of the steps in splicing are reversible [65, 66]. In the context of a highly flexible and reversible spliceosome assembly process, the alternative splicing decision may be the result of kinetic competition between different spliceosome assembly pathways.
One explanation for the lack of consensus across the various methods used might be the difficulties in detecting the whole dynamic range of splicing. For example, if splicing time is exponentially distributed over a broad range (Fig. 2a, b), the measured time will depend on the time resolution of the method. Imaging or pulse-chase methods might overestimate the duration of very short events or might underestimate the duration of extremely long events. Likewise, for steady-state biochemical methods, the inferred dynamic parameters rely on the assumption that all intermediates are identified and analyzed, whether they are on chromatin or in the nucleoplasm.
Above all, the stochasticity of splicing could result in variability in both splice site selection and splicing kinetics. How do splicing kinetics associate with splice site selection? Does alternative splicing exhibit different kinetics? Evidence is emerging that, at least for certain genes, alternative splicing occurs mostly post-transcriptionally, whereas constitutive exons are spliced co-transcriptionally . In addition, changing the nucleotides next to GU at the 5′ splice site can alter both the kinetics of spliceosome remodeling and splicing efficiency . How the spliceosome makes a choice amongst splice sites during the kinetic competition between splicing and transcription [70, 71, 74, 76] is still an unanswered question that requires further investigation.
Can we measure the extent of stochastic RNA processing experimentally?
The initial concept of ‘splicing noise’ comes from the analysis of EST sequences and microarray-based mRNA abundance measurements . These data suggested a positive correlation between the number of alternative isoforms and the number of splicing reactions (i.e., the number of introns per gene and the level of gene expression). A more precise evaluation was provided by the de novo identification of splice junctions based on RNA-seq data. Such studies revealed the existence of a large class of low-abundance isoforms . Most of these isoforms contain the GU-AG dinucleotides, which indicate that they are generated from a random splice site choice. When thousands of independent RNA-seq datasets were combined, a significant number of previously unannotated splice junctions became evident across different tissues and cell types . Although the current focus when analyzing these data is still on major alternative splicing events, a more comprehensive analysis across all splicing junctions would be beneficial for elucidating the distribution of isoform frequencies. Interestingly, a simple two-parameter Weibull distribution can be used to explain the statistical distribution of the isoforms of all transcribed genes, indicating a possible general model of stochastic splicing .
Ideally, measurement of the stochasticity in splicing requires capturing each individual event in a population. Single-cell RNA-seq [3, 81, 82, 83, 84, 85, 86, 87] provides a promising avenue, but there are two major challenges: the first comes from the single-molecule capture efficiency. Using a spike-in assisted evaluation, Wold and colleagues  were able to provide an estimate of single-molecule capture efficiency of around 0.1, meaning that rare events are not represented in the single-cell sequencing library. The second challenge is to distinguish the biological stochasticity from technical noise, which is an enduring issue in single-cell analysis. Careful evaluation of the technical noise with quantitative statistical methods is necessary. Two recent studies carried out splicing analysis at the single-cell level [89, 90]. One unexpected discovery is that about 20% of the genes exhibit a bimodal distribution of certain splicing isoforms (Fig. 2c, d). These bimodal genes are related to differentiation and cell-type determination. After excluding technical artifacts caused by a low capture rate, there are two possible explanations for the bimodal distribution. First, the distribution may be due to extrinsic noise. For example, heterogeneity in the concentration of splicing regulators in different cells might result in the same pre-mRNA being processed differently. Second, the bi-modality might be caused by intrinsic noise. For example, in transcription, slow promoter kinetics will result in a bimodal distribution of gene expression . Similarly, a slow transition parameter in isoform processing could also generate a bimodal distribution of isoforms in a cell population.
Single molecule long-read sequencing (Pacbio RNA-seq, iso-seq) [92, 93] is another promising technique for surveying isoform diversity. It can provide confident high-quality reads for transcripts over 20 kb, and over 10% of novel splice junctions have been identified through this strategy. The drawbacks are low throughput (i.e., limited reads per SMRT cell) and the potential for relatively high error rates in long reads.
Single-cell sequencing is comprehensive but suffers from low sensitivity and the potential for the introduction of error during library preparation and analysis. Single-molecule imaging is a complementary method. Single-molecule fluorescent in situ hybridization (smFISH) [94, 95] is a powerful way to quantify the absolute abundance of endogenous RNA transcripts in individual cells. Alternative splicing can be visualized by detecting the unique sequences of the different isoforms. The major advantage of this method compared to single-cell RNA-seq is that it provides both spatial information and sequence-specific information. For example, by probing the introns undergoing alternative splicing in the genes Sxl and nPTB, Vargas et al. showed that alternatively processed introns have delayed kinetics and are more frequently detected in the nucleoplasm. Waks et al.  probed the alternative spliced exons in genes CAPRIN1 and MKNK2, and examined the cell-to-cell variability by measuring the fraction of isoform abundance. Notably, they found that the distribution of isoform ratio could be explained by a theoretical stochastic model . Nevertheless, standard smFISH requires the targeting of a single transcript with probes of approximately 48 oligonucleotides, each spanning about 17–22 nt and labeled at their 3′ end with one fluorophore. For the large majority of alternatively spliced isoforms, which only have slight differences in their mRNA sequences, a more sensitive approach such as the recently developed inosine fluorescence in situ hybridization (inoFISH)  is necessary.
Both smFISH and inoFISH require killing cells, and neither addresses the dynamic nature of splicing. To explore the stochasticity in splicing, it is necessary to record splicing kinetics in living cells. Taking advantage of the bacteriophage MS2 stem-loop and fluorescence-labeled coat proteins, researchers now can record RNA dynamics at the single-molecule level. Initially, fluorescence recovery after photobleaching (FRAP) together with MS2 stem-loop-labeled genes were used to monitor splicing and transcription kinetics . The improvement in the imaging and analysis of RNAs at the single-molecule level enabled the direct observation of nascent RNAs at the gene locus. The fluctuation of intron and exon signals was recorded, and transcription and splicing kinetics were extracted through the cross-correlation function [70, 72]. With the advance of genome editing [99, 100], it is now possible to label single molecules of RNA produced from endogenous loci, which will allow tracing of the nascent RNA synthesized under physiological conditions. The information provided by live imaging of splicing of endogenous genes will extend our understanding of the stochasticity in splicing kinetics, including the impact of signaling networks and the chromatin environment.
Tremendous progress in the single-cell sequencing and real-time measurement of single-molecule fluorescence has accelerated our understanding of splicing stochasticity. An integrated method that combines the ‘bird’s-eye view’ provided by high-throughput sequencing and the detailed information from time-lapse single-molecule microscopy will facilitate further advancements.
Understanding the physiological role of noise in RNA processing
To understand a potential functional role for variability (stochastic or otherwise) in RNA sequence, a potential starting point is the assessment of the protein products. The proposition of ‘one gene, multiple proteins’ is rooted in the early days soon after the discovery of alternative splicing. Yet, there is debate on the extent to which alternative splicing can change the protein reservoir. Of course, there are numerous examples showing that functionally distinct proteins are generated from alternative splicing isoforms. More recently, using ribosome profiling, it has been shown that more than 75% of medium-to-high abundance alternative cassette exons are occupied by ribosomes . Over 60% of these cassette exons preserve the reading frame, in agreement with the observation that short, frame-preserving cassette exons are more evolutionarily favored . An opposing view is that although thousands of alternative splicing isoforms are identified through RNA-seq, only a small portion of them are identified by large-scale mass spectrometry . In the early days of GENCODE, Tress et al.  examined the limited number of reported alternative splicing events. They concluded that many alternative spliced transcripts, if translated, would drastically change the structure and function of the protein products. Nevertheless, it is hard to predict the protein structure that would result from some isoforms, or whether the sequence would result in an unstable folding status . The follow-up study, based on a large-scale human proteomics database analysis, suggests that most highly expressed genes have one dominant isoform . Nevertheless, owing to the limited sensitivity of mass spectrometry-based proteomics, we still do not know what proportion of alternative splicing isoforms will result in functional proteins.
Did biological systems evolve to suppress splicing noise? Alternatively, has the system evolved to exploit this noise? The most common noise-reducing regulatory mechanism is negative feedback. RNA quality control systems, such as nonsense-mediated decay (NMD), nonstop decay (NSD), and no-go decay (NGD), have evolved to mitigate errors in RNA processing . In addition to negative feedback, kinetic proofreading also plays a role in dampening splicing noise [107, 108]. On the other hand, noisy splicing has been proposed to give rise to population heterogeneity and may be essential in neurogenesis [109, 110], innate immunity , and evolution [112, 113]. Notably, recent work has also demonstrated a global alteration in splicing in cancers that involve mutations in core spliceosomal subunits such as U2AF1 and SF3B1 . Intensive sequencing efforts from patients’ samples argued that the splicing changes in these patients are minor and highly variable [115, 116, 117]. To date, it has been difficult to attribute either the cancer phenotype or the prognosis to isoform changes affecting a specific set of genes. Cancer is an evolutionary disease and these spliceosomal mutations often occur at an early stage [118, 119, 120]. One possibility might be that the mutations in spliceosomal proteins function as an amplifier of splicing noise, as has been suggested for splicing alterations in other disease states . Low-abundance isoforms that are generated through splicing noise may allow the new variant to be evolutionarily tested and could benefit tumor progression in a heterogeneous way.
Current limitations and outlook
Splicing has been studied intensively, but it is only one of the processes that determine the chemical composition of mRNA. The roles of RNA editing and RNA modifications are now coming into focus as additional potential sources of heterogeneity. Transcriptome profiling techniques are powerful because of the exquisite detail they provide, and imaging allows researchers to follow cells over time. Future efforts to combine these advantages in order to generate longitudinal studies of transcription and splicing are promising but in the early stages . In the meantime, the problem of interpreting the phenotypic consequences of variability remains a considerable challenge.
We would like to thank Drs Huimin Chen and Murali Palangat for critical reading of the manuscript.
This work is supported by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research.
Both authors wrote the manuscript and approved the final version.
The authors declare that they have no conflict of interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 36.Swinstead EE, Paakinaho V, Presman DM, Hager GL. Pioneer factors and ATP-dependent chromatin remodeling factors interact dynamically: a new perspective: multiple transcription factors can effect chromatin pioneer functions through dynamic interactions with ATP-dependent chromatin remodeling factors. BioEssays. 2016;38:1150–7.PubMedCrossRefGoogle Scholar
- 70.Coulon A, Ferguson ML, de Turris V, Palangat M, Chow CC, Larson DR. Kinetic competition during the transcription cycle results in stochastic RNA processing. elife. 2014;3 https://doi.org/10.7554/eLife.03939.
- 79.Tapial J, Ha KCH, Sterne-Weiler T, Gohr A, Braunschweig U, Hermoso-Pulido A, et al. An atlas of alternative splicing profiles and functional associations reveals new regulatory programs and genes that simultaneously express multiple major isoforms. Genome Res. 2017;27:1759–68.PubMedPubMedCentralCrossRefGoogle Scholar
- 122.Single Cell Analysis Challenge. https://commonfund.nih.gov/singlecell/challenge. Accessed 30 Sep 2017.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.