Next-generation sequencing (NGS) has transformed diagnostic protocols for Mendelian diseases. Although in the past it could be a long, frustrating and often futile battle for parents with an affected child to find the cause of their child’s suffering, the availability of whole-exome sequencing (WES) and whole-genome sequencing (WGS) has made molecular diagnosis—at least conceptually—possible for every patient. Genetic confirmation of diagnosis can be key for treatment, removes uncertainty, and may be important for future family planning. However, this promise has not been fully kept. For mitochondrial and other diseases, the analysis of the coding sequence does not lead to a diagnosis in 50–75% of patients. This figure indicates that in numerous cases, the pathogenic variants escape detection, were detected but erroneously classified as a variant of uncertain significance (VUS), or were part of a more complex genetic constellation.

Limitations of DNA sequencing in diagnostics

Only 25–50% of patients receive a firm genetic diagnosis after WES, often because of limitations concerning the coverage of the target regions, the detection of intronic and regulatory variants, the bioinformatic filtering and prioritization of potential pathogenic variants, and knowledge about the molecular and clinical consequences of genetic variants. WGS improves the coverage and allows detection of extra-exonic variants and structural variants [12]. When focusing on the coding region, WGS currently improves diagnostics of recessive disorders only marginally. When the search space is extended to the full genome, the currently most effective filter for minor allele frequency of 0.1–1.0% is not effective. In a single WES dataset, the frequency filter already yields on average 100–200 variants (25 bi-allelic) requiring manual interpretation. Outside the exome, the numbers of such variants are two orders of magnitudes higher [1]. Moreover, although our understanding of coding variants is incomplete, our understanding of non-coding sequences is severely restricted. The capability of sequencing technology and bioinformatics tools are developing quickly and provide comprehensive genome annotation, much faster than our ability to define the clinically relevant impact of detected variants.

Limitations in DNA variant interpretation

A definitive diagnosis is based on the discovery of known pathogenic variant(s) in a patient with a specific clinical presentation similar to the clinical picture reported multiple times, usually listed in the disease-variant database ClinVar [9]. However, this is not the common situation, neither on a variant nor on a phenotype level. We observe a continuous extension of the phenotypic spectrum associated with variants in the same gene or even the same variant. The increasing overlap of clinical presentations of genetically different disorders is additionally weakening the discriminating power of established genotype–phenotype associations. Identification of possible protein-truncating variants in genes for which non-truncating/in-frame pathogenic loss-of-function variants are known can already be challenging. If transcripts affected by such variants escape nonsense-mediated mRNA decay (NMD), they may still produce a functional protein by the mechanism of translation re-initiation [15], by functional alternative transcript isoforms [13], or by preserved/residual function of the truncated protein [7]. A recent systematic study shows that exons present only in tissue specific isoforms may not be essential for protein function [5]. In many cases, the candidate variant is even more difficult to interpret, such as rare missense, (near) splice site, intronic, and synonymous variants. Therefore, data describing the functional consequences on the molecular level are required to advance diagnostics.

The value of RNA sequencing in diagnostics

Transcriptomics by RNA sequencing (RNA-seq) takes advantage of new sequencing protocols and allows direct insights into the transcriptome of cell lines or tissues, reflecting a snapshot of a specific time point [11]. With a focus on protein coding genes, the procedures usually include an enrichment step for full-length Poly(A) transcripts followed by cDNA synthesis and sequencing; however, many other protocols exist to analyze total RNA, circular RNA or micro RNA to name a few. RNA-seq of full-length mRNA has the capability to detect and quantify known pre-defined RNA species, in addition to rare and novel RNA transcript variants and isoforms [3]. Hence, it uncovers the transcriptional consequences of genetic variants either previously prioritized or previously missed by the applied filters in the bioinformatics pipeline. RNA-seq provides a single assay to validate and quantify the impact of potential regulatory or splice defects for all genes expressed in a biological sample. Moreover, RNA-seq has been validated as a tool to indicate novel Mendelian disease genes through the identification of pathogenic variants in the respective genes [8]. With a diagnostic yield between 10 and 35%, two recent studies convincingly demonstrated the power of combined DNA and RNA sequencing [4, 8]. Whereas Kremer et al. performed RNA-seq on fibroblast cell lines from patients with suspected mitochondrial disorders, Cummings studied muscle biopsy samples from patients with muscular disorders. In both cases, the tissue was carefully selected. More than 90% of the known mitochondrial disease genes were reliably detected in fibroblast cell lines, and muscular disease genes in muscle biopsies respectively. However, this is not applicable for the whole spectrum of tissues, e.g., in the usually available tissue, blood, only about two thirds of the known disease genes are expressed.

RNA-seq data can be analyzed using gene-specific questions to refine transcript isoform annotation and to verify the consequence of a suspected variant on a specific transcript, thereby replacing quantitative RT-PCR and cDNA sequencing in a comprehensive assay including a number of controls. In cases where only the index cases is available, it enables haplotype phasing of two variants in different exons represented on continuous RNA reads. Examples were transcriptome analysis providing complementary information to DNA sequencing, including three cases with non-pathogenic protein truncating variants (Fig. 1). In cases of a homozygous frameshift mutation in exon 2 of ATP7B, we detected mRNA expression comparable with healthy controls, suggesting that NMD could be bypassed by the mechanism of translation re-initiation. This was confirmed by Western blot and functional tests of copper export capacity [15]. In another case, we identified bi-allelic frameshift variants in FLAD1, which encodes FAD synthase. Because FADS is essential for cellular supply of FAD cofactors, the finding of bi-allelic frameshift variants was unexpected. RNA-seq analysis discovered a novel FLAD1 isoform missing the affected exon, explaining why bi-allelic FLAD1 frameshift variants still harbor substantial FADS activity [13]. In a patient with a mitochondrial disorder, we found homozygous, protein-truncating variants in LYRM7 and MTO1, two genes encoding essential mitochondrial proteins. Transcriptome and proteome studies confirmed normal expression of the truncated MTO1 and we did not find any indication of impaired MTO1 activity [7].

Fig. 1
figure 1

Selected examples where transcriptome analysis provided complementary information to DNA sequencing. UTR untranslated region, NMD nonsense-mediated decay

In addition to the validation of the impact of an identified VUS on the corresponding transcript, RNA-seq data can also be analyzed transcriptome-wide to detect aberrant gene expression. In such systematic analysis of RNA-seq data, searching for extremes (as detailed below) allows candidate disease-causing genes for rare disorders to be identified and prioritized. To focus on rare and recessive diseases, we applied stringent filtering for rare events with strong effect sizes, as described below.

Mono-allelic expression (MAE) is where one allele is silenced, leading to expression of only the second allele. When assuming a recessive mode of inheritance, genes with a single heterozygous rare coding variant identified by WES or WGS analysis are not prioritized [6]. However, MAE of such variants fits the recessive mode of inheritance assumption. Detection of mono-allelic expression can thus help to re-prioritize heterozygous rare variants. Our setting is based on the use of fibroblast cell lines, where about 7500 heterozygous SNPs identified by genotyping are covered by RNA-seq reads at least ten times, allowing detection of alleles expressed by at least 90% [8]. Six of the MAE alleles carry rare single-nucleotide variants (SNVs) affecting the protein sequence.

Aberrant expression, identified as gene expression outliers, occurs when expression is outside their physical range and usually implies impaired gene expression of both alleles with decreased expression levels of less than 50% of the controls. It can result from RNA degradation through nonsense-mediated decay (NMD) based on either apparently protein-truncating variants or splice defects, but it can also result from non-coding variants in regulatory regions such as promoters, enhancers, suppressors or variants in the untranslated region of the transcripts or combinations thereof. The genome-wide analysis reveals a median of only one aberrantly expressed gene per sample (Fig. 2; [8]).

Fig. 2
figure 2

Complex I subunit NDUFA10 expression outlier in a patient with complex I-deficiency. a Volcano plot visualization shows five expression outliers (red dots) with NDUFA10 showing the lowest z score and p value. b Normalized read counts over 160 samples indicate this sample to be the only expression level outlier for NDUFA10. c Integrative genomics viewer (IGV) representation of a homozygous 20 base pair deletion in the 5′UTR of NDUFA10

Aberrant splicing has been recognized as a major cause of Mendelian disorders for a long time [14]. A systematic study of SNVs in ClinVar predicted that 20 to 30% of VUS and pathogenic variants cause aberrant splicing patterns [10]. However, the prediction of splicing defects from genetic sequences is difficult, because splicing involves a complex set of cis-regulatory elements that are not yet fully understood. Some of them can have deep intronic location and are thus not covered by WES. Hence, direct probing of splice isoforms by RNA-seq is important, and has led to the discovery of multiple splicing defects based on single-gene studies. To detect aberrant splicing events, we adapted an algorithm for splicing quantitative trait loci to the context of rare disorders. This pipeline is based on an annotation-free algorithm that is also able to detect novel splice sites. A median of five aberrantly spliced genes are detected per sample [8]. Aberrant splicing is not only caused by variation affecting known splice sites or splice motifs, it can also be the consequence of variants creating novel splice sites or splice motifs within coding or deep-intronic regions (Fig. 3). Splicing abnormalities include exon creation, skipping, extension and truncation, or a combination thereof, but also intron inclusion and often leads to premature in-frame stop codons, provoking degradation of the RNA by NMD, which may frequently be detected as aberrant expression. The RNA-seq data allow the characterization of all novel transcript isoforms. Quantification of the reads connecting the reference and aberrantly spliced exons may provide a direct readout of the DNA variant’s consequences (Fig. 4).

Fig. 3
figure 3

Example of a synonymous homozygous variant in C10orf2 creating a novel splice acceptor site. a Sashimi plot visualization of a detected exon truncation. The main transcript has a deletion of 62 base pairs of exon 2. b IGV visualization of DNA and RNA sequencing data. The predicted synonymous variant enhances splicing 4 base pairs downstream and changes the minor isoform (10% of reads in controls) to the main isoform (14 out of 18 reads)

Fig. 4
figure 4

Example of a complex pattern of aberrant splicing due to a homozygous near splice variant in MRPL44. As a consequence of the near splice variant, three new transcript isoforms are produced, including exon truncation (isoform 2), intron retention (isoform 3), and exon elongation (isoform 4), in addition to the reference transcript isoform 1 (18% of all transcripts)

The small number of less than 20 aberrantly expressed genes per sample allows a manual inspection and evaluation of the RNA-seq data and improved clinical interpretation in the context of the genetic and clinical data.

The RNA-seq protocols and bioinformatics pipelines presently in use are focused on the gene level for expression outliers, on exon/intron or splice site level for aberrant splicing, and on SNPs for mono-allelic expression in a specific tissue or cell line. The development of long-read sequencing will also allow consideration of more complex situations in large genes with multiple transcript isoforms and single-cell RNA-seq protocols will increase the resolution of average expression level from a certain tissue to specific cell types and will allow the cell specific regulation and imprinting mechanism to be studied. However, the methods provide only a snapshot of the cells studied and the non-detection of aberrant expression in a surrogate tissue does not allow normal splicing in the affected tissue to be concluded, which represents a clear limitation. Currently, several RNA-seq analysis pipelines are available, but further improvement is necessary to optimize sensitivity and specificity. To automate and optimize the correction of confounding technical, environmental, or common genetic variations, we recently developed OUTRIDER. OUTRIDER improved the detection of aberrant expression, based on the assessment of statistical significance [2]. Further method development is nevertheless required, especially for the detection of aberrant splicing events and the prediction of causal variants.

Practical conclusion

By integrating phenotype and genotype information only, less than 50% of Mendelian disorders are diagnosed

  • This diagnostic gap is in part due to limitations in prioritizing and interpreting identified variants

  • Transcriptomis by RNA sequencing provides complementary functional information to DNA sequencing

  • RNA sequencing delivers quantitative data on RNA expression level, aberrant splicing, and allele specific expression

  • The systematic analysis helps prioritizing variants predicted or not predicted to change the encoded protein

  • RNA sequencing has been validated as a tool for the discovery of pathogenic variants in novel Mendelian disease genes

  • RNA sequencing has a high potential to be implemented as a routine tool to improve molecular diagnosis.