Indel sensitive and comprehensive variant/mutation detection from RNA sequencing data for precision medicine
RNA-seq is the most commonly used sequencing application. Not only does it measure gene expression but it is also an excellent media to detect important structural variants such as single nucleotide variants (SNVs), insertion/deletion (Indels) or fusion transcripts. However, detection of these variants is challenging and complex from RNA-seq. Here we describe a sensitive and accurate analytical pipeline which detects various mutations at once for translational precision medicine.
The pipeline incorporates most sensitive aligners for Indels in RNA-Seq, the best practice for data preprocessing and variant calling, and STAR-fusion is for chimeric transcripts. Variants/mutations are annotated, and key genes can be extracted for further investigation and clinical actions. Three datasets were used to evaluate the performance of the pipeline for SNVs, indels and fusion transcripts.
For the well-defined variants from NA12878 by GIAB project, about 95% and 80% of sensitivities were obtained for SNVs and indels, respectively, in matching RNA-seq. Comparison with other variant specific tools showed good performance of the pipeline. For the lung cancer dataset with 41 known and oncogenic mutations, 39 were detected by the pipeline with STAR aligner and all by the GSNAP aligner. An actionable EML4 and ALK fusion was also detected in one of the tumors, which also demonstrated outlier ALK expression. For 9 fusions spiked-into RNA-seq libraries with different concentrations, the pipeline was able to detect all in unfiltered results although some at very low concentrations may be missed when filtering was applied.
The new RNA-seq workflow is an accurate and comprehensive mutation profiler from RNA-seq. Key or actionable mutations are reliably detected from RNA-seq, which makes it a practical alternative source for personalized medicine.
KeywordsRNA sequencing Somatic mutations Insertion/deletion Fusion transcript Gene expression Targeted therapy Precision medicine
Fusion fragments per million total reads
Genome in a Bottle
Single nucleotide variants
Somatic mutations are a hallmark of a tumor and inherited mutations cause certain genetic disorders. Characterization of these mutations and exploration of their clinical relevance constitute a critical part of personalized medicine. Mutations present in multiple forms and common ones include single nucleotide variants (SNVs), short insertions/deletions (indels), or fusion transcripts. SNVs or indels are primarily detected from DNA sequencing such as whole-genome, exome-sequencing, targeted sequence or amplicon. However, RNA-seq is the most popular sequencing application as it contains much richer genomic information. Not only does it measure gene expression but it also can detect important structural variants such as SNVs, indels or fusion transcripts, some of which are known actionable mutations for tumor treatment. A good example of this is EGFR single base mutation (L858R) in exon 21 and in-frame deletions (ranging from 12 to 18 bases) in exon 19, both can be targeted by EGFR tyrosine kinase inhibitors, such as gefitinib and erlotinib with clear clinical benefits to patients . Although fusion transcript detection from RNA-seq is commonly used [2, 3, 4], use of RNA-seq for SNV or Indel mutation detection in clinical settings is still rare, which is contributed by several reasons. Detection of structural variants from RNA-seq is much more challenging. RNA transcripts are spliced molecules from different parts of genome and exon-exon junction aware alignment is needed. This alignment causes difficulty for variant calling tools, which are mostly developed for DNA-sequencing. As the primary goal of RNA-seq is gene expression profiling, commonly used RNA-seq mapping programs often conduct ungapped mapping and sequence reads with insertion or deletion are un-mappable and these variants would be ignored . Even for the same alignment, variant calling tools perform differently, particularly for Indel detection . Another concern for RNA-seq based mutation detection is differential gene expression, which leads to variable coverage between genes and affects variant detection for genes expressed at low level. This is highly relevant and important for mutation discovery. Meanwhile, data also show that key or driver mutations often occur in expressed genes and tend to be conserved and easily detectable in RNA-seq , which makes RNA-seq based mutation detection a potential cost effective alternative if it can be used for multiple information profiling simultaneously.
Many RNA-seq workflows have been developed [7, 8, 9], but they mostly perform a particular function in research settings. MAP-RSeq  is a comprehensive analytical pipeline with gene expression quantification, fusion transcript and SNV detection, but it cannot detect indels. PRADA  focuses fusion detection and annotation. A recent tool Opossum  conducts comprehensive RNA-seq alignment pre-processing before variant calling by either Platypus  or GATK Haplotype Caller  but only SNVs are evaluated. As continuation of our previous work in detecting Indels from RNA-seq, we have developed an integrated RNA-seq pipeline “PanMutsRx” with goal of reporting common and clinical important mutations (SNVs, indels, fusion transcript) at once. PanMutsRx implements RNA-seq alignment programs that conduct gapped and junction aware mapping, performs rigourous pre-processing steps unique to RNA-seq before variant calling, incorporates selected best performing single sample variant and paired somatic mutation callers, and optionally reports mutations for a list of genes in interest. Using a sample from Genome in a Bottle Consortium where variants are well defined from multi-platform DNA-sequencing, we demonstrated its good performance in SNV and Indel detection. We also tested a set of clinical samples with known mutations and fusion transcripts and showed that important mutations were almost all detectable, which makes it a potential application for clinical applications.
Sequence read alignment: This pipeline includes two aligners i.e. GSNAP  and STAR , with STAR as default (or both can be run at the same time). STAR is an ultrafast RNA-Seq junction aware aligner which uses sequential maximum mappable seed search, seed clustering and stitching. A two-step alignment is implemented to increase the accuracy; in the first step splice junctions are detected and are used to guide in the second alignment. STAR is not only superfast but also very sensitive for Indel detection as demonstrated in our previous work . GSNAP is another junction aware and fast aligner which is tolerant to complex genomic events like variants and indels and was shown more sensitive for longer indels when sequence reads are short . Read group information is added and duplicate reads are tagged with SAMBLASTER tool .
Aligned read preprocessing for SNV/Indel detection: RNA-Seq variant calling is much complex than DNA-seq and the gapped alignment also causes incompatibility with existing variant callers. This module prepares the aligned bam file for next variant calling step. SplitNCigar, Indel Realignment and Base Recalibration are done by GATK tool kit . In the SplitNCigar step, reads are split into exon segments and sequences which overhang in the intronic region are hard clipped.
Variant/somatic mutation Calling: Our previous work showed GATK  haplotype caller performed superiorly for single sample mode SNV and indel detection, and Strelka  was better for paired tumor/normal somatic mutation calling in RNA-seq . They are implemented as SNV and somatic caller, respectively.
Variant annotation: Functional annotation is a key step for identified variants to understand the potential clinical impacts. Annovar  is a lightweight, simple to use, and efficient tool to annotate variants. Annovar is integrated as part of workflow for variant interpretation.
Fusion Transcripts: Fusion transcripts are characteristic of tumors and highly relevant to targeted therapy [18, 19]. It is important to detect potentially targetable fusions for guided therapy. To provide seamless integration, STAR-Fusion is incorporated for this function.
Gene Expression: The gene level quantification is done using featureCounts software . The gene expression data can be used for outliner gene expression detection or differential expression analysis where both raw digital read count and normalized RPKM expression are generated.
Summary Report and output structure: The output file structure is illustrated in Fig. 1b. The read alignment files are provided in the BAM file format and are indexed to view in the IGV, Single sample variant and somatic variant calls are in the VCF format, Fusion transcripts are provided in tab separated files and gene expression is represented as raw counts and RPKM values in tab separated files.
The workflow has flexibility to execute individual modules separately and appropriate log files are generated for troubleshooting. Additional options are provided to run the workflow in the open grid engine parallel cluster environment, but depending on the other grid engine types changes may need to be made. Parameters used for all steps are provided in Additional file 1: Table S1.
Test data and pipeline evaluation
To evaluate the performance of PanMutsRx in SNV/Indel and fusion detection, we used 3 datasets.
Hapmap NA12878 RNA-Seq and DNA-Seq dataset
Genome in a Bottle (GIAB) consortium released a benchmark SNP and indel dataset for sample NA12878 by integrating multiple DNA sequence data sets including whole genome sequencing . For the same sample, RNA-seq was performed through ENCODE project. We downloaded the raw RNA-seq data (https://www.encodeproject.org/; ENCFF377UIC with 147 million pair-end reads at 100 bp read length) and analyzed through our pipeline for SNVs and Indels and compared with the benchmark DNA variants. As variants from RNA-seq are only possible from coding regions and only expressed genes can be assessed, the comparison was limited to the genomic positions with at least 10X coverage in the RNA-seq where variants are reported in the reference dataset from GIAB. The sensitivity was calculated as the percent of correct calls in RNA-seq at these positions in comparison with variants in DNAs (SNVs or indels separately). For specificity, we extracted all positions in RNA-seq with at least 10X coverage but there are no variants in DNA as defined in GIAB benchmark set (true negatives or TN). Any variants in these positions reported from RNA-seq were considered as false positives (FP) and the specificity was obtained by the formula: TN/(TN + FP).
We also run sample ENCFF377UIC by other public tools and compared the relative performances for the SNV and Indel detection. Opossum is a RNA-seq preprocessing tool before variant calling by either Platypus or GATK haplotype caller and demonstrates a good performance in SNV detection . In addition to the GATK best practices for RNA-seq variant calling , which PanMutRx follows, Opossum merges overlapping reads and modifies the base qualities at the ends of these reads before splitting them. Opossum can use Tophat or STAR alignment but we used the latter as the former does not allow Indel detection. RVBoost along with MAP-RSeq  is a RNA variant prioritization method with demonstrated better performance . It uses several attributes unique for RNA-seq and a boosting method to train a model with reliable variants and then prioritizes the RNA SNV variants based on the trained model.
Lung cancer adenocarcinoma RNA-seq datasets with known oncogenic or targetable mutations
Lung cancer is one of tumors harboring a high number of mutations  and some of the mutations are sensitive to targeted therapy such as EGFR single nucleotide mutation at exon 21 (L858R) and intermediate indels (12 to 18 bases) at exon 19  targeted by tyrosine kinase inhibitors  and EML4-ALK fusion targeted by kinase inhibitor Crizotinib . The diverse cancer mutations and high yield targeted therapy provide an excellent use case to demonstrate the usability of PanMutsRx pipeline. To this end, we downloaded a lung adenocarcinoma dataset from SRA (ERP001058) consisting of 77 tumor and normal pairs with RNA-seq performed  where all of the aforementioned mutations are known to be present. The RNA-seq was sequenced at pair ends of 101 cycles and was analyzed in the paired mode for somatic mutations (comparing each tumor with its paired normal sample from each patient). The somatic mutations were compared with the known mutations.
Synthetic spike-in cancer gene fusions of mRNA-seq data
This publicly available dataset is created for the community to evaluate fusion detection algorithm where 9 well known oncogenic fusion transcripts were spiked into RNA-seq libraries at wide range of molarities . We downloaded and evaluated 6 samples at the concentration of − 3.47, − 4.17, − 5.87, − 6.17, − 6.87, and − 8.57 through our pipeline. To reduce high false positives, we applied the filters that require combined normalized split and spanning fragment reads greater than 0.1 FFPM (J_FFPM + S_FFPM > 0.1, i.e., fusion fragments per million total reads) and the split reads are supported with at least 25 bases at both sides of a putative breakpoint. (“LargeAnchorSupport”==“YES_LDAS”).
Comparison of SNVs and Indels detected from RNA-seq with golden standard DNA-seq of Hapmap NA12878
Performance comparison with other RNA-seq variant calling tools
Lung cancer adenocarcinoma RNA-seq dataset with known targetable/oncogenic mutations
Key known mutations of oncogenic genes detected by PanMutsRx
EGFR micro deletion
Gene expression quantification
RNA-seq is one of the most commonly used sequencing applications as it measures the dynamics of genome transcription activities. Besides research, it also holds great promise for clinical diagnostics, prognostics and therapeutic applicability for various diseases, particularly cancers . To put this into practice, various bioinformatics analyses challenges need to be overcome and to compile the types of information that can be reliably utilized for clinical applications from RNA-seq. Obviously, differential expression, alternative splicing, or allele specific expression are only unique to RNA-seq. RNA-seq is also an excellent platform for fusion transcript detection. The challenges are in detection of single nucleotide variants or small Indels from RNA-seq. Our previous evaluation shows that although SNVs can be reliably detected, indels are ignored by common RNA-seq tools, which calls for a need to develop a more sensitive pipeline . PanMutsRx is developed to meet this specific and critical need.
PanMutsRx was designed with the goal of easy usage and detection of multiple types of mutations simultaneously. Our assessment showed its high sensitivity and specificity to SNVs and small Indels. Fusion transcripts can be easily detected and gene expression can be used along for cross validation of fusion transcript or other applications. In real practice of oncology, only very limited number of mutations has available drugs and capturing these mutations is of paramount priority. Our previous and current work suggests that although many unique mutations can be detected from either DNA-seq (like exome-seq) or RNA-seq, the important and actionable mutations are often conserved in RNA-seq. This suggests we can extract useful and relevant information to reduce the complexity of multi-genomic information from RNA-seq. We provide a post-processing script to extract SNVs, Indels, fusion transcripts, or expression for a list of genes users provide.
Available RNA-seq workflows mostly focus a particular function for example, gene expression, SNV or fusion transcript detection, which has its advantages of easy management. However, conducting analysis for each separately needs redundant work with significant effort for the RNA-seq data. PanMutsRx aimed to perform all clinical relevant tasks at once by selecting high performing tools for each application. RNA-seq alignment by different aligners makes much less difference for SNVs than for Indels and our selection of STAR and GSNAP as part of PanMutsRx was based on our comprehensive comparison among several tools . Our current data further validated their good performance. For STAR alignment, it appears that PanMutsRx pre-processing generated very similar result as Opossum pre-processing. Results from GATK Haplotype caller were more sensitive than Platypus for both SNVs and Indels under the default settings. Parameter optimization may be needed to achieve better results. The slight gain from Opossum in some occasions may justify its adoption. As PanMutsRx is highly modular, a better tool can be integrated easily.
The missed calls in RNA-seq can be several reasons. We found majority of them were caused by insufficient alternative allele and although could be called but filtered out. These positions can be recovered by reducing filtering stringency but the trade-off would be increased false positives. Although Indel detection performs reasonably well, there is room for further improvement.
We have developed a sensitive and comprehensive RNA-seq analytical pipeline which can capture multiple mutations simultaneously (single nucleotide, small insertion/deletion, chimeric transcripts or abnormal gene expression) and can be potentially used in clinical practice and precision medicine.
We thank Genome in a Bottle (GIAB) consortium, ENCODE (https://www.encodeproject.org/) and GEO (https://www.ncbi.nlm.nih.gov/geo/) for making data publicly available for the research community and the reviewers for their constructive comments.
This work was supported by the Mayo Clinic Center for Individualized Medicine, which also provided the cost of publication.
Availability of data and materials
Project name: Indel sensitive and comprehensive variant/mutation detection from RNA sequencing data for precision medicine (PanMutsRx)
Project home page: https://github.com/m081429/PanMutsRx
Operating system (s): Linux
Programming language: PYTHON, Perl and Shell
Prerequisites for PanMutsRx requirements: JAVA 1.8 or greater, PYTHON 3.4.3 or greater, PERL 5.16.2 amd QSUB (if running parallel processing on cluster)
License: GNU GPLv3
Any restrictions to use by non-academics: license needed
About this supplement
This article has been published as part of BMC Medical Genomics Volume 11 Supplement 3, 2018: Selected articles from the 7th Translational Bioinformatics Conference (TBC 2017): medical genomics. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume-11-supplement-3.
NP implemented the workflow, conducted data analysis and drafted the manuscript. AB performed the data analysis and participated in manuscript drafting. JPK participated in the design and supervision of the study. ZS conceived of the study, performed data analysis and coordination, and revised the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable as all data used in this work were publicly available as described.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 3.Van Allen EM, Robinson D, Morrissey C, Pritchard C, Imamovic A, Carter S, Rosenberg M, McKenna A, Wu YM, Cao X, et al. A comparative assessment of clinical whole exome and transcriptome profiling across sequencing centers: implications for precision cancer medicine. Oncotarget. 2016;7(33):52888–99.CrossRefPubMedPubMedCentralGoogle Scholar
- 5.Sun Z, Bhagwate A, Prodduturi N, Yang P, Kocher JA. Indel detection from RNA-seq data: tool evaluation and strategies for accurate detection of actionable mutations. Brief Bioinform. 2016;Google Scholar
- 6.Sun Z, Wang L, Eckloff BW, Deng B, Wang Y, Wampfler JA, Jang J, Wieben ED, Jen J, You M, et al. Conserved recurrent gene mutations correlate with pathway deregulation and clinical outcomes of lung adenocarcinoma in never-smokers. BMC Med Genet. 2014;7:32.Google Scholar
- 22.The GATK Best Practices for variant calling on RNAseq, in full detail [http://gatkforums.broadinstitute.org/discussion/3892/the-gatk-best-practices-for-variant-calling-on-rnaseq-in-full-detail] [Accessed date:11/02/2016].
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.