Systematic evaluation of RNA-Seq preparation protocol performance
RNA-Seq is currently the most widely used tool to analyze whole-transcriptome profiles. There are numerous commercial kits available to facilitate preparing RNA-Seq libraries; however, it is still not clear how some of these kits perform in terms of: 1) ribosomal RNA removal; 2) read coverage or recovery of exonic vs. intronic sequences; 3) identification of differentially expressed genes (DEGs); and 4) detection of long non-coding RNA (lncRNA). In RNA-Seq analysis, understanding the strengths and limitations of commonly used RNA-Seq library preparation protocols is important, as this technology remains costly and time-consuming.
In this study, we present a comprehensive evaluation of four RNA-Seq kits. We used three standard input protocols: Illumina TruSeq Stranded Total RNA and mRNA kits, a modified NuGEN Ovation v2 kit, and the TaKaRa SMARTer Ultra Low RNA Kit v3. Our evaluation of these kits included quality control measures such as overall reproducibility, 5′ and 3′ end-bias, and the identification of DEGs, lncRNAs, and alternatively spliced transcripts. Overall, we found that the two Illumina kits were most similar in terms of recovering DEGs, and the Illumina, modified NuGEN, and TaKaRa kits allowed identification of a similar set of DEGs. However, we also discovered that the Illumina, NuGEN and TaKaRa kits each enriched for different sets of genes.
At the manufacturers’ recommended input RNA levels, all the RNA-Seq library preparation protocols evaluated were suitable for distinguishing between experimental groups, and the TruSeq Stranded mRNA kit was universally applicable to studies focusing on protein-coding gene profiles. The TruSeq protocols tended to capture genes with higher expression and GC content, whereas the modified NuGEN protocol tended to capture longer genes. The SMARTer Ultra Low RNA Kit may be a good choice at the low RNA input level, although it was inferior to the TruSeq mRNA kit at standard input level in terms of rRNA removal, exonic mapping rates and recovered DEGs. Therefore, the choice of RNA-Seq library preparation kit can profoundly affect data outcomes. Consequently, it is a pivotal parameter to consider when designing an RNA-Seq experiment.
KeywordsNext generation sequencing RNA-Seq Quality control
Association of Biomolecular Resource Facilities
Count per million fragments mapped to exons
Differentially expressed genes
External RNA Controls Consortium
False discovery rate
Fragments per kilobase per million
Gene Expression Omnibus
Long non-coding RNAs
- MD Anderson
The University of Texas MD Anderson Cancer Center
Mouse embryonic stem cells
Principal component analysis
Ribonucleic acid sequencing
Omics technology, driven by next-generation sequencing (NGS) coupled with new and increasingly robust bioinformatics pipelines, has triggered exponential growth in the accumulation of large biological datasets. The first NGS study, published in 2005 , reported the highly accurate sequencing of 25 million DNA bases in less than a day, representing a vast improvement in cost and throughput over traditional Sanger sequencing methods. Shortly thereafter, NGS technology was applied to RNA sequencing (RNA-Seq) [2, 3, 4, 5], and since then, the sensitivity, accuracy, reproducibility, and flexibility of RNA-Seq have made it the gold standard in transcriptomic research. Over the last ten years, approximately 53,700 RNA-Seq datasets have been deposited in the Gene Expression Omnibus (GEO) database . These RNA-Seq datasets provide information about the whole transcriptome, including gene fusions, differential expression of coding and non-coding genes, and splice variants in different experimental conditions. Increasing evidence confirms that changes in the transcriptome are a result of biological alterations, making RNA-Seq a driving force behind the exploration of global regulatory networks in cells, tissues, organisms, and diseases.
RNA-Seq is used primarily to identify differentially expressed genes (DEGs) in different biological conditions, but it is also used to discover non-coding RNAs such as microRNAs and long non-coding RNAs (lncRNAs) . RNA-Seq studies have already shown that differences in RNA preparation and enrichment during library preparation can cause fundamental variations in experimental outcomes. Hence, comprehensive evaluation of RNA-Seq library preparation methods by using different kits has provided a baseline from which to compare their overall capabilities and to guide future research applications. Several earlier studies have already identified potential confounding factors affecting RNA-Seq performance and analysis [8, 9, 10, 11, 12, 13, 14, 15]. These include two large-scale projects--the Sequencing Quality Control project of the SEQC/MAQC-III (MicroArray Quality Control) Consortium, led by US Food and Drug Administration  and the Association of Biomolecular Resource Facilities (ABRF) next-generation sequencing (NGS) study , and other studies including the evaluation of three Illumina RNA-Seq protocols for degraded and low quantity samples , a study of gene qualification on clinical samples using Illumina TruSeq Stranded Total RNA and mRNA RNA-Seq protocols  and additional investigations focused on low-input or single-cell sequencing [12, 13, 14, 15].
The SEQC project evaluated the sensitivity, specificity, reproducibility, and complexity of gene expression, DEGs, and splice junction detection from RNA-Seq performed at multiple sites, using the same commercial reference library and External RNA Controls Consortium (ERCC) RNA spike-in controls as well as experimental samples, but using different sequencing platforms and bioinformatics pipelines . Overall, the SEQC project found that RNA-Seq data generated from vendor-prepared libraries were stable across sites but variable across protocols, implying that data variability likely originated from differences in library preparation and/or sequencing platforms. Parameters affecting library preparation include fragmentation time, ribosomal RNA (rRNA) depletion methods, cDNA synthesis procedures, library purification methods, ligation efficiency, and RNA quality. This study  also illustrated that for the most highly expressed genes, DEGs were consistently identified across sites and platforms and that de novo splice junction discovery was robust but sensitive to sequencing depth.
The ABRF-NGS study evaluated not only the sensitivity, specificity, reproducibility, and complexity of gene expression, but also differential gene expression and splice junction detection among different combinations of sequencing platforms and library preparation methods, taking into account size-specific fractionation and RNA integrity . In general, the results across platforms and library preparation methods were highly correlated, but greater read depth was necessary to recover rare transcripts and splice site junctions present at low frequency, especially those resulting from putative novel and complex splicing events. Library preparation influenced the detection of non-polyA tail transcripts, 3′ UTRs, and introns, primarily due to inherent differences between rRNA reduction methods, i.e., rRNA depletion and polyA enrichment, with the former method capturing more structural and non-coding RNAs, and the latter method capturing more full-length mRNAs . More importantly, although gene quantification was robust, transcriptome coverage was sensitive to the pipelines applied during the analyses; however, surrogate variable analysis proved useful in making direct comparisons across platforms.
Schuierer S. et al.  evaluated three Illumina library preparation kits, representing polyA selection, ribosomal RNA depletion and exon capture methods, respectively, on RNA-Seq samples in a wide range of input quantity and quality. They found ribosomal RNA depletion method had generally good performance whereas exon capture method performed the best for highly degraded RNA samples. Zhao S. et al.  evaluated polyA selection vs. rRNA depletion using clinical samples and recommended the former over the latter in most cases where the interest is protein-coding gene quantification.
More recently, increasing interest in investigating rare cell populations and detailed biological mechanisms has led to a demand for protocols generating high quality libraries from nanogram quantities of total RNA [12, 13] and even single cells [14, 15]. Dissecting the characteristics of RNA-Seq protocols designed to obtain data from low-input or degraded samples will benefit studies involving both rare cell populations and fixed clinical samples. For low-quantity RNA analysis, it has been established that the NuGEN protocol yields data with better transcriptome complexity but has less effective rRNA depletion, while the SMARTer Ultra Low RNA Kit has better performance on transcriptome annotation but demonstrates bias with respect to underrepresenting transcripts with high GC content . cDNA amplification can help compensate for extremely small amounts of starting materials in low quantity RNA-Seq, but amplification itself may introduce problems, such as duplication, that affect library performance . ABRF evaluated several low-input RNA amplification kits and identified certain underlying differences, such as two distinct categories of genes recovered in the libraries prepared with two distinct rRNA-reduction techniques, polyA enrichment and rRNA-depletion . The sensitivity of gene detection and accuracy of gene expression level assessments were consistent across approaches but divergent across RNA input amounts. The SMARTer protocol provided a near perfect correlation between obtained values and the actual amount of ERCC standard included as a spike-in control . Although this prior study provides insight into the effects of RNA amplification, it employed an artificial system using commercial RNA from TaKaRa mixed with the ERCC control RNAs, which likely oversimplifies the transcriptome complexity of real cells, thus necessitating similar work in whole-cell systems.
The source of data variation among different library preparation methods remains unclear. Therefore, in the present study, we carefully compared the results we obtained from several commercial RNA-Seq library preparation kits with different rRNA depletion and cDNA synthesis methods to understand the strength of each protocol. The first goal of our study was to investigate confounding factors in RNA-Seq library preparation protocols using three standard input kits: the TruSeq Stranded Total RNA and mRNA Library Prep Kits from Illumina, and a modified NuGEN Ovation® RNA-Seq System. Defining the properties of the data generated using these protocols may aid users in designing their future RNA-Seq strategies. The second part of our study was to thoroughly evaluate the SMARTer Ultra Low RNA Kit using mouse embryonic stem cells (mESCs). Our results demonstrated that the TruSeq Stranded mRNA protocol was the best for transcriptome profiling and that the TruSeq Stranded Total RNA and mRNA protocols were comparable, whereas the modified NuGEN protocol performed less well for whole transcriptome analysis, but might be a better choice for studies focused on non-coding RNAs. Lastly, although the results obtained with the SMARTer Ultra Low RNA Kit were comparable to those of the TruSeq Stranded mRNA kit for most metrics and for identification of DEGs, the absolute expression levels were only moderately correlated. We conclude that each RNA-Seq protocol has individual strengths for particular individual applications that need to be considered for a successful RNA-Seq experiment.
Experimental design and RNA-Seq data quality metrics
We used the manufacturer-recommended optimal input amounts (1 μg for both the Illumina TruSeq Stranded Total RNA and the Illumina TruSeq Stranded mRNA protocols; and 100 ng for the modified NuGEN Ovation v2; hereafter, “standard protocol”) (Fig. 1a). In addition, we also compared all three of these protocols with 100 ng input RNA (Fig. 1a and in the Additional file Figures). As described in a recent study, and as shown in Fig. 1a, the Illumina TruSeq Stranded Total RNA protocol uses Ribo-Zero to remove rRNA, whereas the TruSeq Stranded mRNA protocol enriches mRNA through polyA selection . In contrast, as shown in Fig. 1a, the modified NuGEN Ovation v2 protocol synthesizes cDNA directly from total RNA with a combination of random primers and oligo , and followed by cDNA fragmentation on Covaris. On the other hand, both TruSeq protocols use divalent cations under elevated temperature to fragment purified RNAs. For the TaKaRa SMARTer Ultra Low RNA Kit, we used total RNA from 100 mESCs cells and 1000 mESCs cells or approximately 1 and 10 ng RNA, respectively. To check whether this modified ultra-low input protocol was capable of generating quality data, we compared the mESC dataset derived from the TaKaRa SMARTer cDNA synthesis step combined with Nextera library preparation, to the high-quality datasets obtained using the TruSeq Stranded mRNA protocol with 2 μg total RNA as the input level.
The data analysis flow and the data quality metrics used in this study to evaluate RNA-Seq protocols are diagrammed in Fig. 1c and detailed below.
Mapping statistics (standard input protocols)
Read coverage over transcripts (standard input protocols)
Positional signal bias in RNA-Seq data can lead to inaccurate transcript quantification. Therefore, we examined the read coverage over transcripts longer than 1000 bps and found excessive enrichment of fragments at the 3′-end and depletion of signal at the 5′-end for samples prepared with the modified NuGEN protocol (Fig. 2d and Additional file 1: Figure S1D). Reads from the TruSeq Stranded Total RNA and TruSeq Stranded mRNA protocols were more evenly distributed along the entire length of the transcript (Fig. 2d and Additional file 1: Figure S1D). Closer examination of each nucleotide within 1000 bps of the 5′- and 3′- ends confirmed that the modified NuGEN protocol failed to capture the RNA signal towards the 5′-end (Additional file 2: Figure S2A, C), and also suggested that the TruSeq Stranded mRNA protocol missed the signal within 200 bp of the 3′-end, compared to the TruSeq Stranded Total RNA protocol (Additional file 2: Figure S2B, D).
Representation of the transcriptome (standard input protocols)
To evaluate the capability of the RNA-Seq protocols for detecting coding genes and lncRNAs, we performed saturation analysis to count the number of coding genes and lncRNAs detected at increasing sequencing depth. For coding genes, the saturation curves from the TruSeq Stranded Total RNA and mRNA libraries looked very similar and were superior to those from the NuGEN libraries (Fig. 3b and Additional file 3: Figure S3B). For lncRNAs, the modified NuGEN protocol outperformed both the TruSeq Stranded Total RNA and mRNA protocols, yielding more lncRNAs at the same sequencing depth (Fig. 3c Additional file 3: Figure S3C). However, for lncRNAs, none of the libraries were close to saturation at the sequencing depth used for our experiments. To examine the sequencing depth required to reach saturation for lncRNA detection, we repeated our saturation analysis after pooling samples from the same RNA-Seq protocol together. Our analysis showed that the modified NuGEN protocol still exceeded the other two protocols in lncRNA recovery, even when sequencing depth approached saturation (Fig. 3d and Additional file 3: Figure S3D).
Another important application of RNA-Seq is to identify alternatively spliced variants, which frequently occur in mammalian genes . In this regard, we conducted saturation analysis comparing the number of reads to the number of detected splice sites (Fig. 3e and Additional file 3: Figure S3E). We recovered the lowest number of splice junctions using the modified NuGEN protocol and the highest number with the TruSeq Stranded mRNA protocol.
Concordance of expression quantification (standard input protocols)
Concordance of DEGs recovered with standard input protocols
Mapping statistics, read coverage bias and transcriptome representation (ultra-low protocol)
Concordance of expression quantification and DE detection (ultra-low protocol)
Comparing global gene expression in differing biological contexts is a cornerstone of contemporary biology. As microarray technology is being supplanted by RNA-Seq methods for many applications, it is imperative to determine which library preparation protocols are best suited for specific needs, for example the recovery of coding vs. non-coding RNAs and reliable discernment of DEGs. Here, we have examined three different standard RNA-Seq library preparation protocols, and one low-input protocol in terms of overall reproducibility, rRNA contamination, read coverage, 5′- and 3′-end bias, and recovery of exonic vs. intronic sequences, lncRNAs, and DEGs. These protocols were the standard input Illumina TruSeq Stranded Total RNA, Illumina TruSeq Stranded mRNA, and modified NuGEN Ovation v2 kits; and the low input TaKaRa SMARTer Low Input RNA-Seq kit v3, tested at two different input levels, 100 (~ 1 ng RNA) and 1000 (~ 10 ng RNA) cells. Although all protocols yielded reproducible data, overall, the Illumina kits generally outperformed the modified NuGEN Ovation v2 kit at standard RNA input levels. The modified NuGEN protocol was useful for the recovery of lncRNAs and intronic sequences, but also had higher levels of rRNA contamination.
Undesirable recovery of rRNA
One impediment to the efficient recovery of meaningful RNA-Seq data is repetitive rRNA. Nearly 80% of RNA in a cell is rRNA, making it preferable to remove this class of RNA prior to library construction . RNA-Seq library preparation protocols depend on one of two means of reducing rRNA contamination: rRNA depletion and polyA enrichment. For the three standard protocols and the one ultra-low input protocol we evaluated, the TruSeq Stranded Total RNA and the modified NuGEN Ovation RNA-Seq System V2 protocols employ rRNA depletion methods, whereas the TruSeq Stranded mRNA protocol and SMARTer Ultra-low protocol use polyA enrichment methods to reduce rRNA contamination in sequencing libraries. In our present study, the modified NuGEN protocol libraries averaged 15–20% of their reads mapping to rRNA, as compared to 1–5% for the TruSeq protocols (Fig. 2a and Additional file 1: Figure S1A). These results are consistent with those reported by Adiconis et al. (23.2%) , but lower than those reported by Shanker et al. (35%) . However, our NuGEN rRNA mapping rates were much higher than those reported by both Sun et al.  and Alberti et al.  who had only a 1% rRNA mapping rate for both their Illumina- and NuGEN-created libraries. While we cannot explain the differences in rRNA mapping rates for the NuGEN libraries in these studies, in our core facility, the NuGEN Ovation v2 kit libraries consistently resulted in a 15–20% rRNA mapping rate, not only in this study, but also in prior sequencing libraries constructed in our facility (data not shown), thus providing part of the impetus for the current study. We also examined the rRNA mapping rate in libraries prepared from two polyA-enrichment protocols, the Illumina TruSeq Stranded mRNA protocol and the TaKaRa SMARTer Ultra Low RNA protocol. The SMARTer protocol yielded a 7–9% rRNA mapping rate, which was inferior to the TruSeq protocol at standard RNA input levels (1%) (Fig. 6a). The 7–9% mapping rate yielded by the SMARTer protocol in our facility was consistent with that reported by Adiconis et al.  and Alberti et al. . Overall, the protocols we tested were able to remove the majority of rRNA. Although the modified NuGen protocol showed relatively higher rRNA content, since the existence of rRNA is not expected to introduce a bias for expression quantification, an increase in sequencing depth would be able to compensate.
Overall mapping, end bias and exonic coverage
The TruSeq protocols yielded a ≥ 90% overall mapping rate for fragments with both ends mapped to the genome, compared to 60% for the modified NuGEN protocol (Fig. 2b and Additional file 1: Figure S1B). This is on par with a prior study showing NuGEN rRNA-depleted libraries had a 75% alignment rate and TruSeq PolyA-enrichment mRNA libraries had a 90% alignment rate .
To assess whether complete transcripts were evenly captured by the three standard library preparation protocols, we examined read coverage over the length of the full transcript. Our results, like those of Acondis , indicated that NuGEN libraries displayed augmented 3′-end signal and depleted 5′-end signal, perhaps due to using a combination of both oligo[dT] and random primers during cDNA synthesis . The TruSeq Stranded mRNA libraries were also somewhat biased, as reflected by a lack of reads within 200 bps of the 3′-end, relative to the TruSeq Total RNA libraries (Additional file 2: Figure S2B, 2D). This may be because of the difference between the rRNA depletion approaches used by the TruSeq mRNA and TruSeq total RNA protocols, resulting in more unmappable reads near the 3′-end in TruSeq mRNA libraries due to the presence of polyA tails in these reads.
To determine how well each protocol performed in recovering the transcriptome, we examined the composition of the uniquely mapped fragments from the two Illumina and the modified NuGEN protocols. Ninety percent of our reads were mapped to exons using the TruSeq Stranded mRNA kit, 67–84% using the Total RNA kit, and 35–46% using the NuGEN kit (Fig. 3a and Additional file 3: Figure S3A), which is consistent with similar studies using these kits [9, 11, 13, 18], suggesting that polyA-enrichment protocols may be superior to rRNA depletion protocols for studies focusing on exonic RNA [11, 13, 18]. This is further supported by our finding that, compared to the three standard input protocols, the polyA-based TaKaRa SMARTer Ultra Low RNA Kit had almost the same exonic coverage as the TruSeq Stranded mRNA protocol (Fig. 6d). The inverse was true for the recovery of intronic sequences, with rRNA-depleted libraries outperforming the polyA-enrichment libraries. For example, the modified NuGEN protocol yielded ~ 50% intronic sequences, which was on par with the results of Shanker et al. (after removing PCR duplicates) , whereas our TruSeq Stranded Total RNA libraries consisted of 14–28% intronic sequences. In contrast, the TruSeq Stranded mRNA libraries contained only 6–8% intronic sequences (Fig. 3a and Additional file 3: Figure S3A). We also found that the modified NuGEN kit yielded better lncRNA recovery. In this case, better lncRNA recovery may be due to differences in the cDNA synthesis step rather than in the rRNA depletion step: whereas the TruSeq Stranded Total RNA protocol uses only random primers for cDNA synthesis, the modified NuGEN protocol uses a combination of random and oligo  primers, thus allowing more efficient capture of both coding and non-coding RNAs with and without polyA-tails . However, it is also possible that some of the lncRNAs identified in the rRNA-depleted libraries are merely false signals originating from intronic reads from other coding genes rather than lncRNAs . Additionally, it is worth noting that in our saturation analysis (Fig. 3b, c Additional file 3: Figure S3B, 3C), the curves reached saturation at ~ 60% coding genes or ~ 30% lncRNAs, suggesting that achieving increased coverage of coding genes or lncRNAs beyond these levels by deeper sequencing would be very difficult.
Gene quantification and identification of DEGs
Gene expression quantification in and identification of DEGs between samples from different biological conditions are two of the primary goals for most RNA-Seq experiments. In the current study, we identified 960 and 1028 DEGs between experimental and control tumor tissues using the TruSeq Total RNA and mRNA protocols (manuscript in preparation), respectively, which was slightly fewer than the 1430 DEGs identified using the modified NuGEN protocol (Fig. 5b). This contrasts with the work of Sun et al. who recovered fewer DEGs from NuGEN libraries than TruSeq PolyA-enrichement libraries . To explore this difference, we validated our RNA-Seq-identified DEGs using qRT-PCR. We found that a greater proportion of DEGs identified using the TruSeq Stranded Total RNA and mRNA libraries were supported by our qRT-PCR results compared to DEGs identified using the modified NuGEN protocol libraries. That is, the modified NuGEN protocol may have resulted in more false-positive DEGs than did the TruSeq protocols. The comparable performance of the TruSeq Total and mRNA protocols in our study contrasts with the results of Zhao, et al., who directly compared the TruSeq Stranded Total and mRNA protocols using clinical samples. They found the TruSeq Stranded mRNA libraries more accurately predicted gene expression levels than the TruSeq Stranded Total RNA libraries .
Although the SMARTer Ultra Low RNA Kit-generated libraries were able to capture the effect of biological differences between experimental and control samples, overall, its performance was inferior to that of the TruSeq Stranded mRNA protocol, given both the higher amount of rRNA recovered and the lower number of DEGs recovered (Figs. 6 and 7). This may be due to the very different levels of input RNA used in these two protocols.
Limitations and future work
There are still some limitations in this study that could be addressed in future work. For example, this study didn’t include spike-in RNAs, which could serve as a sample independent benchmark to further evaluate the accuracy of DEG detection in libraries prepared by different protocols. Future work could also consider investigating additional ultralow RNA-Seq protocols and using standard RNA samples such as Universal Human Reference RNA (UHRR) for an easier comparison to other studies. 
In summary, all the RNA-Seq library preparation protocols evaluated in this study were suitable for distinguishing between experimental groups when using the manufacturers’ recommended amount of input RNA. However, we made some discoveries that might have been previously overlooked. First, we found that the TruSeq Stranded mRNA protocol is universally applicable to studies focusing on dissecting protein-coding gene profiles when the amount of input RNA is sufficient, whereas the modified NuGEN protocol might provide more information in studies designed to understand lncRNA profiles. Therefore, choosing the appropriate RNA-Seq library preparation protocol for recovering specific classes of RNA should be a part of the overall study design . Second, when dealing with small amounts of input RNA, the SMARTer Ultra Low RNA Kit may be a good choice in terms of rRNA removal, exonic mapping rates and recovered DEGs. Third, our saturation analysis indicated that the required sequencing depth depends on the biological question being addressed by each individual study. Roughly, a minimum of 20 M aligned reads/mate-pairs are required for a project designed to detect coding genes and increasing the sequencing depth to ≥130 M reads may be necessary to thoroughly investigate lncRNAs  (note: the needed sequencing depth may also vary depending on different biological samples and study designs). Omics technology and big data will facilitate the development of personalized medicine, but we should understand the outcomes of the experimental parameters and control for those as thoroughly as possible.
Biological samples and RNA isolation
The use of mice in this project has been reviewed and approved by The University of Texas MD Anderson Cancer Center (MD Anderson) IACUC committee (ACUF 04–89-07138, S. Fischer) and (ACUF MODIFICATION 00001124-RN01, T. Chen). C57BL/6 mice were purchased from The Jackson Laboratory (Bar Harbor, ME). For the three standard input RNA-Seq library preparation protocols (Illumina TruSeq Stranded Total RNA, TruSeq Stranded mRNA kit, and the modified NuGEN Ovation RNA-Seq kits), total RNA was isolated from three xenograft tumors (biological replicates) from control [30% calorie restricted diet ] and experimental [(diet-induced obese (OB)) xenograft mouse models in the C57BL/6 genetic background, respectively. C57BL/6 mice were chosen, in part, because they are susceptible to obesity when fed a high-fat diet . We fed the mice with two commercial diets following previously established guidelines (Research Diets, Inc., New Brunswick, NJ): a CR diet (D03020702) for lean C57BL/6 mice (30% CR), and a diet-induced obesity (DIO) diet (D12492; consumed ad libitum) for OB C57BL/6 mice, 10 mice per group . Mice were humanely euthanized using carbon dioxide and followed by cervical dislocation, per IACUC approved procedures. A manuscript describing the details of the mouse obesity/tumor xenograft study, including transcriptomic profiling results, is in preparation. For the SMARTer Ultra Low RNA Kit, designed to evaluate both rare cell populations and fixed clinical samples, three mESCs cell lines (biological replicates) from Zbtb24 knockout (1lox/1lox) clones and three Zbtb24 wild-type (2lox/+) clones were used as experimental and control samples, respectively. The mice used for this part of the study were generated in-house at MD Anderson Science Park. A manuscript describing the Zbtb24 KO mESCs, including transcriptomic profiling results, is also in preparation.
Total RNA from mouse xenograft tumor tissues was isolated using TRIZOL following the manufacturer’s protocol. Isolated RNA samples were treated with DNase I followed by purification with a QIAGEN RNeasy Mini kit (Madison, WI). Total RNA from mESCs was extracted using the QIAGEN RNeasy Mini kit with on-column DNase treatment following the manufacturer’s protocol. Both concentration and quality of all the isolated RNA samples were measured and checked with an Agilent Bioanalyzer 2100 and Qubit. All RNA samples had RNA integrity numbers > 8.90. For the low-cell-input experiments, 100 cells and 1000 cells (~ 1 and 10 ng RNA, respectively, according to the SMARTer Ultra Low RNA kit user manual) were used directly without isolating total RNA in accordance with manufacturer recommendations.
TruSeq stranded total RNA and mRNA library preparations
Libraries were prepared using the Illumina TruSeq Stranded Total RNA (Cat. # RS-122-2301) or mRNA (Cat. # RS-122-2101) kit according to the manufacturer’s protocol starting with 1 μg total RNA. Briefly, rRNA-depleted RNAs (Total RNA kit) or purified mRNAs (mRNA kit) were fragmented and converted to cDNA with reverse transcriptase. The resulting cDNAs were converted to double stranded cDNAs and subjected to end-repair, A-tailing, and adapter ligation. The constructed libraries were amplified using 8 cycles of PCR.
NuGEN ovation RNA-Seq system v2 modified with SPRI-TE library construction system
Total RNA (100 ng) was converted to cDNA using the NuGEN Ovation RNA-Seq System v2 (Cat. # 7102–32) (NuGEN) following the manufacturer’s protocol (NuGEN, San Carlos, CA). NuGEN-amplified double-stranded cDNAs were broken into ~ 180 base pair (bp) fragments by sonication with a Covaris S220 instrument (Covaris, Woburn, MA). Fragmented cDNAs were processed on a SPRI-TE library construction system (Beckman Coulter, Fullerton, CA). Uniquely indexed NEXTflex adapters (Bioo Scientific, Austin, TX) were ligated onto each sample to allow for multiplexing. Adapter-ligated libraries were amplified [1 cycle at 98 °C for 45 s; 15 cycles at 98 °C for 15 s, 65 °C for 30 s, and 72 °C for 30 s; 1 cycle at 72 °C for 1 min; and a hold at 4 °C] using a KAPA library amplification kit (KAPA Biosystems, Wilmington, MA) and purified with AMPure XP beads (Beckman Coulter).
Modified protocol for the SMARTer ultra low RNA and Nextera DNA library preparation kits
mESC were lysed in the reaction buffer included in the SMARTer Ultra Low RNA Kit v3 (Cat. # 634849) (TaKaRa, Japan). cDNA was then synthesized using the SMARTer Ultra Low RNA Kit followed by library construction using the Nextera DNA Sample Preparation Kit (Cat. # FC-131-1024) (Illumina, San Diego, CA), according to the manufacturers’ protocols. We performed 10 cycles of PCR for 1000 cells (~ 10 ng RNA) (SMARTer 1000), and 18 cycles of PCR for 100 cells (~ 1 ng RNA) (SMARTer 100).
Ten pM of pooled libraries were processed using a cBot (Illumina) for cluster generation before sequencing on an Illumina HiSeq 2500 (2 × 76 bp run).
RNA-Seq data analysis
Reads were mapped to rRNA sequences (GI numbers: 262231778, 120444901, 120444900, 328447215, 38176281 and Ensembl IDs: ENSMUST00000082388, ENSMUST00000082390, ENSMUST00000083988, ENSMUST00000157970) using Bowtie2 (version 2.1.0) . Reads that were not mapped to rRNAs were then mapped to the mouse genome (mm10) using TopHat (version 2.0.10) .
Read coverage over transcripts
The longest transcript from each gene was chosen to represent the gene. The reads were then mapped to all the transcript sequences using Bowtie2. Transcripts with fewer than 200 total fragment counts or shorter than 1000 bps were filtered out leaving at least 12 k transcripts for each sample. Each full-length transcript was subdivided evenly into 1000 bins. The mean coverage of fragments over each bin was normalized to the total coverage over the whole transcript and then averaged over all the transcripts. Alternatively, the coverage of fragments over each position of the 1000 bps downstream of the 5′-end or upstream of the 3′-end was normalized by the mean coverage of the whole transcript, and then averaged over all the transcripts.
Discovery of splicing junctions
The number of known splicing junctions (defined as junctions with both 5′- and 3′- splice sites annotated in the reference gene set) supported by at least one read in each sample was counted using RSeQC (version 2.6.4) .
Each point in a saturation curve was generated by randomly selecting the desired number of fragments and calculating the percentage of genes with more than 10 fragments over all the genes. For each sample, this procedure was repeated three times and the curve represents the average percentage of genes at each corresponding number of fragments.
Hierarchical clustering of samples was performed using the log2(cpm + 1) values of all the genes using the dist function and Euclidean method in R, as well as the hierarchical clustering (hclust) function and complete method in R.
Software used in this study
Fast gapped-read alignment with Bowtie 2
TopHat: discovering splice junctions with RNA-Seq
HTSeq — A Python framework to work with high-throughput sequencing data
edgeR: a Bioconductor package for differential expression analysis of digital gene expression data
RSeQC: quality control of RNA-seq experiments
Box plots of gene expression, GC content and gene length
Between a pair of protocols, the genes with elevated expression in one protocol compared to the other protocol were identified by edgeR at FDR < 0.01 and log2 ratio > 1. Then the gene expression, GC content, and gene length for the two groups of more highly expressed genes were plotted in box plots. The gene expression is the average FPKM (number of fragments per kilobase per million mapped fragments) value of all the samples used in the evaluation of the standard input or ultralow input protocols. The longest transcript representing each gene was used to calculate both gene GC content and length.
We thank Dr. Briana Dennehey for editorial assistance, Dr. Sharon Dent and Sara Gaddis for their critical reading of the manuscript, and Joi Holcomb for her help with the preparation of the figures. For the work conducted at The University of Texas MD Anderson Cancer Center (MD Anderson) - Science Park, we thank the Research Animal Support Facility for animal maintenance and care (NIH P30 CA16672).
H-PC, YL, and JS conceived the project, designed experiments, interpreted data and wrote the manuscript; H-PC and YL conducted most bioinformatics analyses; KL helped with bioinformatics analyses; YPC, YT, MW, and MSS performed sequencing experiments; JER and CDM performed animal experiments; SMF, TC, and DGT helped with data interpretation; YT, JSK, MSS, and DGT helped with writing the manuscript. All authors read and approved the manuscript.
This project was supported by Cancer Prevention and Research Institute of Texas (CPRIT) Core Facility Support Awards (RP120348 and RP170002) to J.S. and NIH (1R01AI1214030A1) to T.C. The funding agency had no role in the design, collection, analysis, interpretation and the writing of the manuscript.
Ethics approval and consent to participate
The use of mice in this project has been reviewed and approved by the MD Anderson IACUC committee (ACUF 04–89-07138, S. Fischer) and (ACUF MODIFICATION 00001124-RN01, T. Chen).
Consent for publication
The authors declare that they have no competing interests.
- 1.Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437(7057):376–80. https://doi.org/10.1038/nature03959 PubMed PMID: 16056220; PubMed Central PMCID: PMC1464427.CrossRefPubMedPubMedCentralGoogle Scholar
- 2.Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387–402. https://doi.org/10.1146/annurev.genom.9.081307.164359 PubMed PMID: 18576944.CrossRefPubMedGoogle Scholar
- 4.Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320(5881):1344–9. https://doi.org/10.1126/science.1158441 PubMed PMID: 18451266; PubMed Central PMCID: PMC2951732.CrossRefPubMedPubMedCentralGoogle Scholar
- 5.Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, et al. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. 2008;133(3):523–36. https://doi.org/10.1016/j.cell.2008.03.029 PubMed PMID: 18423832; PubMed Central PMCID: PMC2723732.CrossRefPubMedPubMedCentralGoogle Scholar
- 6.Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic acids research. 2013;41(Database issue):D991–D995. https://doi.org/10.1093/nar/gks1193. PubMed PMID: 23193258; PubMed Central PMCID: PMC3531084.CrossRefGoogle Scholar
- 7.Oliver HF, Orsi RH, Ponnala L, Keich U, Wang W, Sun Q, et al. Deep RNA sequencing of L. monocytogenes reveals overlapping and extensive stationary phase and sigma B-dependent transcriptomes, including multiple highly transcribed noncoding RNAs. BMC Genomics. 2009;10:641. https://doi.org/10.1186/1471-2164-10-641 PubMed PMID: 20042087; PubMed Central PMCID: PMC2813243.CrossRefPubMedPubMedCentralGoogle Scholar
- 9.Li S, Tighe SW, Nicolet CM, Grove D, Levy S, Farmerie W, et al. Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study. Nat Biotechnol. 2014;32(9):915–25. https://doi.org/10.1038/nbt.2972 PubMed PMID: 25150835; PubMed Central PMCID: PMC4167418.CrossRefPubMedPubMedCentralGoogle Scholar
- 10.Schuierer S, Carbone W, Knehr J, Petitjean V, Fernandez A, Sultan M, et al. A comprehensive assessment of RNA-seq protocols for degraded and low-quantity samples. BMC Genomics. 2017;18(1):442. https://doi.org/10.1186/s12864-017-3827-y PubMed PMID: 28583074; PubMed Central PMCID: PMCPMC5460543.CrossRefPubMedPubMedCentralGoogle Scholar
- 11.Zhao S, Zhang Y, Gamini R, Zhang B, von Schack D. Evaluation of two main RNA-seq approaches for gene quantification in clinical RNA sequencing: polyA+ selection versus rRNA depletion. Sci Rep. 2018;8(1):4781. https://doi.org/10.1038/s41598-018-23226-4 PubMed PMID: 29556074; PubMed Central PMCID: PMC5859127.CrossRefPubMedPubMedCentralGoogle Scholar
- 12.Adiconis X, Borges-Rivera D, Satija R, DeLuca DS, Busby MA, Berlin AM, et al. Comparative analysis of RNA sequencing methods for degraded or low-input samples. Nat Methods. 2013;10(7):623–9. https://doi.org/10.1038/nmeth.2483 PubMed PMID: 23685885; PubMed Central PMCID: PMC3821180.CrossRefPubMedPubMedCentralGoogle Scholar
- 13.Shanker S, Paulson A, Edenberg HJ, Peak A, Perera A, Alekseyev YO, et al. Evaluation of commercially available RNA amplification kits for RNA sequencing using very low input amounts of total RNA. J Biomol Tech. 2015;26(1):4–18. https://doi.org/10.7171/jbt.15-2601-001 PubMed PMID: 25649271; PubMed Central PMCID: PMC4310221.CrossRefPubMedPubMedCentralGoogle Scholar
- 16.Roy B, Haupt LM, Griffiths LR. Review: alternative splicing (AS) of genes as an approach for generating protein complexity. Current genomics. 2013;14(3):182–94. https://doi.org/10.2174/1389202911314030004 PubMed PMID: 24179441; PubMed Central PMCID: PMC3664468.CrossRefPubMedPubMedCentralGoogle Scholar
- 17.O'Neil D, Glowatz H, Schlumpberger M. Ribosomal RNA depletion for efficient use of RNA-seq capacity. Current protocols in molecular biology. 2013;Chapter 4:Unit 4 19. https://doi.org/10.1002/0471142727.mb0419s103. PubMed PMID: 23821444.
- 18.Sun Z, Asmann YW, Nair A, Zhang Y, Wang L, Kalari KR, et al. Impact of library preparation on downstream analysis and interpretation of RNA-Seq data: comparison between Illumina PolyA and NuGEN ovation protocol. PLoS One. 2013;8(8):e71745. https://doi.org/10.1371/journal.pone.0071745 PubMed PMID: 23977132; PubMed Central PMCID: PMCPMC3747248.CrossRefPubMedPubMedCentralGoogle Scholar
- 19.Alberti A, Belser C, Engelen S, Bertrand L, Orvain C, Brinas L, et al. Comparison of library preparation methods reveals their impact on interpretation of metatranscriptomic data. BMC Genomics. 2014;15:912. https://doi.org/10.1186/1471-2164-15-912 PubMed PMID: 25331572; PubMed Central PMCID: PMC4213505.CrossRefPubMedPubMedCentralGoogle Scholar
- 20.Munro SA, Lund SP, Pine PS, Binder H, Clevert DA, Conesa A, et al. Assessing technical performance in differential gene expression experiments with external spike-in RNA control ratio mixtures. Nat Commun. 2014;5:5125. https://doi.org/10.1038/ncomms6125 PubMed PMID: 25254650.CrossRefPubMedGoogle Scholar
- 23.Lashinger LM, Malone LM, McArthur MJ, Goldberg JA, Daniels EA, Pavone A, et al. Genetic reduction of insulin-like growth factor-1 mimics the anticancer effects of calorie restriction on cyclooxygenase-2-driven pancreatic neoplasia. Cancer Prev Res (Phila) 2011;4(7):1030–40. Epub 2011/05/20. https://doi.org/10.1158/1940-6207.CAPR-11-0027. PubMed PMID: 21593196; PubMed Central PMCID: PMC3131443.CrossRefGoogle Scholar
- 29.Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. https://doi.org/10.1093/bioinformatics/btp616 PubMed PMID: 19910308; PubMed Central PMCID: PMC2796818.CrossRefPubMedGoogle Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.