The impact of read length on quantification of differentially expressed genes and splice junction detection
- 3.6k Downloads
The initial next-generation sequencing technologies produced reads of 25 or 36 bp, and only from a single-end of the library sequence. Currently, it is possible to reliably produce 300 bp paired-end sequences for RNA expression analysis. While read lengths have consistently increased, people have assumed that longer reads are more informative and that paired-end reads produce better results than single-end reads. We used paired-end 101 bp reads and trimmed them to simulate different read lengths, and also separated the pairs to produce single-end reads. For each read length and paired status, we evaluated differential expression levels between two standard samples and compared the results to those obtained by qPCR.
We found that, with the exception of 25 bp reads, there is little difference for the detection of differential expression regardless of the read length. Once single-end reads are at a length of 50 bp, the results do not change substantially for any level up to, and including, 100 bp paired-end. However, splice junction detection significantly improves as the read length increases with 100 bp paired-end showing the best performance. We performed the same analysis on two ENCODE samples and found consistent results confirming that our conclusions have broad application.
A researcher could save substantial resources by using 50 bp single-end reads for differential expression analysis instead of using longer reads. However, splicing detection is unquestionably improved by paired-end and longer reads. Therefore, an appropriate read length should be used based on the final goal of the study.
KeywordsRoot Mean Square Deviation Read Length Splice Junction Orphan Gene Splice Detection
Association of Molecular Resource Facilities
differentially expressed gene
log2-based fold change
root mean square deviation.
One of the main questions for a researcher performing a sequencing experiment is the length of reads to use and whether to use single-end reads or paired-end reads. Longer reads should, a priori, increase the level of uniquely mapping reads, but such longer reads have an increased cost in reagents and an increase in running time for the instrument. While the determination of the proper read length for an experiment is important across all sequencing experiments, including genome re-sequencing, de novo sequencing, RNA-seq, and ChIP-seq, we have only focused on the use of RNA-seq for differentially expressed genes (DEGs) and isoform detection.
The initial reads on Illumina and other next-generation platforms were extremely short and often only ranged up to 25 or 36 bp . While these reads were sufficient for some assays, a substantial percentage of the reads could not be mapped uniquely and were often discarded due to the inability to determine their correct matching location within the genome . More recently, the lengths of reads have increased substantially and sequencers have been improved to allow for the sequencing of both ends of a fragment to allow for paired-end sequences. The current read length that is standard for many experiments is paired-end 100 bp reads and there is also the possibility of running paired-end 300 bp reads.
Approximate cost of sequencing for each read length and sequencing type on a HiSeq 2500, high-output mode v3 (eight lanes per flowcell)
Per lane cost
We have used data from the SEQC Sequencing study to investigate the effects of read-length on RNA-seq results and then validated the results using data from the ENCODE consortium. Since our main goal was to investigate the role of read length in determining RNA-seq results, we wanted to minimize all other variables. Therefore, we obtained the same sets of physical reads for the entire experiment and these reads were bioinformatically trimmed to produce reads of shorter lengths. This trimming is akin to what would have been obtained if the sequencing machine had been stopped earlier than it was for the longer reads. The quality and error profile of the 50th base of a 50 bp read is the same as that of the 50th base of a 100 bp read.
We took two samples from the Association of Molecular Resource Facilities (ABRF) SEQC study that consisted of RNA standards (see "Materials and methods"), labeled here as A and B. For each sample, we took three sets of paired-end 101 bp reads to form the basis of the analysis. These reads were then processed to produce 100, 75, 50 and 25 bp paired-end reads and 100, 75, 50 and 25 bp single-end reads. All of the reads were then aligned to the human genome using the STAR aligner.
Mapping statistics and splice junction detection
The number of splice junctions detected for each alignment was also determined (Fig. 1b). Alignments with 25 bp read lengths resulted in a significantly lower number of splice junctions detected (t = −13.13, p value = 2.2 × 10−16). Moreover, the numbers of splice junctions detected by the alignment of 100 bp reads was significantly higher than with any other read lengths (t = 7.08, p value = 9.5 × 10−8). Also, single-end reads detected fewer splice junctions overall when compared with the paired-end reads. In summary, longer paired-end reads are significantly better for splice junction detection.
Differential expression at different read lengths
The top 200 DEGs for all read lengths for each differential expression method and each sample were sorted both by log2 based fold-change (Log2FC) of both up-regulated (+Log2FC) and down-regulated genes (−Log2FC) and by p value and compared. The number of orphan (read-length-specific) genes was then calculated. For single-end samples (Fig. 2a), 25 bp read lengths gave the highest percentage (average of 13.8 %) of orphan genes among all differential expression methods and samples. This shows that the differential expression profile calculated using the 25 bp read length is significantly different. Using paired-end 25 bp reads reduced this difference (Fig. 2b; down to an average of 5 %); however, among paired-end reads, 25 bp reads still gave the highest difference. The differences between 50 bp, 75 bp and 100 bp, for both paired-end and single-end reads, were small (0–12 % orphan genes).
The number of read lengths that support any specific DEG was calculated using the same overlap as above. For single-end reads (Fig. 2c), the percentage of DEGs supported by all four read lengths is fairly low. This percentage is significantly higher (t = −4.85, p value = 0.00015) when using paired-end reads (Fig. 2d), which suggests that paired-end reads are better for determining DEGs. Also, sorting on p value instead of Log2FC improves the overlap when using single-end reads and significantly improves the overlap when using paired-end reads.
We next looked at the lists of the top 200 DEGs identified for each read length to determine their overlap (Fig. 3c, d). These lists represent the main result of an RNA-seq experiment performed using a particular read length. Overlap between the lists of the top 200 DEGs sorted by Log2FC shows that single-end 25 bp reads perform the worst; however, this performance is nicely recovered when using paired-end 25 bp reads (average of 121.5 versus 137 genes, respectively). Most 100 bp reads performed similarly when comparing paired-end versus single-end reads. Therefore, the gain in performance is not significant when using reads >50 bp for paired-end reads and >75 bp for single-end reads.
Additionally, we looked at genes that are paralogous (n = 21,459) to investigate whether longer read lengths result in higher mapping of overall reads to these genes (Figure S6a in Additional file 1). We calculated the total number of reads mapped to these genes in the paired-end SEQC dataset of samples A and B. To our surprise, with the exception of 25 bp reads, all the read lengths resulted in equal numbers of reads aligned to the paralogous genes. The overall trend of number of reads mapped as a function of read length stays the same regardless of paralogous gene length (Figure S6b in Additional file 1).
To further confirm this result, we examined well-validated, “golden junctions”, which were found by all five sequencing platforms from the ABRF study on RNA-seq  (Figure S1 in Additional file 1). These data showed that 25 bp reads showed the lowest percentage overlap with golden junctions (9.6 %). However, read lengths longer than 25 bp all showed high overlap (97 %) and there was relatively little difference between read lengths. Therefore, read lengths of 50 bp or longer can still be used for detecting bona fide junctions.
Confirmation of results on ENCODE samples
In order to confirm our findings on an additional data set, we performed an identical analysis on RNA-seq data from NHDF and IMR90 cells obtained from an ENCODE study (Figures S3, S4, and S5 in Additional file 1). The only difference was that there were only two replicates available for each ENCODE sample whereas there were three replicates available for the SEQC samples. Additionally, there was no qPCR data available for the ENCODE samples. The results for SEQC samples were mainly replicated when using the ENCODE samples, with a consistent trend of little improvement over short reads’ utility for DEG detection but longer reads consistently improving isoform detection.
Confirmation of results using RSEM/EBseq method
We also aligned all the SEQC samples using RSEM  and then performed differential expression using EBseq . The differential expression tables were then analyzed to find the number of orphan reads and percentage overlap similar to in Fig. 2 (Figure S2 in Additional file 1). RSEM/EBseq showed higher overlap and less orphan genes between read lengths. However, the overall trend is still similar and thus the conclusions made using the STAR aligner still hold true.
When a researcher performs an RNA-seq experiment, he or she is confronted with the question of which read length to use and whether single-end reads are sufficient, or if they require the sequencing of both ends of their fragments. We have utilized two distinct datasets (SEQC and ENCODE) to provide insight into this critical aspect of experimental design. Our findings show that longer reads are not necessarily significantly better than shorter reads and an intermediate length of reads would provide adequate results for differential expression analysis.
We found that 25 bp reads are simply too short for effective use in most RNA-seq applications. They result in high numbers of multi-mapped reads and, depending upon the alignment protocol, may thus result in loss or inaccurate mapping of the data. This effect is even more pronounced when using 25 bp single-end reads. Also, they result in the highest number of orphan genes, which means the gene quantification accuracy may be questionable. Other than 25 bp reads, all other read lengths show similar accuracy and there is little gain in accuracy with longer read lengths. Paired-end reads improve accuracy of differential expression as shown by the qPCR comparison, although this gain is not significant. Paired-end reads are useful more at the alignment level than the differential expression level. They lower the percentage of multi-mapped reads and thus fewer reads are thrown out before quantification. They might also help in quantifying genes that have a high number of duplicated regions, as evidenced by our analysis of pseudogenes.
For an experiment whose goal is only to determine differential expression, the improvement beyond 50 bp single-end reads is not substantial and would not justify the added cost. This lack of improvement is due to the fact that there is not a considerable gain in the mappability of sequencing reads beyond 50 bp read lengths. This mapping lies at the heart of detecting differential expression when the reads from two conditions are mapped back to the reference genome and the number of reads mapping to a gene in each condition are compared (after normalization). Using 50 bp single-end reads costs half as much as using 100 bp paired-end reads (Table 1). Thus, the amount of money saved is substantial.
More troubling, though, is the lack of consistency of results between different read lengths. One would hope that genes that were detected as differentially expressed at one read length would also be detected as differentially expressed at a different read length since the experiment properly represented the expression of genes in the cell. We found this not to be the case and there was substantial variability in the lists of top DEGs between read lengths. This is likely due to a combination of splice junction overlap and gene annotation (e.g., short genes), as well as the usual factors inherent to library preparation, such as library size, RNA fragmentation, and GC content biases .
For splicing, we found contrasting results to those for DEGs. Since splicing detection inherently relies on sequence assembly, and longer sequences improve assembly, longer sequences led to superior splicing detection. This improvement was for the detection of both known and novel splice variants. Use of 100 bp reads not only resulted in many more splice junctions being detected than with smaller read lengths, but also the detection of many splice junctions that are specific to 100 bp reads only. These junctions would not have been detected if smaller read lengths were used. The percentage of common splice junctions between all four read lengths is very low, but improves if 25 bp read lengths are not considered. This suggests that 25 bp reads do not find most of the splice junctions detected by longer reads.
The answer for the ideal length of sequencing for RNA-seq depends on the desired results of the experiment. If only a list of DEGs is desired, then 50 bp single-end reads would be sufficient for most studies. In contrast, for splicing detection, our results suggest that the longest reads possible should be used, including using paired-end reads.
Materials and methods
The raw data for this analysis came from two samples from the ABRF SEQC study. The Universal Human Reference RNA (740000, Agilent Technologies) and Ambion FirstChoice Human Brain Reference RNA (AM6000, Life Technologies) were used as samples A and B, respectively, in the MicroArray Quality Control (MAQC) experiments initiated in 2005 and summarized in Nature Biotechnology in 2006 . These RNA samples are well characterized and were used as part of the SEQC study by the US Food and Drug Administration [Seqc/Maqc-III Consortium. "A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium." Nature biotechnology 32.9 (2014): 903-914.]
Three paired-end 100 bp replicates of sample A and three paired-end 100 bp replicates of sample B were selected and downloaded (Gene Expression Omnibus accession GSE47792). For confirmation of our results in an independent sample, we used two replicates each of paired-end 100 bp ENCODE RNA-seq samples for IMR90 and NHDF cells. The files were downloaded from the UCSC ENCODE repository  with sample names Nhdf70717012CellTotalFastqRd[1, 2]Rep1, Nhdf70717012CellTotalFastqRd[1, 2]Rep2, Imr90CellTotalFastqRd[1, 2]Rep1, and Imr90CellTotalFastqRd[1, 2]Rep2.
Sequencing data preprocessing
FASTQC  was run on all the samples to make sure there were no sequencing errors. Raw sequences were aligned to hg19 genome assembly (UCSC) with STAR RNA-seq aligner version 2.3.0e . In order to eliminate variability between sequencing runs, we used the same exact reads and trimmed them to lengths of 100, 75, 50 and 25 bp computationally and also separated the paired ends to only use the first read from each pair in order to simulate single-end sequencing.
RNA-seq differential gene expression analysis
Read counts were calculated using the BedTools’ intersectBed command  and the GENCODE v.17 annotation was used for gene calls. Sample A versus sample B and imr90 versus nhdf differential expression were run on all read counts with single-end and paired-end reads. Three methods of differential expression were used: Cufflinks v.2.1.1 , DESeq v.1.10.1  and EdgeR v.3.0.8 . Aligned BAM files were provided to Cuffdiff from the Cufflinks suite with default options. Raw read counts of uniquely aligned reads were used for DESeq and EdgeR. All read counts where the average read count from all three replicates was less than 10 were discarded before running analysis with DESeq and EdgeR. The differential expression tables were then analyzed using p values and log2FC cutoffs. This analysis was also performed using the aligner RSEM  and differential expression method EBseq . The gene IDs for paralogous genes were downloaded from Ensembl Biomart v.72 by selecting attributes, then homology, and then paralogs. The list was further filtered by homology type to only keep within_species_paralog.
PrimePCR RT-qPCR gene expression analysis
RT-qPCR data were only available for samples A and B. Undetectable Cq values (Cq > 35 or Cq = 0) were removed from data for samples A and B. The standard deviation of the Cq values for each gene was calculated, and the gene MYSM1 exhibited the lowest standard deviation. The data were normalized by subtracting the average Cq of MYSM1 from each PrimePCR target to give the log2 difference between the endogenous control and the target gene. The normalized Cq values were used to calculate Log2FC between A and B. This differential expression Log2FC was then used to calculate the Pearson and Spearman correlation to the RNAseq data using the cor() function from the R stats package. The RMSD was also calculated using the Log2FC of RNAseq and qPCR data.
Splice junction analysis
Non-canonical junction detection was turned off in order for the aligned files to be compatible with Cufflinks. All other parameters were left as default. Splice junction files produced by STAR were used for all splice junction analysis. The total number of splice junctions detected for each sample and read length was taken from the log.final.out file printed by the aligner STAR . SJ.out.tab is also a file printed by STAR, which contains all the junctions detected, and other metrics for each junction. STAR was run in novel splice junction detection mode and there was no splice junction database provided. These junctions were intersected using bedtools  and the Venn diagrams were created using VennDiagram package in R . Known junctions were annotated using 1 bp window of GENCODE v.17 splice junction file since STAR output is 1-based and GENCODE bed files are 0-based. Overlap plots were created using ggplot2 . The “golden junctions” are junctions found by all five platforms (454, Illumina, PGM, Proton, 454) .
This study was supported in part by NIH grants R01NS076465 and R44HG005297, as well as the Imra T. Hirschl Trust. Vallee, WorldQuant, and Pershing Square Foundations.
- 6.SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nature biotechnology. 2014;32(9):903-14.Google Scholar
- 8.ENCODE RNA-seq samples for IMR90 and NHDF cells produced at Cold Spring Harbor Laboratory. http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/.
- 16.ggplot2. http://ggplot2.org/.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.