Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers
RNA-seq and small RNA-seq are powerful, quantitative tools to study gene regulation and function. Common high-throughput sequencing methods rely on polymerase chain reaction (PCR) to expand the starting material, but not every molecule amplifies equally, causing some to be overrepresented. Unique molecular identifiers (UMIs) can be used to distinguish undesirable PCR duplicates derived from a single molecule and identical but biologically meaningful reads from different molecules.
We have incorporated UMIs into RNA-seq and small RNA-seq protocols and developed tools to analyze the resulting data. Our UMIs contain stretches of random nucleotides whose lengths sufficiently capture diverse molecule species in both RNA-seq and small RNA-seq libraries generated from mouse testis. Our approach yields high-quality data while allowing unique tagging of all molecules in high-depth libraries.
Using simulated and real datasets, we demonstrate that our methods increase the reproducibility of RNA-seq and small RNA-seq data. Notably, we find that the amount of starting material and sequencing depth, but not the number of PCR cycles, determine PCR duplicate frequency. Finally, we show that computational removal of PCR duplicates based only on their mapping coordinates introduces substantial bias into data analysis.
KeywordsRNA-seq Small RNA-seq Unique molecular identifier UMI PCR duplicates PCR cycle Starting material Sequencing depth Transcriptome Ribognome
Coefficient of variation
Polymerase chain reaction
Unique molecular identifier
High-throughput sequencing of long (> 100 nt) or small (18–50 nt) RNA provides a quantitative measure of RNA abundance. However, RNA-seq and small RNA-seq library construction can introduce bias at multiple steps, such as fragmentation of long RNAs, reverse transcription, adapter ligation, library amplification by PCR, and sequencing. Commonly used high-throughput sequencing platforms, including those made by Illumina and Pacific Biosciences, require PCR amplification during library construction to increase the number of cDNA molecules to an amount sufficient for sequencing. However, PCR stochastically introduces errors that can propagate to later cycles [1, 2]. PCR also amplifies different molecules with unequal probabilities . PCR duplicates are reads that are made from the same original cDNA molecule via PCR.
A common practice to eliminate PCR duplicates is to remove all but one read of identical sequences, assuming that such reads have been created from the same cDNA molecule by PCR . This assumption may be flawed, especially with ever higher sequencing throughput, which increases the chance of observing reads with identical sequences but from different cDNA molecules. The situation is further exacerbated for small genomes and for techniques that interrogate a subspace of the genome. For example, the majority of small RNA-seq reads are microRNAs (miRNAs) or PIWI-interacting RNAs (piRNAs), which derive from loci that amount to just a few percent of the genome [4, 5, 6, 7]. The assumption also has systematic biases. For example, a shorter gene is more likely to give rise to identical RNA-seq reads than a longer gene with the same transcript level, simply because the “genomic space” for the random process of RNA fragmentation is smaller for the shorter gene. Finally, the conventional definition of PCR duplicates is based on mapping coordinates—reads mapping to the exact same genomic location are considered to have identical sequences. However, many small RNAs with the same sequence can be produced from multiple genomic loci; thus, strategies using genome mapping to identify PCR duplicates ignore the situation that identical reads arise from distinct sites in the genome.
Standard library preparation and sequencing procedures typically have pre-specified PCR and sequencing error rates, but parameters such as the amount of starting RNA used to generate a library, the number of reads sequenced (i.e., sequencing depth), and the number of PCR cycles used are often adjusted to accommodate sample source, abundance, and quality. While the notion that more PCR amplification increases artefactual duplicate reads in high-throughput sequencing makes intuitive sense and is widely accepted, high PCR cycle numbers are often necessitated by scarce starting materials, another likely cause for duplicate reads. Thus, the contribution of PCR cycle number to PCR duplicates is often confounded with the contributions of starting materials and sequence depth.
Unique molecular identifiers (UMIs) are often used to accurately detect PCR duplicates and quantify transcript abundance [8, 9, 10, 11, 12, 13, 14, 15, 16]. If each molecule in the starting pool is barcoded with a UMI, i.e., all molecules are unique, then reads with the same UMI must be PCR duplicates. In practice, only the molecules in the starting pool that have identical sequences need to have different UMIs.
One strategy to incorporate UMIs introduces pre-defined, manually-selected sequences into the adapters. This strategy can avoid UMIs with suboptimal GC content and minimize complementarity between or within UMI sequences . Because UMI identities are unambiguously defined, sequencing and PCR errors can be easily corrected. However, implementing pre-defined UMIs requires a large number of costly, custom-synthesized oligonucleotides.
An alternative strategy employs adapters that contain random nucleotides at certain positions in the adapters. The combinations of the random-nucleotide positions lead to an exponential number of different UMIs at almost no extra cost, because incorporating a random nucleotide costs the same as incorporating a specific nucleotide during DNA synthesis. UMIs bearing either five (45 = 1,024 unique barcodes) or ten random nucleotides (410 = 1,048,576 unique barcodes) were implemented cost-effectively and shown to improve PCR duplicate removal [8, 14]. A higher number of unique combinations can be achieved simply by increasing the number of random-nucleotide positions. The number of UMI combinations must be sufficiently large because as mentioned above, the chance that two cDNA molecules with identical sequences in the starting pool are tagged with the same UMI combination needs to be infinitesimally small.
Here, we describe novel experimental protocols and computational methods to unambiguously identify PCR duplicates in RNA-seq and small RNA-seq data. We show that removing PCR duplicates using UMI information is accurate, whereas removing PCR duplicates without UMIs is overly aggressive, eliminating many biologically meaningful reads. Finally, we show that the amount of starting materials and sequencing depth determine the level of PCR duplicates, without additional contribution from the extent of PCR amplification.
Adapting standard RNA-seq procedures to incorporate UMIs
Adapting standard small RNA-seq protocol to incorporate UMIs
Previously, we established a reliable and robust small RNA-seq protocol by modifying a published method which utilizes oligonucleotides compatible with Illumina sequencing platforms [24, 25]. Compared to UMI RNA-seq, incorporation of UMIs into this small RNA-seq protocol requires additional considerations. First, the number of distinct UMI combinations needs to be significantly greater than what is required for RNA-seq. For example, millions of piRNA species—an abundant class of small RNAs in the animal germ line—can be routinely detected in a single individual, and it is estimated that there can be as many as 1 million distinct piRNA molecules in a single spermatocyte or round spermatid [4, 6, 26, 27]. The most abundant piRNA species in this study has 42,281 reads. In the soma, the most abundant miRNA can take up > 40% of the total sequencing depth  —tens of millions of reads. Such enormous abundance requires a sufficiently high number of UMI combinations to capture all distinct sequences. Second, the length range of small RNAs (< 50 nt) plus a longer UMI is still well within the read length achievable by common sequencing instruments. Third, the length of a small RNA is a defining feature of its identity and thus, insertions or deletions could lead to misclassification of small RNAs. The second and third considerations also indicate that small RNA-seq is ideally suited for the testing of a large combination of UMIs.
We tested UMIs containing 10 consecutive random nucleotides. Although both the 3′ and 5′ adapters containing 10-nt UMIs ligated to small RNAs with nearly the same efficiency as the original adapters without UMIs, the resulting small RNA-seq libraries yielded unexpectedly short, variable-length reads that contained truncated insert and adapter sequences (data not shown). We speculate that long stretches of random nucleotides interfere with oligonucleotide annealing, a critical step in cDNA synthesis, PCR, and sequencing, by increasing the chance that a primer anneals to a UMI instead of its target sequences. Inter- and intramolecular annealing of 10 nt UMIs may also contribute to truncated reads.
To avoid a long stretch of random nucleotides, we used the UMI locator strategy described above to space out several short stretches of random nucleotides. For each adapter, we designed three trinucleotide UMI sequences, each separated from another by a trinucleotide UMI locator (e.g., 5′–NNN-CGA-NNN-TAC-NNN–3′; Fig. 2a, b). Two adapters with such UMIs can produce a trillion combinations, which should suffice all deep-sequencing applications. Similar to our RNA-seq strategy, we designed adapters with two different sets of UMI locator sequences—mixed at equimolar—to increase the sequence complexity in the early sequencing cycles. This strategy allowed us to successfully generate and sequence the UMI small RNA-seq libraries, unambiguously locate UMIs, and computationally remove reads containing insertions or deletions in UMIs due to reverse transcription, PCR, and sequencing errors (Fig. 2b). We tested our method using total RNAs extracted from mouse testes isolated 17.5 days after birth. To assess the impact of the amount of starting materials on PCR duplicates, we prepared small RNA-seq libraries using a range of 39–5,000 ng RNAs made from serial dilution. To test the effect of PCR cycles, we gradually increased the PCR cycles for each library with a two-cycle increment. The resulting UMI small RNA-seq libraries yielded high-quality sequencing data, comparable to those generated with the original non-UMI protocol (Additional file 1).
Diverse UMIs capture all read species in RNA-seq and small RNA-seq
As mentioned above, to accurately identify PCR duplicates using UMIs, it is critical that the number of distinct UMIs far exceeds the maximal number of starting molecules with identical sequences, such that these molecules have an infinitesimal probability of being ligated to adapters with the same UMI. Previous UMI methods were designed for sequencing single cells or an organism with a less complex transcriptome than mammals [10, 13]. In particular, testis has a higher-complexity transcriptome than many other tissues such as muscle, liver, and even brain , demanding a large number of UMI combinations. Our UMI RNA-seq protocol theoretically provides ~ 1 million (410) distinct combinations, and we asked whether this diversity far exceeded the maximal number of reads with identical sequences in our libraries. Indeed, the transcripts derived from the 299-bp 7S RNA 1 gene produce 19,271 identical reads mapping to the same genomic coordinate, all of which are attached to distinct UMI sequences, indicating that all of these reads were from different starting RNA molecules. In conclusion, our UMI RNA-seq protocol is more than sufficient to disambiguate biologically identical reads from PCR duplicates.
Our UMI small RNA-seq provides an even higher number of possible combinations with 18 nt UMIs—68.7 billion (418)—much larger than the number of reads currently produced by a sequencing run. The most abundant small RNA species in our datasets is a piRNA with 42,281 reads, far fewer than the number of UMI combinations our protocol provides. We conclude that the UMI lengths used in the RNA-seq and small RNA-seq protocols contain a sufficient UMI diversity for current and, most likely, future sequencing experiments.
Error-correction for UMIs only slightly improves PCR duplicate identification
To test whether UMIs could accurately identify PCR duplicates, we first evaluated their performance using simulated data. Assuming a library has sufficiently diverse UMI sequences, the simplest way to determine biologically identical reads is to look for reads with the same sequence but are tagged by different UMIs. This approach assumes that there is no error in the replication or reading of the UMI sequences, since such errors could render identical UMI sequences different and vice versa, causing misidentification of PCR duplicates. UMI errors could occur during PCR sequencing, and computationally correcting these errors has been shown to improve identification of PCR duplicates [14, 16, 30, 31, 32].
We first assumed that there was no error in UMIs (Fig. 3a) and found that on average, (estimate – truth)/truth = 2.10% across 10,000 trials under the baseline condition. Thus, without performing UMI error correction, we slightly overestimated the total number of biological molecules as an error in a UMI would artificially create an extra UMI, and in turn, we slightly underestimated the fraction of PCR duplicates (red vs gray lines in Fig. 4; Additional file 2: Figure S1). Next, we used the UMI graph approach described above (Fig. 3a, b) for correcting errors in UMIs, and the new average of (estimate – truth)/truth = 0.0388%. Even though correcting UMI errors consistently gives better (estimate – truth)/truth than not correcting the errors, the absolute difference in the fractions of PCR duplicates between the two approaches is small (Fig. 4; Additional file 2: Figure S1). For example, under the baseline condition, the true fraction of duplicates was 37.8 ± 3.2%; without correcting UMI errors yielded 36.5 ± 3.3%, and correcting UMI errors gave 37.8 ± 3.2%.
Under some extreme conditions, correcting UMI errors yields substantially better results. For example, if we modify PCR error rate in the baseline condition from the default 3 × 10− 5 [33, 34, 35] to 10− 3, correcting UMI errors still yields a fraction of duplicates (37.2 ± 3.2%) very close to the truth (37.2 ± 3.1%), while not correcting the errors underestimates the fraction of duplicates (32.1 ± 3.5%). In conclusion, error-correction for UMIs consistently, albeit slightly, improves PCR duplicate identification. Therefore, we performed error correction for all following analyses.
Removing PCR duplicates without using UMIs is fundamentally flawed
Does the practice of removing PCR duplicates without UMIs improve the quantification of both long and short transcripts and in particular, of small RNAs such as microRNAs or piRNAs, which collectively originate from a small portion of the genome? We compared PCR duplicate identification using UMIs together with mapping coordinates of the reads to the conventional approach of using coordinates alone.
When only mapping coordinates were used (RNA-seq data from eight mouse tissues) (Additional file 1), 16.4–44.5% RNA-seq reads were determined to be PCR duplicates, whereas using UMI information in conjunction with coordinates identified only 1.89–10.67% as duplicates. That is, the majority of reads mapping to identical coordinates were in fact not PCR duplicates but rather from distinct starting molecules that should be counted for transcript abundance. The situation is even worse for small RNA-seq data, when only small RNA sequences were used, the majority (56.0–76.8%) of reads were flagged as PCR duplicates and therefore excluded from analysis. In contrast, when UMI information was used together with the sequences of reads, just 1.05–13.6% of reads were determined to be duplicates. Thus, most of the identical reads in RNA-seq and small RNA-seq are biologically real and not PCR duplicates, consistent with the view that small RNAs, which tend to come from precisely the same small genomic regions, can easily be mistaken for PCR duplicates when UMI information is not used. Moreover, the assumption that common mapping coordinates indicate PCR duplicates becomes increasingly problematic as sequencing depth increases, because the chance of observing two identical reads that legitimately derive from different molecules before PCR also increases.
UMIs improve data reproducibility
One metric for evaluating the quality of experimental data is the reproducibility between technical replicates. We evaluated how UMIs affect the reproducibility of transcript quantification using five libraries generated using the same sample of total mouse testis RNA, but with gradually decreasing amounts of starting RNA and correspondingly increasing numbers of PCR cycles: 4 μg (8 PCR cycles), 2 μg (9 PCR cycles), 1 μg (10 PCR cycles), 500 ng (11 PCR cycles), 125 ng (13 PCR cycles) (Additional file 1). We then analyzed the data sets treating PCR duplicates using one of three approaches: (1) no PCR duplicates were removed; (2) PCR duplicates were removed using the conventional approach of identical genomic locations; and (3) PCR duplicates were removed using UMIs together with mapping coordinates. We compared the three approaches by calculating coefficients of variation (CV = S.D./mean) for transcript abundance across the five RNA-seq libraries. Compared to removing no duplicates, removing duplicates according to their mapping coordinates decreased the total CV by 5.80% (from 4,210 to 3,960), while using UMIs with mapping coordinates decreased the total CV by 6.67% (from 4,210 to 3,930) (Fig. 5c). For example, when two RNA-seq libraries (125 ng with 12 PCR cycles and 1 μg with 10 PCR cycles) were compared, the number of transcripts whose abundance differed by ≥25% decreased when duplicates were removed (1,880 without duplicate removal, 1,503 removing duplicates by genomic coordinates, and 1,415 removing duplicates using UMIs). We conclude that removing PCR duplicates in RNA-seq data using mapping coordinates alone improves the overall precision of transcript quantification for all genes, and removing PCR duplicates by combining coordinates and UMIs further improves the precision, albeit slightly.
Next, we evaluated the performance of these three approaches for a series of small RNA-seq libraries (starting material 39–5,000 ng). Compared to removing no duplicates, using UMIs to remove duplicates decreased the total CV by 8.72% (Fig. 5c). Surprisingly, removing duplicates according to their mapping coordinates alone increased CV by 79.1% (from 6,490 to 11,620) (Fig. 5c). For example, between two small RNA-seq libraries in this series, one generated from 150 ng and the other from 1 μg of the same total RNA sample, genomic loci (piRNA genes and GENCODE-annotated genes) whose small RNA abundance differed by ≥25% decreased 8.30% when duplicates were removed using UMIs (from 2,613 to 2,396 genes). In contrast, when duplicates were removed using solely mapping coordinates, the number of such irreproducible genes increased by 159% (6,762 genes). These results show that removing PCR duplicates with UMIs leads to more consistent quantification across libraries, whereas removing duplicates without UMIs is overly aggressive and decreases the reproducibility of small RNA-seq experiments.
PCR cycles alone do not determine the frequency of PCR duplicates
One might think that the number of PCR cycles used to amplify the initial cDNA is the major cause of PCR duplicates in sequencing libraries . We sought to test this and to identify other experimental contributing factors. As described above, we performed computer simulations to test the impact of UMI error correction on PCR duplicate detection. We considered seven parameters that could impact the level of PCR duplicates during an RNA-seq or small RNA-seq experiment. Assuming that we have performed UMI error correction, we now examine in detail these seven parameters for their impact on the level of PCR duplicates.
Four of the parameters—PCR amplification efficiency, PCR error rate, sequencing error rate, and UMI length—are specified by the experimental reagents and sequencing platform and typically not adjusted from experiment to experiment. Our simulation results indicate that varying the sequencing error rate, the PCR error rate, or the UMI length around their default values in the baseline condition (i.e., within the ranges stipulated by experimental settings) did not have a significant effect on the fraction of PCR duplicates (the blue line is flat around the dashed vertical line in Additional file 2: Figure S1A–C, top panels). In comparison, PCR efficiency had a measurable effect (the blue line in the top panel of Additional file 2: Figure S1D reveals a negative correlation with PCR efficiency). This is because that at lower PCR efficiency, some molecules are less likely to be amplified and become underrepresented, causing a decrease in library complexity and correspondingly higher fractions of PCR duplicates.
The other three parameters—the number of initial molecules, the number of molecules sequenced (i.e., sequencing depth), and the number of PCR cycles—are often adjusted to meet specific experimental conditions. Our simulations revealed that a change in PCR cycle number alone only minimally affected the fraction of PCR duplicates (the blue line in the top-left panel of Fig. 4 is nearly flat around the dashed vertical line), because the starting molecules of the original pool are proportionally propagated to the final library . In contrast, decreasing the number of initial molecules or increasing the number of molecules sequenced sharply raised the frequency of PCR duplicates (Fig. 4, two top-right panels).
Simulations argue that the increase in PCR duplicates is not a consequence of greater PCR amplification but rather is caused by the use of lower starting material. To test this idea, we analyzed a second set of nine UMI small RNA-seq libraries, all generated from 5 μg total RNA from the same mouse testis, but amplified using 14 to 30 PCR cycles (Additional file 1). Consistent with the simulations, these libraries did not show a discernable trend between fraction of PCR duplicates and the number of PCR cycles (Fig. 6b). Thus, the higher fraction of PCR duplicates observed in libraries made from low amounts of RNA followed by high PCR cycle numbers more likely reflects the reduced complexity of the starting pool, rather than the increased number of PCR cycles. Together, our simulated and experimental data demonstrate that less starting RNA or higher sequencing depth, but not more PCR cycles per se, accounts for the frequency of PCR duplicates. Nonetheless, our recommendation remains to optimize PCR cycle numbers during library preparation because several other artifacts are known to be associated with excessive PCR amplification, such as decreasing amplification efficiency as PCR reaction proceeds , biased amplification of sequences with different GC content , or chimeric PCR products due to cross hybridization [41, 42]. Moreover, when the starting material has low complexity, as for single-cell sequencing, excessive PCR amplification is likely to compound all such problems.
The two most widely used computational tools for PCR duplicate removal, Picard MarkDuplicates (http://broadinstitute.github.io/picard/) and SAMtools rmdup  rely only on the mapping coordinates of sequencing reads. Our data suggest that most identical reads reflect biological reality. Thus, removing PCR duplicate reads using only mapping coordinates erroneously eliminates many usable reads, particularly those produced from short transcripts and small RNAs. This is consistent with an earlier study that compared RNA-seq data generated using three different library preparation methods (Smart-Seq, TruSeq and UMI-seq)  and reported that removing PCR duplicates without UMIs introduced bias into RNA-seq data. There are also computational methods that assess PCR duplicate rates without UMIs, e.g., using reads mapping to heterozygous variants  or gene expression levels , but they only estimate duplicate frequencies and cannot accurately identify duplicate reads. Our approach relies on UMIs and can accurately identify all PCR duplicates, regardless of gene expression and genomic variants.
The eight mouse tissues we analyzed span a range of transcriptome complexity: previous analyses showed that the mouse testis transcriptome contains ~ 18,700 autosomal protein-coding transcripts, ~ 8,600 non-coding RNAs, and ~ 31.7 Mb of intergenic RNA, while the liver transcriptome contains only ~ 15,500 autosomal protein-coding transcripts, ~ 1,000 non-coding RNAs, and ~ 7.2 Mb of intergenic RNA . Among the eight mouse tissues we tested, removing duplicate reads based on only mapping coordinates eliminates many biologically meaningful reads even when the libraries were made using ample starting RNA and optimal experimental conditions. Given the anti-correlation between RNA complexity and PCR duplicate occurrence, UMIs will improve the accuracy of comparing long or small RNA abundance across different tissues or cell types. Short RNAs, such as miRNAs and piRNAs, as well as highly abundant transcripts are particularly susceptible to underestimation by the conventional mapping coordinate method of PCR duplicate removal.
We tested the importance of a key aspect of data processing required for error correction using UMIs and showed that under typical experimental conditions for bulk sequencing (Fig. 4, dotted lines; Additional file 2: Figure S1), correcting or not correcting errors in the UMI sequences has little absolute effect on PCR duplicate quantification. However, sequencing libraries made from a small number of cells or from a small amount of tissue or RNA, have become increasingly common , and they are more severely affected by PCR duplicates. Single-cell sequencing poses three specific challenges for PCR duplicate removal. First, it uses a limited amount of starting RNA, causing low library complexity. Second, the ongoing discovery of new species of non-coding RNAs, many poorly understood, increases the number of species being measured, requiring longer UMIs. Finally, the increasingly high sequencing depth provided by advances in technology increases both the number of species that can be detected and the background noise. Together, these three factors make PCR duplicate measurement without UMI error correction especially problematic for single-cell sequencing. Indeed, a number of protocols have been developed to incorporate UMIs for single-cell sequencing . We compile all UMI methods known to us in Additional file 3, including bulk and single-cell, RNA-seq and small RNA-seq protocols.
Compared to other UMI protocols (Additional file 3), our method has several novel features. First, our UMI RNA-seq protocol fragments the RNA before random hexamer priming of cDNA synthesis. It captures both poly(A) + and poly(A) − RNA while limiting 3′-end bias. In comparison, template-switching cDNA synthesis is commonly used in single-cell sequencing protocols; it captures full length cDNAs more efficiently but is also prone to significant 3′-end bias and loss of poly(A) − RNA species due to oligo(dT) priming [14, 48, 49]. Second, our UMI RNA-seq protocol has yielded high-quality data from 125 ng starting material and our UMI small RNA-seq protocol from 39 ng. Our protocols do not require pre-amplification and perform fragmentation prior to cDNA synthesis. Consequently, these methods capture all PCR duplicates. Third, our protocol can accommodate a large number of UMIs: random sequences with 5✕2 nt for RNA-seq and 9✕2 nt for small RNA-seq. Thus, our protocol can be used to study RNA with high sequence diversity: e.g. piRNA in the germline. In our hands long stretches of random nucleotides interferes with library construction and sequencing, whereas our use and placement of UMI locators improves sequencing quality. With the increasing use of small amounts of starting materials, our protocol complements other approaches (Additional file 3).
Our UMI approach builds on well-established protocols, requiring few changes in the procedures and little additional cost. We expect UMI analysis to be particularly useful when sequencing RNAs derived from a limited number of genomic loci, such as CaptureSeq  and CAGE-seq . Our approach can theoretically be adapted to any sequencing technique using synthetic oligonucleotide adapters. For example, chromatin immunoprecipitation sequencing (ChIP-seq) and its alternative CUT&RUN survey the genomic regions bound by proteins of interest [52, 53]. The CUT&RUN method uses a nuclease to achieve more precise chromatin cleavage than the conventional ChIP-seq procedure, which utilizes sonication to randomly shear the DNA. Consequently, CUT&RUN increases the likelihood of identical but meaningful reads. By their nature, protein-bound fragments also map to a smaller portion of genomic positions than RNA-seq reads. UMIs can improve discovery of protein binding sites by minimizing noise. Similarly, degradome sequencing profiles the 5′ ends of 3′ cleaved RNA products ; incorporating UMIs will enable precise quantification of cleaved RNA abundance.
We described experimental protocols and computational methods that, by incorporating UMIs into standard procedures, allow accurate PCR duplicate removal from RNA-seq and small RNA-seq data. Our approach increases reproducibility and decreases noise in sequencing libraries generated using a broad range of starting RNA amount and number of PCR cycles, enabling accurate quantification of the abundance of both long and short RNAs.
C57BL/6 J mice were sacrificed using CO2 and cervical dislocation and tissues were collected from mice maintained and used according to guidelines approved by the Institutional Animal Care and Use Committee of the University of Massachusetts Medical School (A-2222-17).
Ribosomal RNA depletion for RNA-seq
Total RNA was extracted from tissues using the mirVana kit (ThermoFisher Scientific, Waltham, MA, USA) following manufacturer’s instructions. The ribosomal RNA depletion method was adapted from previously published protocols for human samples [55, 56]. One hundred and eighty-six 50 nt-long DNA oligonucleotides complementary to the entire sequences of mouse 18S, 28S, 5S, and 5.8S rRNAs, and mitochondria 16S rRNA and 16S rRNA precursor were used at 0.5 μM (f.c.) for each oligonucleotide. Total mouse testis RNA was isolated with mirVana and incubated with 1 μL of the DNA oligonucleotide mixture per 1 μg, and rRNA oligonucleotide hybridization buffer (100 mM Tris-Cl pH 7.4, 200 mM NaCl) was added to make up to 10 μL. Oligonucleotide hybridization was carried out by heating the sample at 95 °C for 3 min, then slowly cooling it down (− 0.1 °C/second) to 22 °C in a thermocycler. The reaction was further incubated at 22 °C for 5 min before being placed on ice. Thermostable RNase H (Lucigen, Middleton, MA) was added (5 U/μg total RNA), and the reaction adjusted to 50 mM Tris-Cl pH 7.4, 100 mM NaCl, and 20 mM MgCl2. and incubated at 45 °C for 30 min. After DNase treatment with Turbo DNase (1 μL/μg total RNA) according to the manufacturer’s instructions, the rRNA-depleted RNA was purified using RNA Clean & Concentrator-5 (Zymo Research, Irvine, CA, USA).
RNA-seq library construction
RNA-seq library generation was similar to previously published , except for the use of a more effective rRNA deletion method (above) and the UMI adapters. Briefly, total RNA was randomly fragmented using heat (94 °C for 5 min) and Mg2+. First strand cDNA was generated using ribosomal-depleted, fragmented total RNA with Superscript III (Invitrogen, Carlsbad, CA, USA). cDNA then underwent dUTP incorporation, end repair, and A-tailing. Resulting cDNA was incubated with a mixture of three sets of UMI-containing adapters, each carrying a distinct consensus sequence as described in the results for adapter ligation (Fig. 1b–c). The ligated cDNA was treated with UDG (New England Biolabs, Ipswich, MA, USA) and amplified by PCR. Length distribution and quality of the resulting library was analyzed by Agilent 2100 Bioanalyzer (Agilent Genomics, Santa Clara, MA, USA). The libraries were quantified using KAPA library quantification kit (KAPA Biosystems, Wilmington, MA, USA) and sequenced using NextSeq 500 (Illumina, San Diego, CA, USA) paired-end sequencing.
Small RNA-seq library construction
Small RNA-seq library preparation was similar to previously described [24, 25] with updated modifications. Briefly, 18–35 nt mouse small RNA was purified from a 15% denaturing urea-polyacrylamide gel (National Diagnostics, Atlanta, GA, USA). Small RNA was first ligated to 3′ DNA adapters with adenylated 5′ and dideoxycytosine-blocked 3′ ends in 50 mM Tris-HCl (pH 7.5), 10 mM MgCl2, 10 mM DTT, and 50% PEG 8000 (New England Biolabs, Ipswich, MA, USA) with T4 Rnl2tr K227Q (homemade) at 25 °C for 16 h. The 3′ adapter contained UMIs in 3 nt-blocks of random nucleotides separated by pre-defined 3 nt consensus sequences (NNN-GTC-NNN-TAG-NNN, Fig. 2b). The 54–71 nt (18–35 nt small RNA + 36 nt 3′ UMI adapter) 3′ ligated product was purified from a 10% denaturing urea-polyacrylamide gel. The 3′ ligated product was then ligated to a mixed pool of equimolar amount of 5′ RNA adapters containing UMIs in 3 nt-blocks of random nucleotides and one of the two distinct consensus sequence sets (NNN-CGA-NNN-UAC-NNN and NNN-AUC-NNN-AGU-NNN) in 50 mM Tris-HCl (pH 7.8), 10 mM MgCl2, 10 mM DTT, 1 mM ATP with T4 RNA ligase (Ambion, Foster City, CA, USA) at 25 °C for 2 h. The ligated product was precipitated with ethanol, and cDNA synthesis was performed using AMV reverse transcriptase (New England Biolabs, Ipswich, MA, USA). cDNA was PCR-amplified with a common forward primer (5′–AAT GAT ACG GCG ACC ACC GAC AGG TTC AGA GTT CTA CAG TCC GA-3′) and a reverse primer containing 6 nt Illumina multiplexing barcode (5′–CAA GCA GAA GAC GGC ATA CGA GAT NNN NNN GTG ACT GGA GTT CCT TGG CAC CCG AGA ATT CCA–3′) using AccuPrime Pfx DNA polymerase (ThermoFisher, Waltham, MA, USA). Finally, the PCR product was purified from a 2% Certified Ultra Low Range agarose gel (Bio-Rad Laboratories, Hercules, CA, USA). Length distribution and quality of the resulting libraries was analyzed by Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). The libraries were quantified using the KAPA library quantification kit (KAPA Biosystems, Wilmington, MA, USA) and sequenced using NextSeq 500 single-end sequencing (Illumina, San Diego, CA, USA).
Simulation procedure was performed according to . Briefly, we simulated 7 parameters: PCR amplification probability, PCR and sequencing error rates, UMI length, number of initial molecules, number of sequenced molecules, and PCR cycle numbers, by varying one parameter and keeping other parameters constant. For each combination of the 7 parameters, we performed 10,000 replicates. UMI error correction was implemented as described in , except that we used read sequences instead of genomic coordinates when determining PCR duplicates for small RNA-seq. We used NetworkX  for graph-related algorithms, and pysam (https://github.com/pysam-developers/pysam) for handling SAM/BAM files. Reads were mapped to the mouse mm10 genome as described in . For each UMI RNA-seq read, the UMI locator was required to be a perfect match of GGG, TCA, or ATC. For each UMI small RNA-seq read, the 5′ UMI locator was required to match NNNCGANNNTACNNN or NNNATCNNNAGTNNN and the 3′ UMI locator was required to match NNNGTCNNNTAGNNN, allowing 1 error across all of the non-N positions. Reads not meeting these requirements were discarded. UMIs in RNA-seq data were identified using umitools reformat_fastq. Subsequently, PCR duplicates were marked using umitools mark_duplicates. UMIs in small RNA-seq were identified and removed using umitools reformat_sra_fastq. When reads were analyzed without UMIs, PCR duplicates were identified using MarkDuplicates from Picard (https://github.com/broadinstitute/picard).
We thank members of the Weng and Zamore laboratories for helpful discussions and comments on the manuscript.
This work was supported in part by National Institutes of Health grants P01HD078253 and R37GM062862 to PDZ and HD078253 to ZW. The funding agency had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Availability of data and materials
The tools developed for handling UMIs in our RNA-seq and small RNA-seq data can be found at https://github.com/weng-lab/umitools, and via PyPI (package: umitools). RNA-seq and small RNA-seq data have been deposited in the NCBI SRA under the accession number PRJNA416930.
All authors have read and approved the final version of the manuscript. YF, Bioinformatics analysis, Writing; PHW, Protocol design, RNA-seq, small RNA-seq, Writing; TB, Small RNA-seq; ZW, Project design and management, Writing; PDZ, Project conception, design and management, Writing.
The Institutional Animal Care and Use Committee of the University of Massachusetts Medical School (A-2222-17) provided official ethics board approval for this study. Tissues used in this study were collected from C57BL/6 J mice (Jackson Laboratory, stock number 664) maintained and used according to the approved protocols.
Consent for publication
The authors declare that they have no competing interests.
- 5.Aravin A, Gaidatzis D, Pfeffer S, Lagos-Quintana M, Landgraf P, Iovino N, Morris P, Brownstein MJ, Kuramochi-Miyagawa S, Nakano T, Chien M, Russo JJ, Ju J, Sheridan R, Sander C, Zavolan M, Tuschl T. A novel class of small RNAs bind to MILI protein in mouse testes. Nature. 2006;442:203–7.PubMedGoogle Scholar
- 6.Girard A, Sachidanandam R, Hannon GJ, Carmell MA. A germline-specific class of small RNAs binds mammalian Piwi proteins. Nature. 2006;442:199–202.Google Scholar
- 12.Collins JE, Wali N, Sealy IM, Morris JA, White RJ, Leonard SR, Jackson DK, Jones MC, Smerdon NC, Zamora J, Dooley CM, Carruthers SN, Barrett JC, Stemple DL, Busch-Nentwich EM. High-throughput and quantitative genome-wide messenger RNA sequencing for molecular phenotyping. BMC Genomics. 2015;16:578.CrossRefPubMedPubMedCentralGoogle Scholar
- 18.Mohn F, Sienski G, Handler D, Brennecke J. The rhino-deadlock-cutoff complex licenses noncanonical transcription of dual-strand piRNA clusters in drosophila. Cell. 2014;157:1364-79.Google Scholar
- 22.Illumina: Illumina: Low-Diversity Sequencing on the Illumina HiSeq® Platform. Technical Note: DNA Sequencing 2014,Google Scholar
- 23.Illumina: Illumina: NextSeq® 500 System Guide. Technical Note: DNA Sequencing 2016,Google Scholar
- 25.Li C, Vagin VV, Lee S, Xu J, Ma S, Xi H, Seitz H, Horwich MD, Syrzycka M, Honda BM, Kittler EL, Zapp ML, Klattenhoff C, Schulz N, Theurkauf WE, Weng Z, Zamore PD. Collapse of germline piRNAs in the absence of Argonaute3 reveals somatic piRNAs in flies. Cell. 2009;137:509–21.CrossRefPubMedPubMedCentralGoogle Scholar
- 27.Houwing S, Kamminga LM, Berezikov E, Cronembold D, Girard A, van den Elst H, Filippov DV, Blaser H, Raz E, Moens CB, Plasterk RH, Hannon GJ, Draper BW, Ketting RF. A role for Piwi and piRNAs in germ cell maintenance and transposon silencing in Zebrafish. Cell. 2007;129:69–82.CrossRefPubMedGoogle Scholar
- 28.Yue F, Cheng Y, Breschi A, Vierstra J, Wu W, Ryba T, Sandstrom R, Ma Z, Davis C, Pope BD, Shen Y, Pervouchine DD, Djebali S, Thurman RE, Kaul R, Rynes E, Kirilusha A, Marinov GK, Williams BA, Trout D, Amrhein H, Fisher-Aylor K, Antoshechkin I, DeSalvo G, See LH, Fastuca M, Drenkow J, Zaleski C, Dobin A, Prieto P, Lagarde J, Bussotti G, Tanzer A, Denas O, Li K, Bender MA, Zhang M, Byron R, Groudine MT, McCleary D, Pham L, Ye Z, Kuan S, Edsall L, Wu YC, Rasmussen MD, Bansal MS, Kellis M, Keller CA, Morrissey CS, Mishra T, Jain D, Dogan N, Harris RS, Cayting P, Kawli T, Boyle AP, Euskirchen G, Kundaje A, Lin S, Lin Y, Jansen C, Malladi VS, Cline MS, Erickson DT, Kirkup VM, Learned K, Sloan CA, Rosenbloom KR, Lacerda de Sousa B, Beal K, Pignatelli M, Flicek P, Lian J, Kahveci T, Lee D, Kent WJ, Ramalho Santos M, Herrero J, Notredame C, Johnson A, Vong S, Lee K, Bates D, Neri F, Diegel M, Canfield T, Sabo PJ, Wilken MS, Reh TA, Giste E, Shafer A, Kutyavin T, Haugen E, Dunn D, Reynolds AP, Neph S, Humbert R, Hansen RS, De Bruijn M, Selleri L, Rudensky A, Josefowicz S, Samstein R, Eichler EE, Orkin SH, Levasseur D, Papayannopoulou T, Chang KH, Skoultchi A, Gosh S, Disteche C, Treuting P, Wang Y, Weiss MJ, Blobel GA, Cao X, Zhong S, Wang T, Good PJ, Lowdon RF, Adams LB, Zhou XQ, Pazin MJ, Feingold EA, Wold B, Taylor J, Mortazavi A, Weissman SM, Stamatoyannopoulos JA, Snyder MP, Guigo R, Gingeras TR, Gilbert DM, Hardison RC, Beer MA, Ren B, Mouse ENCODEC. A comparative encyclopedia of DNA elements in the mouse genome. Nature. 2014;515:355–64.CrossRefPubMedPubMedCentralGoogle Scholar
- 31.Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, Trombetta JJ, Weitz DA, Sanes JR, Shalek AK, Regev A, McCarroll SA. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161:1202–14.CrossRefPubMedPubMedCentralGoogle Scholar
- 51.Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, Semple CA, Taylor MS, Engström PG, Frith MC, Forrest AR, Alkema WB, Tan SL, Plessy C, Kodzius R, Ravasi T, Kasukawa T, Fukuda S, Kanamori-Katayama M, Kitazume Y, Kawaji H, Kai C, Nakamura M, Konno H, Nakano K, Mottagui-Tabar S, Arner P, Chesi A, Gustincich S, Persichetti F, Suzuki H, Grimmond SM, Wells CA, Orlando V, Wahlestedt C, Liu ET, Harbers M, Kawai J, Bajic VB, Hume DA, Hayashizaki Y. Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet. 2006;38:626–35.CrossRefPubMedGoogle Scholar
- 53.Skene PJ, Henikoff S. An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites. eLife. 2017:e21856.Google Scholar
- 56.Adiconis X, Borges-Rivera D, Satija R, DeLuca DS, Busby MA, Berlin AM, Sivachenko A, Thompson DA, Wysoker A, Fennell T, Gnirke A, Pochet N, Regev A, Levin JZ. Comparative analysis of RNA sequencing methods for degraded or low-input samples. Nat Methods. 2013;10:623–9.CrossRefPubMedPubMedCentralGoogle Scholar
- 57.Hagberg A, Swart P, Chult DS. Exploring network structure, dynamics, and function using NetworkX, Proceedings of the 7th Python in Science Conference; 2008. p. 11–6.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.