ReorientExpress: reference-free orientation of nanopore cDNA reads with deep learning
We describe ReorientExpress, a method to perform reference-free orientation of transcriptomic long sequencing reads. ReorientExpress uses deep learning to correctly predict the orientation of the majority of reads, and in particular when trained on a closely related species or in combination with read clustering. ReorientExpress enables long-read transcriptomics in non-model organisms and samples without a genome reference without using additional technologies and is available at https://github.com/comprna/reorientexpress.
Long-read sequencing technologies allow the systematic interrogation of transcriptomes from any species. However, functional characterization requires knowledge of the correct 5′-to-3′ orientation of reads. Oxford Nanopore Technologies (ONT) allows the direct measurement of RNA molecules in the native orientation , but the sequencing of complementary-DNA (cDNA) libraries yields generally a larger number of reads [1, 2]. Although strand-specific adapters can be used, error rates hinder their correct detection. Current methods to analyze nanopore transcriptomic reads rely on the comparison to a genome or transcriptome reference [2, 3] or on the use of additional technologies such as in “hybrid sequencing” that employs long- and short-read data , which limits the applicability of rapid and cost-effective long-read sequencing for transcriptomics beyond model species. To facilitate the de novo interrogation of transcriptomes in species or samples for which a genome or transcriptome reference is not available, we have developed ReorientExpress, a new tool to perform reference-free orientation of ONT reads from a cDNA library. ReorientExpress uses deep neural networks (DNNs) to predict the orientation of cDNA long reads independently of adapters and without using a reference. ReorientExpress predicts correctly the orientation of the majority of cDNA reads, and in particular when trained on a related species or in combination with read clustering, thereby enabling the reference-free characterization of transcriptomes.
Sequence-based prediction of read orientation
To demonstrate the suitability of ReorientExpress to predict the orientation of cDNA reads from samples without a genome or transcriptome reference available, we mimicked this situation by building DNN models in one species and testing them on a related species. We thus trained an MLP model (k = 1,…,5) with the mouse transcriptome. This model tested on human ONT cDNA reads showed a precision of 0.79 and recall of 0.71, which is comparable to the MLP model trained on human data (Fig. 2a) (Additional file 1: Table S4). Interestingly, this model showed a higher accuracy (precision and recall = 0.87) when tested on human DRS reads as compared to human cDNA reads (Additional file 1: Table S4). We also trained an MLP model (k = 1,…,5) with the transcriptome annotation for Candida glabrata and tested it on S. cerevisiae ONT cDNA reads. This model yielded accuracy values as high as for the previous S. cerevisiae model (precision and recall = 0.94) (Fig. 2b) (Additional file 1: Table S4). As observed before for S. cerevisiae DRS reads, the model accuracy dropped when tested on DRS reads (precision and recall = 0.87) (Additional file 1: Table S4). We obtained similar results for the cross-species comparisons with the CNN model, with an improvement in accuracy for the mouse model applied to human DRS reads, and a drop for the C. glabrata model applied to S. cerevisiae DRS reads (Additional file 1: Table S4).
Reference-free interpretation of long-read transcriptome data generally involves some form of clustering [8, 9]. Thus, to further demonstrate the utility of ReorientExpress for reference-free interrogation of transcriptomes with long-reads, we performed clustering of the cDNA reads (see the “Methods” section). For the majority of clusters in human (> 81%) and S. cerevisiae (> 85%), ReorientExpress predicted correctly more than 50% of the reads in the cluster (Fig. 2c) (Additional file 1: Figure S1) (the proportion of clusters for each model can be found in Additional file 1: Table S5). That is, for most clusters, more than half the reads in those clusters can be correctly oriented. Accordingly, by taking the orientation of the cluster to be determined by that of the majority of reads, we could improve the overall orientation. To test this, we applied a majority vote per cluster to set the orientation of all reads in the cluster to be the majority label predicted by ReorientExpress. With this, ReorientExpress established the right orientation for the majority of cDNA reads for human and yeast, with up to 96.2% of human reads and up to 98% of S. cerevisiae reads correctly oriented (Fig. 2d) (Additional file 1: Table S6).
Comparisons with other models and inputs
Interestingly, inverting the procedure and training with ONT cDNA reads yields good accuracy when testing on annotated transcripts, but when training on ONT DRS reads the accuracy decreases (Additional file 1: Table S7). This could be a consequence of a higher proportion of base-calling errors in DRS reads due to the presence of RNA modifications, leading to a decrease in the identification of relevant sequence motifs learned by the model. To test this, we trained the MLP model with DRS reads from in vitro transcribed (IVT) RNA  and obtained slightly better accuracy than with DRS reads when testing on cDNA reads (Additional file 1: Table S7). Additionally, we observed no dependency with the base-caller used to obtain the sequence of reads. In particular, using Guppy-rapid or Guppy-high accuracy to base-call the IVT RNA reads did not show any differences in the accuracy of the MLP model (Additional file 1: Table S8). This indicates that DRS errors may prevent accurate training of sequence-based models.
We also observed dependency of the accuracy with the length of the reads. The prediction accuracy decreased for shorter reads (Additional file 1: Table S8), which suggests that either short molecules or partial reads may pose a limitation for the accurate prediction of orientation. To further test the effect of read length on the prediction accuracy, we trimmed a number of nucleotides from both ends of the cDNA reads in the test set. The accuracy was not significantly impacted performing trimming up to 200 nt (Additional file 1: Table S9). Similarly, when we trimmed the training set by different amounts up to 200 nt, leaving fixed the test set, the accuracy did not change significantly (Additional file 1: Table S10). Thus, incomplete annotations can still be valid to train a model, and complete annotations can yield accurate results on partial reads. This is relevant for the application to cDNA reads, which may be fragmented due to internal priming . These results also indicate that DNN models are able to capture predictive features beyond the presence of adapters or poly-A tails to predict the 5′-to-3′ orientation of RNA molecules.
For comparison, we run pychopper (https://github.com/nanoporetech/pychopper), which can identify the orientation of cDNA reads by virtue of detecting the sequencing adapters (see the “Methods” section). We analyzed all cDNA reads whose orientation was labeled previously. For the human cDNA reads, only ~ 23.5% were classified accurately by pychopper. We also compared the accuracy of ReorientExpress with primer-chop (https://gitlab.com/mcfrith/primer-chop), which produced the correct orientation for ~ 54% of all reads tested. These results justify the use of more sophisticated models to predict orientation. Additionally, we trained and tested a support vector machine (SVM) and a Random Forest (RF), using as inputs the same k-mer frequencies. Both methods showed worse accuracy compared to the MLP model for the same test data. However, for S. cerevisiae the accuracy of both models trained with the S. cerevisiae annotation was high (precision and recall 0.86 for the RF, and 0.95 for the SVM) (Additional file 1: Table S10). Finally, we also tested ReorientExpress with PacBio cDNA reads from sorghum . We trained two MLP models, one with the Ensembl cDNA annotations from sorghum and another with maize. Both models showed high accuracy when tested against Sorghum PacBio reads (precision and recall ~ 0.95) (Additional file 1: Table S12).
Association of RNA types and sequence motifs with read orientation prediction
To investigate whether ReorientExpress captures recognizable RNA motifs, we took advantage of the possibility to use the convolutional filters of the CNN to identify sequence motifs captured by the model as done previously [12, 13] (see the “Methods” section). From these filters, we found 32 candidate motifs (Additional file 2), which we compared with known protein-RNA binding motifs . This method detected motifs similar to those described for the RNA binding proteins PCBP1, ELAVL1 (HuR), and RBM42 (Fig. 3c), among others (Additional file 3). Thus, sequence motifs that are relevant to predict molecule orientation recapitulate some of the binding specificities of proteins that control the metabolism of the RNA.
Here, we have shown that deep neural network (DNN) models trained on transcript sequences are able to provide an accurate orientation of cDNA long reads. We hypothesized that sequence motifs that are specific to RNA metabolism would be identifiable in long sequencing reads despite the presence of errors, and found that some of the sequences relevant to predict molecule orientation are similar to known motifs involved in RNA-protein binding. We described how DNN models maintained good accuracy despite using trimmed reads, and worked well on nanopore as well as on PacBio reads. ReorientExpress provides a crucial aid in the interpretation of transcripts using cDNA long reads in samples for which the genome reference is unavailable, as it is the case for many non-model organisms. In this context, identifying the right strand of cDNA reads helps in the accurate detection of open reading frames as well as sequence motifs relevant for RNA metabolism, thereby enabling gene regulation studies despite not having a genome reference available.
ReorientExpress can also be relevant in general for the study of human and model organisms beyond the available references. Direct analysis of the long reads from unstranded libraries is not only cost- and time-effective but also can accurately identify antisense transcripts that are known to play important regulatory roles. Accurate identification of read orientation can provide better estimates of expression levels of sense and antisense genes, which in turn will improve our understanding of the transcriptome and of gene evolution . Establishing the orientation of cDNA long reads without relying on a particular genome reference is also relevant to determine gene sequence variability between individuals at different genomic scales, for instance in terms of short variations in exons  or differences in gene content .
Our analyses show that ReorientExpress can be very valuable in combination with long read clustering [8, 9] to facilitate more accurate downstream analyses of transcriptomes. The ability to predict the 5′-to-3′ orientation of cDNA long reads using models trained on related species makes ReorientExpress a key processing tool for the study of transcriptomes from non-model organisms with long-reads.
Training and testing ReorientExpress
ReorientExpress (https://github.com/comprna/reorientexpress) implements deep neural network (DNN) models using keras (https://github.com/keras-team/keras) and Tensorflow (https://github.com/tensorflow) . All input data is preprocessed to discard reads that contain N’s. For reads from direct RNA-seq experiments, uracil (U) is transformed into thymine (T). Input reads can be optionally trimmed, and this is done for the same length on both sides of each input sequence. For training purposes, a random selection of half the sequences is reverse-complemented to obtain a balanced training set. Optionally, all sequences can be reverse-complemented to double up the training input. ReorientExpress implements two different DNN models, a multi-layer perceptron (MLP) and a convolutional neural network (CNN). The MLP model has 5 hidden layers, with the last layer providing the probability that a read is not in the correct orientation, and with dropout layers to reduce overfitting (Additional file 1: Table S1). In the MLP model, sequences are processed to build a matrix of k-mer frequencies, from k = 1 up to a specified k-mer length (default k = 1,...,5). The normalization is performed per input sequence and per k-mer length. That is, for a fixed k, each k-mer count is divided by the total number of k-mers in the sequence of length L, so that frequency(k-mer) = count(k-mer)/(L-k + 1). Using the k-mer frequencies ensures that the input size is the same for all transcripts regardless of the transcript length. MLPs are simpler than CNNs, so they are faster to train and run. On the other hand, CNNs can model relative spatial relationships; hence, they can take sequence context into account. For this reason, we also included a CNN model in ReorientExpress. For the CNN model, we used an architecture similar to lenet , with 3 convolutional layers, 3 pooling layers, and 3 dense layers, with different filter sizes (Additional file 1: Table S2). For the CNN model, each input sequence was divided into overlapping sequences of 500 nt, overlapping by 250 nt. For transcripts of length between 250 and 500, we added Ns at the end of the sequence. We used one hot encoding as input for each one of the 500-nt windows.
Once a model is trained, or given an already available model, ReorientExpress can predict the orientation of a set of unlabeled reads in prediction mode. ReorientExpress feeds the normalized k-mer counts for each read for the MLP model, or the sliding windows for the CNN model to predict the orientation. In the MLP model, the last layer has only one node, which applies a sigmoid function to approximate a probability from the score it receives. The probability can be interpreted as the certainty that the input read is not in the correct orientation. So, a read with a score greater than 0.5 is predicted to be in the wrong orientation and is reverse-complemented. For the CNN model, for each window tested, the output is a posterior of the orientation given that window. To provide a prediction for each input read, ReorientExpress takes the mean value for both orientations independently and outputs the orientation with the greatest mean.
The test mode is aimed at evaluating the accuracy of a model using as input sequences with known orientation. The program generates predictions for the input reads and compares them with the provided labels, returning a precision (proportion of the predictions that are correct), a recall (true positive rate, proportion of labeled cases that are correctly predicted), an F1-score (harmonic mean of precision and recall), and the total number of input reads. As input for any of the three modes, train, predict, and test, one can use three types of datasets: experimental, annotation, or mapped. Experimental data refers to any kind of long-read data for which the orientation is known, such as direct RNA-seq, and reads are considered to be given in the 5′-to-3′ orientation. Annotation data refers to the transcript sequences from a reference annotation, such as the human transcriptome reference. Annotation is considered to be in the right 5′-to-3′ orientation and can include the transcript type, such as protein coding and processed transcript. Mapped data refers to sequencing data, usually cDNA, whose orientation has been annotated by an independent method, e.g., by mapping the reads to a reference. In this case, a PAF file for the mapping, together with the FASTA/FASTQ file, is required. The labeled data is used for training or testing. In predict mode, the data does not require labeling and ReorientExpress provides a prediction. More details are provided at https://github.com/comprna/reorientexpress.
Deep neural network (DNN) models tested
Models used for the analyses described in the manuscript are provided at https://github.com/comprna/reorientexpress. The human model was trained using the Gencode annotation release 28, and the mouse model was built using the mouse Gencode release M19. The Ensembl annotation (https://fungi.ensembl.org/) was used to train the Saccharomyces cerevisiae (R64-1-1) and the Candida glabrata (ASM254v2) models. Ensembl annotations (http://plants.ensembl.org) were used from sorghum (Sorghum bicolor NCBIv3) and from maize (Zea mays B73_RefGen_v4) to build models to test on PacBio data. From the annotation files, we only used the most frequent transcript annotation types: protein coding, lincRNA, processed transcripts, antisense, and retained intron. We trained the models using 50,000 randomly selected transcript sequences from the annotation, or all of them if there were less than 50,000 (S. cerevisiae and C. glabrata). The results did not change when running the analysis with different sets of 50,000 transcripts.
To test ReorientExpress on cDNA reads, we first calculated a set of cDNA reads for which orientation could be determined unambiguously in an independent way. We used human cDNA from the Nanopore consortium (cDNA 1D pass reads from JHU run 1)  (available from https://github.com/nanopore-wgs-consortium/NA12878/blob/master/nanopore-human-transcriptome/fastq_fast5_bulk.md) and S. cerevisiae cDNA reads  from SRA (SRR6059708). We mapped the cDNA reads to the corresponding transcriptome annotation using minimap2  without secondary alignments (minimap2 -cx map-ont -t7 --secondary=no). We kept only reads with maximum mapping quality (MAPQ = 60) and that were uniquely mapping. For human, 899,431 out of 962,598 reads were mapped in this way, 282,444 of which had MAPQ = 60. After removing the ~ 4% multimapping cases, we finally obtained 270,296 reads with orientation unambiguously assigned. For S. cerevisiae, 4,000,698 out of a total of 5,045,243 reads were mapped, 3,089,543 of which had MAPQ = 60. After removing the ~ 3% multimapping cases, we finally obtained 2,984,873 reads with orientation unambiguously assigned. Additionally, we used direct RNA sequencing (DRS) for human (JHU Run 1 available from https://github.com/nanopore-wgs-consortium/NA12878/blob/master/nanopore-human-transcriptome/fastq_fast5_bulk.md) and for S. cerevisiae from SRA (SRR6059706) . We also tested ReorientExpress with PacBio cDNA reads from sorghum  (data available at https://zenodo.org/record/49944#.XCkXQC-ZN24). We trained two MLP models with the Ensembl cDNA annotations (http://plants.ensembl.org) from sorghum (Sorghum bicolor NCBIv3) and maize (Zea mays B73_RefGen_v4).
Other models for comparison
We tested a Random Forest model and an SVM model using Scikit-learn . The models were trained using 50,000 random annotated transcripts. The features used as training input were the normalized k-mer frequencies for each sequence and the orientation as classification label. Further details are provided in Additional file 1. We also run pychopper (https://github.com/nanoporetech/pychopper) (cdna_classifier.py command), and primer-chop (https://gitlab.com/mcfrith/primer-chop) on the 270,296 human cDNA reads that we had labeled previously and using the full list of barcodes provided by pychopper. Pychopper made predictions for 24% of the reads, from which 98% were correctly classified, i.e., ~ 23.5% (63520) of the total reads. Primer-chop made predictions for 175,539 (65%) reads, from which 83% were correctly classified, i.e., ~ 54% of all reads tested.
Testing the dependency with base-callers
We used Guppy rapid and Guppy high accuracy (v2.2.3) with the signal files from the in vitro transcript RNA sequenced with MinION by the Nanopore Consortium (available from https://github.com/nanopore-wgs-consortium/NA12878/blob/master/nanopore-human-transcriptome/fastq_fast5_bulk.md). As this is direct RNA sequencing, the orientation of the reads can be readily used to test the accuracy of our models.
Clustering and majority vote
We performed clustering of the human and S. cerevisiae cDNA reads using IsONclust . Only cDNA reads that had been assigned an orientation by mapping as described above were used for clustering. We predicted the read 5′-to-3′ orientation for the same reads with ReorientExpress and calculated for each cluster the proportion of reads that were correctly orientated. As IsONclust does not give clusters with oriented reads, the orientation of all cDNA reads was taken from the mapping described above. In each cluster, we then predicted the read orientation with ReorientExpress and selected the majority label to assign all reads in the cluster: if the majority (> 50%) of reads were predicted to be already in 5′-to-3′ orientation (forward), we set all reads to forward. Otherwise, all reads were reverse-complemented. The accuracy of all reads was then calculated by comparing our predictions with the predetermined orientations.
We studied the 32 filters from the first layer of the CNN to obtain the sequences that are most informative for predicting the orientation, using an approach similar to [12, 13]. To explore exhaustively all potential motifs, we used activations above 0 and converted the associated sequences to position weight matrices (PWMs). The derived 32 motif matrices (Additional file 2) were then compared against the CISBP-RNA database (http://cisbp-rna.ccbr.utoronto.ca/)  using the TOMTOM algorithm (http://meme-suite.org/doc/tomtom.html)  for the comparison of PWM-based motifs and selecting matches with p value < 0.05 (Additional file 3).
Peer review information
Barbara Cheifet was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
The review history is available as Additional file 4.
ARR implemented the MLP model, and AS implemented the CNN model. ARR and AS performed the data analysis and benchmarking experiments with the input from JAI and IDLR. The project was devised and coordinated by EE. EE, ARR, and AS drafted the manuscript. All authors read and approved the final manuscript.
Part of this work was funded by the Spanish Government and FEDER with grants BIO2017-85364-R and MDM-2014-0370 and by Catalan Government (AGAUR) with grant SGR2017-1020. JI was supported by a PhD grant from FCT (Fundação para a Ciência e a Tecnologia) Portugal. IdlR had funding from an FPI grant from the Spanish Government (PRE2018-083413).
Ethics approval and consent to participate
Ethics approval is not applicable for this study.
The authors declare that they have no competing interests.
- 8.Marchet C, Lecompte L, Da Silva C, Cruaud C, Aury J-M, Nicolas J, et al. De novo clustering of long reads by gene from transcriptomics data. Nucleic Acids Res. 2018; Available from: http://www.ncbi.nlm.nih.gov/pubmed/30260405.
- 9.Sahlin K, Medvedev P. De novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm. In International Conference on Research in Computational Molecular Biology. Springer, Cham. 2019. pp. 227-42. Available from: https://www.biorxiv.org/content/early/2018/11/06/463463.
- 10.Sessegolo C, Cruaud C, Da Silva C, Cologne A, Dubarry M, Derrien T, Lacroix V, Aury JM. Transcriptome profiling of mouse samples using nanopore sequencing of cDNA and RNA molecules. Sci Rep. 2019;9(1):14908. https://doi.org/10.1038/s41598-019-51470-9. PubMed PMID: 31624302. Available from: http://biorxiv.org/content/early/2019/07/16/575142.abstract.
- 15.Blevins WR, Ruiz-Orera J, Messeguer X, Blasco-Moreno B, Villanueva-Cañas JL, Espinar L, et al. Frequent birth of de novo genes in the compact yeast genome. bioRxiv. 2019:575837 Available from: http://biorxiv.org/content/early/2019/03/13/575837.abstract.
- 20.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2012;12:2825–30.Google Scholar
- 22.Ruiz-Reche A, Srivastava A, Eyras E. ReorientExpress. Github. Available from: https://github.com/comprna/reorientexpress.
- 23.Ruiz-Reche A, Srivastava A, Eyras E. ReorientExpress. source code. Available from: https://doi.org/10.5281/zenodo.3528433
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.