Multithreaded comparative RNA secondary structure prediction using stochastic context-free grammars
- 5.6k Downloads
- 12 Citations
Abstract
Background
The prediction of the structure of large RNAs remains a particular challenge in bioinformatics, due to the computational complexity and low levels of accuracy of state-of-the-art algorithms. The pfold model couples a stochastic context-free grammar to phylogenetic analysis for a high accuracy in predictions, but the time complexity of the algorithm and underflow errors have prevented its use for long alignments. Here we present PPfold, a multithreaded version of pfold, which is capable of predicting the structure of large RNA alignments accurately on practical timescales.
Results
We have distributed both the phylogenetic calculations and the inside-outside algorithm in PPfold, resulting in a significant reduction of runtime on multicore machines. We have addressed the floating-point underflow problems of pfold by implementing an extended-exponent datatype, enabling PPfold to be used for large-scale RNA structure predictions. We have also improved the user interface and portability: alongside standalone executable and Java source code of the program, PPfold is also available as a free plugin to the CLC Workbenches. We have evaluated the accuracy of PPfold using BRaliBase I tests, and demonstrated its practical use by predicting the secondary structure of an alignment of 24 complete HIV-1 genomes in 65 minutes on an 8-core machine and identifying several known structural elements in the prediction.
Conclusions
PPfold is the first parallelized comparative RNA structure prediction algorithm to date. Based on the pfold model, PPfold is capable of fast, high-quality predictions of large RNA secondary structures, such as the genomes of RNA viruses or long genomic transcripts. The techniques used in the parallelization of this algorithm may be of general applicability to other bioinformatics algorithms.
Keywords
Nonterminal Symbol Multicore Machine Lower Triangular Matrice Column Pair Structure Prediction AlgorithmBackground
Recent years have seen an explosion in the amount of biological data available from large-scale genome sequencing projects, but this increase has not been met by a corresponding increase in single-core computer power to bioinformatically analyze this data. It is therefore predicted that the scientific community will face serious computational problems in the coming years in their efforts to interpret genomic data. The prediction of RNA secondary structure remains a particularly challenging problem, in a large part due to its computational complexity: even without pseudoknot prediction, the execution time of state-of-the-art algorithms scales as O(L^{3}) or worse with the length of the sequence, L. One way to address this problem is by exploiting heuristics to reduce complexity, but this happens at the cost of accuracy in predictions, which is particularly detrimental in the case of already inaccurate algorithms. Another possibility is to apply emerging multithreading paradigms to more accurate algorithms, and obtain the precise results in a fraction of the time.
RNA secondary structure prediction algorithms are typically based on either thermodynamic or stochastic context-free grammar (SCFG) models, and are implemented using dynamic programming. A recent review [1] gives an overview over existing tools. Previous attempts to parallelize RNA structure prediction algorithms have included thermodynamic prediction [2, 3, 4, 5] and the SCFG-based profiling of RNAs [6], as well as massively parallel genetic algorithms [7] and hardware-accelerated folding on FPGA chips [8] and GPUs [9]. Despite improved runtimes, the accuracy of these algorithms remains low, due to models that may not be appropriate for very long sequences.
In this study, we address both of these issues, and create PPfold, an improved and multithreaded version of pfold. To our knowledge, PPfold is the first example of a multithreaded comparative RNA secondary structure prediction algorithm. We demonstrate its practical use by predicting the secondary structure of an alignment of 24 HIV-1 genomes.
Results and Discussion
Algorithm
PPfold uses the same combined evolutionary and SCFG model as pfold[10, 11]. A summary of this model is provided in the Methods section. Here we focus strictly on the parts of PPfold that present improvements on pfold.
Multithreading the phylogenetic calculations
After estimating the phylogenetic tree, the pfold algorithm calculates column- and column-pair likelihoods, based on post-order traversal through the tree. We have only distributed the calculation of column-pair likelihoods, as this is the most time-consuming part with a time complexity of O(L^{2}), where L is the length of the alignment. It is desirable to distribute the calculations as evenly as possible, so all processing units have an equal workload. As all column-pairs are treated independently, a natural division for multithreading arises by grouping a number of column pairs together in such a way that there are as many groups with equal numbers of column pairs as processing units. However, a unique mapping from the number of groups, n, to the size of each group, s, does not exist.
Multithreading the inside-outside algorithm
The inside-outside algorithm fills two lower triangular matrices of dimension L for each nonterminal symbol of the grammar, through dynamic programming. The Knudsen-Hein grammar contains 3 nonterminal variables, so in total there are 6 such matrices to be filled. The algorithm exhibits heavy dependencies, making its distribution into independent "jobs" nontrivial. In PPfold, we have chosen an asynchronous wavefront computational approach that exploits the geometry of the algorithm.
Additional file 1:Animation of distribution of the SCFG calculations. The animation demonstrates the asynchronous wavefront computational approach, and was created on the basis of actual runtime data during the folding of a sequence of 460 nt, on a 2-core machine. The animation is divided into three parts: the inside (red/green), outside (yellow/blue) and expectation (cyan/magenta) calculations. The first colour represents jobs that are ready to be executed (because all their dependencies are fulfilled), and are therefore placed in a queue. At any time during the animation, the two jobs that entered the queue earliest are under execution (not shown), as the execution happens in a threadpool corresponding to the available number of cores (here, 2). When a job is finished, it changes to the second colour, and any newly available jobs (with finished dependencies) are pushed onto the queue. (AVI 3 MB)
Memory use is optimized with a large number of divisions, where only the lower triangular matrices are stored. However, the space complexity of the algorithm remains O(L^{2}), and roughly 6 GB of memory are needed to fold a 10000 nucleotide-long alignment.
Multithreading expectation maximization
In contrast to many SCFG-based RNA secondary structure prediction programs, pfold returns the structure with the maximum number of expected correctly predicted nested structural elements, instead of the maximum likelihood estimate. To find this structure, it is necessary to calculate a matrix of expectation values for every fraction of the alignment, similarly to the inside algorithm. The details of this are described in the Methods section. As these calculations also contain bifurcations, they are distributed analogously to the inside algorithm.
Underflow
Floating-point underflow arises commonly in dynamic programming for the folding of long RNAs, due to the multiplication of several thousands of low probabilities with each other. It has effectively prevented the correct folding of large RNA alignments with pfold, as the values of the inside-outside variables decrease both with the length of alignment the number of sequences in it. A common solution is to calculate with log-probabilities, implementing addition as a "log sum" function with a lookup table. Other possible solutions include multiplying the rules of the grammar with a factor, such that underflow is reached more slowly, thus extending the foldable length of alignments, or multiplying a block of probabilities lower than a certain cutoff value by a scaling factor.
In PPfold we have taken an alternative approach and implemented an extended exponent datatype, consisting of a float "fraction" and an integer "exponent" (base 2) part. Together, 64 bits are used to store each number - the same amount of space as a double-precision floating point number, so the overall memory requirements of the algorithm are not increased substantially. For each nonterminal symbol (in the inside, outside and expectation parts of the algorithm), we store two 2-dimensional arrays: one for the exponents and one for the fractions. For every operation, we combine numbers from these arrays using custom bitmasking and bitshifting methods.
Implementation
PPfold has been written in Java 5.0, and integrated into the CLC Workbenches using the CLC Developer Kit (version 3.31) API. The source code and executables are available for download at http://www.daimi.au.dk/~compbio/pfold/downloads.html. PPfold consists of an "algorithm" package that can be compiled and run independently of the CLC Workbenches, and a "plugin" package that provides interfacing with the CLC Workbenches.
The "algorithm" package includes all classes that are involved in the processing of sequences and alignments and creating the final structures. It has no dependencies on any CLC classes, is capable of taking command-line arguments and provides a simple graphical user interface for the selection of input files. Export formats currently supported by PPfold include. ct, .seq (with reliability scores) and white text, as well as basepairing probability data.
The "plugin" package makes use of the CLC Developer Kit API such that PPfold becomes a full-featured plugin to the CLC Workbenches. Futhermore, integration into "minigrid-enabled" CLC Workbenches makes it possible to distribute calculations to a collaborative mini-grid of computers [13]. The details of this aspect of our work will be published elsewhere.
Testing and benchmarking
Performance
Performance
Alignment | divisions | 1 core (sec) | 2 cores (sec) | 4 cores (sec) | 8 cores (sec) | pfold(sec) |
---|---|---|---|---|---|---|
1 | 5.41 | 4.85 | 4.84 | 4.84 | ||
2 × 500 nt | 4 | 5.74 | 3.06 | 2.15 | 2.16 | 0.59 |
35 | 3.70 | 2.05 | 1.25 | 0.92 | ||
1 | 51.8 | 52.0 | 51.1 | 50.4 | ||
20 × 1000 nt | 4 | 46.6 | 27.9 | 19.0 | 19.6 | 7.3 |
35 | 35.7 | 18.3 | 9.7 | 5.8 | ||
1 | 1738 | 1640 | 1581 | 1464 | ||
30 × 3000 nt | 4 | 1476 | 878 | 642 | 632 | 368 |
35 | 842 | 424 | 217 | 123 |
Accuracy
BRaliBase accuracy
Sequence | Program | Ref. basepairs | Pred. basepairs | Sensitivity, % | Selectivity, % | Correlation, % |
---|---|---|---|---|---|---|
tRNA (M) | PPfold | 20 | 21 | 100.0 | 100.0 | 100.0 |
tRNA (M) | pfold | 20 | 21 | 100.0 | 100.0 | 100.0 |
tRNA (H) | PPfold | 20 | 21 | 100.0 | 100.0 | 100.0 |
tRNA (H) | pfold | 20 | 21 | 100.0 | 100.0 | 100.0 |
RNaseP (M) | PPfold | 110 | 110 | 86.4 | 96.0 | 91.2 |
RNaseP (M) | pfold | 110 | 110 | 86.4 | 96.0 | 91.2 |
RNaseP (H) | PPfold | 110 | 69 | 43.6 | 71.6 | 57.6 |
RNaseP (H) | pfold | 110 | 69 | 43.6 | 71.6 | 57.6 |
SSU (M) | PPfold | 468 | 420 | 74.4 | 86.1 | 80.2 |
SSU (M) | pfold | 468 | 0 | 0.0 | 0.0 | 0.0 |
SSU (H) | PPfold | 468 | 373 | 68.4 | 89.1 | 78.8 |
SSU (H) | pfold | 468 | 373 | 68.4 | 89.1 | 78.8 |
LSU (M) | PPfold | 839 | 830 | 58.2 | 62.5 | 60.3 |
LSU (M) | pfold | 839 | 0 | 0.0 | 0.0 | 0.0 |
LSU (H) | PPfold | 839 | 754 | 52.2 | 61.0 | 56.6 |
LSU (H) | pfold | 839 | 0 | 0.0 | 0.0 | 0.0 |
Folding of the HIV-1 genome
Numerical stability
Numerical stability
Number of divisions | P(S→ C_{1}...C_{ n }) |
---|---|
20 | 1.7616063 × 2^{-148446} |
35 | 1.7617458 × 2^{-148446} |
60 | 1.7618935 × 2^{-148446} |
Conclusions
PPfold is a new, multithreaded version of the pfold algorithm, capable of obtaining high-quality, phylogenetically supported structures for large RNA alignments in a practically reasonable time, which has not been possible previously. PPfold distributes both the phylogenetic and the inside-outside calculations of pfold, and our tests show that its speed of execution scales well with the number of executing cores. Using PPfold, we have been able to obtain a high-quality prediction for an alignment of 30 full HIV-1 genomes in 65 minutes on an 8-core computer. We anticipate that our algorithm will be used also for the prediction of other long RNA alignments, such as viral genomes and genomic transcripts.
Methods
Summary of the pfold model
with initialization conditions E_{ i,i } = P_{ s } (i) for all i, where the basepair probabilities P_{ d } and unpaired base probabilities P_{ s } are obtained from the inside-outside variables. The final structure returned to the user is obtained by backtracking in this matrix. The reader is advised to consult references [10] and [11] for more details on the pfold model and algorithm.
Notes
Acknowledgements
ZS would like to thank Paul Gardner for assistance with the BRaliBase tests, Mikkel Nygaard Ravn for work on the CLC Workbenches that aided integration, Alex Andersen for practical help and Attila Sükösd, Paula Tataru and Rasmus Schøler Sørensen for useful discussions. This research has been funded by the Danish Agency for Science, Technology, and Innovation under the project "PC Mini-Grids for Prediction of Viral RNA Structure and Evolution", #09-061856.
Supplementary material
References
- 1.Andersen ES: Prediction and design of DNA and RNA structures. New Biotechnol 2010, 27: 184–193. 10.1016/j.nbt.2010.02.012CrossRefGoogle Scholar
- 2.Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures (The Vienna RNA Package). Monatshefte für Chemie 1994, 125: 167–188. 10.1007/BF00818163CrossRefGoogle Scholar
- 3.Hofacker IL, Huynen MA, Stadler PF, Stolorz PE: RNA Folding on Parallel Computers - The Minimum Free Energy Structures of Complete HIV Genomes. 1996.Google Scholar
- 4.Fekete M, Hofacker IL, Stadler PF: Prediction of RNA base pairing probabilities on massively parallel computers. Journal of Computational Biology 2000, 7: 171–182. 10.1089/10665270050081441CrossRefPubMedGoogle Scholar
- 5.Mathuriya A, Bader DA, Heitsch CE, Harvey SC: GTfold: a scalable multicore code for RNA secondary structure prediction. In SAC '09: Proceedings of the 2009 ACM symposium on Applied Computing. New York, NY, USA: ACM; 2009:981–988.CrossRefGoogle Scholar
- 6.Liu T, Schmidt B: Parallel RNA secondary structure prediction using stochastic context-free grammars. Concurrency Computat: Pract Exper 2005, 17: 1669–1685. 10.1002/cpe.952CrossRefGoogle Scholar
- 7.Shapiro BA, Wu JC, Bengali D, Potts MJ: The massively parallel genetic algorithm for RNA folding: MIMD implementation and population variation. Bioinformatics 2001, 17: 137–148. 10.1093/bioinformatics/17.2.137CrossRefPubMedGoogle Scholar
- 8.Xia F, Dou Y, Zhou D, Li X: Fine-grained parallel RNA secondary structure prediction using SCFGs on FPGA. Parallel Comput 2010, 36: 516–530. 10.1016/j.parco.2010.05.005CrossRefGoogle Scholar
- 9.Chang DJ, Kimmer C, Ouyang M: Accelerating the Nussinov RNA folding algorithm with CUDA/GPU. Proceedings of IEEE International Symposium on Signal Processing and Information Technology (ISSPIT 2010) 2010, 120–125.Google Scholar
- 10.Knudsen B, Hein J: RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 1999, 15: 446–454. 10.1093/bioinformatics/15.6.446CrossRefPubMedGoogle Scholar
- 11.Knudsen B, Hein J: Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nuc Acids Res 2003, 31: 3423–3428. 10.1093/nar/gkg614CrossRefGoogle Scholar
- 12.Gardner PP, Giegerich R: A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics 2004, 5: 140. 10.1186/1471-2105-5-140PubMedCentralCrossRefPubMedGoogle Scholar
- 13.Bardram J, Venkataraman N: The Mini-Grid Framework: Application Programming Support for Ad-Hoc, Peer-to-Peer Volunteer Grids. In Advances in Grid and Pervasive Computing, Volume 6104 of Lecture Notes in Computer Science. Edited by: Bellavista P, Chang RS, Chao HC, Lin SF, Sloot P. Springer Berlin/Heidelberg; 2010:69–80.Google Scholar
- 14.Watts JM, Dang KK, Gorelick RJ, Leonard CW, Bess JW Jr, Swanstrom R, Burch CL, Weeks KM: Architecture and secondary structure of an entire HIV-1 RNA genome. Nature 2009, 460: 711–716. 10.1038/nature08237PubMedCentralCrossRefPubMedGoogle Scholar
- 15.Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 1981, 17: 368–376. 10.1007/BF01734359CrossRefPubMedGoogle Scholar
- 16.Lari K, Young SJ: The estimation of stochastic context-free grammars using the Inside-Outside algorithm. Computer Speech & Language 1990, 4: 35–56. 10.1016/0885-2308(90)90022-XCrossRefGoogle Scholar
- 17.Wiese KC, Glen E: jViz.Rna - An Interactive Graphical Tool for Visualizing RNA Secondary Structure Including Pseudoknots. Proceedings of the 19th IEEE Symposium on Computer-Based Medical Systems Washington, DC, USA: IEEE Computer Society; 2006, 659–664. [http://portal.acm.org/citation.cfm?id=1152999.1153025]Google Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.