Abstract
’ The de novo assembly of DNA sequences is increasingly important for biological researches in the genomic era. After more than one decade since the Human Genome Project, some challenges still exist and new solutions are being explored to improve de novo assembly of genomes. String graph assembler (SGA), based on the string graph theory, is a new method/tool developed to address the challenges. In this paper, based on an in-depth analysis of SGA we prove that the SGA-based sequence de novo assembly is an NP-complete problem. According to our analysis, SGA outperforms other similar methods/tools in memory consumption, but costs much more time, of which 60–70 % is spent on the index construction. Upon this analysis, we introduce a hybrid parallel optimization algorithm and implement this algorithm in the TianHe-2’s parallel framework. Simulations are performed with different datasets. For data of small size the optimized solution is 3.06 times faster than before, and for data of middle size it’s 1.60 times. The results demonstrate an evident performance improvement, with the linear scalability for parallel FM-index construction. This results thus contribute significantly to improving the efficiency of de novo assembly of DNA sequences.
Similar content being viewed by others
References
Myers EW (2005) The fragment assembly string graph. Bioinformatics 21(suppl 2):ii79–ii85
Medvedev P, Georgiou K, Myers G, et al (2007) Computability of models for sequence assembly. In: Algorithms in bioinformatics, pp 289–301
Warren RL, Sutton GG, Jones SJM et al (2007) Ass embling millions of short DNA sequences using SSAKE. Bioinformatics 23(4):500–501
Jeck WR, Reinhardt JA, Baltrus DA et al (2007) Extending assembly of short DNA sequences to handle error. Bioinformatics 23(21):2942–2944
Dohm JC, Lottaz C, Borodina T et al (2007) SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res 17(11):1697–1706
Luo R, Liu B, Xie Y et al (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(1):18
Simpson JT, Wong K, Jackman SD et al (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 9(6):1117–1123
Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18(5):821–829
Simpson JT, Durbin R (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22(3):549–556
Yang X, Liao X, Lu K, Hu Q, Song J, Su J (2011) The TianHe-1a supercomputer: its hardware and software. J Comput Sci Technol 26(3):344–351
Yang X, Liao X, Xu W, Song J, Hu Q, Su J, Xiao L, Lu K, Dou Q, Jiang J, Yang C (2010) TH-1: China’s first petaflop supercomputer. Front Comput Sci China 4(4):445–455
Acknowledgments
We highly appreciate the valuable support and help from the authors of SGA-Dr. Jared Simpson in Sanger Institute in UK, who provides the source code and data. We thank Xiaodong FANG, Lin FANG, Guixin GUO and Zhe LIN in BGI for their help and support. This work is supported by NSFC Grant61272056, U1435222, 1133005, and Guangzhou SC Grant 1488064512003.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Feng Zhang, Xiangke Liao, Shaoliang Peng, Yingbo Cui, Bingqiang Wang, Xiaoqian Zhu, Jie Liu have contributed equally to this work.
Rights and permissions
About this article
Cite this article
Zhang, F., Liao, X., Peng, S. et al. A Hybrid Parallel Strategy Based on String Graph Theory to Improve De Novo DNA Assembly on the TianHe-2 Supercomputer. Interdiscip Sci Comput Life Sci 8, 169–176 (2016). https://doi.org/10.1007/s12539-015-0127-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12539-015-0127-6