Skip to main content

Advertisement

Log in

A Hybrid Parallel Strategy Based on String Graph Theory to Improve De Novo DNA Assembly on the TianHe-2 Supercomputer

  • Original Research Article
  • Published:
Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Abstract

’ The de novo assembly of DNA sequences is increasingly important for biological researches in the genomic era. After more than one decade since the Human Genome Project, some challenges still exist and new solutions are being explored to improve de novo assembly of genomes. String graph assembler (SGA), based on the string graph theory, is a new method/tool developed to address the challenges. In this paper, based on an in-depth analysis of SGA we prove that the SGA-based sequence de novo assembly is an NP-complete problem. According to our analysis, SGA outperforms other similar methods/tools in memory consumption, but costs much more time, of which 60–70 % is spent on the index construction. Upon this analysis, we introduce a hybrid parallel optimization algorithm and implement this algorithm in the TianHe-2’s parallel framework. Simulations are performed with different datasets. For data of small size the optimized solution is 3.06 times faster than before, and for data of middle size it’s 1.60 times. The results demonstrate an evident performance improvement, with the linear scalability for parallel FM-index construction. This results thus contribute significantly to improving the efficiency of de novo assembly of DNA sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Myers EW (2005) The fragment assembly string graph. Bioinformatics 21(suppl 2):ii79–ii85

    Article  CAS  Google Scholar 

  2. Medvedev P, Georgiou K, Myers G, et al (2007) Computability of models for sequence assembly. In: Algorithms in bioinformatics, pp 289–301

  3. Warren RL, Sutton GG, Jones SJM et al (2007) Ass embling millions of short DNA sequences using SSAKE. Bioinformatics 23(4):500–501

    Article  CAS  Google Scholar 

  4. Jeck WR, Reinhardt JA, Baltrus DA et al (2007) Extending assembly of short DNA sequences to handle error. Bioinformatics 23(21):2942–2944

    Article  CAS  Google Scholar 

  5. Dohm JC, Lottaz C, Borodina T et al (2007) SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res 17(11):1697–1706

    Article  CAS  Google Scholar 

  6. Luo R, Liu B, Xie Y et al (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(1):18

    Article  Google Scholar 

  7. Simpson JT, Wong K, Jackman SD et al (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 9(6):1117–1123

    Article  Google Scholar 

  8. Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18(5):821–829

    Article  CAS  Google Scholar 

  9. Simpson JT, Durbin R (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22(3):549–556

    Article  CAS  Google Scholar 

  10. Yang X, Liao X, Lu K, Hu Q, Song J, Su J (2011) The TianHe-1a supercomputer: its hardware and software. J Comput Sci Technol 26(3):344–351

    Article  Google Scholar 

  11. Yang X, Liao X, Xu W, Song J, Hu Q, Su J, Xiao L, Lu K, Dou Q, Jiang J, Yang C (2010) TH-1: China’s first petaflop supercomputer. Front Comput Sci China 4(4):445–455

    Article  Google Scholar 

Download references

Acknowledgments

We highly appreciate the valuable support and help from the authors of SGA-Dr. Jared Simpson in Sanger Institute in UK, who provides the source code and data. We thank Xiaodong FANG, Lin FANG, Guixin GUO and Zhe LIN in BGI for their help and support. This work is supported by NSFC Grant61272056, U1435222, 1133005, and Guangzhou SC Grant 1488064512003.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Shaoliang Peng or Xiaoqian Zhu.

Additional information

Feng Zhang, Xiangke Liao, Shaoliang Peng, Yingbo Cui, Bingqiang Wang, Xiaoqian Zhu, Jie Liu have contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, F., Liao, X., Peng, S. et al. A Hybrid Parallel Strategy Based on String Graph Theory to Improve De Novo DNA Assembly on the TianHe-2 Supercomputer. Interdiscip Sci Comput Life Sci 8, 169–176 (2016). https://doi.org/10.1007/s12539-015-0127-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12539-015-0127-6

Keywords

Navigation