Accelerating De Novo Assembler WTDBG2 on Commodity Servers

Dun, Ming; Li, Yunchun; You, Xin; Sun, Qingxiao; Luan, Zerong; Yang, Hailong

doi:10.1007/978-3-030-60245-1_16

Ming Dun⁹,
Yunchun Li^9,10,
Xin You¹⁰,
Qingxiao Sun¹⁰,
Zerong Luan¹² &
…
Hailong Yang^10,11

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12452))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1560 Accesses

Abstract

De novo genome assembly reconstructs the chromosomes from massive relatively short fragmented reads and serves as fundamental for studying new species where there is no reference genome. Wtdbg2 is a de novo assembler for long reads that is up to hundreds of kilobases. It is based on fuzzy-Bruijn graph (FBG) and is ten times faster than the cutting-edge assemblers such as Canu. However, the performance of wtdbg2 still requires further improvement: 1) it requires up to terabytes of memory to compute the assembly, which is infeasible to run on commodity server; 2) it requires tens of hours for assembling on large datasets such as genomes of homo sapiens. To address the above drawbacks, we propose several optimization techniques for accelerating wtdbg2 on commodity server, including a memory auto-tuning scheme, sequence alignment optimization and intermediate result elimination in the output procedure. We compare the optimized wtdbg2 with the original implementation and two cutting-edge assemblers on real-world datasets. The experiment results demonstrate that optimized wtdbg2 achieves maximum and average speedup of 2.31\(\times \) and 1.54\(\times \) respectively. In addition, our proposed optimization reduces the memory usage of wtdbg2 by 39.5% without affecting the correctness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

C.elegans genome dataset (2019). http://datasets.pacb.com.s3.amazonaws.com/2014/c_elegans
D.melanogaster iso1 genome dataset (2019). https://www.ebi.ac.uk/ena/data/view/SRR6702603
Human hg00733 genome dataset (2019). https://www.ebi.ac.uk/ena/data/view/SRR7615963
Human na24385 genome dataset (2019). https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb
wtdbg2 (2019). https://github.com/ruanjue/wtdbg2
Adhianto, L., et al.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22(6), 685–701 (2010)
Google Scholar
Ahmed, N., Mushtaq, H., Bertels, K., Al-Ars, Z.: GPU accelerated API for alignment of genomics sequencing data. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 510–515. IEEE (2017)
Google Scholar
Berlin, K., Koren, S., Chin, C.S., Drake, J.P., Landolin, J.M., Phillippy, A.M.: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33(6), 623 (2015)
Article Google Scholar
Denisov, G., et al.: Consensus generation and variant detection by Celera assembler. Bioinformatics 24(8), 1035–1040 (2008)
Article Google Scholar
Ghosh, P., Krishnamoorthy, S.: PaKman: scalable assembly of large genomes on distributed memory machines. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2019)
Google Scholar
Gurevich, A., Saveliev, V., Vyahhi, N., Tesler, G.: QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), 1072–1075 (2013)
Article Google Scholar
Jiang, T., Fu, Y., Liu, B., Wang, Y.: Long-read based novel sequence insertion detection with rCANID. IEEE Trans. Nanobiosci. 18(3), 343–352 (2019)
Article Google Scholar
Kolmogorov, M., Yuan, J., Lin, Y., Pevzner, P.A.: Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37(5), 540 (2019)
Article Google Scholar
Koren, S., Walenz, B.P., Berlin, K., Miller, J.R., Bergman, N.H., Phillippy, A.M.: Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27(5), 722–736 (2017)
Article Google Scholar
Liu, Y., Wirawan, A., Schmidt, B.: CUDASW++ 3.0: accelerating smith-waterman protein database search by coupling CPU and GPU SIMD instructions. BMC Bioinform. 14(1) (2013). Article number: 117. https://doi.org/10.1186/1471-2105-14-117
Nichols, B., Buttlar, D., Farrell, J., Farrell, J.: Pthreads Programming: A POSIX Standard for Better Multiprocessing. O’Reilly Media Inc., Sebastopol (1996)
Google Scholar
Pan, T.C., Misra, S., Aluru, S.: Optimizing high performance distributed memory parallel hash tables for DNA k-mer counting. In: SC 2018: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 135–147. IEEE (2018)
Google Scholar
Pantaleoni, J., Subtil, N.: NVBIO: a library of reusable components designed by NVIDIA corporation to accelerate bioinformatics applications using CUDA (2014). http://nvlabs.github.io/nvbio
Peng, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11), 1420–1428 (2012)
Article Google Scholar
Qiu, M., et al.: Data allocation for hybrid memory with genetic algorithm. IEEE Trans. Emerg. Top. Comput. 3(4), 544–555 (2015)
Article Google Scholar
Qiu, M., Ming, Z., Li, J., Liu, S., Wang, B., Lu, Z.: Three-phase time-aware energy minimization with DVFS and unrolling for chip multiprocessors. J. Syst. Archit. 58(10), 439–445 (2012)
Article Google Scholar
Ruan, J., Li, H.: Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17(2), 155–158 (2020)
Article Google Scholar
Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22(3), 549–556 (2012)
Article Google Scholar
Wenger, A.M., et al.: Highly-accurate long-read sequencing improves variant detection and assembly of a human genome, p. 519025. bioRxiv (2019)
Google Scholar
Xiao, C.L., et al.: MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat. Methods 14(11), 1072 (2017)
Article Google Scholar

Download references

Acknowledgment

This work is supported by National Key Research and Development Program of China (Grant No. 2016YFB1000304), National Natural Science Foundation of China (Grant No. 61502019), and the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing (Grant No. 2019A12).

Author information

Authors and Affiliations

School of Cyber Science and Technology, Beihang University, Beijing, 100191, China
Ming Dun & Yunchun Li
School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Yunchun Li, Xin You, Qingxiao Sun & Hailong Yang
State Key Laboratory of Mathematical Engineering and Advanced Computing, Beijing University of Technology, Beijing, 100083, China
Hailong Yang
College of Life Sciences and Bioengineering, Beijing University of Technology, Beijing, 100083, China
Zerong Luan

Authors

Ming Dun
View author publications
You can also search for this author in PubMed Google Scholar
Yunchun Li
View author publications
You can also search for this author in PubMed Google Scholar
Xin You
View author publications
You can also search for this author in PubMed Google Scholar
Qingxiao Sun
View author publications
You can also search for this author in PubMed Google Scholar
Zerong Luan
View author publications
You can also search for this author in PubMed Google Scholar
Hailong Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hailong Yang .

Editor information

Editors and Affiliations

Columbia University, New York, NY, USA
Meikang Qiu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dun, M., Li, Y., You, X., Sun, Q., Luan, Z., Yang, H. (2020). Accelerating De Novo Assembler WTDBG2 on Commodity Servers. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12452. Springer, Cham. https://doi.org/10.1007/978-3-030-60245-1_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-60245-1_16
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60244-4
Online ISBN: 978-3-030-60245-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics