Gerbil: A Fast and Memory-Efficient k-mer Counter with GPU-Support

Erbert, Marius; Rechner, Steffen; Müller-Hannemann, Matthias

doi:10.1007/978-3-319-43681-4_12

Marius Erbert¹⁵,
Steffen Rechner¹⁵ &
Matthias Müller-Hannemann¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9838))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

1404 Accesses
2 Citations

Abstract

A basic task in bioinformatics is the counting of k-mers in genome strings. The k-mer counting problem is to build a histogram of all substrings of length k in a given genome sequence. We present the open source k-mer counting software Gerbil that has been designed for the efficient counting of k-mers for \(k\ge 32\). Given the technology trend towards long reads of next-generation sequencers, support for large k becomes increasingly important. While existing k-mer counting tools suffer from excessive memory resource consumption or degrading performance for large k, Gerbil is able to efficiently support large k without much loss of performance. Our software implements a two-disk approach. In the first step, DNA reads are loaded from disk and distributed to temporary files that are stored at a working disk. In a second step, the temporary files are read again, split into k-mers and counted via a hash table approach. In addition, Gerbil can optionally use GPUs to accelerate the counting step. For large k, we outperform state-of-the-art open source k-mer counting tools by up to a factor of 4 for large genome data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The tool is named Gerbil because of its modest resource requirements, which it has in common with the name-giving mammal.

References

Chikhi, R., Medvedev, P.: Informed and automated \(k\)-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2014)
Article Google Scholar
Deorowicz, S., Debudaj-Grabysz, A., Grabowski, S.: Disk-based \(k\)-mer counting on a PC. BMC Bioinform. 14(1), 1–12 (2013)
Article Google Scholar
Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal \(k\)-mer counting. Bioinformatics 31(10), 1569–1576 (2015)
Article Google Scholar
Li, Y., et al.: MSPKmerCounter: a fast and memory efficient approach for\(k\)-mer counting (2015). arXiv preprint arXiv:1505.06550
Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of \(k\)-mers. Bioinformatics 27(6), 764–770 (2011)
Article Google Scholar
Melsted, P., Pritchard, J.K.: Efficient counting of \(k\)-mers in DNA sequences using a bloom filter. BMC Bioinform. 12(1), 1–7 (2011)
Article Google Scholar
Rizk, G., Lavenier, D., Chikhi, R.: DSK: \(k\)-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)
Article Google Scholar
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)
Article Google Scholar
Roberts, M., Hunt, B.R., Yorke, J.A., Bolanos, R.A., Delcher, A.L.: A preprocessor for shotgun assembly of large genomes. J. Comput. Biol. 11(4), 734–752 (2004)
Article Google Scholar
Roy, R.S., Bhattacharya, D., Schliep, A.: Turtle: identifying frequent \(k\)-mers with cache-efficient algorithms. Bioinformatics 30(14), 1950–1957 (2014)
Article Google Scholar
Sameith, K., Roscito, J.G., Hiller, M.: Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly. Briefings in Bioinformatics (2016). http://dx.org/10.1093/bib/bbw003
Suzuki, S., Kakuta, M., Ishida, T., Akiyama, Y.: Accelerating identification of frequent \(k\)-mers in DNA sequences with GPU. In: GTC 2014 (2014)
Google Scholar
Xavier, B.B., Sabirova, J., Pieter, M., Hernalsteens, J.P., de Greve, H., Goossens, H., Malhotra-Kumar, S.: Employing whole genome mapping for optimal de novo assembly of bacterial genomes. BMC Res. Notes 7(1), 1–4 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
Marius Erbert, Steffen Rechner & Matthias Müller-Hannemann

Authors

Marius Erbert
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Rechner
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Müller-Hannemann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steffen Rechner .

Editor information

Editors and Affiliations

AIST and University of Tokyo , Tokyo, Japan
Martin Frith
Aarhus University, Aarhus, Denmark
Christian Nørgaard Storm Pedersen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Erbert, M., Rechner, S., Müller-Hannemann, M. (2016). Gerbil: A Fast and Memory-Efficient k-mer Counter with GPU-Support. In: Frith, M., Storm Pedersen, C. (eds) Algorithms in Bioinformatics. WABI 2016. Lecture Notes in Computer Science(), vol 9838. Springer, Cham. https://doi.org/10.1007/978-3-319-43681-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-43681-4_12
Published: 06 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43680-7
Online ISBN: 978-3-319-43681-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics