Improving Bloom Filter Performance on Sequence Data Using $$k$$ -mer Bloom Filters

Pellow, David; Filippova, Darya; Kingsford, Carl

doi:10.1007/978-3-319-31957-5_10

David Pellow¹⁴,
Darya Filippova¹⁵ &
Carl Kingsford¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9649))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

Abstract

Using a sequence’s $k$-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. Since $k$-mer sets often reach hundreds of millions of elements, traditional data structures are impractical for $k$-mer set storage, and Bloom filters and their variants are used instead. Bloom filters reduce the memory footprint required to store millions of $k$-mers while allowing for fast set containment queries, at the cost of a low false positive rate. We show that, because $k$-mers are derived from sequencing reads, the information about $k$-mer overlap in the original sequence can be used to reduce the false positive rate up to $30{\times }$ with little or no additional memory and with set containment queries that are only 1.3–1.6 times slower. Alternatively, we can leverage $k$-mer overlap information to store $k$-mer sets in about half the space while maintaining the original false positive rate. We consider several variants of such $k$-mer Bloom filters (kBF), derive theoretical upper bounds for their false positive rate, and discuss their range of applications and limitations. We provide a reference implementation of kBF at https://github.com/Kingsford-Group/kbf/.

D. Pellow—Work performed at the Language Technologies Institute, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Benoit, G., Lemaitre, C., Lavenier, D., Drezen, E., Dayris, T., Uricaru, R., Rizk, G.: Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinform. 16(1), 288 (2015)
Article Google Scholar
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Article MATH Google Scholar
Broder, A., Mitzenmacher, M.: Network applications of Bloom filters: a survey. Internet Math. 1(4), 485–509 (2004)
Article MATH MathSciNet Google Scholar
Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol. 8(22), 1 (2013)
Google Scholar
Heo, Y., Wu, X.L., Chen, D., Ma, J., Hwu, W.M.: BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads. Bioinformatics 30, 1354–1362 (2014)
Article Google Scholar
Holley, G., Wittler, R., Stoye, J.: Bloom filter trie – a data structure for pan-genome storage. In: Pop, M., Touzet, H. (eds.) WABI 2015. LNCS, vol. 9289, pp. 217–230. Springer, Heidelberg (2015)
Chapter Google Scholar
Malde, K., O’Sullivan, B.: Using Bloom filters for large scale gene sequence analysis in Haskell. In: Gill, A., Swift, T. (eds.) PADL 2009. LNCS, vol. 5418, pp. 183–194. Springer, Heidelberg (2008)
Chapter Google Scholar
Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)
Article Google Scholar
Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32(5), 462–464 (2014)
Article Google Scholar
Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J.M., Brown, C.T.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Nat. Acad. Sci. 109(33), 13272–13277 (2012)
Article MATH MathSciNet Google Scholar
Rozov, R., Shamir, R., Halperin, E.: Fast lossless compression via cascading Bloom filters. BMC Bioinform. 15(Suppl 9), S7 (2014)
Article Google Scholar
Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading Bloom filters to improve the memory usage for de Brujin graphs. In: Darling, A., Stoye, J. (eds.) WABI 2013. LNCS, vol. 8126, pp. 364–376. Springer, Heidelberg (2013)
Chapter Google Scholar
Shi, H., Schmidt, B., Liu, W., Müller-Wittig, W.: Accelerating error correction in high-throughput short-read DNA sequencing data with CUDA. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2009), pp. 1–8. IEEE (2009)
Google Scholar
Solomon, B., Kingsford, C.: Large-scale search of transcriptomic read sets with sequence bloom trees. bioRxiv, p. 017087 (2015)
Google Scholar
Song, L., Florea, L., Langmead, B.: Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol. 15(11), 1–13 (2014)
Article Google Scholar
Stranneheim, H., Käller, M., Allander, T., Andersson, B., Arvestad, L., Lundeberg, J.: Classification of DNA sequences using Bloom filters. Bioinformatics 26(13), 1595–1600 (2010)
Article Google Scholar
Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)
Article Google Scholar
Yu, Y.W., Yorukoglu, D., Berger, B.: Traversing the k-mer landscape of NGS read datasets for quality score sparsification. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 385–399. Springer, Heidelberg (2014)
Chapter Google Scholar
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)
Article Google Scholar

Download references

Acknowledgments

The authors want to thank Dr. Geet Duggal and Hao Wang for the many helpful discussions. This research is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to Carl Kingsford, by the US National Science Foundation (CCF-1256087, CCF-1319998) and by the US National Institutes of Health (R21HG006913, R01HG007104). C.K. received support as an Alfred P. Sloan Research Fellow.

Author information

Authors and Affiliations

The Blavatnik School of Computer Science, Tel Aviv University, 69978, Tel Aviv, Israel
David Pellow
Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA, USA
Darya Filippova & Carl Kingsford

Authors

David Pellow
View author publications
You can also search for this author in PubMed Google Scholar
Darya Filippova
View author publications
You can also search for this author in PubMed Google Scholar
Carl Kingsford
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carl Kingsford .

Editor information

Editors and Affiliations

Princeton University, Princeton, New Jersey, USA
Mona Singh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pellow, D., Filippova, D., Kingsford, C. (2016). Improving Bloom Filter Performance on Sequence Data Using $k$-mer Bloom Filters. In: Singh, M. (eds) Research in Computational Molecular Biology. RECOMB 2016. Lecture Notes in Computer Science(), vol 9649. Springer, Cham. https://doi.org/10.1007/978-3-319-31957-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-31957-5_10
Published: 08 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31956-8
Online ISBN: 978-3-319-31957-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Bloom Filter Performance on Sequence Data Using \(k\)-mer Bloom Filters

Abstract

Access this chapter

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Improving Bloom Filter Performance on Sequence Data Using \(k\)-mer Bloom Filters

Abstract

Access this chapter

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation