AllSome Sequence Bloom Trees

Sun, Chen; Harris, Robert S.; Chikhi, Rayan; Medvedev, Paul

doi:10.1007/978-3-319-56970-3_17

Chen Sun¹⁴,
Robert S. Harris¹⁵,
Rayan Chikhi¹⁶ &
…
Paul Medvedev^14,17,18

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10229))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

2070 Accesses
10 Citations
3 Altmetric

Abstract

The ubiquity of next generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2,652 human RNA-seq experiments uploaded to the Sequence Read Archive. Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this paper, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39–85%, with a price of up to 3x memory consumption during queries. Notably, it can query a batch of 198,074 queries in under 8 h (compared to around two days previously) and a whole set of \(k\)-mers from a sequencing experiment (about 27 mil \(k\)-mers) in under 11 min.

C. Sun and R.S. Harris—Contributed equally to the work.

The full version of this paper is available at [33].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
SBT-ALSO GitHub repository: https://github.com/medvedevgroup/bloomtree-allsome.

References

SBT-SK software and data. http://www.cs.cmu.edu/%7Eckingsf/software/bloomtree/. Accessed 01 July 2016
Baier, U., Beller, T., Ohlebusch, E.: Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics 32, 497–504 (2015)
Article Google Scholar
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Article MATH Google Scholar
Bray, N.L., Pimentel, H., Melsted, P., Pachter, L.: Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34(5), 525–527 (2016)
Article Google Scholar
Chambi, S., Lemire, D., Kaser, O., Godin, R.: Better bitmap performance with roaring bitmaps. Softw. Pract. Exp. 46(5), 709–719 (2015)
Article Google Scholar
Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a bloom filter. Algorithms Mol. Biol. 8(1), 1 (2013)
Article Google Scholar
Consortium, C.P.G., et al: Computational pan-genomics: status, promises and challenges. Brief. Bioinform. bbw089 (2016)
Google Scholar
Crainiceanu, A., Lemire, D.: Bloofi: multidimensional bloom filters. Inf. Syst. 54, 311–324 (2015)
Article Google Scholar
Dolle, D.D., Liu, Z., Cotten, M.L., Simpson, J.T., Iqbal, Z., Durbin, R., McCarthy, S., Keane, T.: Using reference-free compressed data structures to analyse sequencing reads from thousands of human genomes. Genome Res. 27, 300–309 (2016)
Article Google Scholar
Ernst, C., Rahmann, S.: PanCake: a data structure for pangenomes. Ger. Conf. Bioinform. 34, 35–45 (2013)
MATH Google Scholar
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). doi:10.1007/978-3-319-07959-2_28
Google Scholar
Heo, Y., Wu, X.L., Chen, D., Ma, J., Hwu, W.M.: BLESS: bloom filter-based error correction solution for high-throughput sequencing reads. Bioinformatics 30, 1354–1362 (2014)
Article Google Scholar
Holley, G., Wittler, R., Stoye, J.: Bloom filter trie – a data structure for pan-genome storage. In: Pop, M., Touzet, H. (eds.) WABI 2015. LNCS, vol. 9289, pp. 217–230. Springer, Heidelberg (2015). doi:10.1007/978-3-662-48221-6_16
Chapter Google Scholar
de Hoon, M.J., Imoto, S., Nolan, J., Miyano, S.: Open source clustering software. Bioinformatics 20(9), 1453–1454 (2004)
Article Google Scholar
Leinonen, R., Sugawara, H., Shumway, M.: The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2010)
Article Google Scholar
Liu, B., Zhu, D., Wang, Y.: deBWT: parallel construction of Burrows-Wheeler Transform for large collection of genomes with de Bruijn-branch encoding. Bioinformatics 32(12), i174–i182 (2016)
Article Google Scholar
Loh, P.R., Baym, M., Berger, B.: Compressive genomics. Nat. Biotechnol. 30(7), 627–630 (2012)
Article Google Scholar
Mäkinen, V., Belazzougui, D., Cunial, F., Tomescu, A.I.: Genome-Scale Algorithm Design. Cambridge University Press, Cambridge (2015)
Book Google Scholar
Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)
Article Google Scholar
Marchet, C., Limasset, A., Bittner, L., Peterlongo, P.: A resource-frugal probabilistic dictionary and applications in (meta) genomics (2016). arXiv preprint: arXiv:1605.08319
Marcus, S., Lee, H., Schatz, M.C.: SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics 30(24), 3476–3483 (2014)
Article Google Scholar
Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 12(1), 333 (2011)
Article Google Scholar
Minkin, I., Pham, S., Medvedev, P.: TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics btw609 (2016)
Google Scholar
Murray, K.D., Webers, C., Ong, C.S., Borevitz, J.O., Warthmann, N.: kWIP: the k-mer weighted inner product, a de novo estimator of genetic similarity (2016). bioRxiv: 075481
Google Scholar
Navarro, G., De Moura, E.S., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Inf. Retr. 3(1), 49–77 (2000)
Article Google Scholar
Nellore, A., Collado-Torres, L., Jaffe, A.E., Alquicira-Hernández, J., Wilks, C., Pritt, J., Morton, J., Leek, J.T., Langmead, B.: Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics btw575 (2016)
Google Scholar
Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32(5), 462–464 (2014)
Article Google Scholar
Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 233–242. Society for Industrial and Applied Mathematics (2002)
Google Scholar
Rozov, R., Shamir, R., Halperin, E.: Fast lossless compression via cascading bloom filters. BMC Bioinform. 15(9), 1 (2014)
Google Scholar
Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve the memory usage for de Brujin graphs. In: Darling, A., Stoye, J. (eds.) WABI 2013. LNCS, vol. 8126, pp. 364–376. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40453-5_28
Chapter Google Scholar
Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34(3), 300–302 (2016)
Article Google Scholar
Stranneheim, H., Käller, M., Allander, T., Andersson, B., Arvestad, L., Lundeberg, J.: Classification of DNA sequences using bloom filters. Bioinformatics 26(13), 1595–1600 (2010)
Article Google Scholar
Sun, C., Harris, R.S., Chikhi, R., Medvedev, P.: Allsome sequence bloom trees. bioRxiv (2016). http://biorxiv.org/content/early/2016/12/02/090464
Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel, H., Salzberg, S.L., Rinn, J.L., Pachter, L.: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7(3), 562–578 (2012)
Article Google Scholar
Yu, Y.W., Daniels, N.M., Danko, D.C., Berger, B.: Entropy-scaling search of massive biological data. Cell Syst. 1(2), 130–140 (2015)
Article Google Scholar
Ziviani, N., de Moura, E.S., Navarro, G., Baeza-Yates, R.: Compression: a key for next-generation text retrieval systems. IEEE Comput. 33(11), 37–44 (2000)
Article Google Scholar

Download references

Acknowledgements

This work has been supported in part by NSF awards DBI-1356529, CCF-551439057, IIS-1453527, and IIS-1421908 to PM.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Pennsylvania State University, University Park, USA
Chen Sun & Paul Medvedev
Department of Biology, The Pennsylvania State University, University Park, USA
Robert S. Harris
CNRS, CRIStAL, University of Lille, Lille, France
Rayan Chikhi
Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, USA
Paul Medvedev
Genome Sciencies Institute of the Huck, The Pennsylvania State University, University Park, USA
Paul Medvedev

Authors

Chen Sun
View author publications
You can also search for this author in PubMed Google Scholar
Robert S. Harris
View author publications
You can also search for this author in PubMed Google Scholar
Rayan Chikhi
View author publications
You can also search for this author in PubMed Google Scholar
Paul Medvedev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Medvedev .

Editor information

Editors and Affiliations

Indiana University Bloomington, Bloomington, Indiana, USA
S. Cenk Sahinalp

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, C., Harris, R.S., Chikhi, R., Medvedev, P. (2017). AllSome Sequence Bloom Trees. In: Sahinalp, S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science(), vol 10229. Springer, Cham. https://doi.org/10.1007/978-3-319-56970-3_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-56970-3_17
Published: 12 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56969-7
Online ISBN: 978-3-319-56970-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics