Skip to main content

AllSome Sequence Bloom Trees

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2017)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10229))

Abstract

The ubiquity of next generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2,652 human RNA-seq experiments uploaded to the Sequence Read Archive. Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this paper, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39–85%, with a price of up to 3x memory consumption during queries. Notably, it can query a batch of 198,074 queries in under 8 h (compared to around two days previously) and a whole set of \(k\)-mers from a sequencing experiment (about 27 mil \(k\)-mers) in under 11 min.

C. Sun and R.S. Harris—Contributed equally to the work.

The full version of this paper is available at [33].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    SBT-ALSO GitHub repository: https://github.com/medvedevgroup/bloomtree-allsome.

References

  1. SBT-SK software and data. http://www.cs.cmu.edu/%7Eckingsf/software/bloomtree/. Accessed 01 July 2016

  2. Baier, U., Beller, T., Ohlebusch, E.: Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics 32, 497–504 (2015)

    Article  Google Scholar 

  3. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  4. Bray, N.L., Pimentel, H., Melsted, P., Pachter, L.: Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34(5), 525–527 (2016)

    Article  Google Scholar 

  5. Chambi, S., Lemire, D., Kaser, O., Godin, R.: Better bitmap performance with roaring bitmaps. Softw. Pract. Exp. 46(5), 709–719 (2015)

    Article  Google Scholar 

  6. Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a bloom filter. Algorithms Mol. Biol. 8(1), 1 (2013)

    Article  Google Scholar 

  7. Consortium, C.P.G., et al: Computational pan-genomics: status, promises and challenges. Brief. Bioinform. bbw089 (2016)

    Google Scholar 

  8. Crainiceanu, A., Lemire, D.: Bloofi: multidimensional bloom filters. Inf. Syst. 54, 311–324 (2015)

    Article  Google Scholar 

  9. Dolle, D.D., Liu, Z., Cotten, M.L., Simpson, J.T., Iqbal, Z., Durbin, R., McCarthy, S., Keane, T.: Using reference-free compressed data structures to analyse sequencing reads from thousands of human genomes. Genome Res. 27, 300–309 (2016)

    Article  Google Scholar 

  10. Ernst, C., Rahmann, S.: PanCake: a data structure for pangenomes. Ger. Conf. Bioinform. 34, 35–45 (2013)

    MATH  Google Scholar 

  11. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). doi:10.1007/978-3-319-07959-2_28

    Google Scholar 

  12. Heo, Y., Wu, X.L., Chen, D., Ma, J., Hwu, W.M.: BLESS: bloom filter-based error correction solution for high-throughput sequencing reads. Bioinformatics 30, 1354–1362 (2014)

    Article  Google Scholar 

  13. Holley, G., Wittler, R., Stoye, J.: Bloom filter trie – a data structure for pan-genome storage. In: Pop, M., Touzet, H. (eds.) WABI 2015. LNCS, vol. 9289, pp. 217–230. Springer, Heidelberg (2015). doi:10.1007/978-3-662-48221-6_16

    Chapter  Google Scholar 

  14. de Hoon, M.J., Imoto, S., Nolan, J., Miyano, S.: Open source clustering software. Bioinformatics 20(9), 1453–1454 (2004)

    Article  Google Scholar 

  15. Leinonen, R., Sugawara, H., Shumway, M.: The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2010)

    Article  Google Scholar 

  16. Liu, B., Zhu, D., Wang, Y.: deBWT: parallel construction of Burrows-Wheeler Transform for large collection of genomes with de Bruijn-branch encoding. Bioinformatics 32(12), i174–i182 (2016)

    Article  Google Scholar 

  17. Loh, P.R., Baym, M., Berger, B.: Compressive genomics. Nat. Biotechnol. 30(7), 627–630 (2012)

    Article  Google Scholar 

  18. Mäkinen, V., Belazzougui, D., Cunial, F., Tomescu, A.I.: Genome-Scale Algorithm Design. Cambridge University Press, Cambridge (2015)

    Book  Google Scholar 

  19. Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)

    Article  Google Scholar 

  20. Marchet, C., Limasset, A., Bittner, L., Peterlongo, P.: A resource-frugal probabilistic dictionary and applications in (meta) genomics (2016). arXiv preprint: arXiv:1605.08319

  21. Marcus, S., Lee, H., Schatz, M.C.: SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics 30(24), 3476–3483 (2014)

    Article  Google Scholar 

  22. Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 12(1), 333 (2011)

    Article  Google Scholar 

  23. Minkin, I., Pham, S., Medvedev, P.: TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics btw609 (2016)

    Google Scholar 

  24. Murray, K.D., Webers, C., Ong, C.S., Borevitz, J.O., Warthmann, N.: kWIP: the k-mer weighted inner product, a de novo estimator of genetic similarity (2016). bioRxiv: 075481

    Google Scholar 

  25. Navarro, G., De Moura, E.S., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Inf. Retr. 3(1), 49–77 (2000)

    Article  Google Scholar 

  26. Nellore, A., Collado-Torres, L., Jaffe, A.E., Alquicira-Hernández, J., Wilks, C., Pritt, J., Morton, J., Leek, J.T., Langmead, B.: Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics btw575 (2016)

    Google Scholar 

  27. Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32(5), 462–464 (2014)

    Article  Google Scholar 

  28. Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 233–242. Society for Industrial and Applied Mathematics (2002)

    Google Scholar 

  29. Rozov, R., Shamir, R., Halperin, E.: Fast lossless compression via cascading bloom filters. BMC Bioinform. 15(9), 1 (2014)

    Google Scholar 

  30. Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve the memory usage for de Brujin graphs. In: Darling, A., Stoye, J. (eds.) WABI 2013. LNCS, vol. 8126, pp. 364–376. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40453-5_28

    Chapter  Google Scholar 

  31. Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34(3), 300–302 (2016)

    Article  Google Scholar 

  32. Stranneheim, H., Käller, M., Allander, T., Andersson, B., Arvestad, L., Lundeberg, J.: Classification of DNA sequences using bloom filters. Bioinformatics 26(13), 1595–1600 (2010)

    Article  Google Scholar 

  33. Sun, C., Harris, R.S., Chikhi, R., Medvedev, P.: Allsome sequence bloom trees. bioRxiv (2016). http://biorxiv.org/content/early/2016/12/02/090464

  34. Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel, H., Salzberg, S.L., Rinn, J.L., Pachter, L.: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7(3), 562–578 (2012)

    Article  Google Scholar 

  35. Yu, Y.W., Daniels, N.M., Danko, D.C., Berger, B.: Entropy-scaling search of massive biological data. Cell Syst. 1(2), 130–140 (2015)

    Article  Google Scholar 

  36. Ziviani, N., de Moura, E.S., Navarro, G., Baeza-Yates, R.: Compression: a key for next-generation text retrieval systems. IEEE Comput. 33(11), 37–44 (2000)

    Article  Google Scholar 

Download references

Acknowledgements

This work has been supported in part by NSF awards DBI-1356529, CCF-551439057, IIS-1453527, and IIS-1421908 to PM.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paul Medvedev .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Sun, C., Harris, R.S., Chikhi, R., Medvedev, P. (2017). AllSome Sequence Bloom Trees. In: Sahinalp, S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science(), vol 10229. Springer, Cham. https://doi.org/10.1007/978-3-319-56970-3_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-56970-3_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-56969-7

  • Online ISBN: 978-3-319-56970-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics