Abstract
The ubiquity of next generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2,652 human RNA-seq experiments uploaded to the Sequence Read Archive. Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this paper, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39–85%, with a price of up to 3x memory consumption during queries. Notably, it can query a batch of 198,074 queries in under 8 h (compared to around two days previously) and a whole set of \(k\)-mers from a sequencing experiment (about 27 mil \(k\)-mers) in under 11 min.
C. Sun and R.S. Harris—Contributed equally to the work.
The full version of this paper is available at [33].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
SBT-ALSO GitHub repository: https://github.com/medvedevgroup/bloomtree-allsome.
References
SBT-SK software and data. http://www.cs.cmu.edu/%7Eckingsf/software/bloomtree/. Accessed 01 July 2016
Baier, U., Beller, T., Ohlebusch, E.: Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics 32, 497–504 (2015)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Bray, N.L., Pimentel, H., Melsted, P., Pachter, L.: Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34(5), 525–527 (2016)
Chambi, S., Lemire, D., Kaser, O., Godin, R.: Better bitmap performance with roaring bitmaps. Softw. Pract. Exp. 46(5), 709–719 (2015)
Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a bloom filter. Algorithms Mol. Biol. 8(1), 1 (2013)
Consortium, C.P.G., et al: Computational pan-genomics: status, promises and challenges. Brief. Bioinform. bbw089 (2016)
Crainiceanu, A., Lemire, D.: Bloofi: multidimensional bloom filters. Inf. Syst. 54, 311–324 (2015)
Dolle, D.D., Liu, Z., Cotten, M.L., Simpson, J.T., Iqbal, Z., Durbin, R., McCarthy, S., Keane, T.: Using reference-free compressed data structures to analyse sequencing reads from thousands of human genomes. Genome Res. 27, 300–309 (2016)
Ernst, C., Rahmann, S.: PanCake: a data structure for pangenomes. Ger. Conf. Bioinform. 34, 35–45 (2013)
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). doi:10.1007/978-3-319-07959-2_28
Heo, Y., Wu, X.L., Chen, D., Ma, J., Hwu, W.M.: BLESS: bloom filter-based error correction solution for high-throughput sequencing reads. Bioinformatics 30, 1354–1362 (2014)
Holley, G., Wittler, R., Stoye, J.: Bloom filter trie – a data structure for pan-genome storage. In: Pop, M., Touzet, H. (eds.) WABI 2015. LNCS, vol. 9289, pp. 217–230. Springer, Heidelberg (2015). doi:10.1007/978-3-662-48221-6_16
de Hoon, M.J., Imoto, S., Nolan, J., Miyano, S.: Open source clustering software. Bioinformatics 20(9), 1453–1454 (2004)
Leinonen, R., Sugawara, H., Shumway, M.: The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2010)
Liu, B., Zhu, D., Wang, Y.: deBWT: parallel construction of Burrows-Wheeler Transform for large collection of genomes with de Bruijn-branch encoding. Bioinformatics 32(12), i174–i182 (2016)
Loh, P.R., Baym, M., Berger, B.: Compressive genomics. Nat. Biotechnol. 30(7), 627–630 (2012)
Mäkinen, V., Belazzougui, D., Cunial, F., Tomescu, A.I.: Genome-Scale Algorithm Design. Cambridge University Press, Cambridge (2015)
Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)
Marchet, C., Limasset, A., Bittner, L., Peterlongo, P.: A resource-frugal probabilistic dictionary and applications in (meta) genomics (2016). arXiv preprint: arXiv:1605.08319
Marcus, S., Lee, H., Schatz, M.C.: SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics 30(24), 3476–3483 (2014)
Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 12(1), 333 (2011)
Minkin, I., Pham, S., Medvedev, P.: TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics btw609 (2016)
Murray, K.D., Webers, C., Ong, C.S., Borevitz, J.O., Warthmann, N.: kWIP: the k-mer weighted inner product, a de novo estimator of genetic similarity (2016). bioRxiv: 075481
Navarro, G., De Moura, E.S., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Inf. Retr. 3(1), 49–77 (2000)
Nellore, A., Collado-Torres, L., Jaffe, A.E., Alquicira-Hernández, J., Wilks, C., Pritt, J., Morton, J., Leek, J.T., Langmead, B.: Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics btw575 (2016)
Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32(5), 462–464 (2014)
Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 233–242. Society for Industrial and Applied Mathematics (2002)
Rozov, R., Shamir, R., Halperin, E.: Fast lossless compression via cascading bloom filters. BMC Bioinform. 15(9), 1 (2014)
Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve the memory usage for de Brujin graphs. In: Darling, A., Stoye, J. (eds.) WABI 2013. LNCS, vol. 8126, pp. 364–376. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40453-5_28
Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34(3), 300–302 (2016)
Stranneheim, H., Käller, M., Allander, T., Andersson, B., Arvestad, L., Lundeberg, J.: Classification of DNA sequences using bloom filters. Bioinformatics 26(13), 1595–1600 (2010)
Sun, C., Harris, R.S., Chikhi, R., Medvedev, P.: Allsome sequence bloom trees. bioRxiv (2016). http://biorxiv.org/content/early/2016/12/02/090464
Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel, H., Salzberg, S.L., Rinn, J.L., Pachter, L.: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7(3), 562–578 (2012)
Yu, Y.W., Daniels, N.M., Danko, D.C., Berger, B.: Entropy-scaling search of massive biological data. Cell Syst. 1(2), 130–140 (2015)
Ziviani, N., de Moura, E.S., Navarro, G., Baeza-Yates, R.: Compression: a key for next-generation text retrieval systems. IEEE Comput. 33(11), 37–44 (2000)
Acknowledgements
This work has been supported in part by NSF awards DBI-1356529, CCF-551439057, IIS-1453527, and IIS-1421908 to PM.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Sun, C., Harris, R.S., Chikhi, R., Medvedev, P. (2017). AllSome Sequence Bloom Trees. In: Sahinalp, S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science(), vol 10229. Springer, Cham. https://doi.org/10.1007/978-3-319-56970-3_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-56970-3_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56969-7
Online ISBN: 978-3-319-56970-3
eBook Packages: Computer ScienceComputer Science (R0)