An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search

  • Fatemeh Almodaresi
  • Prashant PandeyEmail author
  • Michael Ferdman
  • Rob Johnson
  • Rob Patro
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11467)


The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure.

In this paper, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes—patterns of color occurrence—present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e. samples or references) grows into the thousands.

We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved more than \(11\times \) better compression compared to RRR.


Acknowledgments and Declarations

This work was supported by the US National Science Foundation grants BIO-1564917, CCF-1439084, CCF-1716252, CNS-1408695, National Institutes of Health grant R01HG009937. The experiments were conducted with equipment purchased through NSF CISE Research Infrastructure Grant Number 1405641. RP is a co-founder of Ocean Genomics.


  1. 1.
    Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., McVean, G.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012). Scholar
  2. 2.
    Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. National Acad. Sci. 98(17), 9748–9753 (2001)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Pevzner, P.A., Tang, H.: Fragment assembly with double-barreled data. Bioinformatics 17(Suppl. 1), s225–s233 (2001)CrossRefGoogle Scholar
  4. 4.
    Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de bruijn graphs. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 35–55. Springer, Cham (2014). Scholar
  5. 5.
    Prashant, P., Fatemeh, A., Bender, M.A., Ferdman, M., Johnson, R., Patro, R.: Mantis: a fast, small, and exact large-scale sequence-search index. Cell Syst. 7(2), 201–207.e4 (2018).
  6. 6.
    Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34(3), 300–302 (2016)CrossRefGoogle Scholar
  7. 7.
    Solomon, B., Kingsford, C.: Improved search of large transcriptomic sequencing databases using split sequence bloom trees. In: Sahinalp, S.C. (ed.) RECOMB 2017. LNCS, vol. 10229, pp. 257–271. Springer, Cham (2017). Scholar
  8. 8.
    Sun, C., Harris, R.S., Chikhi, R., Medvedev, P.: AllSome sequence bloom trees. In: Sahinalp, S.C. (ed.) RECOMB 2017. LNCS, vol. 10229, pp. 272–286. Springer, Cham (2017). Scholar
  9. 9.
    Bradley, P., den Bakker, H., Rocha, E., McVean, G., Iqbal, Z.: Real-time search of all bacterial and viral genomic data. BioRxiv, p. 234955 (2017)Google Scholar
  10. 10.
    Muggli, M.D., et al.: Succinct colored de bruijn graphs. Bioinformatics 33, 3181–3187 (2017)CrossRefGoogle Scholar
  11. 11.
    Holley, G., Wittler, R., Stoye, J.: Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol. 11(1), 3 (2016)CrossRefGoogle Scholar
  12. 12.
    Almodaresi, F., Pandey, P., Patro, R.: Rainbowfish: a succinct colored de Bruijn graph representation. In: LIPIcs-Leibniz International Proceedings in Informatics, vol. 88. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2017)Google Scholar
  13. 13.
    Liu, B., Guo, H., Brudno, M., Wang, Y.: deBGA: read alignment with de Bruijn graph-based seed and extension. Bioinformatics 32(21), 3224–3232 (2016a)CrossRefGoogle Scholar
  14. 14.
    Chikhi, R., Rizk, G.: Space-efficient and exact de bruijn graph representation based on a bloom filter. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 236–248. Springer, Heidelberg (2012). Scholar
  15. 15.
    Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve the memory usage for de brujin graphs. Algorithms Mol. Biol. 9(1), 2 (2014)CrossRefGoogle Scholar
  16. 16.
    Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de bruijn graphs. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 225–235. Springer, Heidelberg (2012). Scholar
  17. 17.
    Crawford, V., Kuhnle, A., Boucher, C., Chikhi, R., Gagie, T., Hancock, J.: Practical dynamic de bruijn graphs. Bioinformatics 34, 4189–4195 (2018)CrossRefGoogle Scholar
  18. 18.
    Pandey, P., Bender, M.A., Johnson, R., Patro, R.: deBGR: an efficient and near-exact representation of the weighted de bruijn graph. Bioinformatics 33(14), i133–i141 (2017)CrossRefGoogle Scholar
  19. 19.
    Mustafa, H., Schilken, I., Karasikov, M., Eickhoff, C., Rätsch, G., Kahles, A.: Dynamic compression schemes for graph coloring. Bioinformatics, p. bty632 (2018).
  20. 20.
    Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm (1994)Google Scholar
  21. 21.
    Raman, R., Raman, V., Srinivasa Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 233–242. Society for Industrial and Applied Mathematics (2002)Google Scholar
  22. 22.
    Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM (JACM) 21(2), 246–260 (1974)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Raidl, G.R.: Exact and heuristic approaches for solving the bounded diameter minimum spanning tree problem. Ph.D. thesis (2008)Google Scholar
  24. 24.
    Althaus, E., Funke, S., Har-Peled, S., Könemann, J., Ramos, E.A., Skutella, M.: Approximating k-hop minimum-spanning trees. Oper. Res. Lett. 33(2):115–120 (2005). ISSN 0167–6377
  25. 25.
    Manyem, P., Stallmann, M.F.M.: Some approximation results in multicasting. Technical report, Raleigh, NC, USA (1996)Google Scholar
  26. 26.
    Khuller, S., Raghavachari, B., Young, N.E.: Balancing minimum spanning and shortest path trees. CoRR, cs.DS/0205045 (2002).
  27. 27.
    Marathe, M.V., Ravi, R., Sundaram, R., Ravi, S.S., Rosenkrantz, D.J., Hunt III, H.B.: Bicriteria network design problems. CoRR, cs.CC/9809103 (1998).
  28. 28.
    Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J.M., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)CrossRefGoogle Scholar
  29. 29.
    Schulz, M.H., Zerbino, D.R., Vingron, M., Birney, E.: Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28(8), 1086–1092 (2012)CrossRefGoogle Scholar
  30. 30.
    Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)CrossRefGoogle Scholar
  31. 31.
    Grabherr, M.G., et al.: Full-length transcriptome assembly from RNA-seq data without a reference genome. Nature Biotechnol. 29(7), 644–652 (2011)CrossRefGoogle Scholar
  32. 32.
    Chang, Z., et al.: Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome Biol. 16(1), 30 (2015)CrossRefGoogle Scholar
  33. 33.
    Liu, J., et al.: Binpacker: packing-based de novo transcriptome assembly from RNA-seq data. PLOS Comput. Biol. 12(2), e1004772 (2016b)CrossRefGoogle Scholar
  34. 34.
    Almodaresi, F., Sarkar, H., Srivastava, A., Patro, R.: A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics 34(13), i169–i177 (2018)CrossRefGoogle Scholar
  35. 35.
    Turner, I., Garimella, K.V., Iqbal, Z., McVean, G.: Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 34(15), 2556–2565 (2018). Scholar
  36. 36.
    Alipanahi, B., Muggli, M.D., Jundi, M., Noyes, N., Boucher, C.: Resistome SNP calling via read colored de Bruijn graphs. bioRxiv, p. 156174 (2018)Google Scholar
  37. 37.
    Alipanahi, B., Kuhnle, A., Boucher, C.: Recoloring the colored de Bruijn graph. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds.) SPIRE 2018. LNCS, vol. 11147, pp. 1–11. Springer, Cham (2018b). Scholar
  38. 38.
    Pandey, P., Bender, M.A., Johnson, R., Patro, R.: A general-purpose counting filter: making every bit count. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 775–787. ACM (2017)Google Scholar
  39. 39.
    Yu, Y., et al.: SeqOthello: querying RNA-seq experiments at scale. Genome Biol. 19(1), 167 (2018). ISSN 1474–760XCrossRefGoogle Scholar
  40. 40.
    Ottaviano, G., Venturini, R.: Partitioned Elias-Fano Indexes. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 273–282. ACM (2014)Google Scholar
  41. 41.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)MathSciNetCrossRefGoogle Scholar
  42. 42.
    Bookstein, A., Klein, S.T.: Compression of correlated bit-vectors. Inf. Syst. 16(4), 387–400 (1991)CrossRefGoogle Scholar
  43. 43.
    Pandey, P., Bender, M.A., Johnson, R., Patro, R.: Squeakr: an exact and approximate k-mer counting system. Bioinformatics, btx636 (2017).
  44. 44.
    NIH. SRA (2017). Accessed 06 Nov 2017
  45. 45.
    O’Leary, N.A., et al.: Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. gkv1189 (2015)Google Scholar
  46. 46.
    Yu, Y.W., Daniels, N.M., Danko, D.C., Berger, B.: Entropy-scaling search of massive biological data. Cell systems 1(2), 130–140 (2015)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Fatemeh Almodaresi
    • 1
  • Prashant Pandey
    • 1
    Email author
  • Michael Ferdman
    • 1
  • Rob Johnson
    • 1
    • 2
  • Rob Patro
    • 1
  1. 1.Computer Science DepartmentStony Brook UniversityStony BrookUSA
  2. 2.VMware ResearchPalo AltoUSA

Personalised recommendations