Skip to main content

Recoloring the Colored de Bruijn Graph

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11147))

Included in the following conference series:

Abstract

The colored de Bruijn graph, an extension of the de Bruijn graph, is routinely applied for variant calling, genotyping, genome assembly, and various other applications [11]. In this data structure, the edges are labeled with one or more colors from a set \(\{c_1, \dots , c_{\alpha } \}\), and are stored as a \(m \times \alpha \) matrix, where m is the number of edges. Recently, there has been a significant amount of work in developing compacted representations of this color matrix but all existing methods have focused on compressing the color matrix [3, 10, 12, 14]. In this paper, we explore the problem of recoloring the graph in order to reduce the number of colors, and thus, decrease the size of the color matrix. We show that finding the minimum number of colors needed for recoloring is not only NP-hard but also, difficult to approximate within a reasonable factor. These hardness results motivate the need for a recoloring heuristic that we present in this paper. Our results show that this heuristic is able to reduce the number of colors between one and two orders of magnitude. More specifically, when the number of colors is large (>5,000,000) the number of colors is reduced by a factor of 136 by our heuristic. An implementation of this heuristic is publicly available at https://github.com/baharpan/cosmo/tree/Recoloring.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    ZPP is the complexity class of problems solvable by a randomized algorithm in expected polynomial time.

References

  1. The 100,000 Genomes Project Protocol v3 (2017). https://doi.org/10.6084/m9.figshare.4530893.v2

  2. Alipanahi, B., et al.: Resistome SNP calling via read colored de Bruijn graphs. In: RECOMB-Seq (2018)

    Google Scholar 

  3. Almodaresi, F., Pandey, P., Patro, R.: Rainbowfish: a succinct colored de Bruijn graph representation. In: WABI, pp. 251–256 (2017)

    Google Scholar 

  4. Belazzougui, D., Gagie, T., Mäkinen, V., Previtali, M.: Fully dynamic de Bruijn graphs. In: Inenaga, S., Sadakane, K., Sakai, T. (eds.) SPIRE 2016. LNCS, vol. 9954, pp. 145–152. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46049-9_14

    Chapter  Google Scholar 

  5. Bermond, J.C., Hell, P.: On even factorizations and the chromatic index of the Kautz and de Bruijn digraphs. J. Graph Theory 17(5), 647–655 (1993)

    Article  MathSciNet  Google Scholar 

  6. Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)

    Google Scholar 

  7. Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM (JACM) 21(2), 246–260 (1974)

    Article  MathSciNet  Google Scholar 

  8. Feige, U., Kilian, J.: Zero knowledge and the chromatic number. In: Conference on Computational Complexity, pp. 278–287 (1996)

    Google Scholar 

  9. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07959-2_28

    Chapter  Google Scholar 

  10. Holley, G.: Bloom filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol. 11, 3 (2016)

    Article  Google Scholar 

  11. Iqbal, Z.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44(2), 226–232 (2012)

    Article  Google Scholar 

  12. Marcus, S.: Splitmem: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics 30, 3476–3483 (2014)

    Article  Google Scholar 

  13. Mario, F.R.: On the number of bits required to implement an associative memory. Massachusetts Institute of Technology, Project MAC (1971)

    Google Scholar 

  14. Muggli, M.D., et al.: Succinct colored de Bruijn graphs. Bioinformatics 33, 3181–3187 (2017)

    Article  Google Scholar 

  15. Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings of the Meeting on Algorithm Engineering & Expermiments, pp. 60–70 (2007)

    Chapter  Google Scholar 

  16. Sánchez-Arroyo, A.: Determining the total colouring number is NP-hard. Discret. Math. 78, 315–319 (1989)

    Article  MathSciNet  Google Scholar 

  17. Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bahar Alipanahi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alipanahi, B., Kuhnle, A., Boucher, C. (2018). Recoloring the Colored de Bruijn Graph. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds) String Processing and Information Retrieval. SPIRE 2018. Lecture Notes in Computer Science(), vol 11147. Springer, Cham. https://doi.org/10.1007/978-3-030-00479-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00479-8_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00478-1

  • Online ISBN: 978-3-030-00479-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics