Skip to main content

Pan-Genome Storage and Analysis Techniques

  • Protocol
  • First Online:

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1704))

Abstract

Computational pan-genome analysis has emerged from the rapid increase of available genome sequencing data. Starting from a microbial pan-genome, the concept has spread to a variety of species, such as plants or viruses. Characterizing a pan-genome provides insights into intra-species evolution, functions, and diversity. However, researchers face challenges such as processing and maintaining large datasets while providing accurate and efficient analysis approaches. Comparative genomics methods are required for detecting conserved and unique regions between a set of genomes. This chapter gives an overview of tools available for indexing pan-genomes, identifying the sub-regions of a pan-genome and offering a variety of downstream analysis methods. These tools are categorized into two groups, gene-based and sequence-based, according to the pan-genome identification method. We highlight the differences, advantages, and disadvantages between the tools, and provide information about the general workflow, methodology of pan-genome identification, covered functionalities, usability and availability of the tools.

This is a preview of subscription content, log in via an institution.

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

  1. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D et al (2005) Genome analysis of multiple pathogenic isolates of streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci USA 102(39):13950–13955

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Ochman H, Lerat E, Daubin V (2005) Examining bacterial species under the specter of gene transfer and exchange. Proc Natl Acad Sci USA 102(Suppl 1):6595–6599

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Read TD, Ussery DW (2006) Opening the pan-genomics box. Curr Opin Microbiol 9(5):496–498

    Article  Google Scholar 

  4. Vernikos G, Medini D, Riley DR, Tettelin H (2015) Ten years of pan-genome analyses. Curr Opin Microbiol 23:148–154

    Article  CAS  PubMed  Google Scholar 

  5. Mira A, Martín-Cuadrado AB, D’Auria G, Rodríguez-Valera F (2010) The bacterial pan-genome: a new paradigm in microbiology. Int Microbiol 13(2):45–57

    CAS  PubMed  Google Scholar 

  6. Morgante M, De Paoli E, Radovic S (2007) Transposable elements and the plant pan-genomes. Curr Opin Plant Biol 10(2):149–155

    Article  CAS  PubMed  Google Scholar 

  7. Hirsch CN, Foerster JM, Johnson JM, Sekhon RS, Muttoni G et al (2014) Insights into the maize pan-genome and pan-transcriptome. Plant Cell 26(1):121–135

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Weigel D, Mott R (2009) The 1001 genomes project for Arabidopsis thaliana. Genome Biol 10(5):107

    Article  PubMed  PubMed Central  Google Scholar 

  9. Huang S, Zhang S, Jiao N, Chen F (2015) Comparative genomic and phylogenomic analyses reveal a conserved core genome shared by estuarine and oceanic cyanopodoviruses. PloS One 10(11):e0142962

    Article  PubMed  PubMed Central  Google Scholar 

  10. Tettelin H, Riley D, Cattuto C, Medini D (2008) Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol 11(5):472–477

    Article  CAS  PubMed  Google Scholar 

  11. Snipen L, Almøy T, Ussery DW (2009) Microbial comparative pan-genomics using binomial mixture models. BMC Genomics 10(1):385

    Article  PubMed  PubMed Central  Google Scholar 

  12. Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R (2005) The microbial pan-genome. Curr Opin Genet Dev 15(6):589–594

    Article  CAS  PubMed  Google Scholar 

  13. Mosquera-Rendón J, Rada-Bravo AM, Cárdenas-Brito S, Corredor M, Restrepo-Pineda E, Benítez-Páez A (2016) Pangenome-wide and molecular evolution analyses of the pseudomonas aeruginosa species. BMC Genomics 17(1):45

    Article  PubMed  PubMed Central  Google Scholar 

  14. Hassan A, Naz A, Obaid A, Paracha RZ, Naz K, Awan FM, Muhmmad SA, Janjua HA, Ahmad J, Ali A (2016) Pangenome and immuno-proteomics analysis of Acinetobacter baumannii strains revealed the core peptide vaccine targets. BMC Genomics 17(1):732

    Article  PubMed  PubMed Central  Google Scholar 

  15. Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf 11(1):119

    Article  Google Scholar 

  16. Delcher AL, Harmon D, Kasif S, White O, Salzberg SL (1999) Improved microbial gene identification with glimmer. Nucleic Acids Res 27(23):4636–4641

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T et al (2003) Gendb–an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 31(8):2187–2195

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Biol 19(2):99–113

    CAS  Google Scholar 

  19. Li L, Stoeckert CJ, Roos DS (2003) Orthomcl: identification of ortholog groups for eukaryotic genomes. Genome Res 13(9):2178–2189

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science 278(5338):631–637

    Article  CAS  PubMed  Google Scholar 

  21. Kristensen DM, Kannan L, Coleman MK, Wolf YI, Sorokin A, Koonin EV, Mushegian A (2010) A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics 26(12):1481–1487

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Sonnhammer ELL, Östlund G (2015) Inparanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res 43(D1):D234–D239

    Article  CAS  PubMed  Google Scholar 

  23. Alexeyenko A, Tamas I, Liu G, Sonnhammer ELL (2006) Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 22(14):e9–e15

    Article  CAS  PubMed  Google Scholar 

  24. Kuzniar A, van Ham RCHJ, Pongor S, Leunissen JAM (2008) The quest for orthologs: finding the corresponding gene across genomes. Trends Genet 24(11):539–551

    Article  CAS  PubMed  Google Scholar 

  25. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410

    Article  CAS  PubMed  Google Scholar 

  26. Blom J, Albaum S, Doppmeier D, Pühler A, Vorhölter FJ, Zakrzewski M, Goesmann A (2009) EDGAR: a software framework for the comparative analysis of prokaryotic genomes. BMC Bioinf 10:154

    Article  Google Scholar 

  27. Blom J, Kreis J, Spänig S, Juhre T, Bertelli C, Ernst C, Goesmann A (2016) EDGAR 2.0: an enhanced software platform for comparative gene content analyses. Nucleic Acids Res 44(W1):W22–W28

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Brittnacher MJ, Fong C, Hayden HS, Jacobs MA, Radey M, Rohmer L (2011) PGAT: a multistrain analysis resource for microbial genomes. Bioinformatics 27(17):2429–2430

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Zhao Y, Wu J, Yang J, Sun S, Xiao J, Yu J (2012) PGAP: pan-genomes analysis pipeline. Bioinformatics 28(3):416–418

    Article  CAS  PubMed  Google Scholar 

  31. Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575–1584

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Fouts DE, Brinkac L, Beck E, Inman J, Sutton G (2012) PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Res 40(22):e172

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol 79(24):7696–7701

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J et al (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44(D1):D279–D285

    Article  CAS  PubMed  Google Scholar 

  35. Lukjancenko O, Thomsen MC, Larsen MV, Ussery DW (2013) PanFunPro: PAN-genome analysis based on FUNctional PROfiles. F1000Research, 2

    Google Scholar 

  36. Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein families. Nucleic Acids Res 31(1):371–373

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Gough J, Karplus K, Hughey R, Chothia C (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol 313(4):903–919

    Article  CAS  PubMed  Google Scholar 

  38. Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23):3150–3152

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Benedict MN, Henriksen JR, Metcalf WW, Whitaker RJ, Price ND (2014) ITEP: an integrated toolkit for exploration of microbial pan-genomes. BMC Genomics 15(1):8

    Article  PubMed  PubMed Central  Google Scholar 

  40. Zhao Y, Jia X, Yang J, Ling Y, Zhang Z, Yu J, Wu J, Xiao J (2014) PanGP: a tool for quickly analyzing bacterial pan-genome profile. Bioinformatics 30(9):1297–1299

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Sahl JW, Gregory Caporaso J, Rasko DA, Keim P (2014) The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes. PeerJ 2:e332

    Article  PubMed  PubMed Central  Google Scholar 

  42. Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S et al (2015) Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31(22):3691–3693

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Paul S, Bhardwaj A, Bag SK, Sokurenko EV, Chattopadhyay S (2015) PanCoreGen–Profiling, detecting, annotating protein-coding genes in microbial genomes. Genomics 106(6):367–372

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Chaudhari NM, Gupta VK, Dutta C (2016) BPGA-an ultra-fast pan-genome analysis pipeline. Sci Rep 6:24373

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460–2461

    Article  CAS  PubMed  Google Scholar 

  46. Wozniak M, Wong L, Tiuryn J (2014) eCAMBer: efficient support for large-scale comparative analysis of multiple bacterial strains. BMC Bioinf 15(1):1

    Article  Google Scholar 

  47. Santos AR, Barbosa E, Fiaux K, Zurita-Turk M, Chaitankar V et al (2013) PANNOTATOR: an automated tool for annotation of pan-genomes. Genet Mol Res 12:2982–2989

    Article  CAS  PubMed  Google Scholar 

  48. Angiuoli SV, Hotopp JCD, Salzberg SL, Tettelin H (2011) Improving pan-genome annotation using whole genome multiple alignment. BMC Bioinf 12(1):272

    Article  Google Scholar 

  49. Hennig A, Bernhardt J, Nieselt K (2015) Pan-Tetris: an interactive visualisation for Pan-genomes. BMC Bioinf 16(Suppl 11):S3

    Article  Google Scholar 

  50. Herbig A, Jäger G, Battke F, Nieselt K (2012) GenomeRing: alignment visualization based on SuperGenome coordinates. Bioinformatics 28(12):i7–i15

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Darling AE, Mau B, Perna NT (2010) progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PloS One 5(6):e11147

    Google Scholar 

  52. Computational Pan-Genomics Consortium (2016) Computational pan-genomics: status, promises and challenges. Brief Bioinform bbw089 https://doi.org/10.1093/bib/bbw089

  53. Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. Proc VLDB Endowment 6(13):1534–1545

    Article  Google Scholar 

  54. Sadakane K (2007) Compressed suffix trees with full functionality. Theor Comput Syst 41(4):589–607

    Article  Google Scholar 

  55. Fischer J, Mäkinen V, Navarro G (2009) Faster entropy-bounded compressed suffix trees. Theor Comput Sci 410(51):5354–5364

    Article  Google Scholar 

  56. Ohlebusch E, Fischer J, Gog S (2010) CST++. In: Proceedings of the international symposium on string processing and information retrieval (SPIRE’10), vol 6393, pp 322–333

    Google Scholar 

  57. Russo L, Navarro G, Oliveira AL (2011) Fully compressed suffix trees. ACM Trans Algorithms 7(4):53

    Article  Google Scholar 

  58. Rasmussen KR, Stoye J, Myers EW (2006) Efficient q-gram filters for finding all ɛ-matches over a given length. J Comput Biol 13(2):296–308

    Article  CAS  PubMed  Google Scholar 

  59. Danek A, Deorowicz S, Grabowski S (2014) Indexes of large genome collections on a PC. PloS One 9(10):e109384

    Article  PubMed  PubMed Central  Google Scholar 

  60. Rahn R, Weese D, Reinert K (2014) Journaled string tree—a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics 30(24):3499–3505

    Article  CAS  PubMed  Google Scholar 

  61. Ferragina P, Manzini G (2000) Opportunistic data structures with applications. In: Proceedings of the 41st symposium on foundations of computer science (FOCS’00), pp 390–398

    Google Scholar 

  62. Mäkinen V, Navarro G, Sirén J, Välimäki N (2010) Storage and retrieval of highly repetitive sequence collections. J Comput Biol 17(3):281–308

    Article  PubMed  Google Scholar 

  63. Navarro G (2012) Indexing highly repetitive collections. In: Proceedings of the 23rd international workshop on combinatorial algorithms (IWOCA’12), vol 7643, pp 274–279

    Google Scholar 

  64. Huang L, Popic V, Batzoglou S (2013) Short read alignment with populations of genomes. Bioinformatics 29(13):i361–i370

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Burrows M, Wheeler M (1994) A block-sorting lossless data compression algorithm. Digital SRC Research Report 124

    Google Scholar 

  66. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754–1760

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Durbin R (2014) Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30(9):1266–1272

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Laing C, Buchanan C, Taboada EN, Zhang Y, Kropinski A, Villegas A, Thomas JE, Gannon VPJ (2010) Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions. BMC Bioinf 11(1):461

    Article  Google Scholar 

  69. Treangen TJ, Ondov BD, Koren S, Phillippy AM (2014) The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol 15(11):524

    Article  PubMed  PubMed Central  Google Scholar 

  70. Nguyen N, Hickey G, Zerbino DR, Raney B, Earl D, Armstrong J, Haussler D, Paten B (2015) Building a pangenome reference for a population. J Comput Biol 22(5):387–401

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Paten B, Diekhans M, Earl D, John JS, Ma J, Suh B, Haussler D (2011) Cactus graphs for genome comparisons. J Comput Biol 18(3):469–481

    Article  CAS  PubMed  Google Scholar 

  72. Schneeberger K, Hagmann J, Ossowski S, Warthmann N, Gesing S, Kohlbacher O, Weigel D (2009) Simultaneous alignment of short reads against multiple genomes. Genome Biol 10(9):R98

    Article  PubMed  PubMed Central  Google Scholar 

  73. Sirén J, Välimäki N, Mäkinen V (2011) Indexing finite language representation of population genotypes. In: Proceedings of the 11th international workshop on algorithms in bioinformatics (WABI’11), vol 6833, pp 270–281

    Google Scholar 

  74. Sirén J, Välimäki N, Mäkinen V (2014) Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans Comput Biol Bioinf 11(2):375–388

    Article  Google Scholar 

  75. Sirén J (2017) Indexing variation graphs. In: Proceedings of the 19th workshop on algorithm engineering and experiments (ALENEX’17), pp 13–27

    Google Scholar 

  76. vg team (2015) vg implementation. https://github.com/vgteam/vg [Online; Accessed 23 Feb 2017]

  77. Kim D, Langmead B, Salzberg SL (2016) HISAT2 implementation. https://github.com/infphilo/hisat2 [Online; Accessed 23 Feb 2017]

  78. Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12(4):357–360

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Ernst C, Rahmann S (2013) PanCake: a data structure for pangenomes. In: Proceedings of the German conference on bioinformatics 2013 (GCB’13), vol 34, pp 35–45

    Google Scholar 

  80. Myers EW (2005) The fragment assembly string graph. Bioinformatics 21:ii79–ii85

    Google Scholar 

  81. Marcus S, Lee H, Schatz MC (2014) SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics 30(24):3476–3483

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Weiner P (1973) Linear pattern matching algorithms. In: Proceedings of the 14th annual symposium on switching and automata theory (SWAT’73)

    Google Scholar 

  83. Baier U, Beller T, Ohlebusch E (2016) Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics 32(4):497–504

    Article  CAS  PubMed  Google Scholar 

  84. Minkin I, Pham S, Medvedev P (2016) TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics btw609 https://doi.org/10.1093/bioinformatics/btw609

  85. Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426

    Article  Google Scholar 

  86. Chikhi R, Limasset A, Jackman S, Simpson JT, Medvedev P (2015) On the representation of de Bruijn graphs. J Comput Biol 22(5):336–352

    Article  CAS  PubMed  Google Scholar 

  87. Chikhi R, Limasset A, Medvedev P (2016) Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32(12):i201–i208

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18):3363–3369

    Article  CAS  PubMed  Google Scholar 

  89. Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G (2012) De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet 44(2):226–232

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  90. Holley G, Wittler R, Stoye J (2016) Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol Biol 11:3

    Article  PubMed  PubMed Central  Google Scholar 

  91. Heinz S, Zobel J, Williams HE (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Syst 20(2):192–223

    Article  Google Scholar 

  92. Solomon B, Kingsford C (2016) Fast search of thousands of short-read sequencing experiments. Nat Biotechnol 34(3):300–302

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  93. Holley G, Wittler R, Stoye J, Hach F (2017) Dynamic alignment-free and reference-free read compression. In: Proceedings of 21st international conference on research in computational molecular biology (RECOMB’17), vol 10229, pp 50–65

    Google Scholar 

  94. Belk K, Boucher C, Bowe A, Gagie T, Morley P, Muggli MD, Noyes NR, Puglisi SJ, Raymond R (2016) Succinct colored de Bruijn graphs. bioRxiv 040071

    Google Scholar 

  95. Bowe A, Onodera T, Sadakane K, Shibuya T (2012) Succinct de Bruijn graphs. In: Proceedings of 12th international workshop on algorithms in bioinformatics (WABI’12), vol 7534, pp 225–235

    Google Scholar 

  96. Claude F, Farina A, Martínez-Prieto MA, Navarro G (2010) Compressed q-gram indexing for highly repetitive biological sequences. In: Proceedings of the IEEE international conference on bioinformatics and bioengineering (BIBE’10)

    Google Scholar 

  97. Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343

    Article  Google Scholar 

  98. Raman R, Raman V, Rao SS (2007) Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans Algorithms 3(4):43

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jens Stoye .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media LLC

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Zekic, T., Holley, G., Stoye, J. (2018). Pan-Genome Storage and Analysis Techniques. In: Setubal, J., Stoye, J., Stadler, P. (eds) Comparative Genomics. Methods in Molecular Biology, vol 1704. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7463-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-7463-4_2

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-7461-0

  • Online ISBN: 978-1-4939-7463-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics