Abstract
Computational pan-genome analysis has emerged from the rapid increase of available genome sequencing data. Starting from a microbial pan-genome, the concept has spread to a variety of species, such as plants or viruses. Characterizing a pan-genome provides insights into intra-species evolution, functions, and diversity. However, researchers face challenges such as processing and maintaining large datasets while providing accurate and efficient analysis approaches. Comparative genomics methods are required for detecting conserved and unique regions between a set of genomes. This chapter gives an overview of tools available for indexing pan-genomes, identifying the sub-regions of a pan-genome and offering a variety of downstream analysis methods. These tools are categorized into two groups, gene-based and sequence-based, according to the pan-genome identification method. We highlight the differences, advantages, and disadvantages between the tools, and provide information about the general workflow, methodology of pan-genome identification, covered functionalities, usability and availability of the tools.
This is a preview of subscription content, log in via an institution.
References
Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D et al (2005) Genome analysis of multiple pathogenic isolates of streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci USA 102(39):13950–13955
Ochman H, Lerat E, Daubin V (2005) Examining bacterial species under the specter of gene transfer and exchange. Proc Natl Acad Sci USA 102(Suppl 1):6595–6599
Read TD, Ussery DW (2006) Opening the pan-genomics box. Curr Opin Microbiol 9(5):496–498
Vernikos G, Medini D, Riley DR, Tettelin H (2015) Ten years of pan-genome analyses. Curr Opin Microbiol 23:148–154
Mira A, Martín-Cuadrado AB, D’Auria G, Rodríguez-Valera F (2010) The bacterial pan-genome: a new paradigm in microbiology. Int Microbiol 13(2):45–57
Morgante M, De Paoli E, Radovic S (2007) Transposable elements and the plant pan-genomes. Curr Opin Plant Biol 10(2):149–155
Hirsch CN, Foerster JM, Johnson JM, Sekhon RS, Muttoni G et al (2014) Insights into the maize pan-genome and pan-transcriptome. Plant Cell 26(1):121–135
Weigel D, Mott R (2009) The 1001 genomes project for Arabidopsis thaliana. Genome Biol 10(5):107
Huang S, Zhang S, Jiao N, Chen F (2015) Comparative genomic and phylogenomic analyses reveal a conserved core genome shared by estuarine and oceanic cyanopodoviruses. PloS One 10(11):e0142962
Tettelin H, Riley D, Cattuto C, Medini D (2008) Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol 11(5):472–477
Snipen L, Almøy T, Ussery DW (2009) Microbial comparative pan-genomics using binomial mixture models. BMC Genomics 10(1):385
Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R (2005) The microbial pan-genome. Curr Opin Genet Dev 15(6):589–594
Mosquera-Rendón J, Rada-Bravo AM, Cárdenas-Brito S, Corredor M, Restrepo-Pineda E, Benítez-Páez A (2016) Pangenome-wide and molecular evolution analyses of the pseudomonas aeruginosa species. BMC Genomics 17(1):45
Hassan A, Naz A, Obaid A, Paracha RZ, Naz K, Awan FM, Muhmmad SA, Janjua HA, Ahmad J, Ali A (2016) Pangenome and immuno-proteomics analysis of Acinetobacter baumannii strains revealed the core peptide vaccine targets. BMC Genomics 17(1):732
Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf 11(1):119
Delcher AL, Harmon D, Kasif S, White O, Salzberg SL (1999) Improved microbial gene identification with glimmer. Nucleic Acids Res 27(23):4636–4641
Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T et al (2003) Gendb–an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 31(8):2187–2195
Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Biol 19(2):99–113
Li L, Stoeckert CJ, Roos DS (2003) Orthomcl: identification of ortholog groups for eukaryotic genomes. Genome Res 13(9):2178–2189
Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science 278(5338):631–637
Kristensen DM, Kannan L, Coleman MK, Wolf YI, Sorokin A, Koonin EV, Mushegian A (2010) A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics 26(12):1481–1487
Sonnhammer ELL, Östlund G (2015) Inparanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res 43(D1):D234–D239
Alexeyenko A, Tamas I, Liu G, Sonnhammer ELL (2006) Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 22(14):e9–e15
Kuzniar A, van Ham RCHJ, Pongor S, Leunissen JAM (2008) The quest for orthologs: finding the corresponding gene across genomes. Trends Genet 24(11):539–551
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Blom J, Albaum S, Doppmeier D, Pühler A, Vorhölter FJ, Zakrzewski M, Goesmann A (2009) EDGAR: a software framework for the comparative analysis of prokaryotic genomes. BMC Bioinf 10:154
Blom J, Kreis J, Spänig S, Juhre T, Bertelli C, Ernst C, Goesmann A (2016) EDGAR 2.0: an enhanced software platform for comparative gene content analyses. Nucleic Acids Res 44(W1):W22–W28
Brittnacher MJ, Fong C, Hayden HS, Jacobs MA, Radey M, Rohmer L (2011) PGAT: a multistrain analysis resource for microbial genomes. Bioinformatics 27(17):2429–2430
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797
Zhao Y, Wu J, Yang J, Sun S, Xiao J, Yu J (2012) PGAP: pan-genomes analysis pipeline. Bioinformatics 28(3):416–418
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575–1584
Fouts DE, Brinkac L, Beck E, Inman J, Sutton G (2012) PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Res 40(22):e172
Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol 79(24):7696–7701
Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J et al (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44(D1):D279–D285
Lukjancenko O, Thomsen MC, Larsen MV, Ussery DW (2013) PanFunPro: PAN-genome analysis based on FUNctional PROfiles. F1000Research, 2
Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein families. Nucleic Acids Res 31(1):371–373
Gough J, Karplus K, Hughey R, Chothia C (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol 313(4):903–919
Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23):3150–3152
Benedict MN, Henriksen JR, Metcalf WW, Whitaker RJ, Price ND (2014) ITEP: an integrated toolkit for exploration of microbial pan-genomes. BMC Genomics 15(1):8
Zhao Y, Jia X, Yang J, Ling Y, Zhang Z, Yu J, Wu J, Xiao J (2014) PanGP: a tool for quickly analyzing bacterial pan-genome profile. Bioinformatics 30(9):1297–1299
Sahl JW, Gregory Caporaso J, Rasko DA, Keim P (2014) The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes. PeerJ 2:e332
Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S et al (2015) Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31(22):3691–3693
Paul S, Bhardwaj A, Bag SK, Sokurenko EV, Chattopadhyay S (2015) PanCoreGen–Profiling, detecting, annotating protein-coding genes in microbial genomes. Genomics 106(6):367–372
Chaudhari NM, Gupta VK, Dutta C (2016) BPGA-an ultra-fast pan-genome analysis pipeline. Sci Rep 6:24373
Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460–2461
Wozniak M, Wong L, Tiuryn J (2014) eCAMBer: efficient support for large-scale comparative analysis of multiple bacterial strains. BMC Bioinf 15(1):1
Santos AR, Barbosa E, Fiaux K, Zurita-Turk M, Chaitankar V et al (2013) PANNOTATOR: an automated tool for annotation of pan-genomes. Genet Mol Res 12:2982–2989
Angiuoli SV, Hotopp JCD, Salzberg SL, Tettelin H (2011) Improving pan-genome annotation using whole genome multiple alignment. BMC Bioinf 12(1):272
Hennig A, Bernhardt J, Nieselt K (2015) Pan-Tetris: an interactive visualisation for Pan-genomes. BMC Bioinf 16(Suppl 11):S3
Herbig A, Jäger G, Battke F, Nieselt K (2012) GenomeRing: alignment visualization based on SuperGenome coordinates. Bioinformatics 28(12):i7–i15
Darling AE, Mau B, Perna NT (2010) progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PloS One 5(6):e11147
Computational Pan-Genomics Consortium (2016) Computational pan-genomics: status, promises and challenges. Brief Bioinform bbw089 https://doi.org/10.1093/bib/bbw089
Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. Proc VLDB Endowment 6(13):1534–1545
Sadakane K (2007) Compressed suffix trees with full functionality. Theor Comput Syst 41(4):589–607
Fischer J, Mäkinen V, Navarro G (2009) Faster entropy-bounded compressed suffix trees. Theor Comput Sci 410(51):5354–5364
Ohlebusch E, Fischer J, Gog S (2010) CST++. In: Proceedings of the international symposium on string processing and information retrieval (SPIRE’10), vol 6393, pp 322–333
Russo L, Navarro G, Oliveira AL (2011) Fully compressed suffix trees. ACM Trans Algorithms 7(4):53
Rasmussen KR, Stoye J, Myers EW (2006) Efficient q-gram filters for finding all ɛ-matches over a given length. J Comput Biol 13(2):296–308
Danek A, Deorowicz S, Grabowski S (2014) Indexes of large genome collections on a PC. PloS One 9(10):e109384
Rahn R, Weese D, Reinert K (2014) Journaled string tree—a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics 30(24):3499–3505
Ferragina P, Manzini G (2000) Opportunistic data structures with applications. In: Proceedings of the 41st symposium on foundations of computer science (FOCS’00), pp 390–398
Mäkinen V, Navarro G, Sirén J, Välimäki N (2010) Storage and retrieval of highly repetitive sequence collections. J Comput Biol 17(3):281–308
Navarro G (2012) Indexing highly repetitive collections. In: Proceedings of the 23rd international workshop on combinatorial algorithms (IWOCA’12), vol 7643, pp 274–279
Huang L, Popic V, Batzoglou S (2013) Short read alignment with populations of genomes. Bioinformatics 29(13):i361–i370
Burrows M, Wheeler M (1994) A block-sorting lossless data compression algorithm. Digital SRC Research Report 124
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754–1760
Durbin R (2014) Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30(9):1266–1272
Laing C, Buchanan C, Taboada EN, Zhang Y, Kropinski A, Villegas A, Thomas JE, Gannon VPJ (2010) Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions. BMC Bioinf 11(1):461
Treangen TJ, Ondov BD, Koren S, Phillippy AM (2014) The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol 15(11):524
Nguyen N, Hickey G, Zerbino DR, Raney B, Earl D, Armstrong J, Haussler D, Paten B (2015) Building a pangenome reference for a population. J Comput Biol 22(5):387–401
Paten B, Diekhans M, Earl D, John JS, Ma J, Suh B, Haussler D (2011) Cactus graphs for genome comparisons. J Comput Biol 18(3):469–481
Schneeberger K, Hagmann J, Ossowski S, Warthmann N, Gesing S, Kohlbacher O, Weigel D (2009) Simultaneous alignment of short reads against multiple genomes. Genome Biol 10(9):R98
Sirén J, Välimäki N, Mäkinen V (2011) Indexing finite language representation of population genotypes. In: Proceedings of the 11th international workshop on algorithms in bioinformatics (WABI’11), vol 6833, pp 270–281
Sirén J, Välimäki N, Mäkinen V (2014) Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans Comput Biol Bioinf 11(2):375–388
Sirén J (2017) Indexing variation graphs. In: Proceedings of the 19th workshop on algorithm engineering and experiments (ALENEX’17), pp 13–27
vg team (2015) vg implementation. https://github.com/vgteam/vg [Online; Accessed 23 Feb 2017]
Kim D, Langmead B, Salzberg SL (2016) HISAT2 implementation. https://github.com/infphilo/hisat2 [Online; Accessed 23 Feb 2017]
Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12(4):357–360
Ernst C, Rahmann S (2013) PanCake: a data structure for pangenomes. In: Proceedings of the German conference on bioinformatics 2013 (GCB’13), vol 34, pp 35–45
Myers EW (2005) The fragment assembly string graph. Bioinformatics 21:ii79–ii85
Marcus S, Lee H, Schatz MC (2014) SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics 30(24):3476–3483
Weiner P (1973) Linear pattern matching algorithms. In: Proceedings of the 14th annual symposium on switching and automata theory (SWAT’73)
Baier U, Beller T, Ohlebusch E (2016) Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics 32(4):497–504
Minkin I, Pham S, Medvedev P (2016) TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics btw609 https://doi.org/10.1093/bioinformatics/btw609
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426
Chikhi R, Limasset A, Jackman S, Simpson JT, Medvedev P (2015) On the representation of de Bruijn graphs. J Comput Biol 22(5):336–352
Chikhi R, Limasset A, Medvedev P (2016) Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32(12):i201–i208
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18):3363–3369
Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G (2012) De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet 44(2):226–232
Holley G, Wittler R, Stoye J (2016) Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol Biol 11:3
Heinz S, Zobel J, Williams HE (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Syst 20(2):192–223
Solomon B, Kingsford C (2016) Fast search of thousands of short-read sequencing experiments. Nat Biotechnol 34(3):300–302
Holley G, Wittler R, Stoye J, Hach F (2017) Dynamic alignment-free and reference-free read compression. In: Proceedings of 21st international conference on research in computational molecular biology (RECOMB’17), vol 10229, pp 50–65
Belk K, Boucher C, Bowe A, Gagie T, Morley P, Muggli MD, Noyes NR, Puglisi SJ, Raymond R (2016) Succinct colored de Bruijn graphs. bioRxiv 040071
Bowe A, Onodera T, Sadakane K, Shibuya T (2012) Succinct de Bruijn graphs. In: Proceedings of 12th international workshop on algorithms in bioinformatics (WABI’12), vol 7534, pp 225–235
Claude F, Farina A, Martínez-Prieto MA, Navarro G (2010) Compressed q-gram indexing for highly repetitive biological sequences. In: Proceedings of the IEEE international conference on bioinformatics and bioengineering (BIBE’10)
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343
Raman R, Raman V, Rao SS (2007) Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans Algorithms 3(4):43
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media LLC
About this protocol
Cite this protocol
Zekic, T., Holley, G., Stoye, J. (2018). Pan-Genome Storage and Analysis Techniques. In: Setubal, J., Stoye, J., Stadler, P. (eds) Comparative Genomics. Methods in Molecular Biology, vol 1704. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7463-4_2
Download citation
DOI: https://doi.org/10.1007/978-1-4939-7463-4_2
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-7461-0
Online ISBN: 978-1-4939-7463-4
eBook Packages: Springer Protocols