Pan-Genome Storage and Analysis Techniques

Zekic, Tina; Holley, Guillaume; Stoye, Jens

doi:10.1007/978-1-4939-7463-4_2

Pan-Genome Storage and Analysis Techniques

Tina Zekic^5,6,7,
Guillaume Holley^5,6,7 &
Jens Stoye^5,6,7

Protocol
First Online: 26 December 2017

5262 Accesses
17 Citations
5 Altmetric

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1704))

Abstract

Computational pan-genome analysis has emerged from the rapid increase of available genome sequencing data. Starting from a microbial pan-genome, the concept has spread to a variety of species, such as plants or viruses. Characterizing a pan-genome provides insights into intra-species evolution, functions, and diversity. However, researchers face challenges such as processing and maintaining large datasets while providing accurate and efficient analysis approaches. Comparative genomics methods are required for detecting conserved and unique regions between a set of genomes. This chapter gives an overview of tools available for indexing pan-genomes, identifying the sub-regions of a pan-genome and offering a variety of downstream analysis methods. These tools are categorized into two groups, gene-based and sequence-based, according to the pan-genome identification method. We highlight the differences, advantages, and disadvantages between the tools, and provide information about the general workflow, methodology of pan-genome identification, covered functionalities, usability and availability of the tools.

This is a preview of subscription content, log in via an institution.

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D et al (2005) Genome analysis of multiple pathogenic isolates of streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci USA 102(39):13950–13955
Article CAS PubMed PubMed Central Google Scholar
Ochman H, Lerat E, Daubin V (2005) Examining bacterial species under the specter of gene transfer and exchange. Proc Natl Acad Sci USA 102(Suppl 1):6595–6599
Article CAS PubMed PubMed Central Google Scholar
Read TD, Ussery DW (2006) Opening the pan-genomics box. Curr Opin Microbiol 9(5):496–498
Article Google Scholar
Vernikos G, Medini D, Riley DR, Tettelin H (2015) Ten years of pan-genome analyses. Curr Opin Microbiol 23:148–154
Article CAS PubMed Google Scholar
Mira A, Martín-Cuadrado AB, D’Auria G, Rodríguez-Valera F (2010) The bacterial pan-genome: a new paradigm in microbiology. Int Microbiol 13(2):45–57
CAS PubMed Google Scholar
Morgante M, De Paoli E, Radovic S (2007) Transposable elements and the plant pan-genomes. Curr Opin Plant Biol 10(2):149–155
Article CAS PubMed Google Scholar
Hirsch CN, Foerster JM, Johnson JM, Sekhon RS, Muttoni G et al (2014) Insights into the maize pan-genome and pan-transcriptome. Plant Cell 26(1):121–135
Article CAS PubMed PubMed Central Google Scholar
Weigel D, Mott R (2009) The 1001 genomes project for Arabidopsis thaliana. Genome Biol 10(5):107
Article PubMed PubMed Central Google Scholar
Huang S, Zhang S, Jiao N, Chen F (2015) Comparative genomic and phylogenomic analyses reveal a conserved core genome shared by estuarine and oceanic cyanopodoviruses. PloS One 10(11):e0142962
Article PubMed PubMed Central Google Scholar
Tettelin H, Riley D, Cattuto C, Medini D (2008) Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol 11(5):472–477
Article CAS PubMed Google Scholar
Snipen L, Almøy T, Ussery DW (2009) Microbial comparative pan-genomics using binomial mixture models. BMC Genomics 10(1):385
Article PubMed PubMed Central Google Scholar
Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R (2005) The microbial pan-genome. Curr Opin Genet Dev 15(6):589–594
Article CAS PubMed Google Scholar
Mosquera-Rendón J, Rada-Bravo AM, Cárdenas-Brito S, Corredor M, Restrepo-Pineda E, Benítez-Páez A (2016) Pangenome-wide and molecular evolution analyses of the pseudomonas aeruginosa species. BMC Genomics 17(1):45
Article PubMed PubMed Central Google Scholar
Hassan A, Naz A, Obaid A, Paracha RZ, Naz K, Awan FM, Muhmmad SA, Janjua HA, Ahmad J, Ali A (2016) Pangenome and immuno-proteomics analysis of Acinetobacter baumannii strains revealed the core peptide vaccine targets. BMC Genomics 17(1):732
Article PubMed PubMed Central Google Scholar
Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf 11(1):119
Article Google Scholar
Delcher AL, Harmon D, Kasif S, White O, Salzberg SL (1999) Improved microbial gene identification with glimmer. Nucleic Acids Res 27(23):4636–4641
Article CAS PubMed PubMed Central Google Scholar
Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T et al (2003) Gendb–an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 31(8):2187–2195
Article CAS PubMed PubMed Central Google Scholar
Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Biol 19(2):99–113
CAS Google Scholar
Li L, Stoeckert CJ, Roos DS (2003) Orthomcl: identification of ortholog groups for eukaryotic genomes. Genome Res 13(9):2178–2189
Article CAS PubMed PubMed Central Google Scholar
Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science 278(5338):631–637
Article CAS PubMed Google Scholar
Kristensen DM, Kannan L, Coleman MK, Wolf YI, Sorokin A, Koonin EV, Mushegian A (2010) A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics 26(12):1481–1487
Article CAS PubMed PubMed Central Google Scholar
Sonnhammer ELL, Östlund G (2015) Inparanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res 43(D1):D234–D239
Article CAS PubMed Google Scholar
Alexeyenko A, Tamas I, Liu G, Sonnhammer ELL (2006) Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 22(14):e9–e15
Article CAS PubMed Google Scholar
Kuzniar A, van Ham RCHJ, Pongor S, Leunissen JAM (2008) The quest for orthologs: finding the corresponding gene across genomes. Trends Genet 24(11):539–551
Article CAS PubMed Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Article CAS PubMed Google Scholar
Blom J, Albaum S, Doppmeier D, Pühler A, Vorhölter FJ, Zakrzewski M, Goesmann A (2009) EDGAR: a software framework for the comparative analysis of prokaryotic genomes. BMC Bioinf 10:154
Article Google Scholar
Blom J, Kreis J, Spänig S, Juhre T, Bertelli C, Ernst C, Goesmann A (2016) EDGAR 2.0: an enhanced software platform for comparative gene content analyses. Nucleic Acids Res 44(W1):W22–W28
Article CAS PubMed PubMed Central Google Scholar
Brittnacher MJ, Fong C, Hayden HS, Jacobs MA, Radey M, Rohmer L (2011) PGAT: a multistrain analysis resource for microbial genomes. Bioinformatics 27(17):2429–2430
Article CAS PubMed PubMed Central Google Scholar
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797
Article CAS PubMed PubMed Central Google Scholar
Zhao Y, Wu J, Yang J, Sun S, Xiao J, Yu J (2012) PGAP: pan-genomes analysis pipeline. Bioinformatics 28(3):416–418
Article CAS PubMed Google Scholar
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575–1584
Article CAS PubMed PubMed Central Google Scholar
Fouts DE, Brinkac L, Beck E, Inman J, Sutton G (2012) PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Res 40(22):e172
Article CAS PubMed PubMed Central Google Scholar
Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol 79(24):7696–7701
Article CAS PubMed PubMed Central Google Scholar
Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J et al (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44(D1):D279–D285
Article CAS PubMed Google Scholar
Lukjancenko O, Thomsen MC, Larsen MV, Ussery DW (2013) PanFunPro: PAN-genome analysis based on FUNctional PROfiles. F1000Research, 2
Google Scholar
Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein families. Nucleic Acids Res 31(1):371–373
Article CAS PubMed PubMed Central Google Scholar
Gough J, Karplus K, Hughey R, Chothia C (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol 313(4):903–919
Article CAS PubMed Google Scholar
Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23):3150–3152
Article CAS PubMed PubMed Central Google Scholar
Benedict MN, Henriksen JR, Metcalf WW, Whitaker RJ, Price ND (2014) ITEP: an integrated toolkit for exploration of microbial pan-genomes. BMC Genomics 15(1):8
Article PubMed PubMed Central Google Scholar
Zhao Y, Jia X, Yang J, Ling Y, Zhang Z, Yu J, Wu J, Xiao J (2014) PanGP: a tool for quickly analyzing bacterial pan-genome profile. Bioinformatics 30(9):1297–1299
Article CAS PubMed PubMed Central Google Scholar
Sahl JW, Gregory Caporaso J, Rasko DA, Keim P (2014) The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes. PeerJ 2:e332
Article PubMed PubMed Central Google Scholar
Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S et al (2015) Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31(22):3691–3693
Article CAS PubMed PubMed Central Google Scholar
Paul S, Bhardwaj A, Bag SK, Sokurenko EV, Chattopadhyay S (2015) PanCoreGen–Profiling, detecting, annotating protein-coding genes in microbial genomes. Genomics 106(6):367–372
Article CAS PubMed PubMed Central Google Scholar
Chaudhari NM, Gupta VK, Dutta C (2016) BPGA-an ultra-fast pan-genome analysis pipeline. Sci Rep 6:24373
Article CAS PubMed PubMed Central Google Scholar
Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460–2461
Article CAS PubMed Google Scholar
Wozniak M, Wong L, Tiuryn J (2014) eCAMBer: efficient support for large-scale comparative analysis of multiple bacterial strains. BMC Bioinf 15(1):1
Article Google Scholar
Santos AR, Barbosa E, Fiaux K, Zurita-Turk M, Chaitankar V et al (2013) PANNOTATOR: an automated tool for annotation of pan-genomes. Genet Mol Res 12:2982–2989
Article CAS PubMed Google Scholar
Angiuoli SV, Hotopp JCD, Salzberg SL, Tettelin H (2011) Improving pan-genome annotation using whole genome multiple alignment. BMC Bioinf 12(1):272
Article Google Scholar
Hennig A, Bernhardt J, Nieselt K (2015) Pan-Tetris: an interactive visualisation for Pan-genomes. BMC Bioinf 16(Suppl 11):S3
Article Google Scholar
Herbig A, Jäger G, Battke F, Nieselt K (2012) GenomeRing: alignment visualization based on SuperGenome coordinates. Bioinformatics 28(12):i7–i15
Article CAS PubMed PubMed Central Google Scholar
Darling AE, Mau B, Perna NT (2010) progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PloS One 5(6):e11147
Google Scholar
Computational Pan-Genomics Consortium (2016) Computational pan-genomics: status, promises and challenges. Brief Bioinform bbw089 https://doi.org/10.1093/bib/bbw089
Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. Proc VLDB Endowment 6(13):1534–1545
Article Google Scholar
Sadakane K (2007) Compressed suffix trees with full functionality. Theor Comput Syst 41(4):589–607
Article Google Scholar
Fischer J, Mäkinen V, Navarro G (2009) Faster entropy-bounded compressed suffix trees. Theor Comput Sci 410(51):5354–5364
Article Google Scholar
Ohlebusch E, Fischer J, Gog S (2010) CST++. In: Proceedings of the international symposium on string processing and information retrieval (SPIRE’10), vol 6393, pp 322–333
Google Scholar
Russo L, Navarro G, Oliveira AL (2011) Fully compressed suffix trees. ACM Trans Algorithms 7(4):53
Article Google Scholar
Rasmussen KR, Stoye J, Myers EW (2006) Efficient q-gram filters for finding all ɛ-matches over a given length. J Comput Biol 13(2):296–308
Article CAS PubMed Google Scholar
Danek A, Deorowicz S, Grabowski S (2014) Indexes of large genome collections on a PC. PloS One 9(10):e109384
Article PubMed PubMed Central Google Scholar
Rahn R, Weese D, Reinert K (2014) Journaled string tree—a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics 30(24):3499–3505
Article CAS PubMed Google Scholar
Ferragina P, Manzini G (2000) Opportunistic data structures with applications. In: Proceedings of the 41st symposium on foundations of computer science (FOCS’00), pp 390–398
Google Scholar
Mäkinen V, Navarro G, Sirén J, Välimäki N (2010) Storage and retrieval of highly repetitive sequence collections. J Comput Biol 17(3):281–308
Article PubMed Google Scholar
Navarro G (2012) Indexing highly repetitive collections. In: Proceedings of the 23rd international workshop on combinatorial algorithms (IWOCA’12), vol 7643, pp 274–279
Google Scholar
Huang L, Popic V, Batzoglou S (2013) Short read alignment with populations of genomes. Bioinformatics 29(13):i361–i370
Article CAS PubMed PubMed Central Google Scholar
Burrows M, Wheeler M (1994) A block-sorting lossless data compression algorithm. Digital SRC Research Report 124
Google Scholar
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754–1760
Article CAS PubMed PubMed Central Google Scholar
Durbin R (2014) Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30(9):1266–1272
Article CAS PubMed PubMed Central Google Scholar
Laing C, Buchanan C, Taboada EN, Zhang Y, Kropinski A, Villegas A, Thomas JE, Gannon VPJ (2010) Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions. BMC Bioinf 11(1):461
Article Google Scholar
Treangen TJ, Ondov BD, Koren S, Phillippy AM (2014) The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol 15(11):524
Article PubMed PubMed Central Google Scholar
Nguyen N, Hickey G, Zerbino DR, Raney B, Earl D, Armstrong J, Haussler D, Paten B (2015) Building a pangenome reference for a population. J Comput Biol 22(5):387–401
Article CAS PubMed PubMed Central Google Scholar
Paten B, Diekhans M, Earl D, John JS, Ma J, Suh B, Haussler D (2011) Cactus graphs for genome comparisons. J Comput Biol 18(3):469–481
Article CAS PubMed Google Scholar
Schneeberger K, Hagmann J, Ossowski S, Warthmann N, Gesing S, Kohlbacher O, Weigel D (2009) Simultaneous alignment of short reads against multiple genomes. Genome Biol 10(9):R98
Article PubMed PubMed Central Google Scholar
Sirén J, Välimäki N, Mäkinen V (2011) Indexing finite language representation of population genotypes. In: Proceedings of the 11th international workshop on algorithms in bioinformatics (WABI’11), vol 6833, pp 270–281
Google Scholar
Sirén J, Välimäki N, Mäkinen V (2014) Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans Comput Biol Bioinf 11(2):375–388
Article Google Scholar
Sirén J (2017) Indexing variation graphs. In: Proceedings of the 19th workshop on algorithm engineering and experiments (ALENEX’17), pp 13–27
Google Scholar
vg team (2015) vg implementation. https://github.com/vgteam/vg [Online; Accessed 23 Feb 2017]
Kim D, Langmead B, Salzberg SL (2016) HISAT2 implementation. https://github.com/infphilo/hisat2 [Online; Accessed 23 Feb 2017]
Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12(4):357–360
Article CAS PubMed PubMed Central Google Scholar
Ernst C, Rahmann S (2013) PanCake: a data structure for pangenomes. In: Proceedings of the German conference on bioinformatics 2013 (GCB’13), vol 34, pp 35–45
Google Scholar
Myers EW (2005) The fragment assembly string graph. Bioinformatics 21:ii79–ii85
Google Scholar
Marcus S, Lee H, Schatz MC (2014) SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics 30(24):3476–3483
Article CAS PubMed PubMed Central Google Scholar
Weiner P (1973) Linear pattern matching algorithms. In: Proceedings of the 14th annual symposium on switching and automata theory (SWAT’73)
Google Scholar
Baier U, Beller T, Ohlebusch E (2016) Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics 32(4):497–504
Article CAS PubMed Google Scholar
Minkin I, Pham S, Medvedev P (2016) TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics btw609 https://doi.org/10.1093/bioinformatics/btw609
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426
Article Google Scholar
Chikhi R, Limasset A, Jackman S, Simpson JT, Medvedev P (2015) On the representation of de Bruijn graphs. J Comput Biol 22(5):336–352
Article CAS PubMed Google Scholar
Chikhi R, Limasset A, Medvedev P (2016) Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32(12):i201–i208
Article CAS PubMed PubMed Central Google Scholar
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18):3363–3369
Article CAS PubMed Google Scholar
Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G (2012) De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet 44(2):226–232
Article CAS PubMed PubMed Central Google Scholar
Holley G, Wittler R, Stoye J (2016) Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol Biol 11:3
Article PubMed PubMed Central Google Scholar
Heinz S, Zobel J, Williams HE (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Syst 20(2):192–223
Article Google Scholar
Solomon B, Kingsford C (2016) Fast search of thousands of short-read sequencing experiments. Nat Biotechnol 34(3):300–302
Article CAS PubMed PubMed Central Google Scholar
Holley G, Wittler R, Stoye J, Hach F (2017) Dynamic alignment-free and reference-free read compression. In: Proceedings of 21st international conference on research in computational molecular biology (RECOMB’17), vol 10229, pp 50–65
Google Scholar
Belk K, Boucher C, Bowe A, Gagie T, Morley P, Muggli MD, Noyes NR, Puglisi SJ, Raymond R (2016) Succinct colored de Bruijn graphs. bioRxiv 040071
Google Scholar
Bowe A, Onodera T, Sadakane K, Shibuya T (2012) Succinct de Bruijn graphs. In: Proceedings of 12th international workshop on algorithms in bioinformatics (WABI’12), vol 7534, pp 225–235
Google Scholar
Claude F, Farina A, Martínez-Prieto MA, Navarro G (2010) Compressed q-gram indexing for highly repetitive biological sequences. In: Proceedings of the IEEE international conference on bioinformatics and bioengineering (BIBE’10)
Google Scholar
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343
Article Google Scholar
Raman R, Raman V, Rao SS (2007) Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans Algorithms 3(4):43
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Technology, Bielefeld University, Bielefeld, Germany
Tina Zekic, Guillaume Holley & Jens Stoye
Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
Tina Zekic, Guillaume Holley & Jens Stoye
International Research Training Group 1906, Bielefeld University, Bielefeld, Germany
Tina Zekic, Guillaume Holley & Jens Stoye

Authors

Tina Zekic
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Holley
View author publications
You can also search for this author in PubMed Google Scholar
Jens Stoye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jens Stoye .

Editor information

Editors and Affiliations

Department of Biochemistry, Institute of Chemistry, University of São Paulo, São Paulo, São Paulo, Brazil
João C. Setubal
Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
Jens Stoye
Bioinformatics Group, Department of Computer Science, Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany
Peter F. Stadler

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Zekic, T., Holley, G., Stoye, J. (2018). Pan-Genome Storage and Analysis Techniques. In: Setubal, J., Stoye, J., Stadler, P. (eds) Comparative Genomics. Methods in Molecular Biology, vol 1704. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7463-4_2

Download citation

DOI: https://doi.org/10.1007/978-1-4939-7463-4_2
Published: 26 December 2017
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-7461-0
Online ISBN: 978-1-4939-7463-4
eBook Packages: Springer Protocols

Publish with us

Policies and ethics