Abstract
This chapter covers the theory and practice of ortholog gene set computation. In the theoretical part we give detailed and formal descriptions of the relevant concepts. We also cover the topic of graph-based clustering as a tool to compute ortholog gene sets. In the second part we provide an overview of practical considerations intended for researchers who need to determine orthologous genes from a collection of annotated genomes, briefly describing some of the most popular programs and resources currently available for this task.
Notes
- 1.
Certain papers in the literature use the term co-ortholog instead of simply ortholog, to emphasize that this is not always a one-to-one relation. However, the literature also records the usage of co-orthologs in the sense of paralogs [24]. To avoid confusion, we use the simpler term.
References
Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Zool 19:99–113
Petsko GA (2001) Homologuephobia. Genome Biol 2:comment1002
Koonin EV (2001) An apology for orthologs – or brave new memes. Genome Biol 2:comment1005
Gerlt JA, Babbitt PC (2000) Can sequence determine function? Genome Biol 1:R5
Koonin E (2005) Orthologs, paralogs, and evolutionary genomics. Ann Rev Genet 39:309–338
Innan H, Kondrashov F (2010) The evolution of gene duplications: classifying and distinguishing between models. Nat Rev Genet 11:97–108
Altenhoff AM, Studer RA, Robinson-Rechavi M, Dessimoz C (2012) Resolving the ortholog conjecture: orthologs tend to be weakly, but significantly, more similar in function than paralogs. PLoS Comput Biol 8:e1002514
Studer RA, Robinson-Rechavi M (2009) How confident can we be that orthologs are similar, but paralogs differ? Trends Genet 25:210–216
Nehrt NL, Clark WT, Radivojac P, Hahn MW (2011) Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol 7:e1002073
Gabaldon T, Koonin EV (2013) Functional and evolutionary implications of gene orthology. Nat Rev Genet 14:360–366
Sonnhammer EL, Gabaldón T, Sousa da Silva AW, Martin M, Robinson-Rechavi M, Boeckmann B, Thomas P, Dessimoz C, and the Quest for Orthologs consortium (2014) Big data and other challenges in the quest for orthologs. Bioinformatics 30(21):2993–2998
Maddison WP (1997) Gene trees in species trees. Syst Biol 46:523–536
Vernot B, Stolzer M, Goldman A, Durand D (2008) Reconciliation with non-binary species trees. J Comput Biol 15:981–1006
Zhang L (1997) On a Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies. J Comput Biol 4:177–187
Hernandez-Rosales M, Hellmuth M, Wieseke N, Huber KT, Moulton V, Stadler PF (2012) From event-labeled gene trees to species trees. BMC Bioinf 13(Suppl. 19):S6
Doyon J-P, Chauve C, Hamel S (2008) Algorithms for exploring the space of gene tree/species tree reconciliations. In: Nelson CE, Vialette S (eds) Comparative genomics; international workshop, RECOMB-CG 2008. Lecture notes in computer science, vol 5267. Springer, New York, pp 1–13
Doyon J-P, Ranwez V, Daubin V, Berry V (2011) Models, algorithms and programs for phylogeny reconciliation. Brief Bioinform 12:392–400
Page R (1994) Maps between trees and cladistic analysis of historical associations among genes. Syst Biol 43:58–77
Bonizzoni P, Della Vedova G, Dondi R (2005) Reconciling a gene tree to a species tree under the duplication cost model. Theor Comput Sci 347:36–53
Górecki P, Tiuryn J (2006) DLS-trees: a model of evolutionary scenarios. Theor Comput Sci 359:378–399
Guigó R, Muchnik I, Smith TF (1996) Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol 6:189–213
Page RDM, Charleston MA (1997) From gene to organismal phylogeny: reconciled trees and the gene tree/species tree problem. Mol Phylogenet Evol 7:231–240
Lafond M, Semeria M, Swenson KM, Tannier E, El-Mabrouk N (2013) Gene tree correction guided by orthology. BMC Bioinf 14(S15):S5
Kriventseva EV, Tegenfeldt F, Petty TJ, Waterhouse RM, Sim ao FA, Pozdnyakov IA, Zdobnov EM (2015) Orthodb v8: update of the hierarchical catalog of orthologs and the underlying free software. Nucleic Acids Res 43:D250–D256, Database issue
Sonnhammer ELL, Koonin EV (2002) Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet 18:619–620
Doyon JP, Chauve C, Hamel S (2009) Space of gene/species trees reconciliations and parsimonious models. J Comput Biol 16:1399–1418
Page RDM (2000) Extracting species trees from complex gene trees: reconciled trees and vertebrate phylogeny. Mol Phylogenet Evol 14:89–106
Ma B, Li M, Zhang L (2000) From gene trees to species trees. SIAM J Comput 30:729–752
Arvestad L, Berglund AC, Lagergren J, Sennblad B (2003) Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics 19:i7–i15
Arvestad L, Lagergren L, Sennblad B (2009) The gene evolution model and computing its associated probabilities. J ACM 56:1–44
Górecki P, Burleigh GJ, Eulenstein O (2011) Maximum likelihood models and algorithms for gene tree evolution with duplications and losses. BMC Bioinf 12:S15
Böcker S, Dress AWM (1998) Recovering symbolically dated, rooted trees from symbolic ultrametrics. Adv Math 138:105–125
Hellmuth M, Hernandez-Rosales M, Huber KT, Moulton V, Stadler PF, Wieseke N (2013) Orthology relations, symbolic ultrametrics, and cographs. J Math Biol 66:399–420
Lafond M, El-Mabrouk N (2014) Orthology and paralogy constraints: satisfiability and consistency. BMC Genomics 15(S6):S12
Lafond M, Dondi R, El-Mabrouk N (2016) The link between orthology relations and gene trees: a correction perspective. Algorithms Mol Biol 11:4
Krishnamurthy N, Brown D, Kirshner D, Sjölander K (2006) Phylofacts: an online structural phylogenomic encyclopedia for protein functional and structural classification. Genome Biol 7:R83
Sjölander K, Datta R, Shen Y, Shoffner G (2011) Ortholog identification in the presence of domain architecture rearrangement. Brief Bioinform 12(5):413–422
Pryszcz LP, Huerta-Cepas J, Gabaldon T (2011) MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score. Nucleic Acids Res 17(39):e32
Afrasiabi C, Samad B, Dineen D, Meacham C, Sjölander C (2013) Phylofacts fat-cat webserver: ortholog identification and function prediction using fast approximate tree classification. Nucleic Acids Res 41(W1):W242–W248
Huerta-Cepas J, Capella-Gutierrez S, Pryszcz LP, Marcet-Houben M, Gabaldon T (2014) PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome. Nucleic Acids Res 18(42):897–902
Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science 278:631–637
Wolf YI, Koonin EV (2012) A tight link between orthologs and bidirectional best hits in bacterial and archaeal genomes. Genome Biol Evol 4:1286–1294
Lechner M, Findeiß S, Steiner L, Marz M, Stadler PF, Prohaska SJ (2011) Proteinortho: detection of (co-)orthologs in large-scale analysis. BMC Bioinf 12:124
Roth ACJ, Gonnet GH, Dessimoz C (2008) Algorithm of OMA for large-scale orthology inference. BMC Bioinf 9:518
Dessimoz C, Boeckmann B, Roth ACJ, Gonnet GH (2006) Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits. Nucleic Acids Res 34:3309–3316
Liu Y, Wang J, Guo J, Chen J (2012) Complexity and parameterized algorithms for cograph editing. Theor Comput Sci 461:45–54
Hellmuth M, Fritz A, Wieseke N, Stadler PF (2015) Techniques for the cograph editing problem: module merge is equivalent to edit P4’s (submitted). arXiv 1509.06983v2
Gao Y, Hare DR, Nastos J (2013) The cluster deletion problem for cographs. Discret Math 313:2763–2771
Rahmann S, Wittkop T, Baumbach J, Martin M, Truß A, Böcker S (2007) Exact and heuristic algorithms for weighted cluster editing. In: Proceedings of the 6th LSS conference on computational systems bioinformatics (CSB2007). Life Sciences Society, pp 391–401
Falls C, Powell B, Snœyink J (2008) Computing high-stringency COGs using Turán-type graphs. Technical Report, University of North Carolina
Nguyen TH, Ranwez V, Pointet S, Chifolleau AMA, Doyon J-P, Berry V (2013) Reconciliation and local gene tree rearrangement can be of mutual profit. Algorithms Mol Biol 8:12
Doyon J-P, Scornavacca C, Gorbunov KY, Szöllősi G, Ranwez V, Berry V (2010) An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. In: Tannier E (ed) Comparative genomics. Lecture notes in computer science, vol 6398. Springer, Heidelberg, pp 93–108
Wieseke N, Bernt M, Middendorf M (2013) Unifying parsimonious tree reconciliation. In: Darling A, Stoye J (eds) Algorithms in bioinformatics WABI 2013. Lecture notes in computer science, vol 8126. Springer, Heidelberg, pp 200–214
Donati B, Baudet C, Sinaimeri B, Crescenzi B, Sagot M-F (2015) EUCALYPT: efficient tree reconciliation enumerator. Algorithms Mol Biol 10:3
Hallett MT, Lagergren J (2001) Efficient algorithms for lateral gene transfer problems. In Lengauer T (ed) Proceedings of the fifth annual international conference on computational biology (RECOMB). ACM, New York, pp 149–156
Fablet M, Bueno M, Potrzebowski L, Kaessmann H (2009) Evolutionary origin and functions of retrogene introns. Mol Biol Evol 26:2147–2156
Hellmuth M, Stadler PF, Wieseke N (2017) The mathematics of xenology: di-cographs, symbolic ultrametrics, 2-structures and treerepresentable systems of binary relations. J Math Biol 75:199–237
Fitch WM (2000) Homology a personal view on some of the problems. Trends Genet 16:227–231
Jensen RA (2001) Orthologs and paralogs – we need to get it right. Genome Biol 2:8
Kristensen DM, Kannan L, Coleman MK, Wolf YI, Sorokin A, Koonin EV, Mushegian A (2010) A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics 26:1481–1487
Holm L, Heger A (2014) Automated sequence-based approaches for identifying domain families. In: Orengo CA, Bateman A (eds) Protein Families: relating protein sequence, structure, and function. Wiley series in peptide and protein science. Wiley, New York, pp 3–24
Trachana K, Larsson TA, Powell S, Chen W-H, Doerks T, Muller T, Bork P (2011) Orthology prediction methods: a quality assessment using curated protein families. Bioessays 33(10):769–780
Altenhoff AM, Boeckmann B, Capella-Gutierrez S, Dalquen DA, DeLuca T, Forslund K, Huerta-Cepas J, Linard B, Pereira C, Pryszcz LP, Schreiber F, da Silva F, Szklarczyk D, Train CM, Bork P, Lecompte O, von Mering C, Xenarios I, Sjölander K, Jensen LJ, Martin MJ, Muffato M, Gabaldón T, Lewis SE, Thomas PD, Sonnhammer E, Dessimoz C (2016) Standardized benchmarking in the quest for orthologs. Nat Methods 13(5):425–430
Trachana K, Forslund K, Larsson T, Powell S, Doerks T, Mering C, Bork P (2014) A phylogeny-based benchmarking test for orthology inference reveals the limitations of function-based validation. PLoS One 9:e111122
Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178–2189
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
van Dongen S (2000) Graph clustering by flow simulation. PhD Thesis, University of Utrecht, Utrecht
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30:1575–1584
Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol 79:7696–7701
Galperin MY, Makarova KS, Wolf YI, Koonin EV (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acid Res 43:D261–D269
Lechner M, Hernandez-Rosales M, Doerr D, Wieseke N, Thévenin A, Stoye J, Hartmann RK, Prohaska SJ, Stadler PF (2014) Orthology detection combining clustering and synteny for very large datasets. PLoS ONE 9:e105015
Doerr D, Thévenin A, Stoye J (2012) Gene family assignment-free comparative genomics. BMC Bioinf 13(Suppl 19):S3
Hellmuth M, Wieseke N, Lechner M, Lenhof H-P, Middendorf M, Stadler PF (2015) Phylogenetics from paralogs. Proc Natl Acad Sci USA 112:2058–2063
Orengo CA, Bateman A (eds) (2014) Protein Families: relating protein sequence, structure, and function. Wiley series in peptide and protein science. Wiley, New York
The UniProt Consortium (2015) UniProt: a hub for protein information. Nucleic Acids Res 43:D204–D212
Pruitt KD, Tatusova T, Maglott DR (2005) NCBI reference sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 33(Suppl. 1):D501–D504
Huerta-Cepas J, Szklarczyk D, Forslund K, Cook H, Heller D, Walter MC, Rattei T, Mende DR, Sunagawa S, Kuhn M, Jensen LJ, von Mering C, Bork P (2016) eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res 44(D1):D286–D293
Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M (2016) KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res 44:D457–D462
Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M (2014) Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res 42:D199–D205
Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M (2007) KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res 35:W182–W185
Suzuki S, Kakuta M, Ishida T, Akiyama Y (2014) GHOSTX: an improved sequence homology search algorithm using a query suffix array and a database suffix array. PLoS ONE 9(8):e103833
Eddy SR (2004) What is a hidden Markov model? Nat Biotechnol 22:1315–1316
Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7:e1002195
Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44:D279–D285
Haft DH, Selengut JD, Richter RA, Harkins D, Basu MK, Beck E (2013) TIGRFAMs and genome properties in 2013. Nucleic Acids Res 41:D387–D395; Database issue
Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A (2003) PANTHER: a library of protein families and subfamilies indexed by function. Genome Res 13:2129–2141
Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, Kitano H, Thomas PD (2005) The panther database of protein families, subfamilies, functions and pathways. Nucleic Acids Res 33:D284–D288; Database issue
Mi H, Guo N, Kejariwal A, Thomas PD (2007) PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways. Nucleic Acids Res 16(35):D247–D252
Mi H, Muruganujan A, Casagrande JT, Thomas PD (2013) Large-scale gene function analysis with the panther classification system. Nat Protoc 8(8):1754–2189
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry, JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29
Pedruzzi I, Rivoire C, Auchincloss AH, Coudert E, Keller G, de Castro E, Baratin D, Cuche BA, Bougueleret L, Poux S, Redaschi N, Xenarios I, Bridge A (2015) HAMAP in 2015: updates to the protein family classification and annotation system. Nucleic Acid Res 43:D1064–D1070
Acknowledgements
This work was funded in part by the Deutsche Forschungsgemeinschaft (Proj. No. MI439/14-1) (to PFS) and a CNPq fellowship (to JCS).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media LLC
About this protocol
Cite this protocol
Setubal, J.C., Stadler, P.F. (2018). Gene Phylogenies and Orthologous Groups. In: Setubal, J., Stoye, J., Stadler, P. (eds) Comparative Genomics. Methods in Molecular Biology, vol 1704. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7463-4_1
Download citation
DOI: https://doi.org/10.1007/978-1-4939-7463-4_1
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-7461-0
Online ISBN: 978-1-4939-7463-4
eBook Packages: Springer Protocols