Skip to main content

Gene Phylogenies and Orthologous Groups

  • Protocol
  • First Online:
Comparative Genomics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1704))

Abstract

This chapter covers the theory and practice of ortholog gene set computation. In the theoretical part we give detailed and formal descriptions of the relevant concepts. We also cover the topic of graph-based clustering as a tool to compute ortholog gene sets. In the second part we provide an overview of practical considerations intended for researchers who need to determine orthologous genes from a collection of annotated genomes, briefly describing some of the most popular programs and resources currently available for this task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    Certain papers in the literature use the term co-ortholog instead of simply ortholog, to emphasize that this is not always a one-to-one relation. However, the literature also records the usage of co-orthologs in the sense of paralogs [24]. To avoid confusion, we use the simpler term.

References

  1. Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Zool 19:99–113

    Article  CAS  PubMed  Google Scholar 

  2. Petsko GA (2001) Homologuephobia. Genome Biol 2:comment1002

    Google Scholar 

  3. Koonin EV (2001) An apology for orthologs – or brave new memes. Genome Biol 2:comment1005

    Google Scholar 

  4. Gerlt JA, Babbitt PC (2000) Can sequence determine function? Genome Biol 1:R5

    Article  Google Scholar 

  5. Koonin E (2005) Orthologs, paralogs, and evolutionary genomics. Ann Rev Genet 39:309–338

    Article  CAS  PubMed  Google Scholar 

  6. Innan H, Kondrashov F (2010) The evolution of gene duplications: classifying and distinguishing between models. Nat Rev Genet 11:97–108

    Article  CAS  PubMed  Google Scholar 

  7. Altenhoff AM, Studer RA, Robinson-Rechavi M, Dessimoz C (2012) Resolving the ortholog conjecture: orthologs tend to be weakly, but significantly, more similar in function than paralogs. PLoS Comput Biol 8:e1002514

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Studer RA, Robinson-Rechavi M (2009) How confident can we be that orthologs are similar, but paralogs differ? Trends Genet 25:210–216

    Article  CAS  PubMed  Google Scholar 

  9. Nehrt NL, Clark WT, Radivojac P, Hahn MW (2011) Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol 7:e1002073

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Gabaldon T, Koonin EV (2013) Functional and evolutionary implications of gene orthology. Nat Rev Genet 14:360–366

    Article  CAS  PubMed  Google Scholar 

  11. Sonnhammer EL, Gabaldón T, Sousa da Silva AW, Martin M, Robinson-Rechavi M, Boeckmann B, Thomas P, Dessimoz C, and the Quest for Orthologs consortium (2014) Big data and other challenges in the quest for orthologs. Bioinformatics 30(21):2993–2998

    Google Scholar 

  12. Maddison WP (1997) Gene trees in species trees. Syst Biol 46:523–536

    Article  Google Scholar 

  13. Vernot B, Stolzer M, Goldman A, Durand D (2008) Reconciliation with non-binary species trees. J Comput Biol 15:981–1006

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Zhang L (1997) On a Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies. J Comput Biol 4:177–187

    Article  CAS  PubMed  Google Scholar 

  15. Hernandez-Rosales M, Hellmuth M, Wieseke N, Huber KT, Moulton V, Stadler PF (2012) From event-labeled gene trees to species trees. BMC Bioinf 13(Suppl. 19):S6

    Google Scholar 

  16. Doyon J-P, Chauve C, Hamel S (2008) Algorithms for exploring the space of gene tree/species tree reconciliations. In: Nelson CE, Vialette S (eds) Comparative genomics; international workshop, RECOMB-CG 2008. Lecture notes in computer science, vol 5267. Springer, New York, pp 1–13

    Google Scholar 

  17. Doyon J-P, Ranwez V, Daubin V, Berry V (2011) Models, algorithms and programs for phylogeny reconciliation. Brief Bioinform 12:392–400

    Article  PubMed  Google Scholar 

  18. Page R (1994) Maps between trees and cladistic analysis of historical associations among genes. Syst Biol 43:58–77

    Google Scholar 

  19. Bonizzoni P, Della Vedova G, Dondi R (2005) Reconciling a gene tree to a species tree under the duplication cost model. Theor Comput Sci 347:36–53

    Article  Google Scholar 

  20. Górecki P, Tiuryn J (2006) DLS-trees: a model of evolutionary scenarios. Theor Comput Sci 359:378–399

    Article  Google Scholar 

  21. Guigó R, Muchnik I, Smith TF (1996) Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol 6:189–213

    Article  PubMed  Google Scholar 

  22. Page RDM, Charleston MA (1997) From gene to organismal phylogeny: reconciled trees and the gene tree/species tree problem. Mol Phylogenet Evol 7:231–240

    Article  CAS  PubMed  Google Scholar 

  23. Lafond M, Semeria M, Swenson KM, Tannier E, El-Mabrouk N (2013) Gene tree correction guided by orthology. BMC Bioinf 14(S15):S5

    Article  Google Scholar 

  24. Kriventseva EV, Tegenfeldt F, Petty TJ, Waterhouse RM, Sim ao FA, Pozdnyakov IA, Zdobnov EM (2015) Orthodb v8: update of the hierarchical catalog of orthologs and the underlying free software. Nucleic Acids Res 43:D250–D256, Database issue

    Google Scholar 

  25. Sonnhammer ELL, Koonin EV (2002) Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet 18:619–620

    Article  CAS  PubMed  Google Scholar 

  26. Doyon JP, Chauve C, Hamel S (2009) Space of gene/species trees reconciliations and parsimonious models. J Comput Biol 16:1399–1418

    Article  CAS  PubMed  Google Scholar 

  27. Page RDM (2000) Extracting species trees from complex gene trees: reconciled trees and vertebrate phylogeny. Mol Phylogenet Evol 14:89–106

    Article  CAS  PubMed  Google Scholar 

  28. Ma B, Li M, Zhang L (2000) From gene trees to species trees. SIAM J Comput 30:729–752

    Article  Google Scholar 

  29. Arvestad L, Berglund AC, Lagergren J, Sennblad B (2003) Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics 19:i7–i15

    Article  PubMed  Google Scholar 

  30. Arvestad L, Lagergren L, Sennblad B (2009) The gene evolution model and computing its associated probabilities. J ACM 56:1–44

    Article  Google Scholar 

  31. Górecki P, Burleigh GJ, Eulenstein O (2011) Maximum likelihood models and algorithms for gene tree evolution with duplications and losses. BMC Bioinf 12:S15

    Article  Google Scholar 

  32. Böcker S, Dress AWM (1998) Recovering symbolically dated, rooted trees from symbolic ultrametrics. Adv Math 138:105–125

    Article  Google Scholar 

  33. Hellmuth M, Hernandez-Rosales M, Huber KT, Moulton V, Stadler PF, Wieseke N (2013) Orthology relations, symbolic ultrametrics, and cographs. J Math Biol 66:399–420

    Article  PubMed  Google Scholar 

  34. Lafond M, El-Mabrouk N (2014) Orthology and paralogy constraints: satisfiability and consistency. BMC Genomics 15(S6):S12

    Article  PubMed  PubMed Central  Google Scholar 

  35. Lafond M, Dondi R, El-Mabrouk N (2016) The link between orthology relations and gene trees: a correction perspective. Algorithms Mol Biol 11:4

    Article  PubMed  PubMed Central  Google Scholar 

  36. Krishnamurthy N, Brown D, Kirshner D, Sjölander K (2006) Phylofacts: an online structural phylogenomic encyclopedia for protein functional and structural classification. Genome Biol 7:R83

    Article  PubMed  PubMed Central  Google Scholar 

  37. Sjölander K, Datta R, Shen Y, Shoffner G (2011) Ortholog identification in the presence of domain architecture rearrangement. Brief Bioinform 12(5):413–422

    Article  PubMed  PubMed Central  Google Scholar 

  38. Pryszcz LP, Huerta-Cepas J, Gabaldon T (2011) MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score. Nucleic Acids Res 17(39):e32

    Article  Google Scholar 

  39. Afrasiabi C, Samad B, Dineen D, Meacham C, Sjölander C (2013) Phylofacts fat-cat webserver: ortholog identification and function prediction using fast approximate tree classification. Nucleic Acids Res 41(W1):W242–W248

    Article  PubMed  PubMed Central  Google Scholar 

  40. Huerta-Cepas J, Capella-Gutierrez S, Pryszcz LP, Marcet-Houben M, Gabaldon T (2014) PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome. Nucleic Acids Res 18(42):897–902

    Article  Google Scholar 

  41. Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science 278:631–637

    Article  CAS  PubMed  Google Scholar 

  42. Wolf YI, Koonin EV (2012) A tight link between orthologs and bidirectional best hits in bacterial and archaeal genomes. Genome Biol Evol 4:1286–1294

    Article  PubMed  PubMed Central  Google Scholar 

  43. Lechner M, Findeiß S, Steiner L, Marz M, Stadler PF, Prohaska SJ (2011) Proteinortho: detection of (co-)orthologs in large-scale analysis. BMC Bioinf 12:124

    Google Scholar 

  44. Roth ACJ, Gonnet GH, Dessimoz C (2008) Algorithm of OMA for large-scale orthology inference. BMC Bioinf 9:518

    Article  Google Scholar 

  45. Dessimoz C, Boeckmann B, Roth ACJ, Gonnet GH (2006) Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits. Nucleic Acids Res 34:3309–3316

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Liu Y, Wang J, Guo J, Chen J (2012) Complexity and parameterized algorithms for cograph editing. Theor Comput Sci 461:45–54

    Article  Google Scholar 

  47. Hellmuth M, Fritz A, Wieseke N, Stadler PF (2015) Techniques for the cograph editing problem: module merge is equivalent to edit P4’s (submitted). arXiv 1509.06983v2

    Google Scholar 

  48. Gao Y, Hare DR, Nastos J (2013) The cluster deletion problem for cographs. Discret Math 313:2763–2771

    Article  Google Scholar 

  49. Rahmann S, Wittkop T, Baumbach J, Martin M, Truß A, Böcker S (2007) Exact and heuristic algorithms for weighted cluster editing. In: Proceedings of the 6th LSS conference on computational systems bioinformatics (CSB2007). Life Sciences Society, pp 391–401

    Google Scholar 

  50. Falls C, Powell B, Snœyink J (2008) Computing high-stringency COGs using Turán-type graphs. Technical Report, University of North Carolina

    Google Scholar 

  51. Nguyen TH, Ranwez V, Pointet S, Chifolleau AMA, Doyon J-P, Berry V (2013) Reconciliation and local gene tree rearrangement can be of mutual profit. Algorithms Mol Biol 8:12

    Article  PubMed  PubMed Central  Google Scholar 

  52. Doyon J-P, Scornavacca C, Gorbunov KY, Szöllősi G, Ranwez V, Berry V (2010) An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. In: Tannier E (ed) Comparative genomics. Lecture notes in computer science, vol 6398. Springer, Heidelberg, pp 93–108

    Chapter  Google Scholar 

  53. Wieseke N, Bernt M, Middendorf M (2013) Unifying parsimonious tree reconciliation. In: Darling A, Stoye J (eds) Algorithms in bioinformatics WABI 2013. Lecture notes in computer science, vol 8126. Springer, Heidelberg, pp 200–214

    Google Scholar 

  54. Donati B, Baudet C, Sinaimeri B, Crescenzi B, Sagot M-F (2015) EUCALYPT: efficient tree reconciliation enumerator. Algorithms Mol Biol 10:3

    Google Scholar 

  55. Hallett MT, Lagergren J (2001) Efficient algorithms for lateral gene transfer problems. In Lengauer T (ed) Proceedings of the fifth annual international conference on computational biology (RECOMB). ACM, New York, pp 149–156

    Google Scholar 

  56. Fablet M, Bueno M, Potrzebowski L, Kaessmann H (2009) Evolutionary origin and functions of retrogene introns. Mol Biol Evol 26:2147–2156

    Article  CAS  PubMed  Google Scholar 

  57. Hellmuth M, Stadler PF, Wieseke N (2017) The mathematics of xenology: di-cographs, symbolic ultrametrics, 2-structures and treerepresentable systems of binary relations. J Math Biol 75:199–237

    Article  PubMed  Google Scholar 

  58. Fitch WM (2000) Homology a personal view on some of the problems. Trends Genet 16:227–231

    Article  CAS  PubMed  Google Scholar 

  59. Jensen RA (2001) Orthologs and paralogs – we need to get it right. Genome Biol 2:8

    Article  Google Scholar 

  60. Kristensen DM, Kannan L, Coleman MK, Wolf YI, Sorokin A, Koonin EV, Mushegian A (2010) A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics 26:1481–1487

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Holm L, Heger A (2014) Automated sequence-based approaches for identifying domain families. In: Orengo CA, Bateman A (eds) Protein Families: relating protein sequence, structure, and function. Wiley series in peptide and protein science. Wiley, New York, pp 3–24

    Google Scholar 

  62. Trachana K, Larsson TA, Powell S, Chen W-H, Doerks T, Muller T, Bork P (2011) Orthology prediction methods: a quality assessment using curated protein families. Bioessays 33(10):769–780

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Altenhoff AM, Boeckmann B, Capella-Gutierrez S, Dalquen DA, DeLuca T, Forslund K, Huerta-Cepas J, Linard B, Pereira C, Pryszcz LP, Schreiber F, da Silva F, Szklarczyk D, Train CM, Bork P, Lecompte O, von Mering C, Xenarios I, Sjölander K, Jensen LJ, Martin MJ, Muffato M, Gabaldón T, Lewis SE, Thomas PD, Sonnhammer E, Dessimoz C (2016) Standardized benchmarking in the quest for orthologs. Nat Methods 13(5):425–430

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Trachana K, Forslund K, Larsson T, Powell S, Doerks T, Mering C, Bork P (2014) A phylogeny-based benchmarking test for orthology inference reveals the limitations of function-based validation. PLoS One 9:e111122

    Article  PubMed  PubMed Central  Google Scholar 

  65. Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178–2189

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. van Dongen S (2000) Graph clustering by flow simulation. PhD Thesis, University of Utrecht, Utrecht

    Google Scholar 

  68. Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30:1575–1584

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol 79:7696–7701

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Galperin MY, Makarova KS, Wolf YI, Koonin EV (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acid Res 43:D261–D269

    Article  CAS  PubMed  Google Scholar 

  71. Lechner M, Hernandez-Rosales M, Doerr D, Wieseke N, Thévenin A, Stoye J, Hartmann RK, Prohaska SJ, Stadler PF (2014) Orthology detection combining clustering and synteny for very large datasets. PLoS ONE 9:e105015

    Article  PubMed  PubMed Central  Google Scholar 

  72. Doerr D, Thévenin A, Stoye J (2012) Gene family assignment-free comparative genomics. BMC Bioinf 13(Suppl 19):S3

    Article  Google Scholar 

  73. Hellmuth M, Wieseke N, Lechner M, Lenhof H-P, Middendorf M, Stadler PF (2015) Phylogenetics from paralogs. Proc Natl Acad Sci USA 112:2058–2063

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Orengo CA, Bateman A (eds) (2014) Protein Families: relating protein sequence, structure, and function. Wiley series in peptide and protein science. Wiley, New York

    Google Scholar 

  75. The UniProt Consortium (2015) UniProt: a hub for protein information. Nucleic Acids Res 43:D204–D212

    Article  Google Scholar 

  76. Pruitt KD, Tatusova T, Maglott DR (2005) NCBI reference sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 33(Suppl. 1):D501–D504

    CAS  PubMed  Google Scholar 

  77. Huerta-Cepas J, Szklarczyk D, Forslund K, Cook H, Heller D, Walter MC, Rattei T, Mende DR, Sunagawa S, Kuhn M, Jensen LJ, von Mering C, Bork P (2016) eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res 44(D1):D286–D293

    Article  CAS  PubMed  Google Scholar 

  78. Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M (2016) KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res 44:D457–D462

    Article  CAS  PubMed  Google Scholar 

  79. Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M (2014) Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res 42:D199–D205

    Article  CAS  PubMed  Google Scholar 

  80. Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M (2007) KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res 35:W182–W185

    Article  PubMed  PubMed Central  Google Scholar 

  81. Suzuki S, Kakuta M, Ishida T, Akiyama Y (2014) GHOSTX: an improved sequence homology search algorithm using a query suffix array and a database suffix array. PLoS ONE 9(8):e103833

    Article  PubMed  PubMed Central  Google Scholar 

  82. Eddy SR (2004) What is a hidden Markov model? Nat Biotechnol 22:1315–1316

    Article  CAS  PubMed  Google Scholar 

  83. Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7:e1002195

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44:D279–D285

    Article  CAS  PubMed  Google Scholar 

  85. Haft DH, Selengut JD, Richter RA, Harkins D, Basu MK, Beck E (2013) TIGRFAMs and genome properties in 2013. Nucleic Acids Res 41:D387–D395; Database issue

    Google Scholar 

  86. Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A (2003) PANTHER: a library of protein families and subfamilies indexed by function. Genome Res 13:2129–2141

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, Kitano H, Thomas PD (2005) The panther database of protein families, subfamilies, functions and pathways. Nucleic Acids Res 33:D284–D288; Database issue

    Google Scholar 

  88. Mi H, Guo N, Kejariwal A, Thomas PD (2007) PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways. Nucleic Acids Res 16(35):D247–D252

    Article  Google Scholar 

  89. Mi H, Muruganujan A, Casagrande JT, Thomas PD (2013) Large-scale gene function analysis with the panther classification system. Nat Protoc 8(8):1754–2189

    Article  Google Scholar 

  90. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry, JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  91. Pedruzzi I, Rivoire C, Auchincloss AH, Coudert E, Keller G, de Castro E, Baratin D, Cuche BA, Bougueleret L, Poux S, Redaschi N, Xenarios I, Bridge A (2015) HAMAP in 2015: updates to the protein family classification and annotation system. Nucleic Acid Res 43:D1064–D1070

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This work was funded in part by the Deutsche Forschungsgemeinschaft (Proj. No. MI439/14-1) (to PFS) and a CNPq fellowship (to JCS).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to João C. Setubal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media LLC

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Setubal, J.C., Stadler, P.F. (2018). Gene Phylogenies and Orthologous Groups. In: Setubal, J., Stoye, J., Stadler, P. (eds) Comparative Genomics. Methods in Molecular Biology, vol 1704. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7463-4_1

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-7463-4_1

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-7461-0

  • Online ISBN: 978-1-4939-7463-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics