Abstract
Phylogenetic analysis has become a common step in characterization of gene and protein sequences. However, despite the availability of numerous affordable and more-or-less intuitive software tools, construction of biologically relevant, informative phylogenetic trees remains a process involving several critical steps that are inherently non-algorithmic, i.e., dependent on decisions made by the user. These steps involve, but are not limited to, setting the aims of the phylogenetic study, choosing sequences to be analyzed, and selecting methods employed in sequence alignment construction, as well as algorithms and parameters used to construct the actual phylogenetic tree. This review aims towards providing guidance for these decisions, as well as illustrating common pitfalls and problems occurring during phylogenetic analysis of plant gene sequences.
Similar content being viewed by others
Abbreviations
- BLAST :
-
basic local alignment search tool
- CD-search:
-
conserved domain search
- COBALT :
-
constraint-based multiple protein alignment tool
- DDBJ :
-
DNA data bank of Japan
- ENA :
-
European nucleotide archive
- INSDC :
-
international nucleotide sequence database collaboration
- MACAW :
-
multiple alignment construction and analysis workbench
- MAFFT :
-
multiple alignment using fast Fourier transform
- MEGA :
-
molecular evolutionary genetics analysis
- ML:
-
maximum likelihood
- MUSCLE :
-
multiple sequence comparison by log-expectation
- NCBI :
-
National Centre for Biotechnology Information
- NJ:
-
neighbor-joining
- PAUP :
-
phylogenetic analysis using parsimony
- PHYLIP :
-
phylogeny inference package
- SMART :
-
simple modular architecture research tool
- T-REX :
-
tree and reticulogram reconstruction
References
Al Ait, L., Yamak, Z., Morgenstern, B.: DIALIGN at GOBICS–multiple sequence alignment using various sources of external information. — Nucl. Acids Res. 41: W3–W7, 2013.
Baldauf, S.L.: Phylogeny for the faint of heart: a tutorial. — Trends Genet. 19: 345–351, 2003.
Bateman, A., The uniprot consortium: UniProt: a hub for protein information. - Nucl. Acids Res. 43: D204–D212, 2015.
Baum, D.: Reading a phylogenetic tree: the meaning of monophyletic groups. — Natur. Edu. 1: 190, 2008.
Blouin, C., Perry, S., Lavell, A., Susko, E., Roger, A.J.: Reproducing the manual annotation of multiple sequence alignments using a SVM classifier. — Bioinformatics 25: 3093–3098, 2009.
Boc, A., Diallo, A.B., Makarenkov, V.: T-REX: a web server for inferring, validating and visualizing phylogenetic trees and networks. — Nucl. Acids Res. 40: W573–W579, 2012.
Capella-Gutierrez, S., Silla-Martinez, J.M., Gabaldon, T.: trimAl: a tool for automated alignment trimming in largescale phylogenetic analyses. - Bioinformatics 25: 1972–1973, 2009.
Chothia, C., Lesk, A.M.: The relation between the divergence of sequence and structure in proteins. — EMBO J. 5: 823–826, 1986.
Cochrane, G., Karsch-Mizrachi, I., Nakamura, Y.: The international nucleotide sequence database collaboration. — Nucl. Acids Res. 39: D15–D18, 2011.
Criscuolo, A., Gribaldo, S.: BMGE (block mapping and gathering with entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. — BMC Evol. Biol. 10: 210, 2010.
Cvrčková, F., Grunt, M., Bezvoda, R., Hála, M., Kulich, I., Rawat, A., Žárský, V.: Evolution of the land plant exocyst complexes. — Front. Plant Sci. 3: 159, 2012.
Cvrčková, F., Pícková, D., Novotný, M., Žárský, V.: Formin homology 2 domains occur in multiple contexts in angiosperms. — BMC Genomics 5: 44, 2004.
De Castro E., Sigrist, C.J.A., Gattiker, A., Bulliard, V., Langendijk-Genevaux, P.S., Gasteiger, E., Bairoch, A., Hulo, N.: ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins. — Nucl. Acids Res. 34: W362–365, 2006.
Dereeper, A., Guignon, V., Blanc, G., Audic, S., Buffet, S., Chevenet, F., Dufayard, J.F., Guindon, S., Lefort, V., Lescot, M., Claverie, J.M., Gascuel, O.: Phylogeny.fr: robust phylogenetic analysis for the non-specialist. — Nucl. Acids Res. 36: W465–W469, 2008.
Douady, C.J., Delsuc, F., Boucher, Y., Doolittle, W.F., Douzery, E.J.: Comparison of Bayesian and maximum likelihood bootstrap measures of phylogenetic reliability. — Mol. Biol. Evol. 20: 248–254, 2003.
Dvořáková, L., Cvrčková, F., Fischer, L.: Analysis of the hybrid proline-rich protein families from seven plant species suggests rapid diversification of their sequences and expression patterns. — BMC Genomics 8: 412, 2007.
Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. — Nucl. Acids Res. 32: 1792–1797, 2004.
Egli, B., Kölling, K., Köhler, C., Zeeman, S.C., Streb, S.: Loss of cytosolic phosphoglucomutase compromises gametophyte development in Arabidopsis. — Plant Physiol. 154: 1659–1671, 2010.
Eliáš, M., Potocký, M., Cvrčková, F. Žárský, V.: Molecular diversity of phospholipase D in angiosperms. — BMC Genomics 3: 2, 2002.
Felsenstein, J.: PHYLIP - phylogeny inference package (version 3.2). — Cladistics 5: 164–166, 1989.
Fernandez-Pozo, N., Menda, N., Edwards, J.D., Saha, S., Tecle, I.Y., Strickler, S.R., Bombarely, A., Fisher-York, T., Pujar, A., Foerster, H., Yan, A., Mueller, L.A.: The sol genomics network (SGN)–from genotype to phenotype to breeding. — Nucl. Acids Res. 43: D1036–D1041, 2015.
Gish, L.A., Clark. S.E.: The RLK/Pelle family of kinases. — Plant J. 66: 117–127, 2011.
Goldman N.: Maximum likelihood inference of phylogenetic trees, with special reference to a Poisson process model of DNA substitution and to parsimony analyses. — System. Biol. 39: 345–361, 1990.
Goodstein, D.M., Shu, S., Howson, R., Neupane, R., Hayes, R.D., Fazo, J., Mitros, T., Dirks, W., Hellsten, U., Putnam, N., Rokhsar, D.S.: Phytozome: a comparative platform for green plant genomics. — Nucl. Acids Res. 40: D1178–D186, 2012.
Grunt, M., Žárský, V., Cvrčková, F.: Roots of angiosperm formins: the evolutionary history of plant FH2 domaincontaining proteins. — BMC Evol. Biol. 8: 115, 2008.
Guindon, S., Dufayard, J.F., Lefort, V., Anisimova, M., Hordijk, W., Gascuel, O.: New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. — System. Biol. 59: 307–321, 2010.
Hall, B.G.: Building phylogenetic trees from molecular data with MEGA. — Mol. Biol. Evol. 30: 1229–1235, 2013.
Hall, T.: BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. — Nucl. Acids Symp. Ser. 41: 95–98, 1999.
Harrison, C.J., Langdale, J.: A step by step guide to phylogeny reconstruction. — Plant J. 45: 561–572, 2006.
Higgins, D.G, Sharp, P.M.: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. — Gene 73: 237–244, 1988.
Holder, M., Lewis, P.O.: Phylogeny estimation: traditional and Bayesian approaches. — Natur. Rev. Genet. 4: 275–284, 2003.
Howe, C.J., Windram, H.F.: Phylomemetics–evolutionary analysis beyond the gene. — PLoS Biol. 9: e1001069, 2011.
Huelsenbeck, J.P., Larget, B., Miller, R.E., Ronquist, F.: Potential applications and pitfalls of Bayesian inference of phylogeny. — System. Biol. 51: 673–688, 2002.
Jiao, Y., Paterson, A.H.: Polyploidy-associated genome modifications during land plant evolution. — Phil. Trans. Roy. Soc. London B Biol. Sci. 369: 20130355, 2014.
Johnson, M., Zaretskaya, I., Raytselis, Y., Merezhuk, Y., McGinnis, S., Madden, T.L.: NCBI BLAST: a better web interface. — Nucl. Acids Res. 36: W5–W9, 2008.
Katoh, K., Standley, C.M.: MAFFT multiple sequence alignment software version 7: improvements in performance and usability. — Mol. Biol. Evol. 30: 772–780, 2013.
Kuraku, S., Feiner, N., Keeley, S.D., Hara, Y.: Incorporating tree-thinking and evolutionary time scale into developmental biology. - Dev. Growth Differentiation 58: 131–142, 2016.
Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., Higgins, D.G.: Clustal W and Clustal X version 2.0. — Bioinformatics 23: 2947–2948, 2007.
Lassmann, T., Frings, O., Sonnhammer, E.L.L.: Kalign2: highperformance multiple alignment of protein and nucleotide sequences allowing external features. — Nucl. Acids Res. 37: 858–865, 2009.
Letunic, I., Doerks, T., Bork, P.: SMART: recent updates, new developments and status in 2015. — Nucl. Acids Res. 43: D257–D260, 2015.
Marchler-Bauer, A., Bryant, S.H: CD-Search: protein domain annotations on the fly. - Nucl. Acids Res. 32: W327–W331, 2004.
Marchler-Bauer, A., Derbyshire, M.K., Gonzales, N.R., Lu, S., Chitsaz, F., Geer, L.Y., Geer, R.C., He, J., Gwadz, M., Hurwitz, D.I., Lanczycki, C.J., Lu, F., Marchler, G.H., Song, J.S., Thanki, N., Wang, Z., Yamashita, R.A., Zhang, D., Zheng, C., Bryant, S.H.: CDD: NCBI's conserved domain database. - Nucl. Acids Res. 43: D222–D226, 2015.
McGinnis, S., Madden, T.L.: BLAST: at the core of a powerful and diverse set of sequence analysis tools. - Nucl. Acids Res. 32: W20–W25, 2004.
Monaco, M.K., Stein, J., Naithani, S., Wei, S., Dharmawardhana, P., Kumari, S., Amarasinghe, V., Youens-Clark, K., Thomason, J., Preece, J., Pasternak, S., Olson, A., Jiao, Y., Lu, Z., Bolser, D., Kerhornou, A., Staines, D., Walts, B., Wu, G., D'Eustachio, P., Haw, R., Croft, D., Kersey, P.J., Stein, L., Jaiswal, P., Ware, D.: Gramene 2013: comparative plant genomics resources. - Nucl. Acids Res. 42: D1193–D1199, 2014.
Moretti, S., Armougom, F., Wallace, I.M., Higgins, D.G., Jongeneel, C.V., Notredame, C.: The M-Coffee web server: a meta-method for computing multiple sequence alignments by combining alternative alignment methods. - Nucl. Acids Res. 35: W645–W648, 2007.
Mühlbach H, Schnarrenberger C.: Properties and intracellular distribution of two phosphoglucomutases from spinach leaves. — Planta 141: 65–70, 1978.
Notredame. C., Higgins, D.G., Heringa J: T-Coffee: A novel method for fast and accurate multiple sequence alignment. - J. mol. Biol. 302: 205–217, 2000.
O'Halloran, D.: A practical guide to phylogenetics for nonexperts. — J. visual Exp. 84: e50975, 2014.
Pais, F.S.M., Ruy, P.C., Oliveira, G., Coimbra, R.S.:. Assessing the efficiency of multiple sequence alignment programs. - Algorithms mol. Biol. 9: 4, 2014.
Papadopoulos, J.S., Agarwala, R.: COBALT: constraint-based alignment tool for multiple protein sequences. — Bioinformatics 23: 1073–1079, 2007.
Pible, O., Armengaud, J.: Improving the quality of genome, protein sequence, and taxonomy databases: a prerequisite for microbiome meta-omics 2.0. — Proteomics 15: 3418–3423, 2015.
Rannala, B., Yang, Z.: Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. — J. mol. Evol. 43: 304–311, 1996.
Rieppel, O.: The series, the network, and the tree: changing metaphors of order in nature. — Biol. Phil. 25: 475–496, 2010.
Sánchez, R., Serra, F., Tárraga, J., Medina, I., Carbonell, J., Pulido, L., de María, A., Capella-Gutíerrez, S., Huerta-Cepas, J., Gabaldón, T., Dopazo, J., Dopazo, H.: Phylemon 2.0: a suite of web-tools for molecular evolution, phylogenetics, phylogenomics and hypotheses testing. - Nucl. Acids Res. 39: W470–W474. 2011.
Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees.–Mol. Biol. Evol. 4: 406–425, 1987.
Schuler, G.D., Altschul, S.F., Lipman, D.J.: A workbench for multiple alignment construction and analysis. — Proteins 9: 180–190, 1991
Soltis, D.E., Albert, V.A., Leebens-Mack, J., Bell, C.D., Paterson, A.H., Zheng, C., Sankoff, D., de Pamphilis, C.W., Wall, P.K., Soltis, P.S.: Polyploidy and angiosperm diversification. — Amer. J. Bot. 96: 336–348, 2009.
Talavera, G., Castresana, J.: Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. — System. Biol. 56: 564–577, 2007.
Tamura, K., Stecher, G., Peterson, D., Filipski, A., Kumar, S.: MEGA6: molecular evolutionary genetics analysis version 6.0. — Mol. Biol. Evol. 30: 2725–2729, 2013.
Wilgenbusch, J.C., Swofford, D.: Inferring evolutionary trees with PAUP*. - Current Protocols Bioinformatics 6: Unit 6.4, 2003.
Yuksel, B., Memon, A.R.: Comparative phylogenetic analysis of small GTP-binding genes of model legume plants and assessment of their roles in root nodules. — J. exp. Bot. 59: 3831–3844, 2008.
Zhang, X.C., Wang, Z., Zhang, X., Le, M.H., Sun, J., Xu, D., Cheng, J., Stacey, G.: Evolutionary dynamics of protein domain architecture in plants. — BMC Evol. Biol. 12: 6, 2012.
Żmieńko, A., Samelak, A., Kozłowski, P., Figlerowicz, M.: Copy number polymorphism in plant genomes. — Theor. appl. Genet. 127: 1–18, 2014.
Author information
Authors and Affiliations
Corresponding author
Additional information
Acknowledgments: I thank the many generations of students of my Introduction to Bioinformatics undergraduate course for providing continuous feedback that helped to shape the ideas presented here, Anton Markoš, Vojtěch Žárský and Shigehiro Kuraku for critical reading of this manuscript, and the Ministry of Education of the Czech Republic for financial support from the NPUI LO1417 project.
Rights and permissions
About this article
Cite this article
Cvrčková, F. A plant biologists’ guide to phylogenetic analysis of biological macromolecule sequences. Biol Plant 60, 619–627 (2016). https://doi.org/10.1007/s10535-016-0649-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10535-016-0649-8