Large-Scale Multiple Sequence Alignment and Phylogeny Estimation

Warnow, Tandy

doi:10.1007/978-1-4471-5298-9_6

Large-Scale Multiple Sequence Alignment and Phylogeny Estimation

Tandy Warnow⁸

Chapter

2473 Accesses
11 Citations

Part of the book series: Computational Biology ((COBO,volume 19))

Abstract

With the advent of next generation sequencing technologies, alignment and phylogeny estimation of datasets with thousands of sequences is being attempted. To address these challenges, new algorithmic approaches have been developed that have been able to provide substantial improvements over standard methods. This paper focuses on new approaches for ultra-large tree estimation, including methods for co-estimation of alignments and trees, estimating trees without needing a full sequence alignment, and phylogenetic placement. While the main focus is on methods with empirical performance advantages, we also discuss the theoretical guarantees of methods under Markov models of evolution. Finally, we include a discussion of the future of large-scale phylogenetic analysis.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The famous quote by Dobzhansky “Nothing in biology makes sense except in the light of evolution” [1] reflects the less known quote by the Jesuit priest Pierre Teilhard de Chardin [2], who wrote “Evolution is a light which illuminates all facts, a curve that all lines must follow.”
2.
Morgan Price, personal communication, 1 May 2013.
3.
Alexis Stamatakis, personal communication, 1 May 2013.
4.
The p-distance between two aligned sequences is the number of positions in which the two sequences differ, and then normalized to give a number between 0 and 1.

References

Dobzhansky, T.: Nothing in biology makes sense except in the light of evolution. Am. Biol. Teach. 35, 125–129 (1973)
Google Scholar
de Chardin, P.T.: Le Phénomene Humain. Harper Perennial, New York (1959)
Google Scholar
Eisen, J.A.: Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 8, 163–167 (1998)
Google Scholar
Wang, L.-S., Leebens-Mack, J., Wall, K., Beckmann, K., de Pamphilis, C., et al.: The impact of protein multiple sequence alignment on phylogeny estimation. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 1108–1119 (2011)
Google Scholar
Simmons, M., Freudenstein, J.: The effects of increasing genetic distance on alignment of, and tree construction from, rDNA internal transcribed spacer sequences. Mol. Phylogenet. Evol. 26, 444–451 (2003)
Google Scholar
Liu, K., Linder, C.R., Warnow, T.: Multiple sequence alignment: a major challenge to large-scale phylogenetics. PLoS Currents: Tree of Life (2010)
Google Scholar
Hall, B.G.: Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol. Evol. Biol. 22, 792–802 (2005)
Google Scholar
Kumar, S., Filipski, A.: Multiple sequence alignment: in pursuit of homologous DNA positions. Genome Res. 17, 127–135 (2007)
Google Scholar
Ogden, T., Rosenberg, M.: Multiple sequence alignment accuracy and phylogenetic inference. Syst. Biol. 55, 314–328 (2006)
Google Scholar
Liu, K., Raghavan, S., Nelesen, S., Linder, C.R., Warnow, T.: Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324, 1561–1564 (2009)
Google Scholar
Morrison, D.: Multiple sequence alignment for phylogenetic purposes. Aust. Syst. Bot. 19, 479–539 (2006)
Google Scholar
Graybeal, A.: Is it better to add taxa or characters to a difficult phylogenetic problem? Syst. Biol. 47, 9–17 (1998)
Google Scholar
Pollock, D., Zwickl, D., McGuire, J., Hillis, D.: Increased taxon sampling is advantageous for phylogenetic inference. Syst. Biol. 51, 664–671 (2002)
Google Scholar
Zwickl, D., Hillis, D.: Increased taxon sampling greatly reduces phylogenetic error. Syst. Biol. 51, 588–598 (2002)
Google Scholar
Hillis, D.: Inferring complex phylogenies. Nature 383, 130–131 (1996)
Google Scholar
Felsenstein, J.: Inferring Phylogenies. Sinauer Associates, Sunderland (2003)
Google Scholar
Kim, J., Warnow, T.: Tutorial on phylogenetic tree estimation. Presented at the ISMB 1999 Conference (1999). Available on-line at http://www.cs.utexas.edu/users/tandy/tutorial.ps
Linder, C.R., Warnow, T.: An overview of phylogeny reconstruction. In: Aluru, S. (ed.) Handbook of Computational Molecular Biology. Chapman and Hall/CRC Computer and Information Science Series, vol. 9. CRC Press, Boca Raton (2005)
Google Scholar
Semple, C., Steel, M.: Phylogenetics. Oxford University Press, London (2003)
MATH Google Scholar
Hillis, D., Moritz, C., Mable, B. (eds.): Molecular Systematics. Sinauer Associates, Sunderland (1996)
Google Scholar
Ortuno, F., Valenzuela, O., Pomares, H., Rojas, F., Florido, J., et al.: Predicting the accuracy of multiple sequence alignment algorithms by using computational intelligent techniques. Nucleic Acids Res. 41 (2013)
Google Scholar
Whelan, S., Lin, P., Goldman, N.: Molecular phylogenetics: state-of-the-art methods for looking into the past. Trends Genet. 17, 262–272 (2001)
Google Scholar
Goldman, N., Yang, Z.: Introduction: statistical and computational challenges in molecular phylogenetics and evolution. Philos. Trans. R. Soc. Lond. B, Biol. Sci. 363, 3889–3892 (2008)
Google Scholar
Kemena, C., Notredame, C.: Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics 25, 2455–2465 (2009)
Google Scholar
Do, C., Katoh, K.: Protein multiple sequence alignment. In: Methods in Molecular Biology: Functional Proteomics, Methods and Protocols, vol. 484, pp. 379–413. Humana Press, Clifton (2008)
Google Scholar
Mokaddem, A., Elloumi, M.: Algorithms for the alignment of biological sequences. In: Elloumi, M., Zomaya, A. (eds.) Algorithms in Computational Molecular Biology. Wiley, New York (2011). doi:10.1002/9780470892107.ch12
Google Scholar
Pei, J.: Multiple protein sequence alignment. Curr. Opin. Struct. Biol. 18, 382–386 (2008)
Google Scholar
Sievers, F., Wilm, A., Dineen, D., Gibson, T., Karplus, K., et al.: Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Mol. Syst. Biol. 7 (2011)
Google Scholar
Katoh, K., Toh, H.: PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics 23(3), 372–374 (2007)
Google Scholar
Nelesen, S., Liu, K., Wang, L.S., Linder, C.R., Warnow, T.: DACTAL: divide-and-conquer trees (almost) without alignments. Bioinformatics 28, i274–i282 (2012)
Google Scholar
Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., Mcgettigan, P.A., et al.: ClustalW and ClustalX version 2.0. Bioinformatics 23, 2947–2948 (2007)
Google Scholar
Lassmann, T., Frings, O., Sonnhammer, E.: Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features. Nucleic Acids Res. 37, 858–865 (2009)
Google Scholar
Neuwald, A.: Rapid detection, classification, and accurate alignment of up to a million or more related protein sequences. Bioinformatics 25, 1869–1875 (2009)
Google Scholar
Price, M.N., Dehal, P.S., Arkin, A.P.: FastTree-2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010). 10.1371/journal.pone.0009490
Google Scholar
Smith, S., Beaulieu, J., Stamatakis, A., Donoghue, M.: Understanding angiosperm diversification using small and large phylogenetic trees. Am. J. Bot. 98, 404–414 (2011)
Google Scholar
Stamatakis, A.: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688–2690 (2006)
Google Scholar
Goloboff, P.A., Catalano, S.A., Mirande, J.M., Szumik, C.A., Arias, J.S., et al.: Phylogenetic analysis of 73,060 taxa corroborates major eukaryotic groups. Cladistics 25, 211–230 (2009)
Google Scholar
Goloboff, P., Farris, J., Nixon, K.: TNT, a free program for phylogenetic analysis. Cladistics 24, 774–786 (2008)
Google Scholar
Liu, K., Warnow, T., Holder, M., Nelesen, S., Yu, J., et al.: SATé-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst. Biol. 61, 90–106 (2011)
Google Scholar
Maddison, W.: Gene trees in species trees. Syst. Biol. 46, 523–536 (1997)
Google Scholar
Delsuc, F., Brinkmann, H., Philippe, H.: Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Genet. 6, 361–375 (2005)
Google Scholar
Edwards, S.V.: Is a new and general theory of molecular systematics emerging? Evolution 63, 1–19 (2009)
Google Scholar
Dunn, C.W., Hejnol, A., Matus, D.Q., Pang, K., Browne, W.E., et al.: Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 452, 745–749 (2008)
Google Scholar
Wu, D., Hugenholtz, P., Mavromatis, K., Pukall, R., Dalin, E., et al.: A phylogeny-driven genomic encyclopedia of bacteria and archaea. Nature 462, 1056–1060 (2009)
Google Scholar
Eisen, J., Fraser, C.: Phylogenomics: intersection of evolution and genomics. Science 300, 1706–1707 (2003)
Google Scholar
Bininda-Emonds, O. (ed.): Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life. Kluwer Academic, Dordrecht (2004)
Google Scholar
Baum, B., Ragan, M.A.: The MRP method. In: Bininda-Emonds, O.R.P. (ed.) Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life, pp. 17–34. Kluwer Academic, Dordrecht (2004)
Google Scholar
Chen, D., Eulenstein, O., Fernández-Baca, D., Sanderson, M.: Minimum-flip supertrees: complexity and algorithms. IEEE/ACM Trans. Comput. Biol. Bioinform. 3, 165–173 (2006)
Google Scholar
Bininda-Emonds, O.R.P.: The evolution of supertrees. Trends Ecol. Evol. 19, 315–322 (2004)
Google Scholar
Snir, S., Rao, S.: Quartets MaxCut: a divide and conquer quartets algorithm. IEEE/ACM Trans. Comput. Biol. Bioinform. 7, 704–718 (2010)
Google Scholar
Steel, M., Rodrigo, A.: Maximum likelihood supertrees. Syst. Biol. 57, 243–250 (2008)
Google Scholar
Swenson, M., Suri, R., Linder, C., Warnow, T.: An experimental study of quartets MaxCut and other supertree methods. Algorithms Mol. Biol. 6(1), 7 (2011)
Google Scholar
Swenson, M., Suri, R., Linder, C., Warnow, T.: SuperFine: fast and accurate supertree estimation. Syst. Biol. 61, 214–227 (2012)
Google Scholar
Nguyen, N., Mirarab, S., Warnow, T.: MRL and SuperFine+MRL: new supertree methods. Algorithms Mol. Biol. 7(3) (2012)
Google Scholar
Than, C.V., Nakhleh, L.: Species tree inference by minimizing deep coalescences. PLoS Comput. Biol. 5 (2009)
Google Scholar
Boussau, B., Szollosi, G., Duret, L., Gouy, M., Tannier, E., et al.: Genome-scale co-estimation of species and gene trees. Genome Res. 23(2), 323–330 (2013)
Google Scholar
Degnan, J.H., Rosenberg, N.A.: Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol. Evol. 26, 332–340 (2009)
Google Scholar
Chaudhary, R., Bansal, M.S., Wehe, A., Fernández-Baca, D., Eulenstein, O.: IGTP: a software package for large-scale gene tree parsimony analysis. BMC Bioinform. 11, 574 (2010)
Google Scholar
Larget, B., Kotha, S.K., Dewey, C.N., Ané, C.: BUCKy: gene tree/species tree reconciliation with the Bayesian concordance analysis. Bioinformatics 26, 2910–2911 (2010)
Google Scholar
Yu, Y., Warnow, T., Nakhleh, L.: Algorithms for MDC-based multi-locus phylogeny inference: beyond rooted binary gene trees on single alleles. J. Comput. Biol. 18, 1543–1559 (2011)
MathSciNet Google Scholar
Yang, J., Warnow, T.: Fast and accurate methods for phylogenomic analyses. BMC Bioinform. 12(Suppl 9), S4 (2011). doi:10.1186/1471-2105-12-S9-S4
Google Scholar
Liu, L., Yu, L., Edwards, S.: A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol. Biol. 10, 302 (2010)
Google Scholar
Chauve, C., Doyon, J.P., El-Mabrouk, N.: Gene family evolution by duplication, speciation, and loss. J. Comput. Biol. 15, 1043–1062 (2008)
MathSciNet Google Scholar
Hallett, M.T., Lagergren, J.: New algorithms for the duplication-loss model. In: Proceedings RECOMB 2000, pp. 138–146. ACM Press, New York (2000)
Google Scholar
Doyon, J.P., Chauve, C.: Branch-and-bound approach for parsimonious inference of a species tree from a set of gene family trees. Adv. Exp. Med. Biol. 696, 287–295 (2011)
Google Scholar
Ma, B., Li, M., Zhang, L.: From gene trees to species trees. SIAM J. Comput. 30, 729–752 (2000)
MathSciNet MATH Google Scholar
Zhang, L.: From gene trees to species trees II: species tree inference by minimizing deep coalescence events. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 1685–1691 (2011)
Google Scholar
Arvestad, L., Berglung, A.C., Lagergren, J., Sennblad, B.: Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution. In: Bininda-Emonds, O. (ed.) Proc. RECOMB 2004, pp. 238–252 (2004)
Google Scholar
Sennblad, B., Lagergren, J.: Probabilistic orthology analysis. Syst. Biol. 58, 411–424 (2009)
Google Scholar
Edwards, S., Liu, L., Pearl, D.: High-resolution species trees without concatenation. Proc. Natl. Acad. Sci. USA 104, 5936–5941 (2007)
Google Scholar
Heled, J., Drummond, A.J.: Bayesian inference of species trees from multilocus data. Mol. Biol. Evol. 27, 570–580 (2010)
Google Scholar
Roch, S.: An analytical comparison of multilocus methods under the multispecies coalescent: the three-taxon case. In: Proc. Pacific Symposium on Biocomputing, vol. 18, pp. 297–306 (2013)
Google Scholar
Kopelman, N.M., Stone, L., Gascuel, O., Rosenberg, N.A.: The behavior of admixed populations in neighbor-joining inference of population trees. In: Proc. Pacific Symposium on Biocomputing, vol. 18 (2013)
Google Scholar
Degnan, J.H.: Evaluating variations on the STAR algorithm for relative efficiency and sample sizes needed to reconstruct species trees. In: Proc. Pacific Symposium on Biocomputing, vol. 18, pp. 262–272 (2013)
Google Scholar
Bayzid, M., Mirarab, S., Warnow, T.: Inferring optimal species trees under gene duplication and loss. In: Proc. Pacific Symposium on Biocomputing, vol. 18, pp. 250–261 (2013)
Google Scholar
Pei, J., Grishin, N.: PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics 23, 802–808 (2007)
Google Scholar
Edgar, R.C., Sjölander, K.: SATCHMO: sequence alignment and tree construction using hidden Markov models. Bioinformatics 19, 1404–1411 (2003)
Google Scholar
Hagopian, R., Davidson, J., Datta, R., Jarvis, G., Sjölander, K.: SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction. Nucleic Acids Res. 38(Web Server Issue), W29–W34 (2010)
Google Scholar
O’Sullivan, O., Suhre, K., Abergel, C., Higgins, D., Notredame, C.: 3DCoffee: combining protein sequences and structure within multiple sequence alignments. J. Mol. Biol. 340, 385–395 (2004)
Google Scholar
Zhou, H., Zhou, Y.: SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics 21, 3615–3621 (2005)
Google Scholar
Deng, X., Cheng, J.: MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts. BMC Bioinform. 12, 472 (2011)
Google Scholar
Roshan, U., Livesay, D.R.: Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics 22, 2715–2721 (2006)
Google Scholar
Roshan, U., Chikkagoudar, S., Livesay, D.R.: Searching for RNA homologs within large genomic sequences using partition function posterior probabilities. BMC Bioinform. 9, 61 (2008)
Google Scholar
Do, C., Mahabhashyam, M., Brudno, M., Batzoglou, S.: PROBCONS: probabilistic consistency-based multiple sequence alignment of amino acid sequences. Software available at http://probcons.stanford.edu/download.html (2006)
Nawrocki, E.P., Kolbe, D.L., Eddy, S.R.: Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–1337 (2009)
Google Scholar
Nawrocki, E.P.: Structural RNA homology search and alignment using covariance models. Ph.D. thesis, Washington University in Saint Louis, School of Medicine (2009)
Google Scholar
Gardner, D., Xu, W., Miranker, D., Ozer, S., Cannonne, J., et al.: An accurate scalable template-based alignment algorithm. In: Proc. International Conference on Bioinformatics and Biomedicine, 2012, pp. 237–243 (2012)
Google Scholar
Edgar, R.C.: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinform. 5, 113 (2004)
Google Scholar
Mirarab, S., Warnow, T.: FastSP: linear-time calculation of alignment accuracy. Bioinformatics 27, 3250–3258 (2011)
Google Scholar
Blackburne, B., Whelan, S.: Measuring the distance between multiple sequence alignments. Bioinformatics 28, 495–502 (2012)
Google Scholar
Stojanovic, N., Florea, L., Riemer, C., Gumucio, D., Slightom, J., et al.: Comparison of five methods for finding conserved sequences in multiple alignments of gene regulatory regions. Nucleic Acids Res. 27, 3899–3910 (1999)
Google Scholar
Edgar, R.: Quality measures for protein alignment benchmarks. Nucleic Acids Res. 7, 2145–2153 (2010)
Google Scholar
Thompson, J.D., Plewniak, F., Poch, O.: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 27, 2682–2690 (1999)
Google Scholar
Thompson, J., Plewniak, F., Poch, O.: BAliBASE: a benchmark alignments database for the evaluation of multiple sequence alignment programs. Bioinformatics 15, 87–88 (1999)
Google Scholar
Raghava, G., Searle, S.M., Audley, P.C., Barber, J.D., Barton, G.J.: Oxbench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinform. 4, 47 (2003)
Google Scholar
Gardner, P., Wilm, A., Washietl, S.: A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res. 33, 2433–2439 (2005)
Google Scholar
Walle, I.L.V., Wyns, L.: SABmark-a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 21, 1267–1268 (2005)
Google Scholar
Carroll, H., Beckstead, W., O’Connor, T., Ebbert, M., Clement, M., et al.: DNA reference alignment benchmarks based on tertiary structure of encoded proteins. Bioinformatics 23, 2648–2649 (2007)
Google Scholar
Blazewicz, J., Formanowicz, P., Wojciechowski, P.: Some remarks on evaluating the quality of the multiple sequence alignment based on the BAliBASE benchmark. Int. J. Appl. Math. Comput. Sci. 19, 675–678 (2009)
Google Scholar
Iantomo, S., Gori, K., Goldman, N., Gil, M., Dessimoz, C.: Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment. arXiv:1211.2160 [q-bio.QM] (2012)
Aniba, M., Poch, O., Thompson, J.D.: Issues in bioinformatics benchmarking: the case study of multiple sequence alignment. Nucleic Acids Res. 38, 7353–7363 (2010)
Google Scholar
Morrison, D.A.: Why would phylogeneticists ignore computerized sequence alignment? Syst. Biol. 58, 150–158 (2009)
Google Scholar
Reeck, G., de Haen, C., Teller, D., Doolitte, R., Fitch, W., et al.: “Homology” in proteins and nucleic acids: a terminology muddle and a way out of it. Cell 50, 667 (1987)
Google Scholar
Galperin, M., Koonin, E.: Divergence and convergence in enzyme evolution. J. Biol. Chem. 287, 21–28 (2012)
Google Scholar
Sjolander, K.: Getting started in structural phylogenomics. PLoS Comput. Biol. 6, e1000621 (2010)
MathSciNet Google Scholar
Katoh, K., Kuma, K., Miyata, T., Toh, H.: Improvement in the accuracy of multiple sequence alignment MAFFT. Genome Inf. 16, 22–33 (2005)
Google Scholar
Do, C., Mahabhashyam, M., Brudno, M., Batzoglou, S.: PROBCONS: probabilistic consistency-based multiple sequence alignment. Genome Res. 15, 330–340 (2005)
Google Scholar
Loytynoja, A., Goldman, N.: An algorithm for progressive multiple alignment of sequences with insertions. Proc. Natl. Acad. Sci. 102, 10557–10562 (2005)
Google Scholar
Nelesen, S., Liu, K., Zhao, D., Linder, C.R., Warnow, T.: The effect of the guide tree on multiple sequence alignments and subsequent phylogenetic analyses. In: Proc. Pacific Symposium on Biocomputing, vol. 13, pp. 15–24 (2008)
Google Scholar
Fletcher, W., Yang, Z.: The effect of insertions, deletions, and alignment errors on the branch-site test of positive selection. Mol. Biol. Evol. 27, 2257–2267 (2010)
Google Scholar
Penn, O., Privman, E., Landan, G., Graur, D., Pupko, T.: An alignment confidence score capturing robustness to guide tree uncertainty. Mol. Biol. Evol. 27, 1759–1767 (2010)
Google Scholar
Toth, A., Hausknecht, A., Krisai-Greilhuber, I., Papp, T., Vagvolgyi, C., et al.: Iteratively refined guide trees help improving alignment and phylogenetic inference in the mushroom family bolbitiaceae. PLoS ONE 8, e56143 (2013)
Google Scholar
Capella-Gutiérrez, S., Gabaldón, T.: Measuring guide-tree dependency of inferred gaps for progressive aligners. Bioinformatics 29(8), 1011–1017 (2013)
Google Scholar
Preusse, E., Quast, C., Knittel, K., Fuchs, B., Ludwig, W., et al.: SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35, 718–796 (2007)
Google Scholar
DeSantis, T., Hugenholtz, P., Keller, K., Brodie, E., Larsen, N., et al.: NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic Acids Res. 34, W394–W399 (2006)
Google Scholar
Löytynoja, A., Vilella, A.J., Goldman, N.: Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm. Bioinformatics 28, 1685–1691 (2012)
Google Scholar
Papadopoulos, J.S., Agarwala, R.: COBALT: constraint-based alignment tool for multiple protein sequences. Bioinformatics 23, 1073–1079 (2007)
Google Scholar
Berger, S.A., Stamatakis, A.: Aligning short reads to reference alignments and trees. Bioinformatics 27, 2068–2075 (2011)
Google Scholar
Sievers, F., Dineen, D., Wilm, A., Higgins, D.G.: Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics 29(8), 989–995 (2013)
Google Scholar
Smith, S., Beaulieu, J., Donoghue, M.: Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches. BMC Evol. Biol. 9, 37 (2009)
Google Scholar
Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987)
Google Scholar
Roquet, C., Thuiller, W., Lavergne, S.: Building megaphylogenies for macroecology: taking up the challenge. Ecography 36, 013–026 (2013)
Google Scholar
Steel, M.A.: Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett. 7, 19–24 (1994)
MathSciNet MATH Google Scholar
Evans, S., Warnow, T.: Unidentifiable divergence times in rates-across-sites models. IEEE/ACM Trans. Comput. Biol. Bioinform. 1, 130–134 (2005)
Google Scholar
Tavaré, S.: Some probabilistic and statistical problems in the analysis of DNA sequences. In: Lectures on Mathematics in the Life Sciences, vol. 17, pp. 57–86 (1986)
Google Scholar
Dayhoff, M., Schwartz, R., Orcutt, B.: A model of evolutionary change in proteins. In: Dayhoff, M. (ed.) Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, pp. 345–352 (1978)
Google Scholar
Lakner, C., Holder, M., Goldman, N., Naylor, G.: What’s in a likelihood? Simple models of protein evolution and the contribution of structurally viable reconstructions to the likelihood. Syst. Biol. 60, 161–174 (2011)
Google Scholar
Le, S., Gascuel, O.: An improved general amino acid replacement matrix. Mol. Biol. Evol. 25, 1307–1320 (2008)
Google Scholar
Whelan, S., Goldman, N.: A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18, 691–699 (2001)
Google Scholar
Kosiol, C., Goldman, N.: Different versions of the Dayhoff rate matrix. Mol. Biol. Evol. 22, 193–199 (2005)
Google Scholar
Thorne, J.: Models of protein sequence evolution and their applications. Curr. Opin. Genet. Dev. 10, 602–605 (2000)
Google Scholar
Thorne, J., Goldman, N.: Probabilistic models for the study of protein evolution. In: Balding, D., Bishop, M., Cannings, C. (eds.) Handbook of Statistical Genetics, pp. 209–226. Wiley, New York (2003)
Google Scholar
Adachi, J., Hasegawa, M.: Model of amino acid substitution in proteins encoded by mitochondrial DNA. J. Mol. Evol. 42, 459–468 (1996)
Google Scholar
Goldman, N., Yang, Z.: A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11, 725–736 (1994)
Google Scholar
Scherrer, M., Meyer, A., Wilke, C.: Modeling coding-sequence evolution within the context of residue solvent accessibility. BMC Evol. Biol. 12, 179 (2012)
Google Scholar
Mayrose, I., Doron-Faigenbom, A., Bacharach, E., Pupko, T.: Towards realistic codon models: among site variability and dependency of synonymous and non-synonymous rates. Bioinformatics 23, i319–i327 (2007)
Google Scholar
Abascal, F., Zardoya, R., Posada, D.: ProtTest: selection of best-fit models of protein evolution. Bioinformatics 21, 2104–2105 (2005)
Google Scholar
Wilke, C.: Bringing molecules back into molecular evolution. PLoS Comput. Biol. 8, e1002572 (2012)
Google Scholar
Liberles, D., Teichmann, S., et al.: The inference of protein structure, protein biophysics, and molecular evolution. Protein Sci. 21, 769–785 (2012)
Google Scholar
Lopez, P., Casane, D., Philippe, H.: Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19, 1–7 (2002)
Google Scholar
Whelan, S.: Spatial and temporal heterogeneity in nucleotide sequence evolution. Mol. Biol. Evol. 25, 1683–1694 (2008)
Google Scholar
Tuffley, C., Steel, M.: Links between maximum likelihood and maximum parsimony under a simple model of site substitution. Bull. Math. Biol. 59, 581–607 (1997)
MATH Google Scholar
Steel, M.A.: Can we avoid ‘SIN’ in the house of ‘No Common Mechanism’? Syst. Biol. 60, 96–109 (2011)
Google Scholar
Lobkovsky, A., Wolf, Y., Koonin, E.: Gene frequency distributions reject a neutral model of genome evolution. Genome Biol. Evol. 5, 233–242 (2013)
Google Scholar
Galtier, N., Gouy, M.: Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15, 871–879 (1998)
Google Scholar
Foulds, L.R., Graham, R.L.: The Steiner problem in phylogeny is NP-complete. Adv. Appl. Math. 3, 43–49 (1982)
MathSciNet MATH Google Scholar
Felsenstein, J.: Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368–376 (1981)
Google Scholar
Allman, E.S., Ané, C., Rhodes, J.: Identifiability of a Markovian model of molecular evolution with gamma-distributed rates. Adv. Appl. Probab. 40, 229–249 (2008)
MATH Google Scholar
Allman, E.S., Rhodes, J.: Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites. Math. Biosci. 211, 18–33 (2008)
MathSciNet MATH Google Scholar
Allman, E.S., Rhodes, J.A.: The identifiability of tree topology for phylogenetic models, including covariant and mixture models. J. Comput. Biol. 13, 1101–1113 (2006)
MathSciNet Google Scholar
Atteson, K.: The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica 25, 251–278 (1999)
MathSciNet MATH Google Scholar
Chang, J.: Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math. Biosci. 137, 51–73 (1996)
MathSciNet MATH Google Scholar
Steel, M.A.: Consistency of Bayesian inference of resolved phylogenetic trees. arXiv:1001.2864 [q-bioPE] (2010)
Felsenstein, J.: Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27, 401–410 (1978)
Google Scholar
Chang, J.T.: Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. Math. Biosci. 134, 189–215 (1996)
MathSciNet MATH Google Scholar
Matsen, F., Steel, M.: Phylogenetic mixtures on a single tree can mimic a tree of another topology. Syst. Biol. 56, 767–775 (2007)
Google Scholar
Allman, E., Rhodes, J., Sullivant, S.: When do phylogenetic mixture models mimic other phylogenetic models? Syst. Biol. 61, 1049–1059 (2012)
Google Scholar
Erdos, P., Steel, M., Szekely, L., Warnow, T.: Local quartet splits of a binary tree infer all quartet splits via one dyadic inference rule. Comput. Artif. Intell. 16, 217–227 (1997)
MathSciNet Google Scholar
Erdos, P., Steel, M., Szekely, L., Warnow, T.: A few logs suffice to build (almost) all trees (i). Random Struct. Algorithms 14, 153–184 (1999)
MathSciNet Google Scholar
Erdos, P., Steel, M., Szekely, L., Warnow, T.: A few logs suffice to build (almost) all trees (ii). Theor. Comput. Sci. 221, 77–118 (1999)
MathSciNet Google Scholar
Lacey, M.R., Chang, J.T.: A signal-to-noise analysis of phylogeny estimation by neighbor-joining: insufficiency of polynomial length sequences. Math. Biosci. 199, 188–215 (2006)
MathSciNet MATH Google Scholar
Csürős, M., Kao, M.Y.: Recovering evolutionary trees through harmonic greedy triplets. Proc. SODA 99, 261–270 (1999)
Google Scholar
Csurös, M.: Fast recovery of evolutionary trees with thousands of nodes. J. Comput. Biol. 9, 277–297 (2002)
Google Scholar
Huson, D., Nettles, S., Warnow, T.: Disk-covering, a fast converging method for phylogenetic tree reconstruction. J. Comput. Biol. 6, 369–386 (1999)
Google Scholar
Steel, M.A., Székely, L.A.: Inverting random functions. Ann. Comb. 3, 103–113 (1999)
MathSciNet MATH Google Scholar
Steel, M.A., Székely, L.A.: Inverting random functions—II: explicit bounds for discrete maximum likelihood estimation, with applications. SIAM J. Discrete Math. 15, 562–575 (2002)
MathSciNet MATH Google Scholar
King, V., Zhang, L., Zhou, Y.: On the complexity of distance-based evolutionary tree reconstruction. In: SODA: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 444–453 (2003)
Google Scholar
Mossel, E., Roch, S.: Learning nonsingular phylogenies and hidden Markov models. In: Proc. 37th Symp. on the Theory of Computing (STOC’05), pp. 366–376 (2005)
Google Scholar
Mossel, E., Roch, S.: Learning nonsingular phylogenies and hidden Markov models. Ann. Appl. Probab. 16, 538–614 (2006)
MathSciNet Google Scholar
Daskalakis, C., Mossel, E., Roch, S.: Optimal phylogenetic reconstruction. In: STOC’06: Proceedings of the 38th Annual ACM Symposium on Theory of Computing, pp. 159–168 (2006)
Google Scholar
Daskalakis, C., Hill, C., Jaffe, A., Mihaescu, R., Mossel, E., et al.: Maximal accurate forests from distance matrices. In: RECOMB, pp. 281–295 (2006)
Google Scholar
Mossel, E.: Distorted metrics on trees and phylogenetic forests. IEEE/ACM Trans. Comput. Biol. Bioinform. 4, 108–116 (2007)
Google Scholar
Gronau, I., Moran, S., Snir, S.: Fast and reliable reconstruction of phylogenetic trees with very short edges. In: SODA (ACM/SIAM Symp. Disc. Alg), pp. 379–388 (2008)
Google Scholar
Roch, S.: Sequence-length requirement for distance-based phylogeny reconstruction: breaking the polynomial barrier. In: FOCS (Foundations of Computer Science), pp. 729–738 (2008)
Google Scholar
Daskalakis, C., Mossel, E., Roch, S.: Phylogenies without branch bounds: contracting the short, pruning the deep. In: RECOMB, pp. 451–465 (2009)
Google Scholar
Lin, Y., Rajan, V., Moret, B.: A metric for phylogenetic trees based on matching. IEEE/ACM Trans. Comput. Biol. Bioinform. 9, 1014–1022 (2012)
Google Scholar
Rannala, B., Huelsenbeck, J., Yang, Z., Nielsen, R.: Taxon sampling and the accuracy of large phylogenies. Syst. Biol. 47, 702–710 (1998)
Google Scholar
Robinson, D., Foulds, L.: Comparison of phylogenetic trees. Math. Biosci. 53, 131–147 (1981)
MathSciNet MATH Google Scholar
Huelsenbeck, J., Hillis, D.: Success of phylogenetic methods in the four-taxon case. Syst. Biol. 42, 247–265 (1993)
Google Scholar
Hillis, D.: Taxonomic sampling, phylogenetic accuracy, and investigator bias. Syst. Biol. 47, 3–8 (1998)
Google Scholar
Nakhleh, L., Moret, B., Roshan, U., St John, K., Sun, J., et al.: The accuracy of fast phylogenetic methods for large datasets. In: Proc. 7th Pacific Symposium on BioComputing, pp. 211–222. World Scientific, Singapore (2002)
Google Scholar
Zwickl, D.J., Hillis, D.M.: Increased taxon sampling greatly reduces phylogenetic error. Syst. Biol. 51, 588–598 (2002)
Google Scholar
Pollock, D.D., Zwickl, D.J., McGuire, J.A., Hillis, D.M.: Increased taxon sampling is advantageous for phylogenetic inference. Syst. Biol. 51, 664–671 (2002)
Google Scholar
Wiens, J.: Missing data and the design of phylogenetic analyses. J. Biomed. Inform. 39, 36–42 (2006)
Google Scholar
Lemmon, A., Brown, J., Stanger-Hall, K., Lemmon, E.: The effect of ambiguous data on phylogenetic estimates obtained by maximum-likelihood and Bayesian inference. Syst. Biol. 58, 130–145 (2009)
Google Scholar
Wiens, J., Morrill, M.: Missing data in phylogenetic analysis: reconciling results from simulations and empirical data. Syst. Biol. 60, 719–731 (2011)
Google Scholar
Simmons, M.: Misleading results of likelihood-based phylogenetic analyses in the presence of missing data. Cladistics 28, 208–222 (2012)
Google Scholar
Moret, B., Roshan, U., Warnow, T.: Sequence-length requirements for phylogenetic methods. In: Guigo, R., Gusfield, D. (eds.) Proc. 2nd International Workshop on Algorithms in Bioinformatics. Lecture Notes in Computer Science, vol. 2452, pp. 343–356. Springer, Berlin (2002)
Google Scholar
Gascuel, O.: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14, 685–695 (1997)
Google Scholar
Bruno, W.J., Socci, N.D., Halpern, A.L.: Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol. 17, 189–197 (2000)
Google Scholar
Wheeler, T.: Large-scale neighbor-joining with NINJA. In: Proc. Workshop Algorithms in Bioinformatics (WABI), vol. 5724, pp. 375–389 (2009)
Google Scholar
Desper, R., Gascuel, O.: Fast and accurate phylogeny reconstruction algorithm based on the minimum-evolution principle. J. Comput. Biol. 9, 687–705 (2002)
Google Scholar
Price, M., Dehal, P., Arkin, A.: FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 7, 1641–1650 (2009)
Google Scholar
Brown, D., Truszkowski, J.: Towards a practical O(nlogn) phylogeny algorithm. In: Proc. Workshop Algorithms in Bioinformatics (WABI), pp. 14–25 (2011)
Google Scholar
Rice, K., Warnow, T.: Parsimony is hard to beat! In: Jiang, T., Lee, D. (eds.) Proceedings, Third Annual International Conference of Computing and Combinatorics (COCOON), pp. 124–133 (1997)
Google Scholar
Hillis, D., Huelsenbeck, J., Swofford, D.: Hobgoblin of phylogenetics. Nature 369, 363–364 (1994)
Google Scholar
Swofford, D.: PAUP*: Phylogenetic Analysis Using Parsimony (and Other Methods), Version 4.0. Sinauer Associates, Sunderland (1996)
Google Scholar
Roch, S.: A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. IEEE/ACM Trans. Comput. Biol. Bioinform. 3, 92–94 (2006)
Google Scholar
Guindon, S., Gascuel, O.: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52, 696–704 (2003)
Google Scholar
Zwickl, D.: Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. Ph.D. thesis, The University of Texas at Austin (2006)
Google Scholar
Liu, K., Linder, C., Warnow, T.: RAxML and FastTree: comparing two methods for large-scale maximum likelihood phylogeny estimation PLoS ONE 6, e27731 (2012).
Google Scholar
Claesson, M.J., Cusack, S., O’Sullivan, O., Greene-Diniz, R., de Weerd, H., et al.: Composition, variability, and temporal stability of the intestinal microbiota of the elderly. Proc. Natl. Acad. Sci. 108, 4586–4591 (2011)
Google Scholar
McDonald, D., Price, M.N., Goodrich, J., Nawrocki, E.P., DeSantis, T.Z., et al.: An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 6, 610–618 (2012)
Google Scholar
Boussau, B., Guoy, M.: Efficient likelihood computations with non-reversible models of evolution. Syst. Biol. 55, 756–768 (2006)
Google Scholar
Whelan, S., Money, D.: The prevalence of multifurcations in tree-space and their implications for tree-search. Mol. Biol. Evol. 27, 2674–2677 (2010)
Google Scholar
Whelan, S., Money, D.: Characterizing the phylogenetic tree-search problem. Syst. Biol. 61, 228–239 (2012)
Google Scholar
Ronquist, F., Huelsenbeck, J.: MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19, 1572–1574 (2003)
Google Scholar
Drummond, A., Rambaut, A.: BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol. Biol. 7, 214 (2007)
Google Scholar
Lartillot, N., Philippe, H.: A Bayesian mixture model for across-site heterogeneities in the amino acid replacement process. Mol. Biol. Evol. 21 (2004)
Google Scholar
Foster, P.: Modeling compositional heterogeneity. Syst. Biol. 53, 485–495 (2004)
Google Scholar
Pagel, M., Meade, A.: A phylogenetic mixture model for detecting pattern heterogeneity in gene sequence or character state data. Syst. Biol. 53, 571–581 (2004)
Google Scholar
Huelsenbeck, J., Ronquist, R.: MrBayes: Bayesian inference of phylogeny. Bioinformatics 17, 754–755 (2001)
Google Scholar
Ronquist, F., Deans, A.: Bayesian phylogenetics and its influence on insect systematics. Annu. Rev. Entomol. 55, 189–206 (2010)
Google Scholar
Huelsenbeck, J.P., Ronquist, F., Nielsen, R., Bollback, J.P.: Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294, 2310–2314 (2001)
Google Scholar
Holder, M., Lewis, P.: Phylogeny estimation: traditional and Bayesian approaches. Nat. Rev. Genet. 4, 275–284 (2003)
Google Scholar
Lewis, P., Holder, M., Holsinger, K.: Polytomies and Bayesian phylogenetic inference. Syst. Biol. 54, 241–253 (2005)
Google Scholar
Ganapathy, G., Ramachandran, V., Warnow, T.: On contract-and-refine-transformations between phylogenetic trees. In: ACM/SIAM Symposium on Discrete Algorithms (SODA’04), pp. 893–902. SIAM Press, Philadelphia (2004)
Google Scholar
Ganapathy, G., Ramachandran, V., Warnow, T.: Better hill-climbing searches for parsimony. In: Proceedings of the Third International Workshop on Algorithms in Bioinformatics (WABI), pp. 245–258 (2003)
Google Scholar
Bonet, M., Steel, M., Warnow, T., Yooseph, S.: Faster algorithms for solving parsimony and compatibility. J. Comput. Biol. 5, 409–422 (1999)
Google Scholar
Nixon, K.C.: The parsimony ratchet, a new method for rapid parsimony analysis. Cladistics 15, 407–414 (1999)
Google Scholar
Vos, R.: Accelerated likelihood surface exploration: the likelihood ratchet. Syst. Biol. 52, 368–373 (2003)
Google Scholar
Warnow, T., Moret, B.M.E., St John, K.: Absolute phylogeny: true trees from short sequences. In: Proc. 12th Ann. ACM/SIAM Symp. on Discr. Algs., SODA01, pp. 186–195. SIAM Press, Philadelphia (2001)
Google Scholar
Nakhleh, L., Roshan, U., St John, K., Sun, J., Warnow, T.: Designing fast converging phylogenetic methods. Bioinformatics 17, 190–198 (2001)
Google Scholar
Warnow, T.: Large-scale phylogenetic reconstruction. In: Aluru, S. (ed.) Handbook of Computational Molecular Biology. Chapman and Hall/CRC Computer and Information Science Series, vol. 9. CRC Press, Boca Raton (2005)
Google Scholar
Roshan, U., Moret, B., Williams, T., Warnow, T.: Rec-I-DCM3: a fast algorithmic technique for reconstructing large phylogenetic trees. In: Proc. 3rd Computational Systems Biology Conf. (CSB’05). Proceedings of the IEEE, pp. 98–109 (2004)
Google Scholar
Steel, M.: The maximum likelihood point for a phylogenetic tree is not unique. Syst. Biol. 43, 560–564 (1994)
Google Scholar
Blair, C., Murphy, R.: Recent trends in molecular phylogenetic analysis: where to next? J. Heredity 102, 130–138 (2011)
Google Scholar
Nagy, L., Kocsube, S., Csanadi, Z., Kovacs, G., Petkovits, T., et al.: Re-mind the gap! Insertion and deletion data reveal neglected phylogenetic potential of the nuclear ribosomal internal transcribed spacer (its) of fungi. PLoS ONE 7, e49794 (2012).
Google Scholar
Barriel, V.: Molecular phylogenies and nucleotide insertion-deletions. C. R. Acad. Sci. III 7, 693–701 (1994)
Google Scholar
Young, N., Healy, J.: GapCoder automates the use of indel characters in phylogenetic analysis. BMC Bioinform. 4 (2003)
Google Scholar
Muller, K.: Incorporating information from length-mutational events into phylogenetic analysis. Mol. Phylogenet. Evol. 38, 667–676 (2006)
Google Scholar
Ogden, T., Rosenberg, M.: How should gaps be treated in parsimony? A comparison of approaches using simulation. Mol. Phylogenet. Evol. 42, 817–826 (2007)
Google Scholar
Dwivedi, B., Gadagkar, S.: Phylogenetic inference under varying proportions of indel-induced alignment gaps. BMC Evol. Biol. 9, 211 (2009)
Google Scholar
Dessimoz, C., Gil, M.: Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biol. 11, R37 (2010)
Google Scholar
Yuri, T., Kimball, R.T., Harshman, J., Bowie, R.C.K., Braun, M.J., et al.: Parsimony and model-based analyses of indel in avian nuclear genes reveal congruent and incongruent phylogenetic signals. Biology 2, 419–444 (2013)
Google Scholar
Warnow, T.: Standard maximum likelihood analyses of alignments with gaps can be statistically inconsistent. PLoS Currents Tree of Life (2012)
Google Scholar
Daskalakis, C., Roch, S.: Alignment-free phylogenetic reconstruction. In: Berger, B. (ed.) Proc. RECOMB 2010. Lecture Notes in Computer Science, vol. 6044, pp. 123–137. Springer, Berlin (2010). http://dx.doi.org/10.1007/978-3-642-12683-3_9
Google Scholar
Thatte, B.: Invertibility of the TKF model of sequence evolution. Math. Biosci. 200, 58–75 (2006)
MathSciNet MATH Google Scholar
Hartmann, S., Vision, T.: Using ESTs for phylogenomics: can one accurately infer a phylogenetic tree from a Gappy alignment? BMC Evol. Biol. 8, 95 (2008)
Google Scholar
Mirarab, S., Nguyen, N., Warnow, T.: SEPP: SATé-enabled phylogenetic placement. In: Pacific Symposium on Biocomputing, pp. 247–258 (2012)
Google Scholar
Matsen, F.A., Kodner, R.B., Armbrust, E.V.: pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinform. 11, 538 (2010)
Google Scholar
Berger, S.A., Krompass, D., Stamatakis, A.: Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Syst. Biol. 60, 291–302 (2011)
Google Scholar
Eddy, S.: A new generation of homology search tools based on probabilistic inference. Genome Inform. 23, 205–211 (2009)
Google Scholar
Finn, R., Clements, J., Eddy, S.: HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29–W37 (2011)
Google Scholar
Brown, D.G., Truskowski, J.: LSHPlace: fast phylogenetic placement using locality-sensitive hashing. In: Pacific Symposium on Biocomputing, vol. 18, pp. 310–319 (2013)
Google Scholar
Stark, M., Berger, S., Stamatakis, A., von Mering, C.: MLTreeMap—accurate maximum likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies. BMC Genomics 11, 461 (2010)
Google Scholar
Droge, J., McHardy, A.: Taxonomic binning of metagenome samples generated by next-generation sequencing technologies. Brief. Bioinform. (2012)
Google Scholar
Giribet, G.: Exploring the behavior of POY, a program for direct optimization of molecular data. Cladistics 17, S60–S70 (2001)
Google Scholar
Hartigan, J.: Minimum mutation fits to a given tree. Biometrics 29, 53–65 (1973)
Google Scholar
Sankoff, D.: Minimal mutation trees of sequences. SIAM J. Appl. Math. 28, 35–42 (1975)
MathSciNet MATH Google Scholar
Sankoff, D., Cedergren, R.J.: Simultaneous comparison of three or more sequences related by a tree. In: Sankoff, D., Kruskall, J.B. (eds.) Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison, pp. 253–263. Addison Wesley, New York (1993)
Google Scholar
Wang, L., Jiang, T.: On the complexity of multiple sequence alignment. J. Comput. Biol. 1, 337–348 (1994)
Google Scholar
Wang, L., Jiang, T., Lawler, E.: Approximation algorithms for tree alignment with a given phylogeny. Algorithmica 16, 302–315 (1996)
MathSciNet Google Scholar
Wang, L., Gusfield, D.: Improved approximation algorithms for tree alignment. J. Algorithms 25(2), 255–273 (1997)
MathSciNet MATH Google Scholar
Wang, L., Jiang, T., Gusfield, D.: A more efficient approximation scheme for tree alignment. SIAM J. Comput. 30(1), 283–299 (2000)
MathSciNet MATH Google Scholar
Liu, K., Warnow, T.: Treelength optimization for phylogeny estimation. PLoS ONE 7, e33104 (2012)
Google Scholar
Varón, A., Vinh, L., Bomash, I., Wheeler, W.: POY software. Documentation by Varon, A., Vinh, L.S., Bomash, I., Wheeler, W., Pickett, K., Temkin, I., Faivovich, J., Grant, T., Smith, W.L. Available for download at http://research.amnh.org/scicomp/projects/poy.php (2007)
Kjer, K., Gillespie, J., Ober, K.: Opinions on multiple sequence alignment, and an empirical comparison on repeatability and accuracy between POY and structural alignment. Syst. Biol. 56, 133–146 (2007)
Google Scholar
Ogden, T.H., Rosenberg, M.: Alignment and topological accuracy of the direct optimization approach via POY and traditional phylogenetics via ClustalW+PAUP*. Syst. Biol. 56, 182–193 (2007)
Google Scholar
Yoshizawa, K.: Direct optimization overly optimizes data. Syst. Entomol. 35, 199–206 (2010)
Google Scholar
Wheeler, W., Giribet, G.: Phylogenetic hypotheses and the utility of multiple sequence alignment. In: Rosenberg, M. (ed.) Sequence Alignment: Methods, Models, Concepts and Strategies, pp. 95–104. University of California Press, Berkeley (2009)
Google Scholar
Lehtonen, S.: Phylogeny estimation and alignment via POY versus clustal + PAUP*: a response to Ogden and Rosenberg. Syst. Biol. 57, 653–657 (2008)
Google Scholar
Liu, K., Nelesen, S., Raghavan, S., Linder, C., Warnow, T.: Barking up the wrong treelength: the impact of gap penalty on alignment and tree accuracy. IEEE/ACM Trans. Comput. Biol. Bioinform. 6, 7–21 (2009)
Google Scholar
Gu, X., Li, W.H.: The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment. J. Mol. Evol. 40, 464–473 (1995)
Google Scholar
Altschul, S.F.: Generalized affine gap costs for protein sequence alignment. Proteins, Struct. Funct. Genomics 32, 88–96 (1998)
Google Scholar
Gill, O., Zhou, Y., Mishra, B.: Aligning sequences with non-affine gap penalty: PLAINS algorithm, a practical implementation, and its biological applications in comparative genomics. In: Proc. ICBA 2004 (2004)
Google Scholar
Qian, B., Goldstein, R.: Distribution of indel lengths. Proteins 45, 102–104 (2001)
Google Scholar
Chang, M., Benner, S.: Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments. J. Mol. Biol. 341, 617–631 (2004)
Google Scholar
Thorne, J.L., Kishino, H., Felsenstein, J.: An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol. 33, 114–124 (1991)
Google Scholar
Thorne, J.L., Kishino, H., Felsenstein, J.: Inching toward reality: an improved likelihood model of sequence evolution. J. Mol. Evol. 34, 3–16 (1992)
Google Scholar
Thorne, J.L., Kishino, H., Felsenstein, J.: Erratum, an evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol. 34, 91–92 (1992)
Google Scholar
Rivas, E.: Evolutionary models for insertions and deletions in a probabilistic modeling framework. BMC Bioinform. 6, 30 (2005)
Google Scholar
Rivas, E., Eddy, S.: Probabilistic phylogenetic inference with insertions and deletions. PLoS Comput. Biol. 4, e1000172 (2008)
MathSciNet Google Scholar
Holmes, I., Bruno, W.J.: Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics 17, 803–820 (2001)
Google Scholar
Miklós, I., Lunter, G.A., Holmes, I.: A “long indel model” for evolutionary sequence alignment. Mol. Biol. Evol. 21, 529–540 (2004)
Google Scholar
Redelings, B., Suchard, M.: Joint Bayesian estimation of alignment and phylogeny. Syst. Biol. 54, 401–418 (2005)
Google Scholar
Suchard, M.A., Redelings, B.D.: BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics 22, 2047–2048 (2006)
Google Scholar
Redelings, B., Suchard, M.: Incorporating indel information into phylogeny estimation for rapidly emerging pathogens. BMC Evol. Biol. 7, 40 (2007)
Google Scholar
Fleissner, R., Metzler, D., von Haeseler, A.: Simultaneous statistical multiple alignment and phylogeny reconstruction. Syst. Biol. 54, 548–561 (2005)
Google Scholar
Novák, A., Miklós, I., Lyngso, R., Hein, J.: StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees. Bioinformatics 24, 2403–2404 (2008)
Google Scholar
Lunter, G.A., Miklos, I., Song, Y.S., Hein, J.: An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J. Comput. Biol. 10, 869–889 (2003)
Google Scholar
Lunter, G., Miklós, I., Drummond, A., Jensen, J.L., Hein, J.: Bayesian phylogenetic inference under a statistical indel model. In: Benson, G., Page, R. (eds.) Third International Workshop (WABI 2003). Lecture Notes in Bioinformatics vol. 2812, pp. 228–244. Springer, Berlin (2003)
Google Scholar
Lunter, G., Drummond, A., Miklós, I., Hein, J.: Statistical alignment: recent progress, new applications, and challenges. In: Nielsen, R. (ed.) Statistical Methods in Molecular Evolution (Statistics for Biology and Health), pp. 375–406. Springer, Berlin (2005)
Google Scholar
Metzler, D.: Statistical alignment based on fragment insertion and deletion models. Bioinformatics 19, 490–499 (2003)
Google Scholar
Miklós, I.: Algorithm for statistical alignment of sequences derived from a Poisson sequence length distribution. Discrete Appl. Math. 127, 79–84 (2003)
MathSciNet MATH Google Scholar
Arunapuram, P., Edvardsson, I., Golden, M., Anderson, J., Novak, A., et al.: StatAlign 2.0: combining statistical alignment with RNA secondary structure prediction. Bioinformatics 29(5), 654–655 (2013)
Google Scholar
Lunter, G., Miklós, I., Drummond, A., Jensen, J.L., Hein, J.: Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinform. 6, 83 (2005)
Google Scholar
Bouchard-Côté, A., Jordan, M.I.: Evolutionary inference via the Poisson indel process. Proc. Natl. Acad. Sci. 110, 1160–1166 (2013)
Google Scholar
Brown, D., Krishnamurthy, N., Sjolander, K.: Automated protein subfamily identification and classification. PLoS Comput. Biol. 3, e160 (2007)
Google Scholar
Vinga, S., Almeida, J.: Alignment-free sequence comparison—a review. Bioinformatics 19, 513–523 (2003)
Google Scholar
Chan, C., Ragan, M.: Next-generation phylogenomics. Biol. Direct 8 (2013)
Google Scholar
Blaisdell, B.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl. Acad. Sci. USA 83, 5155–5159 (1986)
MATH Google Scholar
Sims, G., Jun, S.R., Wu, G., Kim, S.H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. USA 106, 2677–2682 (2009)
Google Scholar
Jun, S.R., Sims, G., Wu, G., Kim, S.H.: Whole-proteome phylogeny of prokaryotes by feature frequency profiles: an alignment-free method with optimal feature resolution. Proc. Natl. Acad. Sci. USA 107, 133–138 (2010)
Google Scholar
Liu, X., Wan, L., Li, J., Reinert, G., Waterman, M., et al.: New powerful statistics for alignment-free sequence comparison under a pattern transfer model. J. Theor. Biol. 284, 106–116 (2011)
MathSciNet Google Scholar
Yang, K., Zhang, L.: Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction. Nucleic Acids Res. 36, e33 (2008)
Google Scholar
Roshan, U., Moret, B.M.E., Williams, T.L., Warnow, T.: Performance of supertree methods on various dataset decompositions. In: Bininda-Emonds, O.R.P. (ed.) Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life, pp. 301–328. Kluwer Academic, Dordrecht (2004)
Google Scholar
Nelesen, S.: Improved methods for phylogenetics. Ph.D. thesis, The University of Texas at Austin (2009)
Google Scholar
Swenson, M.: Phylogenetic supertree methods. Ph.D. thesis, The University of Texas at Austin (2008)
Google Scholar
Neves, D., Warnow, T., Sobral, J., Pingali, K.: Parallelizing SuperFine. In: 27th Symposium on Applied Computing (ACM-SAC) (2012)
Google Scholar
Cannone, J., Subramanian, S., Schnare, M., Collett, J., D’Souza, L., et al.: The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron and other RNAs. BMC Bioinform. 3 (2002)
Google Scholar
Roch, S.: Towards extracting all phylogenetic information from matrices of evolutionary distances. Science 327, 1376–1379 (2010)
MathSciNet MATH Google Scholar
Darling, A., Mau, B., Blatter, F., Perna, N.: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 14, 1394–1403 (2004)
Google Scholar
Darling, A., Mau, B., Perna, N.: progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS ONE 5, e11147 (2010)
Google Scholar
Raphael, B., Zhi, D., Tang, H., Pevzner, P.: A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Res. 14, 2336–2346 (2004)
Google Scholar
Dubchak, I., Poliakov, A., Kislyuk, A., Brudno, M.: Multiple whole-genome alignments without a reference organism. Genome Res. 19, 682–689 (2009)
Google Scholar
Brudno, M., Do, C., Cooper, G., Kim, M., Davydov, E., et al.: LAGAN and multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, 721–731 (2003)
Google Scholar
Phuong, T., Do, C., Edgar, R., Batzoglou, S.: Multiple alignment of protein sequences with repeats and rearrangements. Nucleic Acids Res. 34, 5932–5942 (2006)
Google Scholar
Paten, B., Earl, D., Nguyen, N., Diekhans, M., Zerbino, D., et al.: Cactus: algorithms for genome multiple sequence alignment. Genome Res. 21, 1512–1528 (2011)
Google Scholar
Angiuoli, S., Salzberg, S.: Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics (2011). 10.1093/bioinformatics/btq665
Google Scholar
Agren, J., Sundstrom, A., Hafstrom, T., Segerman, B.: Gegenees: fragmented alignment of multiple genomes for determining phylogenomic distances and genetic signatures unique for specified target groups. PLoS ONE 7, e39107 (2012)
Google Scholar
Gogarten, J., Doolittle, W., Lawrence, J.: Prokaryotic evolution in light of gene transfer. Mol. Biol. Evol. 19, 2226–2238 (2002)
Google Scholar
Gogarten, J., Townsend, J.: Horizontal gene transfer, genome innovation and evolution. Nat. Rev. Microbiol. 3, 679–687 (2005)
Google Scholar
Bergthorsson, U., Richardson, A., Young, G., Goertzen, L., Palmer, J.: Massive horizontal transfer of mitochondrial genes from diverse land plant donors to basal angiosperm Amborella. Proc. Natl. Acad. Sci. USA 101, 17,747–17,752 (2004)
Google Scholar
Bergthorsson, U., Adams, K., Thomason, B., Palmer, J.: Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature 424, 197–201 (2003)
Google Scholar
Wolf, Y., Rogozin, I., Grishin, N., Koonin, E.: Genome trees and the tree of life. Trends Genet. 18, 472–478 (2002)
Google Scholar
Koonin, E., Makarova, K., Aravind, L.: Horizontal gene transfer in prokaryotes: quantification and classification. Annu. Rev. Microbiol. 55, 709–742 (2001)
Google Scholar
Linder, C., Rieseberg, L.: Reconstructing patterns of reticulate evolution in plants. Am. J. Bot. 91, 1700–1708 (2004)
Google Scholar
Sessa, E., Zimmer, E., Givnish, T.: Reticulate evolution on a global scale: a nuclear phylogeny for New World Dryopteris (Dryopteridaceae). Mol. Phylogenet. Evol. 64, 563–581 (2012)
Google Scholar
Moody, M., Rieseberg, L.: Sorting through the chaff, nDNA gene trees for phylogenetic inference and hybrid identification of annual sunflowers Helianthus. Mol. Phylogenet. Evol. 64, 145–155 (2012) (sect. Helianthus)
Google Scholar
Mindell, D.: The tree of life: metaphor, model, and heuristic device. Syst. Biol. 62(3), 479–489 (2013)
Google Scholar
Warnow, T., Evans, S., Ringe, D., Nakhleh, L.: A stochastic model of language evolution that incorporates homoplasy and borrowing. In: Phylogenetic Methods and the Prehistory of Languages, pp. 75–90. Cambridge University Press, Cambridge (2006)
Google Scholar
Nakhleh, L., Ringe, D.A., Warnow, T.: Perfect phylogenetic networks: a new methodology for reconstructing the evolutionary history of natural languages. Language 81, 382–420 (2005)
Google Scholar
Huson, D., Rupp, R., Scornovacca, C.: Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press, Cambridge (2010)
Google Scholar
Morrison, D.: Introduction to Phylogenetic Networks. RJR Productions, Uppsala (2011)
Google Scholar
Nakhleh, L.: Evolutionary phylogenetic networks: models and issues. In: Problem Solving Handbook in Computational Biology and Bioinformatics, pp. 125–158. Springer, Berlin (2011)
Google Scholar
van Iersel, L., Kelk, S., Rupp, R., Huson, D.: Phylogenetic networks do not need to be complex: using fewer reticulations to represent conflicting clusters. Bioinformatics 26, i124–i131 (2010)
Google Scholar
Wu, Y.: An algorithm for constructing parsimonious hybridization networks with multiple phylogenetic trees. In: Proc. RECOMB (2013)
Google Scholar
Jin, G., Nakhleh, L., Snir, S., Tuller, T.: Maximum likelihood of phylogenetic networks. Bioinformatics 22, 2604–2611 (2006)
Google Scholar
Jin, G., Nakhleh, L., Snir, S., Tuller, T.: Inferring phylogenetic networks by the maximum parsimony criterion: a case study. Mol. Biol. Evol. 24, 324–337 (2007)
Google Scholar
Nakhleh, L., Warnow, T., Linder, C.: Reconstructing reticulate evolution in species—theory and practice. In: Proc. 8th Conf. Comput. Mol. Biol. (RECOMB’04), pp. 337–346. ACM Press, New York (2004)
Google Scholar
Nakhleh, L., Ruths, D., Wang, L.S.: RIATA-HGT: a fast and accurate heuristic for reconstructing horizontal gene transfer. In: Proc. 11th Conf. Computing and Combinatorics (COCOON’05). Lecture Notes in Computer Science. Springer, Berlin (2005)
Google Scholar
Yu, Y., Than, C., Degnan, J., Nakhleh, L.: Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Syst. Biol. 60, 138–149 (2011)
Google Scholar
Lapierre, P., Lasek-Nesselquist, E., Gogarten, J.: The impact of HGT on phylogenomic reconstruction methods. Brief. Bioinform. (2012). 10.1093/bib/bbs050
Google Scholar
Roch, S., Snir, S.: Recovering the tree-like trend of evolution despite extensive lateral genetic transfer: a probabilistic analysis. In: Proceedings RECOMB 2012 (2012)
Google Scholar
Gerard, D., Gibbs, H., Kubatko, L.: Estimating hybridization in the presence of coalescence using phylogenetic intraspecific sampling. BMC Evol. Biol. 11, 291 (2011)
Google Scholar
Yu, Y., Degnan, J., Nakhleh, L.: The probability of a gene tree topology within a phylogenetic network with applications to hybridization detection. PLoS Genet. 8, e1002660 (2012)
Google Scholar
Chowdhury, R., Ramachandran, V.: Cache-oblivious dynamic programming. In: Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 591–600 (2006)
Google Scholar

Download references

Acknowledgements

During the time I wrote the paper, I was a Program Director at the National Science Foundation working on the BigData program; however, the research discussed in this paper took place over a span of many years. This research was therefore supported by U.S. National Science Foundation, Microsoft New England, the Guggenheim Foundation, the David and Lucile Packard Foundation, the Radcliffe Institute for Advanced Study, the Program for Evolutionary Dynamics at Harvard, the David Bruton Jr. Centennial Professorship in Computer Sciences at U.T. Austin, and two Faculty Research Assignments from the University of Texas at Austin.

It makes sense now to tell how some of the work in this paper came about. I was working with Randy Linder (UT-Austin Integrative Biology) on various problems, including large-scale alignment and phylogeny estimation. During our initial attempts to design a fast and accurate co-estimation method, we began by trying to come up with a better solution to the Treelength optimization problem. Our interest in treelength optimization convinced a colleague, Vijaya Ramachandran (UT-Austin Computer Science), to develop a fast exact median calculator [338], which led to an improved treelength estimator; however our subsequent studies [263] suggested that improving the treelength would not lead to improved alignments and trees. This led us to look for other approaches to obtain more accurate alignments and trees from large datasets. Our next attempts considered the impact of guide trees, which gave a small benefit [109], but even iterating in this manner also did not lead to substantial improvements. Finally, we developed SATé, the co-estimation method described earlier. In a very real sense, therefore, much of the work in this chapter was inspired by David Sankoff, since he introduced the treelength optimization problem. And so, I end by thanking David Sankoff for this, as well as many other things.

Author information

Authors and Affiliations

Department of Computer Science, University of Texas at Austin, Austin, TX, USA
Tandy Warnow

Authors

Tandy Warnow
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tandy Warnow .

Editor information

Editors and Affiliations

Department of Mathematics, Simon Fraser University, Burnaby, British Columbia, Canada
Cedric Chauve
Computer Science and Operations Research, University of Montreal, Montreal, Québec, Canada
Nadia El-Mabrouk
Biometry and Evolutionary Biology, INRIA Rhône-Alpes, University of Lyon, Villeurbanne, France
Eric Tannier

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Warnow, T. (2013). Large-Scale Multiple Sequence Alignment and Phylogeny Estimation. In: Chauve, C., El-Mabrouk, N., Tannier, E. (eds) Models and Algorithms for Genome Evolution. Computational Biology, vol 19. Springer, London. https://doi.org/10.1007/978-1-4471-5298-9_6

Download citation

DOI: https://doi.org/10.1007/978-1-4471-5298-9_6
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5297-2
Online ISBN: 978-1-4471-5298-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics