Large-Scale Multiple Sequence Alignment and Tree Estimation Using SATé

Liu, Kevin; Warnow, Tandy

doi:10.1007/978-1-62703-646-7_15

Kevin Liu³ &
Tandy Warnow⁴

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1079))

5104 Accesses
11 Citations

Abstract

SATé is a method for estimating multiple sequence alignments and trees that has been shown to produce highly accurate results for datasets with large numbers of sequences. Running SATé using its default settings is very simple, but improved accuracy can be obtained by modifying its algorithmic parameters. We provide a detailed introduction to the algorithmic approach used by SATé, and instructions for running a SATé analysis using the GUI under default settings. We also provide a discussion of how to modify these settings to obtain improved results, and how to use SATé in a phylogenetic analysis pipeline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kemena C, Notredame C (2009) Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics 25:2455–2465
Article PubMed CAS Google Scholar
Nelesen S, Liu K, Zhao D et al (2008) The effect of the guide tree on multiple sequence alignments and subsequent phylogenetic analyses. Pac Symp Biocomput 2008:25–36
Google Scholar
Liu K, Linder CR, Warnow T (2010) Multiple sequence alignment: a major challenge to large-scale phylogenetics. PLoS Curr 2, RRN1198
Article PubMed Google Scholar
Wang L-S, Leebens-Mack J, Wall PK, Beckman K, de Pamphilis CW, Warnow T (2011) The impact of multiple protein sequence alignment on phylogenetic estimation. IEEE Trans Comput Biol Bioinform 8:1108–1119
Article Google Scholar
Cantarel BL, Morrison HG, Pearson W (2006) Exploring the relationship between sequence similarity and accurate phylogenetic trees. Mol Biol Evol 11:2090–100
Article Google Scholar
Löytynoja A, Goldman N (2008) Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320(5883):1632–5
Article PubMed Google Scholar
Hall BG (2005) Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol Biol Evol 22(3):792–802
Article PubMed CAS Google Scholar
Morrison DA, Ellis JT (1997) Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of Apicomplexa. Mol Biol Evol 14(4):428–41
Article PubMed CAS Google Scholar
Ogden TH, Rosenberg MS (2006) Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol 55(2):314–28
Article Google Scholar
Larkin MA, Blackshields G, Brown NP et al (2007) ClustalW and ClustalX version 2.0. Bioinformatics 23:2947–2948
Article PubMed CAS Google Scholar
Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113
Article PubMed Google Scholar
Edgar RC (2004) MUSCLE: a multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797
Article PubMed CAS Google Scholar
Katoh K, Toh H (2008) Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinformatics 9:286–298
Article PubMed CAS Google Scholar
Nelesen S, Liu K, Wang L-S et al (2012) DACTAL: fast and accurate estimations of trees without computing full sequence alignments. Bioinformatics 28:i274–i282
Article PubMed CAS Google Scholar
Varón A, Vinh LS, Wheeler WC (2010) POY version 4: phylogenetic analysis using dynamic homologies. Cladistics 26:72–85
Article Google Scholar
Liu K, Nelesen S, Raghavan S, Linder CR, Warnow T (2009) Barking up the wrong treelength: the impact of gap penalty on alignment and tree accuracy. IEEE/ACM Trans Comput Biol Bioinform 6(1):7–21
Article PubMed Google Scholar
Liu K, Warnow T (2012) Treelength optimization for phylogeny estimation. PLoS One 7(3):e33104
Article PubMed CAS Google Scholar
Suchard MA, Redelings BD (2006) BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics 22:2047–2048
Article PubMed CAS Google Scholar
Fleissner R, Metzler D, von Haeseler A (2005) Simultaneous statistical multiple alignment and phylogeny reconstruction. Syst Biol 54:548–561
Article PubMed Google Scholar
Novák A, Miklós I, Lyngsoe R et al (2008) StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees. Bioinformatics 24:2403–2404
Article PubMed Google Scholar
Lunter G, Miklós I, Drummond A et al (2005) Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics 6:83
Article PubMed Google Scholar
Liu K, Raghavan S, Nelesen S et al (2009) Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324:1561–1564
Article PubMed CAS Google Scholar
Liu K, Warnow T, Holder MT et al (2012) SATé-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst Biol 61(1):90–106
Article PubMed Google Scholar
Stamatakis A (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22:2688–2690
Article PubMed CAS Google Scholar
Price M, Dehal P, Arkin A (2010) FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS One 5:e9490
Article PubMed Google Scholar
Löytynoja A, Goldman N (2005) An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci U S A 102:10557–10562
Article PubMed Google Scholar
Wheeler T, Kececioglu J (2007) Multiple alignment by aligning alignments. Bioinformatics 23:i559–i568
Article PubMed CAS Google Scholar
Felsenstein J (2004) Inferring phylogenies. Sinauer, Sunderland, MA
Google Scholar
Dewey CN (2012) Whole-genome alignment. Methods Mol Biol 855:237–257
Article PubMed CAS Google Scholar
Mirarab S, Nguyen N-P, Warnow T (2012) SEPP: SATé-enabled phylogenetic placement. Pac Symp Biocomput 2012:247–58
Google Scholar
Matsen F, Kodner R, Armbrust EV (2010) pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11:538
Article PubMed Google Scholar
Berger SA, Krompass D, Stamatakis A (2011) Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Syst Biol 60:291–302
Article PubMed Google Scholar
Liu K, Randal Linder C, Warnow T (2011) RAxML and FastTree: comparing two methods for large-scale maximum likelihood phylogeny estimation. PLoS One 6(11):e27731. doi:10.1371/journal.pone.0027731
Article PubMed CAS Google Scholar
Stamatakis A (2006) Phylogenetic models of rate heterogeneity: a high performance computing perspective. Proc IPDPS, Rhodes, Greece, 2006
Google Scholar
Jukes TH, Cantor CR (1969) Evolution of protein molecules. Mammalian protein metabolism. Academic, New York, pp 21–132
Google Scholar
Posada D, Buckley T (2004) Model selection and model averaging in phylogenetics: advantages of Akaike Information criterion and Bayesian approaches over likelihood ratio tests. Syst Biol 53(5):793–808
Article PubMed Google Scholar
Abascal F, Zardoya R, Posada D (2005) ProtTest: selection of best-fit models of protein evolution. Bioinformatics 21(9):2104–2105
Article PubMed CAS Google Scholar
Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:275–282
PubMed CAS Google Scholar
Whelan S, Goldman N (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18:691–699
Article PubMed CAS Google Scholar
Dayhoff M, Schwartz R, Orcutt B (1978) A model of evolutionary change in proteins. Atlas Protein Sequence Struct 5:345–352
Google Scholar
Kosiol C, Goldman N (2005) Different versions of the Dayhoff rate matrix. Mol Biol Evol 22:193–199
Article PubMed CAS Google Scholar
Adachi J (1996) Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol 42:459–468
Article PubMed CAS Google Scholar
Dimmic M, Rest J, Mindell D, Goldstein R (2002) rtREV: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogeny. J Mol Evol 55:65–73
Article PubMed CAS Google Scholar
Adachi J, Waddell P, Martin W, Hasegawa M (2000) Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J Mol Evol 50:348–358
PubMed CAS Google Scholar
Mueller T, Vingron M (2000) Modeling amino acid replacement. J Comput Biol 7:761–776
Article Google Scholar
Henikoff S, Henikoff J (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89:10915–10919
Article PubMed CAS Google Scholar
Yang Z (1998) Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J Mol Evol 46:409–418
Article PubMed CAS Google Scholar
Le S, Gascuel O (2008) An improved general amino acid replacement matrix. Mol Biol Evol 25(7):1307–1320
Article PubMed CAS Google Scholar
Bodaker I, Suzuki MT, Oren A, Béjà O (2012) Dead Sea rhodopsins revisited. Environ Microbiol Rep 4(6):617–621
PubMed CAS Google Scholar
Andam C, Harlow T, Papke RT, Gogarten JP (2012) Ancient origin of the divergent forms of leucyl-tRNA synthetases in the Halobacteriales. BMC Evol Biol 12(1):85
Article PubMed CAS Google Scholar
Hagopian R, Davidson JR, Datta RS et al (2010) SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction. Nucleic Acids Res 38(suppl 2):W29–W34
Article PubMed CAS Google Scholar
Katoh K, Toh H (2007) PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics 23:372–374
Article PubMed CAS Google Scholar
Sievers F, Wilm A, Dineen D et al (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7:539
Article PubMed Google Scholar
Wang N, Braun EL, Kimball RT (2012) Testing hypotheses about the sister group of the Passeriformes using an independent 30-locus data set. Mol Biol Evol 29(2):737–750
Article PubMed CAS Google Scholar
Xiang C-L, Gitzendanner MA, Soltis DE et al (2012) Phylogenetic placement of the enigmatic and critically endangered genus Saniculiphyllum (Saxifragaceae) inferred from combined analysis of plastid and nuclear DNA sequences. Mol Phylogenet Evol 64:357–367
Article PubMed Google Scholar
Andam C, Harlow T, Thane R et al (2012) Ancient origin of the divergent forms of leucyl-tRNA synthetases in the Halobacteriales. Evol Biol 12:85
Article CAS Google Scholar
Huelsenbeck JP, Ronquist R (2001) MrBayes: Bayesian inference of phylogeny. Bioinformatics 17:754–755
Article PubMed CAS Google Scholar
Stockham C, Wang L-S, Warnow T (2002) Postprocessing of phylogenetic analysis using clustering. Bioinformatics 18(Suppl 1):i285–i293
Article Google Scholar
Amenta N, Klinger J (2002). Case study: visualizing sets of evolutionary trees. In: Proceedings IEEE symposium on information visualization, pp 71–74
Google Scholar
Bryant D (2003) A classification of consensus methods for phylogenetics. DIMACS series in discrete mathematics and theoretical computer science 51:163–184
Google Scholar
Kannan S, Warnow T, Yooseph S (1998) Computing the local consensus of trees. SIAM J Comput 27(6):1695–1724
Article Google Scholar
Phillips C, Warnow T (1996) The asymmetric median tree – a new model for building consensus trees. Discrete Appl Math 71(1–3):311–335
Article Google Scholar
Mirarab S, Warnow T (2011) FAST-SP: linear time calculation of alignment accuracy. Bioinformatics 27(23):3250–3258
Article PubMed CAS Google Scholar
Maddison W (1997) Gene trees in species trees. Syst Biol 46(3):523–536
Article Google Scholar
Boussau B, Szöllősi G, Duret L et al (2013) Genome-scale coestimation of species and gene trees. Genome Res 23(2):323–30
Article PubMed CAS Google Scholar
Yu Y, Degnan JH, Nakhleh L (2012) The probability of a gene tree topology within a phylogenetic network with applications to hybridization detection. PLoS Genet 8(4):e1002660
Article PubMed CAS Google Scholar
Degnan JH, Rosenberg NA (2009) Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol Evol 24(6):332–340
Article PubMed Google Scholar
Chaudhary R, Bansal MS, Wehe A et al (2010) iGTP: a software package for large-scale gene tree parsimony analysis. BMC Bioinformatics 11:547
Article Google Scholar
Bansal MS, Alm EJ, Kellis M (2012) Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer, and loss. Bioinformatics 28(12):i283–i291
Article PubMed CAS Google Scholar
Yang J, Warnow T (2011) Fast and accurate methods for phylogenomic analyses. RECOMB comparative genomics, 2011. BMC Bioinformatics 12(Suppl 9):S4
Article PubMed Google Scholar
Bayzid MS, Warnow T (2012) Finding optimal species trees from incomplete gene trees under incomplete lineage sorting. J Comput Biol 19(6):591–605
Article PubMed CAS Google Scholar
Rice P, Longden I, Bleasby A (2000) EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet 16:276–277
Article PubMed CAS Google Scholar
Swofford DL (2003) PAUP*: phylogenetic analysis using parsimony (*and other methods), Version 4
Google Scholar
Warnow T (2012) Standard maximum likelihood analyses of alignments with gaps can be statistically inconsistent. PLoS Curr 4:RRN1308. doi:10.1371/currents.RRN1308
Article PubMed Google Scholar
Swenson MS, Suri R, Linder CR et al (2012) SuperFine: fast and accurate supertree estimation. Syst Biol 61(2):214–227
Article PubMed Google Scholar
Neves DT, Warnow TJ, Sobral L et al (2012) Parallelizing SuperFine. 27th Symp Appl Comp
Google Scholar
Nguyen N, Mirarab S, Warnow T (2012) MRL and SuperFine + MRL: new supertree methods. Algorithms Mol Biol 7:3
Article PubMed Google Scholar
Daskalakis C, Roch S (2010) Alignment-free phylogenetic reconstruction. Proc Res Comp Molec Biol (RECOMB), Lecture Notes Computer Science 6044: 123–137
Google Scholar
Chan CX, Ragan RA (2013) Next-generation phylogenomics. Biol Direct 8:30. doi:10.1186/1745-6150-8-3
Article Google Scholar
Vinga S, Almeida J (2003) Alignment-free sequence comparison – a review. Bioinformatics 19(4):513–523
Article PubMed CAS Google Scholar
Holder M, Warnow T, Mirarab S et al (2012) Online tutorial for SATe. http://phylo.bio.ku.edu/software/sate/sate_tutorial.pdf
Linder CR, Suri R, Liu K et al (2010) Benchmark datasets and software for developing and testing methods for large-scale multiple sequence alignment and phylogenetic inference. PLoS Curr 2:RRN1195. doi:10.1371/currents.RRN1195
Article PubMed Google Scholar
Linder CR, Warnow T (2005) Overview of phylogeny reconstruction. In: Aluru S (ed) Handbook of computational biology. CRC computer and information science series. Chapman & Hall, Boca Raton, FL
Google Scholar

Download references

Acknowledgments

This work was supported a training fellowship to KL from the Keck Center of the Gulf Coast Consortia, on the NLM Training Program in Biomedical Informatics, National Library of Medicine (NLM) T15LM007093. This work was also partially supported by NSF grant DEB 0733029 to TW.

Author information

Authors and Affiliations

Department of Computer Science, Rice University, Houston, TX, USA
Kevin Liu
Department of Computer Science, The University of Texas at Austin, Austin, TX, USA
Tandy Warnow

Authors

Kevin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tandy Warnow
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Electrical Engineering, University of Nebraska-Lincoln, Lincoln, Nebraska, USA
David J Russell

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Liu, K., Warnow, T. (2014). Large-Scale Multiple Sequence Alignment and Tree Estimation Using SATé. In: Russell, D. (eds) Multiple Sequence Alignment Methods. Methods in Molecular Biology, vol 1079. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-646-7_15

Download citation

DOI: https://doi.org/10.1007/978-1-62703-646-7_15
Published: 23 August 2013
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-62703-645-0
Online ISBN: 978-1-62703-646-7
eBook Packages: Springer Protocols

Publish with us

Policies and ethics