Abstract
Over the last years, the availability of genomic sequence data from thousands of different species has led to hopes that a phylogenetic tree of all life might be achievable. Yet, the most accurate methods for estimating phylogenies are heuristics for NP-hard optimization problems, many of which are too computationally intensive to use on large datasets. Divide-and-conquer approaches have been proposed to address scalability to large datasets that divide the species into subsets, construct trees on subsets, and then merge the trees together. Prior approaches have divided species sets into overlapping subsets and used supertree methods to merge the subset trees, but limitations in supertree methods suggest this kind of divide-and-conquer approach is unlikely to provide scalability to ultra-large datasets. Recently, a new approach has been developed that divides the species dataset into disjoint subsets, computes trees on subsets, and then combines the subset trees using auxiliary information (e.g., a distance matrix). Here, we describe these strategies and their theoretical properties, present open problems, and discuss opportunities for impact in large-scale phylogenetic estimation using these and similar approaches.
Supported by the University of Illinois at Urbana-Champaign.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Atteson, K.: The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica 25, 251–278 (1999)
Bansal, M., Burleigh, J., Eulenstein, O., Fernández-Baca, D.: Robinson-Foulds supertrees. Algorithms Mol. Biol. 5, 18 (2010)
Bayzid, M.S., Hunt, T., Warnow, T.: Disk covering methods improve phylogenomic analyses. BMC Genom. 15(Suppl. 6), S7 (2014)
Boussau, B., Szöllősi, G., Duret, L., Gouy, M., Tannier, E., Daubin, V.: Genome-scale co-estimation of species and gene trees. Genom. Res. 23, 323–330 (2013)
Chaudhary, R., Bansal, M.S., Wehe, A., Fernández-Baca, D., Eulenstein, O.: iGTP: a software package for large-scale gene tree parsimony analysis. BMC Bioinform. 11, 574 (2010)
Chaudhary, R., Burleigh, J.G., Fernández-Baca, D.: Fast local search for unrooted Robinson-Foulds supertrees. IEEE/ACM Trans. Comput. Biol. Bioinform. 9, 1004–1013 (2012)
Chifman, J., Kubatko, L.: Quartet inference from SNP data under the coalescent. Bioinformatics 30(23), 3317–3324 (2014)
Chifman, J., Kubatko, L.: Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. J. Theor. Biol. 374, 35–47 (2015)
Erdös, P., Steel, M., Székely, L., Warnow, T.: A few logs suffice to build (almost) all trees (i). Random Struct. Algorithms 14, 153–184 (1999)
Erdös, P., Steel, M., Székely, L., Warnow, T.: A few logs suffice to build (almost) all trees (ii). Theor. Comput. Sci. 221, 77–118 (1999)
Felsenstein, J.: Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17(6), 368–376 (1981)
Felsenstein, J.: Inferring Phylogenies. Sinauer Associates, Sunderland (2004)
Heled, J., Drummond, A.J.: Bayesian inference of species trees from multilocus data. Mol. Biol. Evol. 27, 570–580 (2010)
Huson, D., Nettles, S., Warnow, T.: Disk-covering, a fast converging method for phylogenetic tree reconstruction. J. Comput. Biol. 6(3), 369–386 (1999)
Jukes, T.H., Cantor, C.R.: Evolution of protein molecules. In: Mammalian Protein Metabolism, pp. 21–132 (1969)
Kingman, J.F.C.: The coalescent. Stochast. Process. Appl. 13, 235–248 (1982)
Kolaczkowski, B., Thornton, J.: Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431, 980–984 (2004). https://doi.org/10.1038/nature02917
Kubatko, L., Degnan, J.: Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst. Biol. 56, 17 (2007)
Lagergren, J.: Combining polynomial running time and fast convergence for the disk-covering method. J. Comput. Syst. Sci. 65(3), 481–493 (2002)
Le, T., Sy, A., Molloy, E., Zhang, Q., Rao, S., Warnow, T.: Using INC within divide-and-conquer phylogeny estimation. In: Proceedings of AlCoB 2019 (2019)
Lefort, V., Desper, R., Gascuel, O.: FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Mol. Biol. Evol. 32(10), 2798–2800 (2015). https://doi.org/10.1093/molbev/msv150
Liu, K., et al.: SATé-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst. Biol. 61(1), 90–106 (2012). https://doi.org/10.1093/sysbio/syr095
Liu, L., Yu, L.: Estimating species trees from unrooted gene trees. Syst. Biol. 60(5), 661–667 (2011)
Lockhart, P., Novis, P., Milligan, B., Riden, J., Rambaut, A., Larkum, T.: Heterotachy and tree building: a case study with Plastids and Eubacteria. Mol. Biol. Evol. 23(1), 40–45 (2006). https://doi.org/10.1093/molbev/msj005. http://mbe.oxfordjournals.org/content/23/1/40.abstract
Lopez, P., Casane, D., Philippe, H.: Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19, 1–7 (2002)
Maddison, W.: Gene trees in species trees. Syst. Biol. 46(3), 523–536 (1997)
Mirarab, S., Nguyen, N., Wang, L.S., Guo, S., Kim, J., Warnow, T.: PASTA: ultra-large multiple sequence alignment of nucleotide and amino acid sequences. J. Comput. Biol. 22, 377–386 (2015)
Mirarab, S., Reaz, R., Bayzid, M.S., Zimmermann, T., Swenson, M., Warnow, T.: ASTRAL: accurate species TRee ALgorithm. Bioinformatics 30(17), i541–i548 (2014)
Mirarab, S., Warnow, T.: ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31(12), i44–i52 (2015)
Molloy, E.K., Warnow, T.: NJMerge: a generic technique for scaling phylogeny estimation methods and its application to species trees. In: Blanchette, M., Ouangraoua, A. (eds.) RECOMB-CG 2018. LNCS, vol. 11183, pp. 260–276. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00834-5_15
Molloy, E.K., Warnow, T.: Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge. bioRxiv (2018). https://doi.org/10.1101/469130
Molloy, E.K., Warnow, T.: To include or not to include: the impact of gene filtering on species tree estimation methods. Syst. Biol. 67(2), 285–303 (2018). https://doi.org/10.1093/sysbio/syx077
Nakhleh, L., Roshan, U., St. John, K., Sun, J., Warnow, T.: Designing fast converging phylogenetic methods. Bioinformatics 17, 190–198 (2001)
Nelesen, S., Liu, K., Wang, L.S., Linder, C.R., Warnow, T.: DACTAL: divide-and-conquer trees (almost) without alignments. Bioinformatics 28, i274–i282 (2012)
Nguyen, L.T., Schmidt, H., von Haeseler, A., Minh, B.: IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32(1), 268–274 (2015). https://doi.org/10.1093/molbev/msu300
Nguyen, N., Mirarab, S., Warnow, T.: MRL and SuperFine+ MRL: new supertree methods. Algorithms Mol. Biol. 7(1), 3 (2012)
Price, M., Dehal, P., Arkin, A.: FastTree 2 - approximately maximum-likelihood trees for large alignments. PLoS ONE 5(3), e9490 (2010). https://doi.org/10.1371/journal.pone.0009490
Roch, S.: A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. TCBB 3(1), 92–94 (2006)
Roch, S., Nute, M., Warnow, T.: Long-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods. Syst. Biol. 68, 281–297 (2018). https://doi.org/10.1093/sysbio/syy061
Roch, S., Steel, M.: Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor. Popul. Biol. 100, 56–62 (2015)
Ronquist, F.: Matrix representation of trees, redundancy, and weighting. Syst. Biol. 45, 247–253 (1996)
Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987)
Shekhar, S., Roch, S., Mirarab, S.: Species tree estimation using ASTRAL: how many genes are enough? IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 15(5), 1738–1747 (2018)
Stamatakis, A.: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688–2690 (2006)
Steel, M.: The complexity of reconstructing trees from qualitative characters and subtrees. J. Classif. 9, 91–116 (1992)
Steel, M.: Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett. 7, 19–24 (1994)
Swofford, D.L.: PAUP*. Phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates (2003)
Tavaré, S.: Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17, 57–86 (1986). American Mathematical Society
Taylor, M., Kai, C., Kawai, J., Carninci, P., Hayashizaki, Y., Semple, C.: Heterotachy in mammalian promoter evolution. PLoS Genet. 2(4), e30 (2006). https://doi.org/10.1371/journal.pgen.0020030
Ullah, I., Parviainen, P., Lagergren, J.: Species tree inference using a mixture model. Mol. Biol. Evol. 32(9), 2469–2482 (2015)
Vachaspati, P., Warnow, T.: ASTRID: accurate species TRees from internode distances. BMC Genom. 16(Suppl. 10), S3 (2015)
Vachaspati, P., Warnow, T.: FastRFS: fast and accurate Robinson-Foulds supertrees using constrained exact optimization. Bioinformatics (2016). https://doi.org/10.1093/bioinformatics/btw600
Vachaspati, P., Warnow, T.: SVDquest: improving SVDquartets species tree estimation using exact optimization within a constrained search space. Mol. Phylogenet. Evol. 124, 122–136 (2018). https://doi.org/10.1016/j.ympev.2018.03.006
Wang, L.S., Leebens-Mack, J., Wall, P.K., Beckmann, K., DePamphilis, C.W., Warnow, T.: The impact of multiple protein sequence alignment on phylogenetic estimation. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 1108–1119 (2011)
Warnow, T., Moret, B.M.E., St. John, K.: Absolute convergence: true trees from short sequences. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA 2001), pp. 186–195. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2001)
Warnow, T.: Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. Cambridge University Press, Cambridge (2017)
Warnow, T.: Divide-and-conquer tree estimation: opportunities and challenges. In: Warnow, T. (ed.) Bioinformatics and Phylogenetics. Springer, Heidelberg (2019)
Yang, Z.: Molecular Evolution: A Statistical Approach. Oxford University Press, Oxford (2014)
Zhang, C., Sayyari, E., Mirarab, S.: ASTRAL-III: Increased scalability and impacts of contracting low support branches. In: Meidanis, J., Nakhleh, L. (eds.) RECOMB-CG 2017. LNCS, pp. 53–75. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67979-2_4
Zhang, Q., Rao, S., Warnow, T.: New absolute fast converging phylogeny estimation methods with improved scalability and accuracy. In: Parida, L., Ukkonen, E. (eds.) 18th International Workshop on Algorithms in Bioinformatics (WABI 2018), pp. 8:1–8:12. LIPICS, Dagsttuhl (2018)
Zhou, Y., Rodrigue, N., Lartillot, N., Philippe, H.: Evaluation of the models handling heterotachy in phylogenetic inference. BMC Evol. Biol. 7, 206 (2007)
Zimmermann, T., Mirarab, S., Warnow, T.: BBCA: improving the scalability of *BEAST using random binning. BMC Genom. 15(Suppl. 6), S11 (2014). Proceedings of RECOMB-CG (Comparative Genomics)
Acknowledgments
This work was supported in part by NSF grant CCF-1535977. I also wish to thank Erin Molloy and Thien Le for helpful comments on the manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Warnow, T. (2019). New Divide-and-Conquer Techniques for Large-Scale Phylogenetic Estimation. In: Holmes, I., MartÃn-Vide, C., Vega-RodrÃguez, M. (eds) Algorithms for Computational Biology. AlCoB 2019. Lecture Notes in Computer Science(), vol 11488. Springer, Cham. https://doi.org/10.1007/978-3-030-18174-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-18174-1_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-18173-4
Online ISBN: 978-3-030-18174-1
eBook Packages: Computer ScienceComputer Science (R0)