Skip to main content

New Divide-and-Conquer Techniques for Large-Scale Phylogenetic Estimation

  • Conference paper
  • First Online:
Algorithms for Computational Biology (AlCoB 2019)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11488))

Included in the following conference series:

  • 534 Accesses

Abstract

Over the last years, the availability of genomic sequence data from thousands of different species has led to hopes that a phylogenetic tree of all life might be achievable. Yet, the most accurate methods for estimating phylogenies are heuristics for NP-hard optimization problems, many of which are too computationally intensive to use on large datasets. Divide-and-conquer approaches have been proposed to address scalability to large datasets that divide the species into subsets, construct trees on subsets, and then merge the trees together. Prior approaches have divided species sets into overlapping subsets and used supertree methods to merge the subset trees, but limitations in supertree methods suggest this kind of divide-and-conquer approach is unlikely to provide scalability to ultra-large datasets. Recently, a new approach has been developed that divides the species dataset into disjoint subsets, computes trees on subsets, and then combines the subset trees using auxiliary information (e.g., a distance matrix). Here, we describe these strategies and their theoretical properties, present open problems, and discuss opportunities for impact in large-scale phylogenetic estimation using these and similar approaches.

Supported by the University of Illinois at Urbana-Champaign.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Atteson, K.: The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica 25, 251–278 (1999)

    Article  MathSciNet  Google Scholar 

  2. Bansal, M., Burleigh, J., Eulenstein, O., Fernández-Baca, D.: Robinson-Foulds supertrees. Algorithms Mol. Biol. 5, 18 (2010)

    Article  Google Scholar 

  3. Bayzid, M.S., Hunt, T., Warnow, T.: Disk covering methods improve phylogenomic analyses. BMC Genom. 15(Suppl. 6), S7 (2014)

    Article  Google Scholar 

  4. Boussau, B., Szöllősi, G., Duret, L., Gouy, M., Tannier, E., Daubin, V.: Genome-scale co-estimation of species and gene trees. Genom. Res. 23, 323–330 (2013)

    Article  Google Scholar 

  5. Chaudhary, R., Bansal, M.S., Wehe, A., Fernández-Baca, D., Eulenstein, O.: iGTP: a software package for large-scale gene tree parsimony analysis. BMC Bioinform. 11, 574 (2010)

    Article  Google Scholar 

  6. Chaudhary, R., Burleigh, J.G., Fernández-Baca, D.: Fast local search for unrooted Robinson-Foulds supertrees. IEEE/ACM Trans. Comput. Biol. Bioinform. 9, 1004–1013 (2012)

    Article  Google Scholar 

  7. Chifman, J., Kubatko, L.: Quartet inference from SNP data under the coalescent. Bioinformatics 30(23), 3317–3324 (2014)

    Article  Google Scholar 

  8. Chifman, J., Kubatko, L.: Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. J. Theor. Biol. 374, 35–47 (2015)

    Article  MathSciNet  Google Scholar 

  9. Erdös, P., Steel, M., Székely, L., Warnow, T.: A few logs suffice to build (almost) all trees (i). Random Struct. Algorithms 14, 153–184 (1999)

    Article  MathSciNet  Google Scholar 

  10. Erdös, P., Steel, M., Székely, L., Warnow, T.: A few logs suffice to build (almost) all trees (ii). Theor. Comput. Sci. 221, 77–118 (1999)

    Article  MathSciNet  Google Scholar 

  11. Felsenstein, J.: Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17(6), 368–376 (1981)

    Article  Google Scholar 

  12. Felsenstein, J.: Inferring Phylogenies. Sinauer Associates, Sunderland (2004)

    Google Scholar 

  13. Heled, J., Drummond, A.J.: Bayesian inference of species trees from multilocus data. Mol. Biol. Evol. 27, 570–580 (2010)

    Article  Google Scholar 

  14. Huson, D., Nettles, S., Warnow, T.: Disk-covering, a fast converging method for phylogenetic tree reconstruction. J. Comput. Biol. 6(3), 369–386 (1999)

    Article  Google Scholar 

  15. Jukes, T.H., Cantor, C.R.: Evolution of protein molecules. In: Mammalian Protein Metabolism, pp. 21–132 (1969)

    Chapter  Google Scholar 

  16. Kingman, J.F.C.: The coalescent. Stochast. Process. Appl. 13, 235–248 (1982)

    Article  MathSciNet  Google Scholar 

  17. Kolaczkowski, B., Thornton, J.: Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431, 980–984 (2004). https://doi.org/10.1038/nature02917

    Article  Google Scholar 

  18. Kubatko, L., Degnan, J.: Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst. Biol. 56, 17 (2007)

    Article  Google Scholar 

  19. Lagergren, J.: Combining polynomial running time and fast convergence for the disk-covering method. J. Comput. Syst. Sci. 65(3), 481–493 (2002)

    Article  MathSciNet  Google Scholar 

  20. Le, T., Sy, A., Molloy, E., Zhang, Q., Rao, S., Warnow, T.: Using INC within divide-and-conquer phylogeny estimation. In: Proceedings of AlCoB 2019 (2019)

    Google Scholar 

  21. Lefort, V., Desper, R., Gascuel, O.: FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Mol. Biol. Evol. 32(10), 2798–2800 (2015). https://doi.org/10.1093/molbev/msv150

    Article  Google Scholar 

  22. Liu, K., et al.: SATé-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst. Biol. 61(1), 90–106 (2012). https://doi.org/10.1093/sysbio/syr095

    Article  Google Scholar 

  23. Liu, L., Yu, L.: Estimating species trees from unrooted gene trees. Syst. Biol. 60(5), 661–667 (2011)

    Article  Google Scholar 

  24. Lockhart, P., Novis, P., Milligan, B., Riden, J., Rambaut, A., Larkum, T.: Heterotachy and tree building: a case study with Plastids and Eubacteria. Mol. Biol. Evol. 23(1), 40–45 (2006). https://doi.org/10.1093/molbev/msj005. http://mbe.oxfordjournals.org/content/23/1/40.abstract

    Article  Google Scholar 

  25. Lopez, P., Casane, D., Philippe, H.: Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19, 1–7 (2002)

    Article  Google Scholar 

  26. Maddison, W.: Gene trees in species trees. Syst. Biol. 46(3), 523–536 (1997)

    Article  Google Scholar 

  27. Mirarab, S., Nguyen, N., Wang, L.S., Guo, S., Kim, J., Warnow, T.: PASTA: ultra-large multiple sequence alignment of nucleotide and amino acid sequences. J. Comput. Biol. 22, 377–386 (2015)

    Article  Google Scholar 

  28. Mirarab, S., Reaz, R., Bayzid, M.S., Zimmermann, T., Swenson, M., Warnow, T.: ASTRAL: accurate species TRee ALgorithm. Bioinformatics 30(17), i541–i548 (2014)

    Article  Google Scholar 

  29. Mirarab, S., Warnow, T.: ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31(12), i44–i52 (2015)

    Article  Google Scholar 

  30. Molloy, E.K., Warnow, T.: NJMerge: a generic technique for scaling phylogeny estimation methods and its application to species trees. In: Blanchette, M., Ouangraoua, A. (eds.) RECOMB-CG 2018. LNCS, vol. 11183, pp. 260–276. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00834-5_15

    Chapter  Google Scholar 

  31. Molloy, E.K., Warnow, T.: Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge. bioRxiv (2018). https://doi.org/10.1101/469130

  32. Molloy, E.K., Warnow, T.: To include or not to include: the impact of gene filtering on species tree estimation methods. Syst. Biol. 67(2), 285–303 (2018). https://doi.org/10.1093/sysbio/syx077

    Article  Google Scholar 

  33. Nakhleh, L., Roshan, U., St. John, K., Sun, J., Warnow, T.: Designing fast converging phylogenetic methods. Bioinformatics 17, 190–198 (2001)

    Article  Google Scholar 

  34. Nelesen, S., Liu, K., Wang, L.S., Linder, C.R., Warnow, T.: DACTAL: divide-and-conquer trees (almost) without alignments. Bioinformatics 28, i274–i282 (2012)

    Article  Google Scholar 

  35. Nguyen, L.T., Schmidt, H., von Haeseler, A., Minh, B.: IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32(1), 268–274 (2015). https://doi.org/10.1093/molbev/msu300

    Article  Google Scholar 

  36. Nguyen, N., Mirarab, S., Warnow, T.: MRL and SuperFine+ MRL: new supertree methods. Algorithms Mol. Biol. 7(1), 3 (2012)

    Article  Google Scholar 

  37. Price, M., Dehal, P., Arkin, A.: FastTree 2 - approximately maximum-likelihood trees for large alignments. PLoS ONE 5(3), e9490 (2010). https://doi.org/10.1371/journal.pone.0009490

    Article  Google Scholar 

  38. Roch, S.: A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. TCBB 3(1), 92–94 (2006)

    Google Scholar 

  39. Roch, S., Nute, M., Warnow, T.: Long-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods. Syst. Biol. 68, 281–297 (2018). https://doi.org/10.1093/sysbio/syy061

    Article  Google Scholar 

  40. Roch, S., Steel, M.: Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor. Popul. Biol. 100, 56–62 (2015)

    Article  Google Scholar 

  41. Ronquist, F.: Matrix representation of trees, redundancy, and weighting. Syst. Biol. 45, 247–253 (1996)

    Article  Google Scholar 

  42. Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987)

    Google Scholar 

  43. Shekhar, S., Roch, S., Mirarab, S.: Species tree estimation using ASTRAL: how many genes are enough? IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 15(5), 1738–1747 (2018)

    Article  Google Scholar 

  44. Stamatakis, A.: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688–2690 (2006)

    Article  Google Scholar 

  45. Steel, M.: The complexity of reconstructing trees from qualitative characters and subtrees. J. Classif. 9, 91–116 (1992)

    Article  MathSciNet  Google Scholar 

  46. Steel, M.: Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett. 7, 19–24 (1994)

    Article  MathSciNet  Google Scholar 

  47. Swofford, D.L.: PAUP*. Phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates (2003)

    Google Scholar 

  48. Tavaré, S.: Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17, 57–86 (1986). American Mathematical Society

    MathSciNet  MATH  Google Scholar 

  49. Taylor, M., Kai, C., Kawai, J., Carninci, P., Hayashizaki, Y., Semple, C.: Heterotachy in mammalian promoter evolution. PLoS Genet. 2(4), e30 (2006). https://doi.org/10.1371/journal.pgen.0020030

    Article  Google Scholar 

  50. Ullah, I., Parviainen, P., Lagergren, J.: Species tree inference using a mixture model. Mol. Biol. Evol. 32(9), 2469–2482 (2015)

    Article  Google Scholar 

  51. Vachaspati, P., Warnow, T.: ASTRID: accurate species TRees from internode distances. BMC Genom. 16(Suppl. 10), S3 (2015)

    Article  Google Scholar 

  52. Vachaspati, P., Warnow, T.: FastRFS: fast and accurate Robinson-Foulds supertrees using constrained exact optimization. Bioinformatics (2016). https://doi.org/10.1093/bioinformatics/btw600

  53. Vachaspati, P., Warnow, T.: SVDquest: improving SVDquartets species tree estimation using exact optimization within a constrained search space. Mol. Phylogenet. Evol. 124, 122–136 (2018). https://doi.org/10.1016/j.ympev.2018.03.006

    Article  Google Scholar 

  54. Wang, L.S., Leebens-Mack, J., Wall, P.K., Beckmann, K., DePamphilis, C.W., Warnow, T.: The impact of multiple protein sequence alignment on phylogenetic estimation. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 1108–1119 (2011)

    Article  Google Scholar 

  55. Warnow, T., Moret, B.M.E., St. John, K.: Absolute convergence: true trees from short sequences. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA 2001), pp. 186–195. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2001)

    Google Scholar 

  56. Warnow, T.: Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. Cambridge University Press, Cambridge (2017)

    Book  Google Scholar 

  57. Warnow, T.: Divide-and-conquer tree estimation: opportunities and challenges. In: Warnow, T. (ed.) Bioinformatics and Phylogenetics. Springer, Heidelberg (2019)

    Google Scholar 

  58. Yang, Z.: Molecular Evolution: A Statistical Approach. Oxford University Press, Oxford (2014)

    Book  Google Scholar 

  59. Zhang, C., Sayyari, E., Mirarab, S.: ASTRAL-III: Increased scalability and impacts of contracting low support branches. In: Meidanis, J., Nakhleh, L. (eds.) RECOMB-CG 2017. LNCS, pp. 53–75. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67979-2_4

    Chapter  Google Scholar 

  60. Zhang, Q., Rao, S., Warnow, T.: New absolute fast converging phylogeny estimation methods with improved scalability and accuracy. In: Parida, L., Ukkonen, E. (eds.) 18th International Workshop on Algorithms in Bioinformatics (WABI 2018), pp. 8:1–8:12. LIPICS, Dagsttuhl (2018)

    Google Scholar 

  61. Zhou, Y., Rodrigue, N., Lartillot, N., Philippe, H.: Evaluation of the models handling heterotachy in phylogenetic inference. BMC Evol. Biol. 7, 206 (2007)

    Article  Google Scholar 

  62. Zimmermann, T., Mirarab, S., Warnow, T.: BBCA: improving the scalability of *BEAST using random binning. BMC Genom. 15(Suppl. 6), S11 (2014). Proceedings of RECOMB-CG (Comparative Genomics)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported in part by NSF grant CCF-1535977. I also wish to thank Erin Molloy and Thien Le for helpful comments on the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tandy Warnow .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Warnow, T. (2019). New Divide-and-Conquer Techniques for Large-Scale Phylogenetic Estimation. In: Holmes, I., Martín-Vide, C., Vega-Rodríguez, M. (eds) Algorithms for Computational Biology. AlCoB 2019. Lecture Notes in Computer Science(), vol 11488. Springer, Cham. https://doi.org/10.1007/978-3-030-18174-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-18174-1_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-18173-4

  • Online ISBN: 978-3-030-18174-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics