Hands-on Introduction to Sequence-Length Requirements in Phylogenetics

  • Sébastien RochEmail author
Part of the Computational Biology book series (COBO, volume 29)


In this tutorial, through a series of analytical computations and numerical simulations, we review many known insights into a fundamental question: how much data is needed to reconstruct the Tree of Life? A Jupyter notebook and code for this tutorial are provided in Python.


Phylogenetics Sequence-length requirements Distance-based methods Maximum likelihood estimation 



This work is supported by NSF grants DMS-1149312 (CAREER), DMS-1614242, and CCF-1740707 (TRIPODS).

   When I was first introduced to the field of computational phylogenetics in graduate school, I had the privilege of being supported by the NSF-funded CIPRES project—of which Bernard Moret was a leader—which had a significant impact on my early career .


  1. 1.
    Casella, G., Berger, R.: Statistical Inference. Duxbury Resource Center (2001)Google Scholar
  2. 2.
    Cavender, J.A.: Taxonomy with confidence. Math. Biosci. 40(3–4) (1978)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley-Interscience. Wiley, Hoboken, NJ (2006)Google Scholar
  4. 4.
    Dasarathy, G., Nowak, R., Roch, S.: Data requirement for phylogenetic inference from multiple loci: a new distance method. IEEE/ACM Trans. Comput. Biol. Bioinform. 12(2), 422–432 (2015)CrossRefGoogle Scholar
  5. 5.
    Daskalakis, C., Hill, C., Jaffe, A., Mihaescu, R., Mossel, E., Rao, S.: Maximal accurate forests from distance matrices. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) Research in Computational Molecular Biology, pp. 281–295. Springer, Berlin, Heidelberg (2006)Google Scholar
  6. 6.
    Daskalakis, C., Mossel, E., Roch, S.: Evolutionary trees and the ising model on the bethe lattice: a proof of steel’s conjecture. Probab. Theory Relat. Fields 149(1), 149–189 (2011)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Daskalakis, C., Mossel, E., Roch, S.: Phylogenies without branch bounds: contracting the short, pruning the deep. SIAM J. Discret. Math. 25(2), 872–893 (2011)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Daskalakis, C., Roch, S.: Alignment-free phylogenetic reconstruction: sample complexity via a branching process analysis. Ann. Appl. Probab. 23(2), 693–721 (2013)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Erdős, P.L., Steel, M.A., Székely, L., Warnow, T.J.: A few logs suffice to build (almost) all trees (i). Random Struct. Algorithms 14(2), 153–184 (1999)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Erdős, P.L., Steel, M.A., Székely, L., Warnow, T.J.: A few logs suffice to build (almost) all trees: part II. Theor. Comput. Sci. 221(1), 77–118 (1999)CrossRefGoogle Scholar
  11. 11.
    Farris, J.S.: A probability model for inferring evolutionary trees. Syst. Zool. 22(4), 250–256 (1973)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Huson, D.H., Nettles, S.M., Warnow, T.J.: Disk-covering, a fast-converging method for phylogenetic tree reconstruction. J. Comput. Biol. 6(3–4), 369–386 (1999)CrossRefGoogle Scholar
  13. 13.
    Lacey, M.R., Chang, J.T.: A signal-to-noise analysis of phylogeny estimation by neighbor-joining: Insufficiency of polynomial length sequences. Math. Biosci. 199(2), 188–215 (2006)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Mihaescu, R., Hill, C., Rao, S.: Fast phylogeny reconstruction through learning of ancestral sequences. Algorithmica 66(2), 419–449 (2013)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Moret, B.M., Roshan, U., Warnow, T.: Sequence-length requirements for phylogenetic methods. In: Guigó, R., Gusfield, D. (eds.) In: International Workshop on Algorithms in Bioinformatics (WABI), pp. 343–356. Springer, Berlin, Heidelberg (2002)Google Scholar
  16. 16.
    Moret, B.M.E., Wang, L.S., Warnow, T.: Toward new software for computational phylogenetics. Computer 35(7), 55–64 (2002). Scholar
  17. 17.
    Mossel, E.: On the impossibility of reconstructing ancestral data and phylogenies. J. Comput. Biol. 10(5), 669–676 (2003)CrossRefGoogle Scholar
  18. 18.
    Mossel, E.: Phase transitions in phylogeny. Trans. Am. Math. Soc. 356(6), 2379–2404 (2004)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Mossel, E.: Distorted metrics on trees and phylogenetic forests. IEEE/ACM Trans. Comput. Biol. Bioinform. 4(1), 108–116 (2007)CrossRefGoogle Scholar
  20. 20.
    Mossel, E., Roch, S.: Learning nonsingular phylogenies and hidden Markov models. Ann. Appl. Probab. 16(2), 583–614 (2006)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Mossel, E., Roch, S.: Distance-based species tree estimation under the coalescent: information-theoretic trade-off between number of loci and sequence length. Ann. Appl. Probab. 27(5), 2926–2955 (2017)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Mossel, E., Roch, S., Sly, A.: On the inference of large phylogenies with long branches: how long is too long? Bull. Math. Biol. 73(7), 1627–1644 (2011)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995)Google Scholar
  24. 24.
    Nakhleh, L., Moret, B.M.E., Roshan, U., John, K.S., Sun, J., Warnow, T.: The accuracy of fast phylogenetic methods for large datasets. In: Altman, R., Dunker, A., Hunter, L., Lauderdale, K., Klein, T. (eds.) In: Pacific Symposium on Biocomputing 2002, pp. 211–222. World Scientific Press, SingaporeGoogle Scholar
  25. 25.
    Pollard, D., Gill, R., Ripley, B.: A User’s Guide to Measure Theoretic Probability. Cambridge Series in Statistica. Cambridge University Press (2002)Google Scholar
  26. 26.
    Roch, S.: Toward extracting all phylogenetic information from matrices of evolutionary distances. Science 327(5971), 1376–1379 (2010)MathSciNetCrossRefGoogle Scholar
  27. 27.
    Roch, S., Sly, A.: Phase transition in the sample complexity of likelihood-based phylogeny inference. Probab. Theory Relat. Fields 169(1), 3–62 (2017)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Roch, S., Warnow, T.: On the robustness to gene tree estimation error (or lack thereof) of coalescent-based species tree methods. Syst. Biol. 64(4), 663–676 (2015)CrossRefGoogle Scholar
  29. 29.
    Steel, M.: Phylogeny. Society for Industrial and Applied Mathematics, Philadelphia, PA (2016)Google Scholar
  30. 30.
    Steel, M., Székely, L.: Inverting random functions II: explicit bounds for discrete maximum likelihood estimation, with applications. SIAM J. Discret. Math. 15(4), 562–575 (2002)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Warnow, T.: Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. Cambridge University Press (2017)Google Scholar
  32. 32.
    Warnow, T., Moret, B.M.E., St. John, K.: Absolute convergence: true trees from short sequences. In: Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’01, pp. 186–195. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (2001)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of MathematicsUniversity of WisconsinMadisonUSA

Personalised recommendations