Multi-SpaM: A Maximum-Likelihood Approach to Phylogeny Reconstruction Using Multiple Spaced-Word Matches and Quartet Trees

Dencker, Thomas; Leimeister, Chris-André; Gerth, Michael; Bleidorn, Christoph; Snir, Sagi; Morgenstern, Burkhard

doi:10.1007/978-3-030-00834-5_13

Thomas Dencker¹⁵,
Chris-André Leimeister¹⁵,
Michael Gerth¹⁶,
Christoph Bleidorn^17,18,
Sagi Snir¹⁹ &
…
Burkhard Morgenstern^15,20

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11183))

Included in the following conference series:

RECOMB International conference on Comparative Genomics

611 Accesses
2 Citations

Abstract

Word-based or ‘alignment-free’ methods for phylogeny reconstruction are much faster than traditional, alignment-based approaches, but they are generally less accurate. Most alignment-free methods calculate pairwise distances for a set of input sequences, for example from word frequencies, from so-called spaced-word matches or from the average length of common substrings. In this paper, we propose the first word-based phylogeny approach that is based on multiple sequence comparison and Maximum Likelihood. Our algorithm first samples small, gap-free alignments involving four taxa each. For each of these alignments, it then calculates a quartet tree and, finally, the program Quartet MaxCut is used to infer a super tree for the full set of input taxa from the calculated quartet trees. Experimental results show that trees calculated with our approach are of high quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Angiuoli, S.V., Salzberg, S.L.: Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics 27, 334–342 (2011)
Article Google Scholar
Avni, E., Yona, Z., Cohen, R., Snir, S.: The performance of two supertree schemes compared using synthetic and real data quartet input. J. Mol. Evol. 86, 150–165 (2018)
Article Google Scholar
Ayad, L.A., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P.: Longest common prefixes with \(k\)-errors and applications. arXiv:1801.04425 [cs.DS] (2018)
Baum, B.: Combining trees as a way of combining data sets for phylogenetic inference. Taxon 41, 3–10 (1992)
Article Google Scholar
Bernard, G., Chan, C.X., Ragan, M.A.: Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer. Sci. Rep. 6, 28970 (2016)
Article Google Scholar
Bininda-Emonds, O.R.P.: Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life. Computational Biology. Springer, Netherlands (2004). https://doi.org/10.1007/978-1-4020-2330-9
Book MATH Google Scholar
Bromberg, R., Grishin, N.V., Otwinowski, Z.: Phylogeny reconstruction with alignment-free method that corrects for horizontal gene transfer. PLoS Comput. Biol. 12, e1004985 (2016)
Article Google Scholar
Cattaneo, G., Ferraro Petrillo, U., Giancarlo, R., Roscigno, G.: An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop. J. Supercomput. 73, 1467–1483 (2017)
Article Google Scholar
Chiaromonte, F., Yap, V.B., Miller, W.: Scoring pairwise genomic sequence alignments. In: Altman, R.B., Dunker, A.K., Hunter, L., Klein, T.E. (eds.) Pacific Symposium on Biocomputing, Lihue, Hawaii, pp. 115–126 (2002)
Google Scholar
Chor, B., Tuller, T.: Maximum likelihood of evolutionary trees is hard. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2005. LNCS, vol. 3500, pp. 296–310. Springer, Heidelberg (2005). https://doi.org/10.1007/11415770_23
Chapter Google Scholar
Comin, M., Schimd, M.: Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns. BMC Bioinform. 15, S1 (2014)
Article Google Scholar
Dalquen, D.A., Anisimova, M., Gonnet, G.H., Dessimoz, C.: ALF - a simulation framework for genome evolution. Mol. Biol. Evol. 29, 1115–1123 (2012)
Article Google Scholar
Dencker, T., Leimeister, C.A., Morgenstern, B.: Multi-SpaM: a maximum-likelihood approach to phylogeny reconstruction based on multiple spaced-word matches. arxiv.org/abs/1803.09222 [q-bio.PE] (2018). http://arxiv.org/abs/1703.08792
Farris, J.S.: Methods for computing wagner trees. Syst. Biol. 19, 83–92 (1970)
Article Google Scholar
Felsenstein, J.: Evolutionary trees from DNA sequences:a maximum likelihood approach. J. Mol. Evol. 17, 368–376 (1981)
Article Google Scholar
Felsenstein, J.: PHYLIP - phylogeny inference package (version 3.2). Cladistics 5, 164–166 (1989)
Google Scholar
Fitch, W.: Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Zool. 20, 406–416 (1971)
Article Google Scholar
Foulds, L., Graham, R.: The steiner problem in phylogeny is NP-complete. Adv. Appl. Math. 3, 43–49 (1982)
Article MathSciNet Google Scholar
Gerth, M., Bleidorn, C.: Comparative genomics provides a timeframe for Wolbachia evolution and exposes a recent biotin synthesis operon transfer. Nat. Microbiol. 2, 16241 (2016)
Article Google Scholar
Girotto, S., Comin, M., Pizzi, C.: FSH: fast spaced seed hashing exploiting adjacent hashes. Algorithms Mol. Biol. 13, 8 (2018)
Article Google Scholar
Hahn, L., Leimeister, C.A., Ounit, R., Lonardi, S., Morgenstern, B.: rasbhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison. PLOS Comput. Biol. 12(10), e1005107 (2016)
Article Google Scholar
Hatje, K., Kollmar, M.: A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method. Front. Plant Sci. 3, 192 (2012)
Article Google Scholar
Haubold, B., Klötzl, F., Pfaffelhuber, P.: andi: fast and accurate estimation of evolutionary distances between closely related genomes. Bioinformatics 31, 1169–1175 (2015)
Article Google Scholar
Haubold, B., Pfaffelhuber, P., Domazet-Loso, M., Wiehe, T.: Estimating mutation distances from unaligned genomes. J. Comput. Biol. 16, 1487–1500 (2009)
Article MathSciNet Google Scholar
Horwege, S., et al.: Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucl. Acids Res. 42, W7–W11 (2014)
Google Scholar
Ilie, L., Ilie, S., Bigvand, A.M.: SpEED: fast computation of sensitive spaced seeds. Bioinformatics 27, 2433–2434 (2011)
Article Google Scholar
Ilie, S.: Efficient Computation of Spaced Seeds. BMC Res. Notes 5, 123 (2012)
Article Google Scholar
Leimeister, C.A., Boden, M., Horwege, S., Lindner, S., Morgenstern, B.: Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics 30, 1991–1999 (2014)
Article Google Scholar
Leimeister, C.A., Morgenstern, B.: kmacs: the \(k\)-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30, 2000–2008 (2014)
Article Google Scholar
Leimeister, C.A., Schellhorn, J., Schöbel, S., Gerth, M., Bleidorn, C., Morgenstern, B.: Prot-SpaM: Fast alignment-free phylogeny reconstruction based on whole-proteome sequences. bioRxiv (2018). https://doi.org/10.1101/306142
Leimeister, C.A., Sohrabi-Jahromi, S., Morgenstern, B.: Fast and accurate phylogeny reconstruction using filtered spaced-word matches. Bioinformatics 33, 971–979 (2017)
Google Scholar
Morgenstern, B., Schöbel, S., Leimeister, C.A.: Phylogeny reconstruction based on the length distribution of k-mismatch common substrings. Algorithms Mol. Biol. 12, 27 (2017)
Article Google Scholar
Morgenstern, B., Zhu, B., Horwege, S., Leimeister, C.A.: Estimating evolutionary distances between genomic sequences from spaced-word matches. Algorithms Mol. Biol. 10, 5 (2015)
Article Google Scholar
Newton, R., et al.: Genome characteristics of a generalist marine bacterial lineage. ISME J. 4, 784–798 (2010)
Article Google Scholar
Noé, L.: Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds. Algorithms Mol. Biol. 12, 1 (2017)
Article Google Scholar
OpenMP Forum: OpenMP C and C++ Application Program Interface, Version 2.0. Technical report (2002). http://www.openmp.org
Ounit, R., Lonardi, S.: Higher classification accuracy of short metagenomic reads by discriminative spaced k-mers. In: Pop, M., Touzet, H. (eds.) WABI 2015. LNCS, vol. 9289, pp. 286–295. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48221-6_21
Chapter Google Scholar
Petrillo, U.F., Guerra, C., Pizzi, C.: A new distributed alignment-free approach to compare whole proteomes. Theor. Comput. Sci. 698, 100–112 (2017)
Article MathSciNet Google Scholar
Pizzi, C.: MissMax: alignment-free sequence comparison with mismatches through filtering and heuristics. Algorithms Mol. Biol. 11, 6 (2016)
Article Google Scholar
Ragan, M.: Matrix representation in reconstructing phylogenetic-relationships among the eukaryotes. Biosystems 28, 47–55 (1992)
Article Google Scholar
Ren, J., Bai, X., Lu, Y.Y., Tang, K., Wang, Y., Reinert, G., Sun, F.: Alignment-free sequence analysis and applications. Annu. Rev. Biomed. Data Sci. 1, 93–114 (2018)
Article Google Scholar
Robinson, D.F., Foulds, L.: Comparison of phylogenetic trees. Math. Biosci. 53, 131–147 (1981)
Article MathSciNet Google Scholar
Roychowdhury, T., Vishnoi, A., Bhattacharya, A.: Next-generation anchor based phylogeny (NexABP): constructing phylogeny from next-generation sequencing data. Sci. Rep. 3, 2634 (2013)
Article Google Scholar
Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987)
Google Scholar
Sievers, F., et al.: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011)
Article Google Scholar
Snir, S., Rao, S.: Quartets MaxCut: a divide and conquer quartets algorithm. IEEE/ACM Trans. Comput. Biology Bioinform. 7, 704–718 (2010)
Article Google Scholar
Snir, S., Rao, S.: Quartet MaxCut: a fast algorithm for amalgamating quartet trees. Mol. Phylogenetics Evol. 62, 1–8 (2012)
Article Google Scholar
Song, K., Ren, J., Reinert, G., Deng, M., Waterman, M.S., Sun, F.: New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief. Bioinform. 15, 343–353 (2014)
Article Google Scholar
Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M., Sun, F.: Alignment-free sequence comparison based on next-generation sequencing reads. J. Comput. Biol. 20, 64–79 (2013)
Article MathSciNet Google Scholar
Stamatakis, A.: RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014)
Article Google Scholar
Steel, M.: The complexity of reconstructing trees from qualitative characters and subtress. J. Classif. 9, 91–116 (1992)
Article Google Scholar
Tavaré, S.: Some probabilistic and statistical problems on the analysis of DNA sequences. Lect. Math. Life Sci. 17, 57–86 (1986)
MathSciNet MATH Google Scholar
Thankachan, S.V., Apostolico, A., Aluru, S.: A provably efficient algorithm for the \(k\)-mismatch average common substring problem. J. Comput. Biol. 23, 472–482 (2016)
Article MathSciNet Google Scholar
Thankachan, S.V., Chockalingam, S.P., Liu, Y., Aluru, A.K.S.: A greedy alignment-free distance estimator for phylogenetic inference. BMC Bioinform. 18, 238 (2017)
Article Google Scholar
Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenomic reconstruction. J. Comput. Biol. 13, 336–350 (2006)
Article MathSciNet Google Scholar
Yi, H., Jin, L.: Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucl. Acids Res. 41, e75 (2013)
Article Google Scholar
Zielezinski, A., Vinga, S., Almeida, J., Karlowski, W.M.: Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18, 186 (2017)
Article Google Scholar

Download references

Funding

The project was funded by VW Foundation, project VWZN3157. We acknowledge support by the Open Access Publication Funds of Göttingen University.

Author information

Authors and Affiliations

Department of Bioinformatics, Institute of Microbiology and Genetics, University of Goettingen, Goldschmidtstr. 1, 37077, Goettingen, Germany
Thomas Dencker, Chris-André Leimeister & Burkhard Morgenstern
Institute for Integrative Biology, University of Liverpool, Biosciences Building, Crown Street, Liverpool, L69 7ZB, UK
Michael Gerth
Department of Animal Evolution and Biodiversity, University of Goettingen, Untere Karspüle 2, 37073, Goettingen, Germany
Christoph Bleidorn
Museo Nacional de Ciencias Naturales, Spanish National Research Council (CSIC), 28006, Madrid, Spain
Christoph Bleidorn
Institute of Evolution, Department of Evolutionary and Environmental Biology, University of Haifa, 199 Aba Khoushy Ave. Mount Carmel, Haifa, Israel
Sagi Snir
Goettingen Center of Molecular Biosciences (GZMB), Justus-von-Liebig-Weg 11, 37077, Goettingen, Germany
Burkhard Morgenstern

Authors

Thomas Dencker
View author publications
You can also search for this author in PubMed Google Scholar
Chris-André Leimeister
View author publications
You can also search for this author in PubMed Google Scholar
Michael Gerth
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Bleidorn
View author publications
You can also search for this author in PubMed Google Scholar
Sagi Snir
View author publications
You can also search for this author in PubMed Google Scholar
Burkhard Morgenstern
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Burkhard Morgenstern .

Editor information

Editors and Affiliations

McGill University, Montréal, QC, Canada
Mathieu Blanchette
Université de Sherbrooke, Sherbrooke, QC, Canada
Aïda Ouangraoua

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dencker, T., Leimeister, CA., Gerth, M., Bleidorn, C., Snir, S., Morgenstern, B. (2018). Multi-SpaM: A Maximum-Likelihood Approach to Phylogeny Reconstruction Using Multiple Spaced-Word Matches and Quartet Trees. In: Blanchette, M., Ouangraoua, A. (eds) Comparative Genomics. RECOMB-CG 2018. Lecture Notes in Computer Science(), vol 11183. Springer, Cham. https://doi.org/10.1007/978-3-030-00834-5_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-00834-5_13
Published: 08 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00833-8
Online ISBN: 978-3-030-00834-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics