Alignment Free String Distances for Phylogeny

Guyon, Frédéric; Guénoche, Alain

doi:10.1007/978-3-642-10745-0_2

Alignment Free String Distances for Phylogeny

Frédéric Guyon &
Alain Guénoche³

Conference paper
First Online: 01 January 2010

2120 Accesses

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

Abstract

In this paper, we compare the accuracy of four string distances to recover correct phylogenies of complete genomes. These distances are based on common words shared by raw genomic sequences and do not require preliminary processing steps such as gene identification or sequence alignment. Moreover, they are computable in linear time. The first distance is based on Maximum Significant Matches. The second is computed from the frequencies of all the words of length k. The third one is based on the Average length of maximum Common Substrings at any position. The last one is based on the Ziv-Lempel compression algorithm. We describe a simulation process of evolution to generate a set of sequences having evolved according to a random tree topology T. This process allows both base substitutions and fragment insertion/deletion, including horizontal gene transfers. The distances between the generated sequences are computed using the four string formulas and the corresponding trees T′ are reconstructed using Neighbor-Joining. Trees T and T′ are compared using three topological criteria. These comparisons show that the MSM distance outperforms the others whatever the parameters used to generate sequences.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amir, A., & Keselman, D. (1997). Maximum agreement subtree in a set of evolutionary trees: metric and efficient algorithms. SIAM Journal on Computing, 26, 1656–1669.
Article MATH MathSciNet Google Scholar
Estabrook, G.F. et al. (1985). Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology, 34, 193–200.
Article Google Scholar
Guyon F., & Guénoche A. (2010). An evolutionary distance based on maximal unique matches. Communications in Statistics, 39(3), 385–397.
Article MATH Google Scholar
Karlin, S., & Burge, C. (1995). Dinucleotide relative abundance extremes: A genomic signature. Trends in Genetics, 11, 283–290.
Article Google Scholar
Kimura, M. (1980). A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16, 111–120.
Article Google Scholar
Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C., & Salzberg, S. L. (2004). Versatile and open software for comparing large genomes. Genome Biology, 5, R12.
Article Google Scholar
Otu, H. H., & Sayood, K. (2003). A new sequence distance measure for phylogenetic tree construction. Bioinformatics, 19(16), 2122–2130.
Article Google Scholar
Qi, J., Wang, B., & Hao, B. I. (2004). Whole proteome prokaryote phylogeny without sequence alignment: A K-string composition approach. Journal of Molecular Evolution, 58(1), 1–11.
Article Google Scholar
Robinson, D. F., & Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53, 131–147.
Article MATH MathSciNet Google Scholar
Saitou, N., & Nei, M. (1987). The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4, 406–425.
Google Scholar
Snel, B., Huynen, M. A., & Dutilh, B. E. (2005). Genome trees and the nature of genome evolution. Annual Review of Microbiology, 59, 191–209.
Article Google Scholar
Ulitsky, I., Burnstein, D., Tuller, T., & Chor, B. (2006). The average common substring approach to phylogenomic reconstruction. Journal of Computational Biology, 13, 336–350.
Article MathSciNet Google Scholar
Ziv, J., & Lempel, A. (1977). A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23, 337–343.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

IML, CNRS, 163 Avenue de Luminy, Marseille, France
Alain Guénoche

Authors

Frédéric Guyon
View author publications
You can also search for this author in PubMed Google Scholar
Alain Guénoche
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

LS für BWL, insb. Finanzwirtschaft und, Finanzdienstleistungen, TU Dresden, Helmholtzstr. 10, Dresden, 01062, Germany
Hermann Locarek-Junge
FG Computergestützte Statistik, Univ. Dortmund, Vogelpothsweg 87, Dortmund, 44227, Germany
Claus Weihs

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guyon, F., Guénoche, A. (2010). Alignment Free String Distances for Phylogeny. In: Locarek-Junge, H., Weihs, C. (eds) Classification as a Tool for Research. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10745-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-10745-0_2
Published: 03 May 2010
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10744-3
Online ISBN: 978-3-642-10745-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics