Skip to main content

Alignment Free String Distances for Phylogeny

  • Conference paper
  • First Online:
  • 2120 Accesses

Abstract

In this paper, we compare the accuracy of four string distances to recover correct phylogenies of complete genomes. These distances are based on common words shared by raw genomic sequences and do not require preliminary processing steps such as gene identification or sequence alignment. Moreover, they are computable in linear time. The first distance is based on Maximum Significant Matches. The second is computed from the frequencies of all the words of length k. The third one is based on the Average length of maximum Common Substrings at any position. The last one is based on the Ziv-Lempel compression algorithm. We describe a simulation process of evolution to generate a set of sequences having evolved according to a random tree topology T. This process allows both base substitutions and fragment insertion/deletion, including horizontal gene transfers. The distances between the generated sequences are computed using the four string formulas and the corresponding trees T′ are reconstructed using Neighbor-Joining. Trees T and T′ are compared using three topological criteria. These comparisons show that the MSM distance outperforms the others whatever the parameters used to generate sequences.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Amir, A., & Keselman, D. (1997). Maximum agreement subtree in a set of evolutionary trees: metric and efficient algorithms. SIAM Journal on Computing, 26, 1656–1669.

    Article  MATH  MathSciNet  Google Scholar 

  • Estabrook, G.F. et al. (1985). Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology, 34, 193–200.

    Article  Google Scholar 

  • Guyon F., & Guénoche A. (2010). An evolutionary distance based on maximal unique matches. Communications in Statistics, 39(3), 385–397.

    Article  MATH  Google Scholar 

  • Karlin, S., & Burge, C. (1995). Dinucleotide relative abundance extremes: A genomic signature. Trends in Genetics, 11, 283–290.

    Article  Google Scholar 

  • Kimura, M. (1980). A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16, 111–120.

    Article  Google Scholar 

  • Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C., & Salzberg, S. L. (2004). Versatile and open software for comparing large genomes. Genome Biology, 5, R12.

    Article  Google Scholar 

  • Otu, H. H., & Sayood, K. (2003). A new sequence distance measure for phylogenetic tree construction. Bioinformatics, 19(16), 2122–2130.

    Article  Google Scholar 

  • Qi, J., Wang, B., & Hao, B. I. (2004). Whole proteome prokaryote phylogeny without sequence alignment: A K-string composition approach. Journal of Molecular Evolution, 58(1), 1–11.

    Article  Google Scholar 

  • Robinson, D. F., & Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53, 131–147.

    Article  MATH  MathSciNet  Google Scholar 

  • Saitou, N., & Nei, M. (1987). The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4, 406–425.

    Google Scholar 

  • Snel, B., Huynen, M. A., & Dutilh, B. E. (2005). Genome trees and the nature of genome evolution. Annual Review of Microbiology, 59, 191–209.

    Article  Google Scholar 

  • Ulitsky, I., Burnstein, D., Tuller, T., & Chor, B. (2006). The average common substring approach to phylogenomic reconstruction. Journal of Computational Biology, 13, 336–350.

    Article  MathSciNet  Google Scholar 

  • Ziv, J., & Lempel, A. (1977). A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23, 337–343.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Guyon, F., Guénoche, A. (2010). Alignment Free String Distances for Phylogeny. In: Locarek-Junge, H., Weihs, C. (eds) Classification as a Tool for Research. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10745-0_2

Download citation

Publish with us

Policies and ethics