Abstract
In this paper, we compare the accuracy of four string distances to recover correct phylogenies of complete genomes. These distances are based on common words shared by raw genomic sequences and do not require preliminary processing steps such as gene identification or sequence alignment. Moreover, they are computable in linear time. The first distance is based on Maximum Significant Matches. The second is computed from the frequencies of all the words of length k. The third one is based on the Average length of maximum Common Substrings at any position. The last one is based on the Ziv-Lempel compression algorithm. We describe a simulation process of evolution to generate a set of sequences having evolved according to a random tree topology T. This process allows both base substitutions and fragment insertion/deletion, including horizontal gene transfers. The distances between the generated sequences are computed using the four string formulas and the corresponding trees T′ are reconstructed using Neighbor-Joining. Trees T and T′ are compared using three topological criteria. These comparisons show that the MSM distance outperforms the others whatever the parameters used to generate sequences.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Amir, A., & Keselman, D. (1997). Maximum agreement subtree in a set of evolutionary trees: metric and efficient algorithms. SIAM Journal on Computing, 26, 1656–1669.
Estabrook, G.F. et al. (1985). Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology, 34, 193–200.
Guyon F., & Guénoche A. (2010). An evolutionary distance based on maximal unique matches. Communications in Statistics, 39(3), 385–397.
Karlin, S., & Burge, C. (1995). Dinucleotide relative abundance extremes: A genomic signature. Trends in Genetics, 11, 283–290.
Kimura, M. (1980). A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16, 111–120.
Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C., & Salzberg, S. L. (2004). Versatile and open software for comparing large genomes. Genome Biology, 5, R12.
Otu, H. H., & Sayood, K. (2003). A new sequence distance measure for phylogenetic tree construction. Bioinformatics, 19(16), 2122–2130.
Qi, J., Wang, B., & Hao, B. I. (2004). Whole proteome prokaryote phylogeny without sequence alignment: A K-string composition approach. Journal of Molecular Evolution, 58(1), 1–11.
Robinson, D. F., & Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53, 131–147.
Saitou, N., & Nei, M. (1987). The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4, 406–425.
Snel, B., Huynen, M. A., & Dutilh, B. E. (2005). Genome trees and the nature of genome evolution. Annual Review of Microbiology, 59, 191–209.
Ulitsky, I., Burnstein, D., Tuller, T., & Chor, B. (2006). The average common substring approach to phylogenomic reconstruction. Journal of Computational Biology, 13, 336–350.
Ziv, J., & Lempel, A. (1977). A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23, 337–343.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Guyon, F., Guénoche, A. (2010). Alignment Free String Distances for Phylogeny. In: Locarek-Junge, H., Weihs, C. (eds) Classification as a Tool for Research. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10745-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-10745-0_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10744-3
Online ISBN: 978-3-642-10745-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)