Skip to main content

Information Theoretic Approaches to Whole Genome Phylogenies

(Extended Abstract)

  • Conference paper
Research in Computational Molecular Biology (RECOMB 2005)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3500))

Abstract

We describe a novel method for efficient reconstruction of phylogenetic trees, based on sequences of whole genomes or proteomes. The core of our method is a new measure of pairwise distances between sequences, whose lengths may greatly vary. This measure is based on information theoretic tools (Kullback-Leibler relative entropy). We present an algorithm for efficiently computing these distances. The algorithm uses suffix arrays to compute the distance of two ℓ long sequences in O(ℓ) time. It is fast enough to enable the construction of the phylogenomic tree for hundreds of species, and the phylogenomic forest for almost two thousand viruses. An initial analysis of the results exhibits a remarkable agreement with “acceptable phylogenetic truth”. To assess our approach, it was implemented together with a number of alternative approaches, including two that were previously published in the literature. Comparing their outcome to ours, using a “traditional” tree and a standard tree comparison method, our algorithm improved upon the “competition” by a substantial margin.

Research supported by ISF grant 418/00.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ben-Dor, A., Chor, B., Graur, D., Ophir, R., Pelleg, D.: Constructing phylogenies from quartets: elucidation of eutherian superordinal relationships. Journal of computational Biology 5, 377–390 (1998)

    Article  Google Scholar 

  2. Bininda-Emonds, O.: Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life. Kluwer series in Computational Biology (2004)

    Google Scholar 

  3. Blusch, J.H., Seelmeir, S., Helm, K.V.: Molecular and enzymatic characterization of the porcine endogenous retrovirus protease. Virol 76(15), 7913–7917 (2002)

    Article  Google Scholar 

  4. Bourque, G., Pevzner, P.A.: Genome-scale evolution: Reconstructing gene order in ancestral species. Genome Research 12, 26–36 (2002)

    Google Scholar 

  5. Chen, X., Kwong, S., Li, M.: A compression algorithm for dna sequences and its applications in genome comparison. In: RECOMB, pp. 107–117 (2000)

    Google Scholar 

  6. Cover, T.M., Thomas, J.A.: Elements of Information Theory. J. Wiley and Sons, New York (1991)

    Book  MATH  Google Scholar 

  7. Darwin, C.: On the origin of species, 1st edn., November 24 (1859)

    Google Scholar 

  8. NCBI Taxonomy Database, http://www.ncbi.nlm.nih.gov/entrez/linkout/tutorial/taxtour.html

  9. The Universal Virus Database, http://www.ncbi.nlm.nih.gov/ictvdb/ictvdb/

  10. Downie, S., Palmer, J.: Use of chloroplast dna rearangements in reconstructing plant phylogeny. In: Soltis, P., Soltis, D., Doyle, J. (eds.) Plant Molecular Systematics, pp. 14–35. Chapman and Hall, Boca Raton (1992)

    Google Scholar 

  11. Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)

    Book  MATH  Google Scholar 

  12. Endres, D.M., Schindelin, J.E.: A new metric for probability distribution. IEEE Tran. Inf. Theory. 49(7) (2003)

    Google Scholar 

  13. NCBI Genome Entrez, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genome

  14. Farach, M., Noordewier, M., Savari, S., Shepp, L., Wyner, A., Ziv, J.: On the entropy of dna: Algorithms and measurements based on memory and rapid. In: Symposium on Discrete Algorithms (1994)

    Google Scholar 

  15. Felsenstein, J.: Phylip (phylogeny inference package) version 3.5c. Distributed by the author. Department of Genetics, University of Washington, Seattle (1993)

    Google Scholar 

  16. Hannenhalli, S., Pevzner, P.: Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals). In: Proc. 27th Annual ACM Symposium on the Theory of Computing, pp. 178–189 (1995)

    Google Scholar 

  17. Ribosomal Database Project II, http://rdp.cme.msu.edu/html/

  18. Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22, 75–88 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  19. Lempel, A., Ziv, J.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory (1977)

    Google Scholar 

  20. Li, M., Badger, J., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)

    Article  Google Scholar 

  21. Ma, B., Li, M., Zhang, L.: From gene trees to species trees. SIAM, Philadelphia (1998)

    Google Scholar 

  22. Moret, B.M.E., Wang, L.S., Warnow, T., Wyman, S.K.: New approaches for reconstructing phylogenies from gene order data. bioinformatics 17, 165–173 (2001)

    Google Scholar 

  23. Nelson, M.: LZW Data Compression (1989)

    Google Scholar 

  24. Origins of viruses, http://www.mcb.uct.ac.za/tutorial/virorig.html

  25. Otu, H.H., Sayood, K.: A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19(16) (2003)

    Google Scholar 

  26. Qi, J., Wang, B., Hao, B.: Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach. J. Mol. Evol. 58(1), 1–11 (2004)

    Article  Google Scholar 

  27. Raul, P.T., Gordon, B., Oliver, E.: Quartet Supertrees. In: Bininda-Emonds, Olaf, R.P. (eds.) Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life, pp. 173–191. Kluwer Academic, Dordrecht (2004) (in press)

    Google Scholar 

  28. Sacco, M.A., Flannery, D.M.J., Howes, K., Venugopal, K.: Avian endogenous retrovirus eav-hp shares regions of identity with avian leukosis virus subgroup j and the avian retrotransposon art-ch. J. Virol 74(3), 1296–1306 (2000)

    Article  Google Scholar 

  29. Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987)

    Google Scholar 

  30. Sayood, K.: Introduction to data compression, 2nd edn.

    Google Scholar 

  31. Stefan, B., Kärkkäinen, J.: Fast lightweight suffix array construction and checking, pp. 55–69 (2003)

    Google Scholar 

  32. Stuart, G.W., Berry, M.W.: A comprehensive whole genome bacterial phylogeny using correlated peptide motive defined in a high dimensional vector space. Journal of Bioinformatics and Computational Biology 1(3), 475–493 (2003)

    Article  Google Scholar 

  33. Triyatnib, M., Ey, P.L., Tran, T., Mire, M.L., Qiao, M., Burrell, C.J., Jilbert, A.R.: Sequence comparison of an australian duck hepatitis b virus strain with other avian hepadnaviruses. Journal of General Virology 82, 373–378 (2001)

    Google Scholar 

  34. Waters, E., Hohn, M.J., Ahel, I., Graham, D.E., Adams, M.D., Barnstead, M., Beeson, K.Y., Bibbs, L., Bolanos, R., Keller, M., Kretz, K., Lin, X., Mathur, E., Ni, J., Podar, M., Richardson, T., Sutton, G.G., Simon, M., Söll, D., Stetter, K.O., Short, J.M., Noordewier, M.: The genome of nanoarchaeum equitans: Insights into early archaeal evolution and derived parasitism. Proc. Natl. Acad. Sci. USA 100(22), 12984–12988 (2003)

    Article  Google Scholar 

  35. Wyner, A.D., Wyner, A.J.: An improved version of lempel-ziv algorithm. IEEE Tran. Inf. Theory (1995)

    Google Scholar 

  36. Wyner, A.J.: String matching theorems and applications to data compression and statistics. Ph.d., Stanford (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Burstein, D., Ulitsky, I., Tuller, T., Chor, B. (2005). Information Theoretic Approaches to Whole Genome Phylogenies. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds) Research in Computational Molecular Biology. RECOMB 2005. Lecture Notes in Computer Science(), vol 3500. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11415770_22

Download citation

  • DOI: https://doi.org/10.1007/11415770_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25866-7

  • Online ISBN: 978-3-540-31950-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics