Journal of Mathematical Modelling and Algorithms

, Volume 5, Issue 3, pp 291–308 | Cite as

Generation of the Exact Distribution and Simulation of Matched Nucleotide Sequences on a Phylogenetic Tree

  • Faisal Ababneh
  • Lars S. Jermiin
  • John Robinson


Nucleotide sequences are often generated by Monte Carlo simulations to address complex evolutionary or analytic questions but the simulations are rarely described in sufficient detail to allow the research to be replicated. Here we briefly review the Markov processes of substitution in a pair of matching (homologous) nucleotide sequences and then extend it to k matching nucleotide sequences. We describe calculation of the joint distribution of nucleotides of two matching sequences. Based on this distribution, we give a method for simulation of the divergence matrix for n sites using the multinomial distribution. This is then extended to the joint distribution for k nucleotide sequences and the corresponding 4 k divergence array, generalizing Felsenstein (Journal of Molecular Evolution, 17, 368–376, 1981), who considered stationary, homogeneous and reversible processes on trees. We give a second method to generate matched sequences that begins with a random ancestral sequence and applies a continuous Markov process to each nucleotide site as in Rambaut and Grassly (Computer Applications in the Biosciences, 13, 235–238, 1997); further, we relate this to an equivalent approach based on an embedded Markov chain. Finally, we describe an approximate method that was recently implemented in a program developed by Jermiin et al. (Applied Bioinformatics, 2, 159–163, 2003). The three methods presented here cater for different computational and mathematical limitations and are shown in an example to produce results close to those expected on theoretical grounds. All methods are implemented using functions in the S-plus or R languages.

Mathematics Subject Classifications (2000):


Key words

Markov processes on trees Monte Carlo simulations 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Rapp, B. A. and Wheeler, D. L.: GenBank, Nucleic Acids Res. 28 (2000), 15–18CrossRefGoogle Scholar
  2. 2.
    Conant, G. C. and Lewis, P. O.: Effects of nucleotide compositional bias in the success of the parsimony criterion in phylogenetic inference, Mol. Biol. Evol. 18 (2001), 1024–1033.Google Scholar
  3. 3.
    Felsenstein, J.: Evolutionary trees from DNA sequences: A maximum likelihood approach, J. Mol. Evol. 17 (1981), 368–376.CrossRefGoogle Scholar
  4. 4.
    Felsenstein, J.: Inferring Phylogenies, Sinauer, Sunderland, Massachusetts, USA, 2004.Google Scholar
  5. 5.
    Felsenstein, J.: PHYLIP (Phylogeny Inference Package), version 3.62, Distributed by the author. Department of Genome Sciences, University of Washington, Seattle, 2004.Google Scholar
  6. 6.
    Gaut, B. S. and Lewis, P. O.: Success of maximum likelihood phylogeny inference in the four-taxon case, Mol. Biol. Evol. 12 (1995), 152–162.Google Scholar
  7. 7.
    Ho, S. Y. W. and Jermiin, L. S.: Tracing the decay of the historical signal in biological sequence data, Syst. Biol. 53 (2004), 623–637.CrossRefGoogle Scholar
  8. 8.
    Jermiin, L. S., Ho, S. Y. W., Ababneh, F., Robinson, J. and Larkum, A. W. D.: Hetero: A program to simulate the evolution of DNA on a four-taxon tree, Appl. Bioinformatics 2 (2003), 159–163.Google Scholar
  9. 9.
    Jermiin, L. S., Ho, S. Y. W., Ababneh, F., Robinson, J. and Larkum, A. W. D.: The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst. Biol. 53 (2004), 638–643.CrossRefGoogle Scholar
  10. 10.
    Lake, J. A.: Reconstructing evolutionary trees from DNA and protein sequences: Paralinear distances. Proc. Natl. Acad. Sci. USA. 91 (1994), 1155–1159.Google Scholar
  11. 11.
    Lockhart, P. J., Steel, M. A., Hendy, M. D. and Penny, D.: Recovering evolutionary trees under a more realistic model of sequence evolution, Mol. Biol. Evol. 11 (1994), 605–612.Google Scholar
  12. 12.
    Rambaut, A. and Grassly, N. C.: Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees, Comput. Appl. Biosci. 13 (1997), 235–238.Google Scholar
  13. 13.
    Swofford, D. L., Olsen, G. J., Waddell, P. J. and Hillis, D. M.: Phylogenetic inference, in D. M. Hillis, D. Moritz and B. K. Mable (eds), Molecular Systematics, 2nd Edn., Sinauer, Sunderland, Massachusetts, USA, 1996, pp. 407–514.Google Scholar
  14. 14.
    Tavaré, S.: Some probabilistic and statistical problems on the analysis of DNA sequences, Lect. Math. Life Sci. 17 (1986), 57–86.Google Scholar
  15. 15.
    Van, Den Bussche, R. A., Baker, R. J., Huelsenbeck, J. P. and Hillis, D. M.: Base compositional bias and phylogenetic analyses: A test of the “flying DNA” hypothesis, Mol. Phylogenet. Evol. 10 (1998), 408–416.CrossRefGoogle Scholar
  16. 16.
    Zharkikh, A.: Estimation of evolutionary distances between nucleotide sequences, J. Mol. Evol. 39 (1994), 315–329.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, Inc. 2006

Authors and Affiliations

  • Faisal Ababneh
    • 1
  • Lars S. Jermiin
    • 2
    • 3
    • 4
  • John Robinson
    • 1
  1. 1.School of Mathematics and StatisticsUniversity of SydneySydneyAustralia
  2. 2.School of Biological SciencesUniversity of SydneySydneyAustralia
  3. 3.Sydney University Biological Informatics and Technology CentreUniversity of SydneySydneyAustralia
  4. 4.Unité de Biologie Moléculaire de Gène chez les ExtrêmophilesInstitut PasteurParis CedexFrance

Personalised recommendations