Bayesian Approach to DNA Segmentation into Regions with Different Average Nucleotide Composition

  • Vsevolod Makeev
  • Vasily Ramensky
  • Mikhail Gelfand
  • Mikhail Roytberg
  • Vladimir Tumanyan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2066)


We present a new method of segmentation of nucleotide sequences into regions with different average composition. The sequence is modelled as a series of segments; within each segment the sequence is considered as a random sequence of independent and identically distributed variables. The partition algorithm includes two stages. In the first stage the optimal partition is found, which maximises the overall product of marginal likelihoods calculated for each segment. To prevent segmentation into short segments, the border insertion penalty may be introduced. In the next stage segments with close compositions are merged. Filtration is performed with the help of partition function calculated for all possible subsets of boundaries that belong to the optimal partition. The long sequences can be segmented by dividing sequences and segmenting those parts separately. The contextual effects of repeats, genes and other genomic elements are readily visualised.


Partition Function Hide Markov Model Random Sequence Bayesian Approach Marginal Likelihood 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Karlin, S., Brendel, V.: Patchiness and correlation in DNA sequences. Science 259 (1993) 677–680.CrossRefGoogle Scholar
  2. 2.
    Li, W.: The study of correlation structure of DNA sequences: a critical review. Computer & Chemistry 21(4) (1997) 257–278.CrossRefGoogle Scholar
  3. 3.
    Bernardi, G.: The isochore organization of the human genome. Annual Review of Genetics 23 (1989) 637–661.CrossRefGoogle Scholar
  4. 4.
    D’Onofrio, G., Mouchiroud, D., Aissani, B., Gautier, C., Bernardi, G.: Correlation between the compositional properties of human genes, codon usage, and amino acid composition of proteins. J. Mol. Evol. 32 (1991) 504–510.CrossRefGoogle Scholar
  5. 5.
    Guigo, R. Fickett, J. W.: Distinctive sequence features in protein coding, genic noncoding and intergenic human DNA. J. Mol. Biol. 253 (1995) 51–60.CrossRefGoogle Scholar
  6. 6.
    Herzel, H., Grosse, I.: Correlation in DNA sequences: The role of protein coding segments. Phys. Rev. E. 55 (1997) 800–810.CrossRefGoogle Scholar
  7. 7.
    Li, W., Kaneko, V.: DNA Correlations. Nature 360 (1992) 635–636.CrossRefGoogle Scholar
  8. 8.
    Gelfand, M. S.: Prediction of function in DNA sequence analysis. Journal of Computational Biology 2 (1995) 87–117.Google Scholar
  9. 9.
    Gelfand, M. S., Koonin, E. V.: Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes. Nucl. Acid. Res 27 (1995) 2430–2439.Google Scholar
  10. 10.
    Pedersen, A. G., Baldi, P., Chauvin, Y. Brunak, S.: The biology of eukaryotic promoter prediction. Computer & Chemistry 23 (1999) 191–207.CrossRefGoogle Scholar
  11. 11.
    Krogh, A., Mian, I. S. Haussler, D.: A hidden Markov model that finds genes in E.coli DNA. Nucl. Acid. Res 22 (1994) 4768–4778.CrossRefGoogle Scholar
  12. 12.
    Liu, S. L., Lawrence, C. E.: Bayesian Inference of Biopolymer Models. Bioinformatics 15 (1999) 38–52.CrossRefGoogle Scholar
  13. 13.
    Lawrence, C. E.: Bayesian Bioinformatics. 5th international conference on intelligent systems for molecular biology, Halkidiki, Greece (1997).Google Scholar
  14. 14.
    Liu, S. L., Lawrence, C. E.: Bayesian inference of biopolymer models, Stanford Statistical Department Technical Report (1998).Google Scholar
  15. 15.
    Roman-Roldan, R., Bernaola-Galvan, P. and Oliver, J. L.: Sequence compositional complexity of DNA through an entropic segmentation method. Phys. Rev. Lett. 80 (1998) 1344.Google Scholar
  16. 16.
    Churchill, G. A.: Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51 (1989) 79–94.zbMATHMathSciNetGoogle Scholar
  17. 17.
    Durbin, R., Eddy, Y. S., Krogh, A. Mitchison, G.: Biological Sequence Analysis. Cambridge, Cambirdge University Press (1998).zbMATHGoogle Scholar
  18. 18.
    Muri, F., Chauveau, D., Cellier, D.: Convergence assessment in latent variable models: DNA applications. In C. P. Robert (ed.) Lectural Notes in Statistics, Vol. 135, Discretization and MCMC convergence assessment., Springer. (1998) 127–146.Google Scholar
  19. 19.
    Wolpert, D. H., Wolf, D. R.: Estimating functions of probability distributions from a finite set of samples. Phys. Rev. E. 52 (1995) 6841–6854.CrossRefMathSciNetGoogle Scholar
  20. 20.
    Rozanov, Y. M.: Teoriya veroyatnosti, sluchainye processy i matematicheskaya statistika (russ: Probability Theory, Stochastic Processes and Mathematical Statisitics). Moscow, Nauka (1985).Google Scholar
  21. 21.
    Ramensky, V.E., Makeev, V.Ju., Roytberg, M.A., Tumanyan, V.G.: DNA segmentation through the bayesian approach. Journal of Computational Biology., 7 (2000), 215–231.CrossRefGoogle Scholar
  22. 22.
    Shaeffer, G. (1999) Personal communication.Google Scholar
  23. 23.
    Finkelstein, A. V., Roytberg, M. A.: Computation of biopolymers: A general approach to different problems. BioSystems 30 (1993) 1–19.CrossRefGoogle Scholar
  24. 24.
    Ossadnik, S.M., Buldyrev, S.V., Goldberger, A.L., Havlin, S., Mantegna, R.N., Peng, C.-K., Simons, M., Stanley, H.E.: Correlation approach to identify coding regions in DNA sequences. Biophysical Journal 67 (1994) 64–70.CrossRefGoogle Scholar
  25. 25.
    Bernaola-Galván, P., Grosse, I., Carpena, P., Oliver, J., Román-Roldán, R., Stanley, H.: Finding borders between coding and noncoding DNA regions by an entropic segmentation method. Phys. Rev. Let., 85, (2000) 1342–1345.CrossRefGoogle Scholar
  26. 26.
    Ono, S.: Evolution by gene duplication. Springer. (1970)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Vsevolod Makeev
    • 1
  • Vasily Ramensky
    • 1
  • Mikhail Gelfand
    • 2
  • Mikhail Roytberg
    • 3
  • Vladimir Tumanyan
    • 1
  1. 1.Engelhardt Institute of Molecular BiologyMoscowRussia
  2. 2.VNIIGENETIKAMoscowRussia
  3. 3.Institute of Mathematical Problems of BiologyMoscow RegionRussia

Personalised recommendations