The Peres-Shields Order Estimator for Fixed and Variable Length Markov Models with Applications to DNA Sequence Similarity

  • Daniel Dalevi
  • Devdatt Dubhashi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3692)


Recently Peres and Shields discovered a new method for estimating the order of a stationary fixed order Markov chain [15]. They showed that the estimator is consistent by proving a threshold result. While this threshold is valid asymptotically in the limit, it is not very useful for DNA sequence analysis where data sizes are moderate. In this paper we give a novel interpretation of the Peres-Shields estimator as a sharp transition phenomenon. This yields a precise and powerful estimator that quickly identifies the core dependencies in data. We show that it compares favorably to other estimators, especially in the presence of noise and/or variable dependencies. Motivated by this last point, we extend the Peres-Shields estimator to Variable Length Markov Chains. We give an application to the problem of detecting DNA sequence similarity using genomic signatures.

Abbreviations: Mk = Fixed order Markov model of order k, PST = Prediction suffix tree, MC = Markov chain, VLMC = Variable length Markov chain.


Markov Chain Akaike Information Criterion Bayesian Information Criterion Sharp Transition Lower Order Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Akaike, H.: A new look at the statistical model identification. IEEE Trans. Auto. Cont. 19, 716–723 (1974)zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Bejerano, G., Yona, G.: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17(1), 23–43 (2001)CrossRefGoogle Scholar
  3. 3.
    Borodovsky, M., McIninch, J.: Recognition of genes in DNA sequence with ambiguities. Biosystems 30, 161–171 (1993)CrossRefGoogle Scholar
  4. 4.
    Bühlmann, P., Wyner, A.: Variable length Markov chains, Ann. Statist. 27(2), 480–513 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Bühlmann, P., Wyner, A.: Model selection for variable length Markov chains and tuning the context algorithm. Annals of the Inst. of Stat. Math. 52(2), 287–315 (2000)zbMATHCrossRefGoogle Scholar
  6. 6.
    Csiszàr, I., Shields, P.: The Consistency of the BIC Markov Order Estimator. The Annals of Statistics. 28(6), 1601–1619 (2000)zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Dalevi, D., Dubhashi, D.: Bayesian Classifiers for Detecting HGT Using Fixed and Variable Length Markov Chains (submitted)Google Scholar
  8. 8.
    Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis. Cambridge University Press, Cambridge (2004)Google Scholar
  9. 9.
    Ellrott, K., Yang, C., Saldek, M., Jiang, T.: Identifying transcription binding sites through Markov chain optimization. Bioinformatics 18(2), 100–109 (2002)Google Scholar
  10. 10.
    Fan, T.-H., Tsai, C.: A Bayesian Method in Determining the Order of a Finite State Markov Chain. Comm. Statist. Theory and Methods 28(7), 1711–1730 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Forsdyke, D.: Different Biological Species “Broadcast” Their DNAs at Different (G+C)% “Wavelengths”. J. Theor. Biol. 178, 405–417 (1996)CrossRefGoogle Scholar
  12. 12.
    Karlin, S., Burge, C.: Dinucleotide relative abundance extremes: a genomic signature. Trends Genet 11(7), 283–290 (1995)CrossRefGoogle Scholar
  13. 13.
    Mächler, M., Bühlmann, P.: Variable Length Markov Chains: Methodology, Computing, and Software. J Comp Graph Stat 13(2), 435–455 (2004)CrossRefGoogle Scholar
  14. 14.
    McDiarmid, C.: Concentration. In: Habib, M., McDiarmid, C., Ramirez-Alfonsin, J., Reed, B. (eds.) Probabilistic Methods for Algorithmic Discrete Mathematics Series: Algorithms and Combinatorics, vol. 16, pp. 195–248. Springer, Berlin (1998)Google Scholar
  15. 15.
    Peres, Y., Shields, P.: Two New Markov Order Estimators, to appear, see,
  16. 16.
    Pride, D., Meinersmann, R., Wassenaar, T., Blaser, M.: Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 13, 145–158 (2003)CrossRefGoogle Scholar
  17. 17.
    Ron, D., Singer, Y., Tishby, N.: The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. Machine Learning 25(2-3), 117–149 (1996)zbMATHCrossRefGoogle Scholar
  18. 18.
    Sandberg, R., Winberg, G., Branden, C.I., Kaske, A., Ernberg, I., Coster, J.: Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. Genome Res. 11(8), 1404–1409 (2001)CrossRefGoogle Scholar
  19. 19.
    Schwartz, G.: Estimating the dimension of a model. Annals of Statistics 6, 461–464 (1978)CrossRefMathSciNetGoogle Scholar
  20. 20.
    Zhao, X., Huang, H., Speed, T.: Finding Short DNA motifs using Permuted Markov models. In: RECOMB, pp. 68–75 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Daniel Dalevi
    • 1
  • Devdatt Dubhashi
    • 1
  1. 1.Department of Computing ScienceChalmers UniversityGöteborgSweden

Personalised recommendations