Skip to main content

Stochastic Modelling over a Finite Alphabet and Algorithms for Finding Genes from Genomes

  • Conference paper
  • First Online:
  • 655 Accesses

Part of the book series: Lecture Notes in Control and Information Science ((LNCIS,volume 329))

Abstract

In this paper, we study the problem of constructing models for a stationary stochastic process {\(\mathcal{Y}_t\) } assuming values in a finite set \(\mathcal{M} := { 1 , \ldots, m }\). It is assumed that only a finite length sample path of the process is known, and not the full statistics of the process. Two kinds of problems are studied, namely: modelling for prediction, and modelling for classification. For the prediction problem, in a companion paper it is shown that a well-known approach of modelling the given process as a multi-step Markov process is in fact the only solution satisfying certain nonnegativity constraints. In the present paper, accuracy and confidence bounds are derived for the parameters of this multi-step Markov model. So far as the author is aware, such bounds have not been published previously. For the classification problem, it is assumed that two distinct sets of sample paths of two separate stochastic processes are available – call them {u1 , ... , u r } and {v1 , ... , v s }. The objective here is to develop not one but two models, called \(\mathcal{C}\) and \(\mathcal{NC}\) respectively, such that the strings u i have much larger likelihoods with the model \(\mathcal{C}\) than with the model \(\mathcal{NC}\), and the opposite is true for the strings v j . Then a new string w is classified into the set \(\mathcal{C}\) or \(\mathcal{NC}\) according as its likelihood is larger from the model \(\mathcal{C}\) or the model \(\mathcal{NC}\). For the classification problem, we develop a new algorithm called the 4M (Mixed Memory Markov Model) algorithm, which is an improvement over variable length Markov models. We then apply the 4M algorithm to the problem of finding genes from the genome. The performance of the 4M algorithm is compared against that of the popular Glimmer algorithm. In most of the test cases studied, the 4M algorithm correctly classifies both coding as well as non-coding regions more than 90% of the time. Moreover, the accuracy of the 4M algorithm compares well with that of Glimmer. At the same time, the 4M algorithm is amenable to statistical analysis.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Author information

Authors and Affiliations

Authors

Editor information

Bruce A. Francis Malcolm C. Smith Jan C. Willems

Rights and permissions

Reprints and permissions

About this paper

Cite this paper

Vidyasagar, M. Stochastic Modelling over a Finite Alphabet and Algorithms for Finding Genes from Genomes. In: Francis, B.A., Smith, M.C., Willems, J.C. (eds) Control of Uncertain Systems: Modelling, Approximation, and Design. Lecture Notes in Control and Information Science, vol 329. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11664550_18

Download citation

  • DOI: https://doi.org/10.1007/11664550_18

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-31754-8

  • Online ISBN: 978-3-540-31755-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics