Counting Patterns in Degenerated Sequences

  • Grégory Nuel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5780)


Biological sequences like DNA or proteins, are always obtained through a sequencing process which might produce some uncertainty. As a result, such sequences are usually written in a degenerated alphabet where some symbols may correspond to several possible letters (ex: IUPAC DNA alphabet). When counting patterns in such degenerated sequences, the question that naturally arises is: how to deal with degenerated positions ? Since most (usually 99%) of the positions are not degenerated, it is considered harmless to discard the degenerated positions in order to get an observation, but the exact consequences of such a practice are unclear. In this paper, we introduce a rigorous method to take into account the uncertainty of sequencing for biological sequences (DNA, Proteins). We first introduce a Forward-Backward approach to compute the marginal distribution of the constrained sequence and use it both to perform a Expectation-Maximization estimation of parameters, as well as deriving a heterogeneous Markov distribution for the constrained sequence. This distribution is hence used along with known DFA-based pattern approaches to obtain the exact distribution of the pattern count under the constraints. As an illustration, we consider a EST dataset from the EMBL database. Despite the fact that only 1% of the positions in this dataset are degenerated, we show that not taking into account these positions might lead to erroneous observations, further proving the interest of our approach.


Forward-Backward algorithm Expectation-Maximization algorithmn Markov chain embedding Deterministic Finite state Automaton 


  1. 1.
    IUPAC: International Union of Pure and Applied Chemistry (2009),
  2. 2.
    EMBL: European Molecular Biology Laboratory Nucleotide Sequence Database (2009),
  3. 3.
    Baum, L.E., Petrie, T., Soules, G., Weiss, N.: A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Statist. 41(1), 164–171 (1970)CrossRefGoogle Scholar
  4. 4.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Stat. Society. Series B 39(1), 1–38 (1977)Google Scholar
  5. 5.
    Nicodème, P., Salvy, B., Flajolet, P.: Motif statistics. Theoretical Com. Sci. 287(2), 593–617 (2002)CrossRefGoogle Scholar
  6. 6.
    Crochemore, M., Stefanov, V.: Waiting time and complexity for matching patterns with automata. Info. Proc. Letters 87(3), 119–125 (2003)CrossRefGoogle Scholar
  7. 7.
    Lladser, M.E.: Mininal markov chain embeddings of pattern problems. In: Information Theory and Applications Workshop, pp. 251–255 (2007)Google Scholar
  8. 8.
    Nuel, G.: Pattern markov chains: optimal markov chain embedding through deterministic finite automata. J. of Applied Prob. 45(1), 226–243 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Grégory Nuel
    • 1
  1. 1.MAP5, CNRS 8145University Paris DescartesParisFrance

Personalised recommendations