Abstract
Biological sequences like DNA or proteins, are always obtained through a sequencing process which might produce some uncertainty. As a result, such sequences are usually written in a degenerated alphabet where some symbols may correspond to several possible letters (ex: IUPAC DNA alphabet). When counting patterns in such degenerated sequences, the question that naturally arises is: how to deal with degenerated positions ? Since most (usually 99%) of the positions are not degenerated, it is considered harmless to discard the degenerated positions in order to get an observation, but the exact consequences of such a practice are unclear. In this paper, we introduce a rigorous method to take into account the uncertainty of sequencing for biological sequences (DNA, Proteins). We first introduce a Forward-Backward approach to compute the marginal distribution of the constrained sequence and use it both to perform a Expectation-Maximization estimation of parameters, as well as deriving a heterogeneous Markov distribution for the constrained sequence. This distribution is hence used along with known DFA-based pattern approaches to obtain the exact distribution of the pattern count under the constraints. As an illustration, we consider a EST dataset from the EMBL database. Despite the fact that only 1% of the positions in this dataset are degenerated, we show that not taking into account these positions might lead to erroneous observations, further proving the interest of our approach.
Chapter PDF
Similar content being viewed by others
Keywords
References
IUPAC: International Union of Pure and Applied Chemistry (2009), http://www.iupac.org
EMBL: European Molecular Biology Laboratory Nucleotide Sequence Database (2009), http://www.ebi.ac.uk/embl/
Baum, L.E., Petrie, T., Soules, G., Weiss, N.: A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Statist. 41(1), 164–171 (1970)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Stat. Society. Series B 39(1), 1–38 (1977)
Nicodème, P., Salvy, B., Flajolet, P.: Motif statistics. Theoretical Com. Sci. 287(2), 593–617 (2002)
Crochemore, M., Stefanov, V.: Waiting time and complexity for matching patterns with automata. Info. Proc. Letters 87(3), 119–125 (2003)
Lladser, M.E.: Mininal markov chain embeddings of pattern problems. In: Information Theory and Applications Workshop, pp. 251–255 (2007)
Nuel, G.: Pattern markov chains: optimal markov chain embedding through deterministic finite automata. J. of Applied Prob. 45(1), 226–243 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nuel, G. (2009). Counting Patterns in Degenerated Sequences. In: Kadirkamanathan, V., Sanguinetti, G., Girolami, M., Niranjan, M., Noirel, J. (eds) Pattern Recognition in Bioinformatics. PRIB 2009. Lecture Notes in Computer Science(), vol 5780. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04031-3_20
Download citation
DOI: https://doi.org/10.1007/978-3-642-04031-3_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04030-6
Online ISBN: 978-3-642-04031-3
eBook Packages: Computer ScienceComputer Science (R0)