Counting Patterns in Degenerated Sequences

Nuel, Grégory

doi:10.1007/978-3-642-04031-3_20

Grégory Nuel²⁴

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5780))

Included in the following conference series:

IAPR International Conference on Pattern Recognition in Bioinformatics

977 Accesses
1 Citations

Abstract

Biological sequences like DNA or proteins, are always obtained through a sequencing process which might produce some uncertainty. As a result, such sequences are usually written in a degenerated alphabet where some symbols may correspond to several possible letters (ex: IUPAC DNA alphabet). When counting patterns in such degenerated sequences, the question that naturally arises is: how to deal with degenerated positions ? Since most (usually 99%) of the positions are not degenerated, it is considered harmless to discard the degenerated positions in order to get an observation, but the exact consequences of such a practice are unclear. In this paper, we introduce a rigorous method to take into account the uncertainty of sequencing for biological sequences (DNA, Proteins). We first introduce a Forward-Backward approach to compute the marginal distribution of the constrained sequence and use it both to perform a Expectation-Maximization estimation of parameters, as well as deriving a heterogeneous Markov distribution for the constrained sequence. This distribution is hence used along with known DFA-based pattern approaches to obtain the exact distribution of the pattern count under the constraints. As an illustration, we consider a EST dataset from the EMBL database. Despite the fact that only 1% of the positions in this dataset are degenerated, we show that not taking into account these positions might lead to erroneous observations, further proving the interest of our approach.

Download to read the full chapter text

Chapter PDF

Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map

Article Open access 18 March 2016

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

Article Open access 14 September 2020

Backward Pattern Matching on Elastic-Degenerate Strings

Article Open access 12 June 2023

Keywords

References

IUPAC: International Union of Pure and Applied Chemistry (2009), http://www.iupac.org
EMBL: European Molecular Biology Laboratory Nucleotide Sequence Database (2009), http://www.ebi.ac.uk/embl/
Baum, L.E., Petrie, T., Soules, G., Weiss, N.: A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Statist. 41(1), 164–171 (1970)
Article Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Stat. Society. Series B 39(1), 1–38 (1977)
Google Scholar
Nicodème, P., Salvy, B., Flajolet, P.: Motif statistics. Theoretical Com. Sci. 287(2), 593–617 (2002)
Article Google Scholar
Crochemore, M., Stefanov, V.: Waiting time and complexity for matching patterns with automata. Info. Proc. Letters 87(3), 119–125 (2003)
Article Google Scholar
Lladser, M.E.: Mininal markov chain embeddings of pattern problems. In: Information Theory and Applications Workshop, pp. 251–255 (2007)
Google Scholar
Nuel, G.: Pattern markov chains: optimal markov chain embedding through deterministic finite automata. J. of Applied Prob. 45(1), 226–243 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

MAP5, CNRS 8145, University Paris Descartes, 45 rue des Saint-Pères, F-75006, Paris, France
Grégory Nuel

Authors

Grégory Nuel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Automatic Control and Systems Engineering, University of Sheffield, Mappin Street, S1 3JD, Sheffield, UK
Visakan Kadirkamanathan
Department of Computer Science and Department of Chemical and Process Engineering, University of Sheffield, Mappin Street, S1 3JD, Sheffield, UK
Guido Sanguinetti
University of Glasgow, Department of Computing Science, Sir Alwyn Williams Building, Lilybank Gardens, Glasgow, G12 8QQ, UK, and, University of Glasgow, Department of Statistics, 14 University Gardens, Glasgow, G12 8QQ, UK
Mark Girolami
School of Electronics and Computer Science, University of Southampton, SO17 1BJ, Southampton, UK
Mahesan Niranjan
Department of Chemical and Process Engineering, University of Sheffield, Mappin Street, S1 3JD, Sheffield, UK
Josselin Noirel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nuel, G. (2009). Counting Patterns in Degenerated Sequences. In: Kadirkamanathan, V., Sanguinetti, G., Girolami, M., Niranjan, M., Noirel, J. (eds) Pattern Recognition in Bioinformatics. PRIB 2009. Lecture Notes in Computer Science(), vol 5780. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04031-3_20

Download citation

DOI: https://doi.org/10.1007/978-3-642-04031-3_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04030-6
Online ISBN: 978-3-642-04031-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Counting Patterns in Degenerated Sequences

Abstract

Chapter PDF

Similar content being viewed by others

Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

Backward Pattern Matching on Elastic-Degenerate Strings

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Counting Patterns in Degenerated Sequences

Abstract

Chapter PDF

Similar content being viewed by others

Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

Backward Pattern Matching on Elastic-Degenerate Strings

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation