Abstract
Nowadays, Next- Generation Sequencing (NGS) produces huge number of reads which are combined using multiple alignment techniques to produce sequences. During this process, many sequencing errors are corrected, but the resulting sequences nevertheless contain a marginal level of uncertainty in the form of ∼0.1 % or less of degenerated positions (like the letter “N” corresponding to any nucleotide).
A previous work Nuel (Pattern Recognition in Bioinformatics. Springer, New York, 2009) showed that these degenerated letters might lead to erroneous counts when performing pattern matching on these sequences. An algorithm based on Deterministic Finite Automata (DFA) and Markov Chain Embedding (MCE) was suggested to deal with this problem.
In this paper, we introduce a new version of this algorithm which uses Nondeterministic Finite Automata (NFA) rather than DFA to perform what we call “lazy MCE.”. This new approach proves itself much faster than the previous one and we illustrate its usefulness on two NGS datasets and a selection of regular expressions.
A software implementing this algorithm is available: countmotif, http://www.math-info.univ-paris5.fr/~delosvin/index.php?choix=4.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Allauzen, C., Mohri, M.: A unified construction of the glushkov, follow, and antimirov automata. In: Královic, R., Urzyczyn, P. (eds.) Mathematical Foundations of Computer Science 2006. Lecture Notes in Computer Science, vol. 4162, pp. 110–121. Springer, Berlin/Heidelberg (2006)
Gilles, A., Meglécz, E., Pech, N., Ferreira, S.: Thibaut Malausa, and Jean-François Martin. Accuracy and quality assessment of 454 gs-flx titanium pyrosequencing. BMC Genomics 12(1), 245 (2011)
Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation, 3rd edn. Addison-Wesley, Boston (2006)
IUPAC: International Union of Pure and Applied Chemistry (2009). http://www.iupac.org
Lladser, M.E.: Mininal markov chain embeddings of pattern problems. In: Information Theory and Applications Workshop, pp. 251–255 (2007)
Nuel, G.: Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata. J. Appl. Probab. 45(1), 226–243 (2008)
Nuel, G.: Counting patterns in degenerated sequences. In: Pattern Recognition in Bioinformatics, pp. 222–232. Springer, New York (2009)
Zagordi, O., Klein, R., Däumer, M., Beerenwinkel, N.: Error correction of next-generation sequencing data and reliable estimation of hiv quasispecies. Nucleic Acids Res. 38(21), 7400–7409 (2010)
Acknowledgements
This work received the financial support of Sorbonne Paris-Cité in the context of the project “SA-Flex” (Structural Alphabet taking into account protein Flexibility).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this paper
Cite this paper
Nuel, G., Delos, V. (2016). Counting Regular Expressions in Degenerated Sequences Through Lazy Markov Chain Embedding. In: Chen, K., Ravindran, A. (eds) Forging Connections between Computational Mathematics and Computational Geometry. Springer Proceedings in Mathematics & Statistics, vol 124. Springer, Cham. https://doi.org/10.5176/2251-1911_CMCGS14.28_20
Download citation
DOI: https://doi.org/10.5176/2251-1911_CMCGS14.28_20
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16138-9
Online ISBN: 978-3-319-16139-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)