Skip to main content

Counting Regular Expressions in Degenerated Sequences Through Lazy Markov Chain Embedding

  • Conference paper
Forging Connections between Computational Mathematics and Computational Geometry

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 124))

  • 803 Accesses

Abstract

Nowadays, Next- Generation Sequencing (NGS) produces huge number of reads which are combined using multiple alignment techniques to produce sequences. During this process, many sequencing errors are corrected, but the resulting sequences nevertheless contain a marginal level of uncertainty in the form of ∼0.1 % or less of degenerated positions (like the letter “N” corresponding to any nucleotide).

A previous work Nuel (Pattern Recognition in Bioinformatics. Springer, New York, 2009) showed that these degenerated letters might lead to erroneous counts when performing pattern matching on these sequences. An algorithm based on Deterministic Finite Automata (DFA) and Markov Chain Embedding (MCE) was suggested to deal with this problem.

In this paper, we introduce a new version of this algorithm which uses Nondeterministic Finite Automata (NFA) rather than DFA to perform what we call “lazy MCE.”. This new approach proves itself much faster than the previous one and we illustrate its usefulness on two NGS datasets and a selection of regular expressions.

A software implementing this algorithm is available: countmotif, http://www.math-info.univ-paris5.fr/~delosvin/index.php?choix=4.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Allauzen, C., Mohri, M.: A unified construction of the glushkov, follow, and antimirov automata. In: Královic, R., Urzyczyn, P. (eds.) Mathematical Foundations of Computer Science 2006. Lecture Notes in Computer Science, vol. 4162, pp. 110–121. Springer, Berlin/Heidelberg (2006)

    Chapter  Google Scholar 

  2. Gilles, A., Meglécz, E., Pech, N., Ferreira, S.: Thibaut Malausa, and Jean-François Martin. Accuracy and quality assessment of 454 gs-flx titanium pyrosequencing. BMC Genomics 12(1), 245 (2011)

    Google Scholar 

  3. Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation, 3rd edn. Addison-Wesley, Boston (2006)

    MATH  Google Scholar 

  4. IUPAC: International Union of Pure and Applied Chemistry (2009). http://www.iupac.org

    Google Scholar 

  5. Lladser, M.E.: Mininal markov chain embeddings of pattern problems. In: Information Theory and Applications Workshop, pp. 251–255 (2007)

    Google Scholar 

  6. Nuel, G.: Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata. J. Appl. Probab. 45(1), 226–243 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  7. Nuel, G.: Counting patterns in degenerated sequences. In: Pattern Recognition in Bioinformatics, pp. 222–232. Springer, New York (2009)

    Google Scholar 

  8. Zagordi, O., Klein, R., Däumer, M., Beerenwinkel, N.: Error correction of next-generation sequencing data and reliable estimation of hiv quasispecies. Nucleic Acids Res. 38(21), 7400–7409 (2010)

    Article  Google Scholar 

Download references

Acknowledgements

This work received the financial support of Sorbonne Paris-Cité in the context of the project “SA-Flex” (Structural Alphabet taking into account protein Flexibility).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to G. Nuel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media New York

About this paper

Cite this paper

Nuel, G., Delos, V. (2016). Counting Regular Expressions in Degenerated Sequences Through Lazy Markov Chain Embedding. In: Chen, K., Ravindran, A. (eds) Forging Connections between Computational Mathematics and Computational Geometry. Springer Proceedings in Mathematics & Statistics, vol 124. Springer, Cham. https://doi.org/10.5176/2251-1911_CMCGS14.28_20

Download citation

Publish with us

Policies and ethics