International Workshop on Combinatorial Algorithms

IWOCA 2014: Combinatorial Algorithms pp 364-375 | Cite as

Lossless Seeds for Searching Short Patterns with High Error Rates

  • Christophe VrolandEmail author
  • Mikaël Salson
  • Hélène Touzet
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8986)


We address the problem of approximate pattern matching using the Levenshtein distance. Given a text T and a pattern P, find all locations in T that differ by at most k errors from P. For that purpose, we propose a filtration algorithm that is based on a novel type of seeds, combining exact parts and parts with a fixed number of errors. Experimental tests show that the method is specifically well-suited for short patterns with a large number of errors.


High Error Rate Levenshtein Distance Text Index Space Consumption Short Pattern 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Baeza-Yates, R.A., Perleberg, C.H.: Fast and practical approximate string matching. Inf. Process. Lett. 59(1), 21–27 (1996)zbMATHMathSciNetCrossRefGoogle Scholar
  2. 2.
    Belazzougui, D.: Improved space-time tradeoffs for approximate full-text indexing with one edit error. Algorithmica, pp. 1–27 (2014)Google Scholar
  3. 3.
    Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional burrows-wheeler transform. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 133–144. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  4. 4.
    Chan, H.L., Lam, T.W., Sung, W.K., Tam, S.L., Wong, S.S.: A linear size index for approximate pattern matching. J. Discrete Algorithms 9(4), 358–364 (2011)zbMATHMathSciNetCrossRefGoogle Scholar
  5. 5.
    Chávez, E., Navarro, G.: A metric index for approximate string matching. In: Rajsbaum, S. (ed.) LATIN 2002. LNCS, vol. 2286, pp. 181–195. Springer, Heidelberg (2002) CrossRefGoogle Scholar
  6. 6.
    Döring, A., Weese, D., Rausch, T., Reinert, K.: SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics 9(1), 11–19 (2008)CrossRefGoogle Scholar
  7. 7.
    Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. J. Exp. Algorithmics (JEA) 13, 12 (2009)Google Scholar
  8. 8.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM (JACM) 52(4), 552–581 (2005)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Alg. (TALG) 3(2) (2007)Google Scholar
  10. 10.
    Hyyrö, H.: A bit-vector algorithm for computing levenshtein and damerau edit distances. Nord. J. Comput. 10(1), 29–39 (2003)zbMATHGoogle Scholar
  11. 11.
    Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Meth. 9(4), 357–359 (2012)CrossRefGoogle Scholar
  12. 12.
    Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. bioinformatics 25(14), 1754–1760 (2009). (Oxford, England)CrossRefGoogle Scholar
  13. 13.
    Maaß, M.G., Nowak, J.: Text indexing with errors. J. Discrete Algorithms 5(4), 662–681 (2007)zbMATHMathSciNetCrossRefGoogle Scholar
  14. 14.
    Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46(3), 395–415 (1999)zbMATHMathSciNetCrossRefGoogle Scholar
  15. 15.
    Navarro, G.: A guided tour to approximate string matching. ACM comput. surv. (CSUR) 33(1), 31–88 (2001)CrossRefGoogle Scholar
  16. 16.
    Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. J. Discrete Algorithms 1, 19–27 (2001)Google Scholar
  17. 17.
    Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 350–363. Springer, Heidelberg (2000) CrossRefGoogle Scholar
  18. 18.
    Petri, M., Culpepper, J.S.: Efficient indexing algorithms for approximate pattern matching in text. In: Proceedings of the Seventeenth Australasian Document Computing Symposium, ADCS 2012, pp. 9–16. ACM, New York (2012)Google Scholar
  19. 19.
    Russo, L., Navarro, G., Oliveira, A.L., Morales, P.: Approximate string matching with compressed indexes. Algorithms 2(3), 1105–1136 (2009)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Schbath, S., Martin, V., Zytnicki, M., Fayolle, J., Loux, V., Gibrat, J.F.: Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis. J. Comput. Biol. 19(6), 796–813 (2012)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Schnattinger, T., Ohlebusch, E., Gog, S.: Bidirectional search in a string with wavelet trees. In: Amir, A., Parida, L. (eds.) CPM 2010. LNCS, vol. 6129, pp. 40–50. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  22. 22.
    Shah, S.A., Hansen, N.R., Garrett, R.A.: Distribution of CRISPR spacer matches in viruses and plasmids of crenarchaeal acidothermophiles and implications for their inhibitory mechanism. Biochem. Soc. Trans. 37(1), 23 (2009)CrossRefGoogle Scholar
  23. 23.
    Slater, G.S.C., Birney, E.: Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 1–11 (2005)CrossRefGoogle Scholar
  24. 24.
    Stern, A., Keren, L., Wurtzel, O., Amitai, G., Sorek, R.: Self-targeting by CRISPR: gene regulation or autoimmunity? Trends Genet. 26(8), 335–340 (2010)CrossRefGoogle Scholar
  25. 25.
    Storz, G., Altuvia, S., Wassarman, K.M.: An abundance of RNA regulators. Annu. Rev. Biochem. 74, 199–217 (2005)CrossRefGoogle Scholar
  26. 26.
    Weese, D., Holtgrewe, M., Reinert, K.: RazerS 3: faster, fully sensitive read mapping. Bioinformatics 28(20), 2592–2599 (2012)CrossRefGoogle Scholar
  27. 27.
    Wu, S., Manber, U.: Fast text searching: allowing errors. Commun. ACM 35(10), 83–91 (1992)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Christophe Vroland
    • 1
    • 2
    • 3
    Email author
  • Mikaël Salson
    • 1
    • 2
  • Hélène Touzet
    • 1
    • 2
  1. 1.LIFL, UMR CNRS 8022Université Lille 1Villeneuve D’ascqFrance
  2. 2.Inria Lille Nord-EuropeVilleneuve D’ascqFrance
  3. 3.Laboratoire Génétique et Evolution des Populations Végétales, UMR CNRS 8198Université Lille1Villeneuve D’ascqFrance

Personalised recommendations