Advertisement

A Randomized Numerical Aligner (rNA)

  • Alberto Policriti
  • Alexandru I. Tomescu
  • Francesco Vezzi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6031)

Abstract

With the advent of new sequencing technologies able to produce an enormous quantity of short genomic sequences, new tools able to search for them inside a references sequence genome have emerged. Because of chemical reading errors or of the variability between organisms, one is interested in finding not only exact occurrences, but also occurrences with up to k mismatches. The contribution of this paper is twofold. On one hand, we present a generalization of the classical Rabin-Karp string matching algorithm to solve the k-mismatch problem, with average complexity \(\mathcal{O}(n+m)\). On the other hand, we show how to employ this idea in conjunction with an index over the text, allowing to search a pattern, with up to k mismatches, in time proportional to its length. This novel tool—rNA (randomized Numerical Aligner)—outperforms available tools like SOAP2, BWA, and BOWTIE, processing up to 10 times more patterns per second on texts of (practically) significant lengths.

Keywords

String Match Reference Sequence Genome Processor Word Residue Number System Average Complexity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abrahamson, K.: Generalized string matching. SIAM Journal on Computing 16(6), 1039–1051 (1987)zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25(17), 3389–3402 (1997)CrossRefGoogle Scholar
  3. 3.
    Amir, A., Lewenstein, M., Porat, E.: Faster algorithms for string matching with k mismatches. Journal of Algorithms 50, 257–275 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Commun. ACM 20(10), 762–772 (1977)CrossRefGoogle Scholar
  5. 5.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press, McGraw-Hill Book Company (2001)Google Scholar
  6. 6.
    Ferragina, P.: String algorithms and data structures. CoRR abs/0801.2378 (2008)Google Scholar
  7. 7.
    Galil, Z., Giancarlo, R.: Improved string matching with k mismatches. SIGACT News 17(4), 52–54 (1986)CrossRefGoogle Scholar
  8. 8.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)zbMATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Horner, D.S., Pavesi, G., Castrignano, T., De Meo, P.D., Liuni, S., Sammeth, M., Picardi, E., Pesole, G.: Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Brief. Bioinform., bbp046+ (2009)Google Scholar
  10. 10.
    Huynh, T.N.D., Hon, W.K., Lam, T.W., Sung, W.K.: Approximate string matching using compressed suffix arrays. Theor. Comput. Sci. 352(1), 240–249 (2006)zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Jokinen, P., Ukkonen, E.: Two algorithms for approximate string matching in static texts. In: Proc. 2nd Ann. Symp. on Mathematical Foundations of Computer Science, vol. 520, pp. 240–248 (1991)Google Scholar
  12. 12.
    Karp, R., Rabin, M.: Efficient randomized pattern-matching algorithms. IBM J. Res. Develop. 31(2), 249–260 (1987)zbMATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Kent, W.J.: BLAT—The BLAST-like Alignment Tool. Genome research 12(4), 656–664 (2002)MathSciNetGoogle Scholar
  14. 14.
    Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM Journal on Computing 6(2), 323–350 (1977)zbMATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    Landau, G.M., Vishkin, U.: Efficient string matching in the presence of errors. In: Proceedings of the 26th IEEE Symposium on Foundations of Computer Science, pp. 126–136 (1985)Google Scholar
  16. 16.
    Landau, G.M., Vishkin, U.: Efficient string matching with k mismatches. Theoretical Computer Science 43, 239–249 (1986)zbMATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), R25 (2009)CrossRefGoogle Scholar
  18. 18.
    Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)CrossRefGoogle Scholar
  19. 19.
    Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.M., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)CrossRefGoogle Scholar
  20. 20.
    Liu, Z., Chen, X., Borneman, J., Jiang, T.: A fast algorithm for approximate string matching on gene sequences. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 79–90. Springer, Heidelberg (2005)Google Scholar
  21. 21.
    Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. In: SODA ’90: Proc. 1st Ann. ACM-SIAM Symp. on Discrete Algorithms, pp. 319–327. Society for Industrial and Applied Mathematics, Philadelphia (1990)Google Scholar
  22. 22.
    Muth, R., Manber, U.: Approximate multiple string search. In: Proc. 7th Ann. Symp. on Combinatorial Pattern Matching, Laguna Beach, CA, pp. 75–86 (1996)Google Scholar
  23. 23.
    Policriti, A., Tomescu, A.I., Vezzi, F.: A Randomized Numerical Aligner (rNA) (2010), http://sole.dimi.uniud.it/~alexandru.tomescu/files/rNA-ext.pdf
  24. 24.
    Salmela, L., Tarhio, J., Kalsi, P.: Approximate Boyer-Moore string matching for small alphabets. Algorithmica (to appear)Google Scholar
  25. 25.
    Ukkonen, E.: Approximate string matching over suffix trees. In: Proc. 4th Ann. Symp. on Combinatorial Pattern Matching, pp. 228–242 (1993)Google Scholar
  26. 26.
    Zimmermann, R.: Efficient VLSI Implementation of Modulo (2n ±1) Addition and Multiplication. In: IEEE Symposium on Computer Arithmetic, pp. 158–167. IEEE Computer Society, Los Alamitos (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Alberto Policriti
    • 1
    • 2
  • Alexandru I. Tomescu
    • 1
    • 3
  • Francesco Vezzi
    • 1
    • 2
  1. 1.Dipartimento di Matematica e InformaticaUniversità di UdineUdineItaly
  2. 2.Istituto di Genomica Applicata (IGA)UdineItaly
  3. 3.Faculty of Mathematics and Computer ScienceUniversity of BucharestBucharestRomania

Personalised recommendations