Gapped Local Similarity Search with Provable Guarantees

Narayanan, Manikandan; Karp, Richard M.

doi:10.1007/978-3-540-30219-3_7

Manikandan Narayanan²¹ &
Richard M. Karp^21,22

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3240))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

656 Accesses
8 Citations

Abstract

We present a program qhash, based on q-gram filtration and high-dimensional search, to find gapped local similarities between two sequences. Our approach differs from past q-gram-based approaches in two main aspects. Our filtration step uses algorithms for a sparse all-pairs problem, while past studies use suffix-tree-like structures and counters. Our program works in sequence-sequence mode, while most past ones (except QUASAR) work in pattern-database mode.

We leverage existing research in high-dimensional proximity search to discuss sparse all-pairs algorithms, and show them to be subquadratic under certain reasonable input assumptions. Our qhash program has provable sensitivity (even on worst-case inputs) and average-case performance guarantees. It is significantly faster than a fully sensitive dynamic-programming-based program for strong similarity search on longsequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990)
Google Scholar
A. Borodin, R. Ostrovsky, and Y. Rabani. Subquadratic approximation algorithms for clustering problems in high dimensional spaces. In Proc. 31st Symp. on Theory of Computing, pages 435–444, 1999.
Google Scholar
Bray, N., Dubchak, I., Pachter, L.: Avid: A global alignment program. Genome Research 13(1), 97–102 (2003)
Article Google Scholar
Brejova, B., Brown, D., Vinar, T.: Vector seeds: An extension to spaced seeds allows substantial improvements in sensitivity and specifity. In: Benson, G., Page, R.D.M. (eds.) WABI 2003. LNCS (LNBI), vol. 2812, pp. 39–54. Springer, Heidelberg (2003)
Chapter Google Scholar
Broder, A., Charikar, M., Frieze, A., Mitzenmacher, M.: Min-wise independent permutations. In: Proc. 30th Symp. on Theory of Computing, pp. 327–336 (1998)
Google Scholar
Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the web. In: Proc. 6th Intl. World Wide Web Conf., pp. 391–404 (1997)
Google Scholar
Brudno, M., Morgenstern, B.: Fast and sensitive alignment of large genomic sequences. In: Proc. IEEE Comp. Soc. Bioinformatics Conf., pp. 138–147 (2002)
Google Scholar
Buhler, J.: Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17(5), 419–428 (2001)
Article Google Scholar
Buhler, J.: Search Algorithms for Biosequences Using Random Projection. PhD thesis, University of Washington (2001)
Google Scholar
Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H., Rivals, E., Vingron, M.: qgram based database searching using a suffix array. In: Proc. 3rd Conf. on Research in Comp. Molecular Biology, pp. 77–83 (1999)
Google Scholar
Burkhardt, S., Karkkainen, J.: Better filtering with gapped q-grams. In: Proc. 12th Symp. on Comb. Pattern Matching, pp. 73–85 (2001)
Google Scholar
Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences 55(3), 441–453 (1997)
Article MATH MathSciNet Google Scholar
Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J., Yang, C.: Finding interesting associations without support pruning. IEEE Trans. on Knowledge and Data Engineering 13(1), 64–78 (2001)
Article Google Scholar
Fredriksson, K., Navarro, G.: Improved single and multiple approximate string matching. In: 15th Symp. on Comb. Pattern Matching (2004) (to appear)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences, chapter 11.6.5 (Approximate occurrences of P in T). Cambridge Univ. Press (1997)
Google Scholar
Haveliwala, T., Gionis, A., Indyk, P.: Scalable techniques for clustering the web. In: Proc. 3rd Intl. Workshop on the Web and Databases (2000)
Google Scholar
Indyk, P.: A small approximately min-wise independent family of hash functions. In: Proc. 10th Symp. on Discrete Algorithms, pp. 454–456 (1999)
Google Scholar
Indyk, P.: Nearest neighbors in high-dimensional spaces. In: Handbook of Discrete and Comp. Geometry, 2nd edn., CRC Press LLC (Upcoming)
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proc. 30th Symp. on Theory of Computing, pp. 604–613 (1998)
Google Scholar
Karp, R., Waarts, O., Zweig, G.: The bit vector intersection problem. In: Proc. 36th Symp. on Foundations of Computer Science, pp. 621–630 (1995)
Google Scholar
Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proc. 30th Symp. on Theory of Computing, pp. 614–623 (1998)
Google Scholar
Landau, G., Vishkin, U.: Introducing efficient parallelism into approximate string matching and a new serial algorithm. In: Proc. 18th Symp. on Theory of Computing, pp. 220–230 (1986)
Google Scholar
Lippert, R., Zhao, X., Florea, L., Mobarry, C., Istrail, S.: Finding anchors for genomic sequence comparison. In: Proc. 8th Conf. on Research in Comp.Molecular Biology, pp. 233–241 (2004)
Google Scholar
Muthukrishnan, S., Sahinalp, S.: Simple and practical sequence nearest neighbors with block operations. In: Proc. 13th Symp. on Comb. Pattern Matching, pp. 262–278 (2002)
Google Scholar
Myers, E.: An O(ND) Difference Algorithm and Its Variations. Algorithmica 1(2), 251–266 (1986)
Article MATH MathSciNet Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)
Article Google Scholar
Pevzner, P.: Statistical distance between texts and filtration methods in sequence comparison. CABIOS 8(2), 121–127 (1992)
Google Scholar
Schwartz, S., Kent, W., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., Miller, W.: Human-mouse alignments with blastz. Genome Research 13(1), 103–107 (2003)
Article Google Scholar
Smith, T., Waterman, M.: Identification of common molecular subsequences. Journal of Molecular Biology 147(1), 195–197 (1981)
Article Google Scholar
Sutinen, E., Tarhio, J.: On using q-gram locations in approximate string matching. In: Proc. European Symp. on Algorithms, pp. 327–340 (1995)
Google Scholar
Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Theoretical Computer Science 92(1), 191–211 (1992)
Article MATH MathSciNet Google Scholar
NCBI Entrez Genomes, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome

Download references

Author information

Authors and Affiliations

Computer Science Division, University of California, Berkeley, CA, 94720, USA
Manikandan Narayanan & Richard M. Karp
International Computer Science Institute, Berkeley, CA, 94704, USA
Richard M. Karp

Authors

Manikandan Narayanan
View author publications
You can also search for this author in PubMed Google Scholar
Richard M. Karp
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics and Computational Biology Unit, HIB, University of Bergen, 5020, Bergen, Norway
Inge Jonassen
Department of Biology,, Penn Center for Bioinformatics, Penn Genomics Institute, 415 S. University Ave., PA 19104, Philadelphia, USA
Junhyong Kim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Narayanan, M., Karp, R.M. (2004). Gapped Local Similarity Search with Provable Guarantees. In: Jonassen, I., Kim, J. (eds) Algorithms in Bioinformatics. WABI 2004. Lecture Notes in Computer Science(), vol 3240. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30219-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-540-30219-3_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23018-2
Online ISBN: 978-3-540-30219-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics