Hardness of String Similarity Search and Other Indexing Problems

  • S. Cenk Sahinalp
  • Andrey Utis
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3142)

Abstract

Similarity search is a fundamental problem in computer science. Given a set of points A={A1,...,Ap} from a universe U and a distance measure D, it is possible to pose similarity search queries on a point Q in the form of nearest neighbors (find the string that has the smallest edit distance to a query string) or in the form of furthest neighbors (find the string that has the longest common subsequence with a query string).

Exact similarity search appears to be a very hard problem for most application domains; available solutions require either a preprocessing time/space exponential with p or query time exponential with |Q|. For such problems approximate solutions have recently attracted considerable attention. Approximate nearest (furthest) neighbor search aims to find a point in A whose distance to query point Q is within a small multiplicative factor of that between Q and its nearest (furthest) neighbor.

In this paper, we study hardness of several important similarity search problems for strings as well as other combinatorial objects, for which exact solutions have proven to be very difficult to achieve. We show here that even the approximate versions of these problems are quite hard; more specifically they are as hard as exact similarity search in Hamming space. Thus available cell probe lower bounds for exact similarity search in Hamming space apply for approximate similarity search in string spaces (under Levenshtein edit distance and longest common subsequence) as well as other spaces.

As a consequence of our reductions we also make observations about pairwise approximate distance computations. One such observation gives a simple linear time 2-approximation algorithm for permutation edit distance.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Barkol, O., Rabani, Y.: Tighter lower bounds for nearest neighbor search and related problems in the cell probe model. In: Proc. of STOC (2000)Google Scholar
  2. 2.
    Borodin, A., Ostrovsky, R., Rabani, Y.: Lower bounds for high-dimensional nearest neighbor search and related problems. In: Proc. of STOC (1999)Google Scholar
  3. 3.
    Bourgain, J.: On Lipschitz embedding of finite metric spaces in Hilbert space. Israel Journal of Mathematics 52, 46–52 (1985)MATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Chakrabarti, A., Chazelle, B., Gum, B., Lvov, A.: A Lower Bound on the Complexity of Approximate Nearest-Neighbor Searching on the Hamming Cube. In: Proc. ACM STOC (1999)Google Scholar
  5. 5.
    Chakrabarti, A., Regev, O.: An optimal randomized cell probe lower bound for approximate nearest neighbor searching. In: ECCC (2003)Google Scholar
  6. 6.
    Cormode, G., Paterson, M., Sahinalp, S.C., Vishkin, U.: Communication Complexity of Document Exchange. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (2000)Google Scholar
  7. 7.
    Cormode, G., Muthukrishnan, S., Sahinalp, S.C.: Permutation Edit Distance and Matching via Embeddings. In: Orejas, F., Spirakis, P.G., van Leeuwen, J. (eds.) ICALP 2001. LNCS, vol. 2076, p. 481. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  8. 8.
    Farach-Colton, M., Indyk, P.: Approximate nearest neighbor algorithms for Hausdorff metrics via embeddings. In: Proc. of FOCS (1999)Google Scholar
  9. 9.
    Hirschberg, D., Galil: Serial Computations of Levenshtein Distances. In: Apostolico (ed.) Pattern Matching Algorithms, Oxford Univ. Press, Oxford (1997)Google Scholar
  10. 10.
    Indyk, P.: Approximate nearest neighbors in l ∞ . In: Proc. of FOCS (1998)Google Scholar
  11. 11.
    Indyk, P.: Approximate nearest neighbor algorithms for Frechet metric via product metrics. In: Proc. of Symp. on Computational Geometry (2002)Google Scholar
  12. 12.
    Indyk, P.: Better Algorithms for High-dimensional Proximity Problems via Asymmetric Embeddings. In: Proc. of 14th SODA (2003)Google Scholar
  13. 13.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: Proc. of 30th STOC (1998)Google Scholar
  14. 14.
    Jayram, T.S., Khot, S., Kumar, R., Rabani, Y.: Cell-Probe Lower Bounds for the Partial Match Problem. In: Proc. of STOC (2003)Google Scholar
  15. 15.
    Kalyanasundaram, B., Schnitger, G.: The Probabilistic Communication Complexity of Set Intersection. SIAM Journal on Discrete Mathematics 5, 545–557 (1992)MATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proc. of 30th STOC (1998)Google Scholar
  17. 17.
    Linial, N., London, E., Rabinovich, Y.: The geometry of graphs and some of its algorithmic applications. Combinatorica 15, 215–245 (1995)MATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    Liu, D.: A strong lower bound for approximate nearest neighbor searching in the cell probe model (2003) (manuscript)Google Scholar
  19. 19.
    Miltersen, P.B.: Lower bounds for union-split-find related problems on random access machines. In: Proc. of 26th STOC (1994)Google Scholar
  20. 20.
    Miltersen, P.B., Nisan, N., Safra, S., Wigderson, A.: On data structures and asymmetric communication complexity. Journal of Computer and System Sciences 57(1), 37–49 (1998)MATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Muthukrishnan, S., Sahinalp, C.: Approximate nearest neighbors and sequence comparison with block operations. In: Proc. of 32nd STOC (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • S. Cenk Sahinalp
    • 1
  • Andrey Utis
    • 2
  1. 1.School of Computing ScienceSimon Fraser UniversityCanada
  2. 2.Department of Computer ScienceUniversity of MarylandCollege ParkUSA

Personalised recommendations