Skip to main content

Locality-Sensitive Hashing Without False Negatives for \(l_p\)

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9797))

Abstract

In this paper, we show a construction of locality-sensitive hash functions without false negatives, i.e., which ensure collision for every pair of points within a given radius R in d dimensional space equipped with \(l_p\) norm when \(p \in [1,\infty ]\). Furthermore, we show how to use these hash functions to solve the c-approximate nearest neighbor search problem without false negatives. Namely, if there is a point at distance R, we will certainly report it and points at distance greater than cR will not be reported for \(c=\varOmega (\sqrt{d},d^{1-\frac{1}{p}})\). The constructed algorithms work:

  • with preprocessing time \(\mathcal {O}(n \log (n))\) and sublinear expected query time,

  • with preprocessing time \(\mathcal {O}(\mathrm {poly}(n))\) and expected query time \(\mathcal {O}(\log (n))\).

Our paper reports progress on answering the open problem presented by Pagh [8], who considered the nearest neighbor search without false negatives for the Hamming distance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    \(||\cdot ||_p\) denotes the standard \(l_p\) norm for fixed p.

  2. 2.

    However, one may try to obtain tighter bound (e.g., \(c = d^{1/2}/\log (d)\)) or show that for every \(\epsilon > 0\), the approximation factor \(c=d^{1/2}-\epsilon \) does not work.

References

  1. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)

    Article  Google Scholar 

  2. Andoni, A., Razenshteyn, I.: Optimal data-dependent hashing for approximate near neighbors. In: Servedio, R.A., Rubinfeld, R. (eds.) Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, 14–17 June 2015, pp. 793–801. ACM (2015)

    Google Scholar 

  3. Bentley, J.L.: K-d trees for semidynamic point sets. In: Proceedings of the Sixth Annual Symposium on Computational Geometry, SCG 1990, pp. 187–197. ACM, New York (1990)

    Google Scholar 

  4. Datar, M., Indyk, P.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, SCG 2004, pp. 253–262. ACM Press (2004)

    Google Scholar 

  5. Haagerup, U.: The best constants in the Khintchine inequality. Stud. Math. 70(3), 231–283 (1981)

    MathSciNet  MATH  Google Scholar 

  6. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)

    Article  MathSciNet  MATH  Google Scholar 

  7. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC 1998, pp. 604–613. ACM, New York (1998)

    Google Scholar 

  8. Pagh, R.: Locality-sensitive hashing without false negatives. In: Krauthgamer, R. (ed.) Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, 10–12 January 2016, pp. 1–9. SIAM (2016)

    Google Scholar 

  9. Veraar, M.: On Khintchine inequalities with a weight. Proc. Am. Math. Soc. 138, 4119–4121 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  10. Williams, R.: A new algorithm for optimal 2-constraint satisfaction and its implications. Theor. Comput. Sci. 348(2), 357–365 (2005)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This work was supported by ERC PoC project PAAl-POC 680912 and FET project MULTIPLEX 317532. We would also like to thank Rafał Latała for meaningful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Piotr Wygocki .

Editor information

Editors and Affiliations

Appendices

A Proof of Observation 1

Proof

We will use, the fact that for any \(x,y \in \mathbb {R}\) we have \(| \left\lfloor x \right\rfloor - \left\lfloor y \right\rfloor | \le 1 \Rightarrow |x-y| < 2\). Then the following implications hold:

$$\begin{aligned} | h_p(x) - h_p(y) | \le 1 \iff&\bigg | \left\lfloor \frac{\left\langle x , v \right\rangle }{\rho _pr} \right\rfloor - \left\lfloor \frac{\left\langle y , v \right\rangle }{\rho _pr} \right\rfloor \bigg | \le 1\!&\implies \Big | \frac{\left\langle x , y \right\rangle }{\rho _pr} - \frac{\left\langle y , v \right\rangle }{\rho _pr} \Big |< 2\!\! \iff \\ \iff&| \left\langle x-y , v \right\rangle | < 2 \rho _pr.&\end{aligned}$$

So, based on the increasing property of the probability:

$$\begin{aligned} \mathrm {if} \; A \subset B \; \mathrm {then} \; \mathbb {P}\left[ A \right] \le \mathbb {P}\left[ B \right] , \end{aligned}$$

the inequality of the probabilities holds.    \(\square \)

B Proof of Observation 2

Proof

We will use the fact that for \(x,y \in \mathbb {R}: | x-y | < 1 \Rightarrow | \left\lfloor x \right\rfloor - \left\lfloor y \right\rfloor | \le 1\)).

$$\begin{aligned} \Big |\left\langle x-y , v \right\rangle \Big |< \rho _pr \iff&\!\!\!\!\!\!\! \Big | \frac{\left\langle x , v \right\rangle }{\rho _pr} - \frac{\left\langle x , v \right\rangle }{\rho _pr} \Big | < 1\!\!\!\!\!\!\!&\implies \bigg | \left\lfloor \frac{\left\langle x , v \right\rangle }{\rho _pr} \right\rfloor - \left\lfloor \frac{\left\langle x , v \right\rangle }{\rho _pr} \right\rfloor \bigg | \le 1\!\!\!\! \iff \\ \iff&| h_p(x) - h_p(y) | \le 1&\end{aligned}$$

   \(\square \)

C Proof of Observation 4

Proof

For every \(0 < b \le a\) vectors in \(\mathbb {R}^d\) satisfy the inequality:

$$\begin{aligned} ||z ||_a \le ||z ||_b \le d^{(\frac{1}{b} - \frac{1}{a})} ||z ||_a . \end{aligned}$$
(1)

For \(p>2\) we have \(\max \{ d^\frac{1}{2} , d^{1-\frac{1}{p}} \} = d^{1-\frac{1}{p}}\). Then, using ineqaulity (1) for \(a=p\) and \(b=2\) we have:

$$\begin{aligned} ||z ||_2 \ge ||z ||_p = \frac{\rho _p}{d^{1-\frac{1}{p}}} ||z ||_p = \frac{\rho _p}{\max \{d^\frac{1}{2}, d^{1 - \frac{1}{p} } \} } ||z ||_p \end{aligned}$$

For \(1 \le p \le 2\) we have \(\max \{ d^\frac{1}{2}, d^{1-\frac{1}{p}} \} = d^\frac{1}{2}\). Analogously by using inequality (1) for \(a = 2\) and \(b=p\):

$$\begin{aligned} ||z ||_p \le d^{\frac{1}{p} - \frac{1}{2}} ||z ||_2 = ||z ||_2 \frac{d^{\frac{1}{2}}}{\rho _p} \end{aligned}$$

Hence, by dividing both sides we have:

$$\begin{aligned} ||z ||_p \frac{\rho _p}{\max \{ d^\frac{1}{2}, d^{1-\frac{1}{p}} \}} \le ||z ||_2 \end{aligned}$$

   \(\square \)

D Hoeffding Bound

Here we are going to show all technical details used in the proof in the Sect. 3.1. Let us start with the Hoeffding inequality. Let \(X_1, \ldots , X_d\) be bounded independent random variables: \(a_i \le X_i \le b_i\) and \(\overline{X}\) be the mean of these variables \( \overline{X} = \sum _{i=1}^{d}X_i / d \). Theorem 2 of Hoeffding [6] states:

$$\begin{aligned} \mathbb {P}\left[ | \overline{X} - \mathbb {E}\left[ \overline{X} \right] |\ge t \right]&\le 2 \cdot \exp \left( -\frac{2d^2t^2}{\sum _{i=1}^d(b_i - a_i)^2} \right) . \end{aligned}$$

In our case, \(D_1, \ldots , D_d\) are bounded by \(a_i = -1 \le D_i \le 1 = b_i\) with \(\mathbb {E}D_i=0\). Hoeffding inequality implies:

$$\begin{aligned} \mathbb {P}\left[ \left| \frac{\sum _{i=1}^{d} D_i}{d} \right| \ge t \right]&\le 2 \cdot \exp \left( -\frac{2d^2t^2}{\sum _{i=1}^d(b_i - a_i)^2} \right) = 2 \cdot \exp \left( -\frac{dt^2}{2} \right) . \end{aligned}$$

Taking \(t=d^{-1/2 +\epsilon }\) we get the claim:

$$\begin{aligned} \mathbb {P}\left[ \bigg |\frac{\sum _{i=1}^{d} D_i}{d} \bigg | \ge d^{-1/2 +\epsilon } \right]&\le 2 \cdot \exp \left( -\frac{d^{2 \epsilon }}{2} \right) . \end{aligned}$$

E Preprocessing Complexity Bounds for the Distributions Introduced in Lemma 1

By Lemma 1, we have: \({\textsf {p}_{\textsf {fp}}}_1 = 1-\frac{(1-\frac{\tau _1^2}{c^2})^2}{3}\), so:

$$\begin{aligned} \lim _{c \rightarrow \infty } {\gamma }=\lim _{c \rightarrow \infty } \frac{\ln {3}}{-\ln {{\textsf {p}_{\textsf {fp}}}_1}} = {\frac{\ln {3}}{\ln {1.5}}} \approx {2.71} . \end{aligned}$$

If we omit terms polynomial in d, the preprocessing time of the algorithm from Theorem 2 converges to \(\mathcal {O}(n^{3.71})\).

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Pacuk, A., Sankowski, P., Wegrzycki, K., Wygocki, P. (2016). Locality-Sensitive Hashing Without False Negatives for \(l_p\) . In: Dinh, T., Thai, M. (eds) Computing and Combinatorics . COCOON 2016. Lecture Notes in Computer Science(), vol 9797. Springer, Cham. https://doi.org/10.1007/978-3-319-42634-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-42634-1_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-42633-4

  • Online ISBN: 978-3-319-42634-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics