Skip to main content

The Power of Distance Distributions: Cost Models and Scheduling Policies for Quality-Controlled Similarity Queries

  • Conference paper
  • First Online:
Similarity Search and Applications (SISAP 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10609))

Included in the following conference series:

Abstract

Approximate similarity queries are a practical way to obtain good, yet suboptimal, results from large data sets without having to pay high execution costs. In this paper we analyze the problem of understanding how the strategy for searching through an index tree, also called scheduling policy, can influence costs. We consider quality-controlled similarity queries, in which the user sets a quality (distance) threshold \(\theta \) and the system halts as soon as it finds k objects in the data set at distance \(\le \theta \) from the query object. After providing experimental evidence that the scheduling policy might indeed have a high impact on paid costs, we characterize the policies’ behavior through an analytical cost model, in which a major role is played by parameterized local distance distributions. Such distributions are also the key to derive new scheduling policies, which we show to be optimal in a simplified, yet relevant, scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Here we assume, for simplicity, that indicator \(\varPsi \) is of discrete type. For continuous types, our results still apply when density functions are used in place of probabilities.

  2. 2.

    In [PC09] we introduced only the query-independent cost model for approximate queries.

References

  1. Arya, S., Mount, D.M., et al.: An optimal algorithm for approximate nearest neighbor searching. JACM 45(6), 891–923 (1998)

    Article  MATH  Google Scholar 

  2. Berchtold, S., Böhm, C., et al.: A cost model for nearest neighbor search in high-dimensional data space. In: Proceedings of PODS 1997, Tucson, AZ, pp. 78–86 (1997)

    Google Scholar 

  3. Bennett, K.P., Fayyad, U.M., Geiger, D.: Density-based indexing for approximate nearest-neighbor queries. In: Proceedings of KDD 1999, San Diego, CA, pp. 233–243 (1999)

    Google Scholar 

  4. Bustos, B., Navarro, G.: Probabilistic proximity searching algorithms based on compact partitions. JDA 2(1), 115–134 (2004)

    MathSciNet  MATH  Google Scholar 

  5. Chaudhuri, S., Ding, B., Kandula, S.: Approximate query processing: no silver bullet. In: Proceedings of SIGMOD 2017, Chicago, IL (2017, to appear)

    Google Scholar 

  6. Chávez, E., Navarro, G., et al.: Proximity searching in metric spaces. ACM Comp. Sur. 33(3), 273–321 (2001)

    Article  Google Scholar 

  7. Ciaccia, P., Patella, M.: PAC nearest neighbor queries: approximate and controlled search in high-dimensional and metric spaces. In: Proceedings of ICDE 2000, San Diego, CA, pp. 244–255 (2000)

    Google Scholar 

  8. Ciaccia, P., Patella, M.: Approximate and probabilistic methods. SIGSPATIAL Spec. 2(2), 16–19 (2010)

    Article  Google Scholar 

  9. Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of VLDB 1997, Athens, Greece, pp. 426–435 (1997)

    Google Scholar 

  10. Ciaccia, P., Patella, M., Zezula, P.: A cost model for similarity queries in metric spaces. In: Proceedings of PODS 1998, Seattle, WA, pp. 59–68 (1998)

    Google Scholar 

  11. Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: Proceedings of SIGMOD 1999, New York, NY, pp. 287–298 (1999)

    Google Scholar 

  12. Hellerstein, J.M., Naughton, J.F., Pfeffer, A.: Generalized search trees for database systems. In: Proceedings of VLDB 1995, Zurich, Switzerland, pp. 562–573 (1995)

    Google Scholar 

  13. Hjaltason, G.R., Samet, H.: Index-driven similarity search in metric spaces. ACM TODS 28(4), 517–580 (2003)

    Article  Google Scholar 

  14. Patella, M., Ciaccia, P.: Approximate similarity search: a multi-faceted problem. JDA 7(1), 36–48 (2009)

    MathSciNet  MATH  Google Scholar 

  15. Zezula, P., Amato, G., et al.: Similarity Search: The Metric Space Approach. Springer, Heidelberg (2006)

    MATH  Google Scholar 

  16. Zezula, P., Savino, P., et al.: Approximate similarity retrieval with M-trees. VLDBJ 7(4), 275–293 (1998)

    Article  Google Scholar 

Download references

Acknowledgments

The authors thank Dr. Alessandro Linari for helping with the experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Patella .

Editor information

Editors and Affiliations

A Guaranteeing Quality of Results

A Guaranteeing Quality of Results

In this Appendix, we show how the threshold \(\theta \) of a quality-controlled query can be chosen so as to provide probabilistic guarantees on the quality of results, extending the results presented in [CP00] to the case \(k>1\). The approximate result \(\widetilde{\mathcal {R}}\) of a query is a list of \(k\) objects that may, however, not be the \(k\) closest ones to the query \(q\). A possible way to define the quality of \(\widetilde{\mathcal {R}}\) is in terms of the relative error \(Err\) wrt the exact result \(\mathcal {R}\). Let us denote the i-th NN of a query \(q\) in a set of objects \(X\) as \(p^{i}_{X} \left( q \right) \) and with \(\widetilde{p}^{i}_{X} \left( q \right) \) the i-th NN of \(q\) in \(\widetilde{\mathcal {R}}\). In the (simplest) case presented in [CP00], when \(k=1\), the error is computed as:

$$\begin{aligned} Err= \frac{d\left( q, \widetilde{p}^{1}_{X} \left( q \right) \right) }{d\left( q, p^{1}_{X} \left( q \right) \right) } - 1 \end{aligned}$$
(11)

This can be extended to the case \(k> 1\) by using the error on the \(k\)-th nearest neighbor (see also [AM+98, ZS+98]):

$$\begin{aligned} \boxed { Err\mathop {=}\limits ^{\text {def}}\frac{d\left( q, \widetilde{p}^{k}_{X} \left( q \right) \right) }{d\left( q, p^{k}_{X} \left( q \right) \right) } - 1 } \end{aligned}$$
(12)

which reduces to Eq. 11 when \(k=1\).

The type of guarantee provided on the quality of results in [CP00] has the form: “with probability at least \(1 - \delta \) the error does not exceed \(\epsilon \)”, that is \(\Pr \left\{ \mathbf {Err} \le \epsilon \right\} \ge 1 - \delta \), where \(\epsilon \ge 0\) is an accuracy parameter, \(\delta \in [0,1)\) is a confidence parameter, and \(\mathbf {Err}\) is the random variable obtained from Eq. 12 applied to a random query \(\mathbf {q}\). In [CP00], this is computed by using \(G_{}^{} \left( x \right) \), i.e., the distance distribution of the 1-NN, which can be obtained from the distance distribution \(F_{} \left( \cdot \right) \) as:

$$\begin{aligned} G_{}^{} \left( x \right) \mathop {=}\limits ^{\text {def}}\Pr \left\{ d\left( \mathbf {q}, p^{1}_{X} \left( \mathbf {q} \right) \right) \le x \right\} = 1 - \left( 1 - F_{} \left( x \right) \right) ^N\end{aligned}$$
(13)

For \(k\ge 1\), this generalizes to \(G_{}^{k} \left( x \right) \), i.e., the probability to find at least \(k\) objects at a distance \(\le x\), which can be computed as:

$$\begin{aligned} G_{}^{k} \left( x \right) \mathop {=}\limits ^{\text {def}}\Pr \left\{ d\left( \mathbf {q}, p^{k}_{X} \left( \mathbf {q} \right) \right) \le x \right\} = 1 - \sum _{j=0}^{k-1} \left( {\begin{array}{c}N\\ j\end{array}}\right) \cdot F_{} \left( x \right) ^j \cdot \left( 1 - F_{} \left( x \right) \right) ^{N- j} \end{aligned}$$
(14)

The probability that the result of an approximate \(k\)-NN query \(\theta \) is correct, i.e., that the approximate k-th NN, \(\widetilde{p}^{k}_{X} \left( q \right) \), is indeed the correct one, is given by \(G_{}^{k} \left( d\left( q, \widetilde{p}^{k}_{X} \left( q \right) \right) \right) \). Since we want to bound the error with confidence \(1-\delta \), we obtain the following guarantee on the error:

$$\begin{aligned} Err\le \epsilon = \frac{d\left( q, \widetilde{p}^{k}_{X} \left( q \right) \right) }{\sup \left\{ x | G_{}^{k} \left( x \right) \le \delta \right\} } - 1 \end{aligned}$$

because the probability that \(d\left( q, p^{k}_{X} \left( q \right) \right) < \sup \left\{ x | G_{}^{k} \left( x \right) \le \delta \right\} \) is less than \(\delta \). The (one-sided) confidence interval corresponding to \(1-\delta \) is therefore \([0,\epsilon ]\). Thus, the threshold \(\theta \) for a quality-controlled query can be obtained as \((1+\epsilon ) \sup \left\{ x | G_{}^{k} \left( x \right) \le \delta \right\} \). Clearly, if \(G_{}^{k} \left( \right) \) is invertible, it is: \(\sup \left\{ x | G_{}^{k} \left( x \right) \le \delta \right\} = (G^{k})^{-1}(\delta )\).

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Ciaccia, P., Patella, M. (2017). The Power of Distance Distributions: Cost Models and Scheduling Policies for Quality-Controlled Similarity Queries. In: Beecks, C., Borutta, F., Kröger, P., Seidl, T. (eds) Similarity Search and Applications. SISAP 2017. Lecture Notes in Computer Science(), vol 10609. Springer, Cham. https://doi.org/10.1007/978-3-319-68474-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-68474-1_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-68473-4

  • Online ISBN: 978-3-319-68474-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics