The Power of Distance Distributions: Cost Models and Scheduling Policies for Quality-Controlled Similarity Queries

Ciaccia, Paolo; Patella, Marco

doi:10.1007/978-3-319-68474-1_1

Paolo Ciaccia¹⁷ &
Marco Patella¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10609))

Included in the following conference series:

International Conference on Similarity Search and Applications

1728 Accesses
1 Citations

Abstract

Approximate similarity queries are a practical way to obtain good, yet suboptimal, results from large data sets without having to pay high execution costs. In this paper we analyze the problem of understanding how the strategy for searching through an index tree, also called scheduling policy, can influence costs. We consider quality-controlled similarity queries, in which the user sets a quality (distance) threshold $\theta $ and the system halts as soon as it finds k objects in the data set at distance $\le \theta $ from the query object. After providing experimental evidence that the scheduling policy might indeed have a high impact on paid costs, we characterize the policies’ behavior through an analytical cost model, in which a major role is played by parameterized local distance distributions. Such distributions are also the key to derive new scheduling policies, which we show to be optimal in a simplified, yet relevant, scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Here we assume, for simplicity, that indicator $\varPsi $ is of discrete type. For continuous types, our results still apply when density functions are used in place of probabilities.
2.
In [PC09] we introduced only the query-independent cost model for approximate queries.

References

Arya, S., Mount, D.M., et al.: An optimal algorithm for approximate nearest neighbor searching. JACM 45(6), 891–923 (1998)
Article MATH Google Scholar
Berchtold, S., Böhm, C., et al.: A cost model for nearest neighbor search in high-dimensional data space. In: Proceedings of PODS 1997, Tucson, AZ, pp. 78–86 (1997)
Google Scholar
Bennett, K.P., Fayyad, U.M., Geiger, D.: Density-based indexing for approximate nearest-neighbor queries. In: Proceedings of KDD 1999, San Diego, CA, pp. 233–243 (1999)
Google Scholar
Bustos, B., Navarro, G.: Probabilistic proximity searching algorithms based on compact partitions. JDA 2(1), 115–134 (2004)
MathSciNet MATH Google Scholar
Chaudhuri, S., Ding, B., Kandula, S.: Approximate query processing: no silver bullet. In: Proceedings of SIGMOD 2017, Chicago, IL (2017, to appear)
Google Scholar
Chávez, E., Navarro, G., et al.: Proximity searching in metric spaces. ACM Comp. Sur. 33(3), 273–321 (2001)
Article Google Scholar
Ciaccia, P., Patella, M.: PAC nearest neighbor queries: approximate and controlled search in high-dimensional and metric spaces. In: Proceedings of ICDE 2000, San Diego, CA, pp. 244–255 (2000)
Google Scholar
Ciaccia, P., Patella, M.: Approximate and probabilistic methods. SIGSPATIAL Spec. 2(2), 16–19 (2010)
Article Google Scholar
Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of VLDB 1997, Athens, Greece, pp. 426–435 (1997)
Google Scholar
Ciaccia, P., Patella, M., Zezula, P.: A cost model for similarity queries in metric spaces. In: Proceedings of PODS 1998, Seattle, WA, pp. 59–68 (1998)
Google Scholar
Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: Proceedings of SIGMOD 1999, New York, NY, pp. 287–298 (1999)
Google Scholar
Hellerstein, J.M., Naughton, J.F., Pfeffer, A.: Generalized search trees for database systems. In: Proceedings of VLDB 1995, Zurich, Switzerland, pp. 562–573 (1995)
Google Scholar
Hjaltason, G.R., Samet, H.: Index-driven similarity search in metric spaces. ACM TODS 28(4), 517–580 (2003)
Article Google Scholar
Patella, M., Ciaccia, P.: Approximate similarity search: a multi-faceted problem. JDA 7(1), 36–48 (2009)
MathSciNet MATH Google Scholar
Zezula, P., Amato, G., et al.: Similarity Search: The Metric Space Approach. Springer, Heidelberg (2006)
MATH Google Scholar
Zezula, P., Savino, P., et al.: Approximate similarity retrieval with M-trees. VLDBJ 7(4), 275–293 (1998)
Article Google Scholar

Download references

Acknowledgments

The authors thank Dr. Alessandro Linari for helping with the experiments.

Author information

Authors and Affiliations

DISI, University of Bologna, Bologna, Italy
Paolo Ciaccia & Marco Patella

Authors

Paolo Ciaccia
View author publications
You can also search for this author in PubMed Google Scholar
Marco Patella
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Patella .

Editor information

Editors and Affiliations

Fraunhofer Institute for Applied Information Technology, Sankt Augustin, Germany
Christian Beecks
Ludwig-Maximilians-Universität München, Munich, Germany
Felix Borutta
Ludwig-Maximilians-Universität München, Munich, Germany
Peer Kröger
Ludwig-Maximilians-Universität München, Munich, Germany
Thomas Seidl

A Guaranteeing Quality of Results

In this Appendix, we show how the threshold $\theta $ of a quality-controlled query can be chosen so as to provide probabilistic guarantees on the quality of results, extending the results presented in [CP00] to the case $k>1$. The approximate result $\widetilde{\mathcal {R}}$ of a query is a list of $k$ objects that may, however, not be the $k$ closest ones to the query $q$. A possible way to define the quality of $\widetilde{\mathcal {R}}$ is in terms of the relative error $Err$ wrt the exact result $\mathcal {R}$. Let us denote the i-th NN of a query $q$ in a set of objects $X$ as $p^{i}_{X} \left( q \right) $ and with $\widetilde{p}^{i}_{X} \left( q \right) $ the i-th NN of $q$ in $\widetilde{\mathcal {R}}$. In the (simplest) case presented in [CP00], when $k=1$, the error is computed as:

$$\begin{aligned} Err= \frac{d\left( q, \widetilde{p}^{1}_{X} \left( q \right) \right) }{d\left( q, p^{1}_{X} \left( q \right) \right) } - 1 \end{aligned}$$

(11)

This can be extended to the case $k> 1$ by using the error on the $k$-th nearest neighbor (see also [AM+98, ZS+98]):

$$\begin{aligned} \boxed { Err\mathop {=}\limits ^{\text {def}}\frac{d\left( q, \widetilde{p}^{k}_{X} \left( q \right) \right) }{d\left( q, p^{k}_{X} \left( q \right) \right) } - 1 } \end{aligned}$$

(12)

which reduces to Eq. 11 when $k=1$.

The type of guarantee provided on the quality of results in [CP00] has the form: “with probability at least $1 - \delta $ the error does not exceed $\epsilon $”, that is $\Pr \left\{ \mathbf {Err} \le \epsilon \right\} \ge 1 - \delta $, where $\epsilon \ge 0$ is an accuracy parameter, $\delta \in [0,1)$ is a confidence parameter, and $\mathbf {Err}$ is the random variable obtained from Eq. 12 applied to a random query $\mathbf {q}$. In [CP00], this is computed by using $G_{}^{} \left( x \right) $, i.e., the distance distribution of the 1-NN, which can be obtained from the distance distribution $F_{} \left( \cdot \right) $ as:

$$\begin{aligned} G_{}^{} \left( x \right) \mathop {=}\limits ^{\text {def}}\Pr \left\{ d\left( \mathbf {q}, p^{1}_{X} \left( \mathbf {q} \right) \right) \le x \right\} = 1 - \left( 1 - F_{} \left( x \right) \right) ^N\end{aligned}$$

(13)

For $k\ge 1$, this generalizes to $G_{}^{k} \left( x \right) $, i.e., the probability to find at least $k$ objects at a distance $\le x$, which can be computed as:

$$\begin{aligned} G_{}^{k} \left( x \right) \mathop {=}\limits ^{\text {def}}\Pr \left\{ d\left( \mathbf {q}, p^{k}_{X} \left( \mathbf {q} \right) \right) \le x \right\} = 1 - \sum _{j=0}^{k-1} \left( {\begin{array}{c}N\\ j\end{array}}\right) \cdot F_{} \left( x \right) ^j \cdot \left( 1 - F_{} \left( x \right) \right) ^{N- j} \end{aligned}$$

(14)

The probability that the result of an approximate $k$-NN query $\theta $ is correct, i.e., that the approximate k-th NN, $\widetilde{p}^{k}_{X} \left( q \right) $, is indeed the correct one, is given by $G_{}^{k} \left( d\left( q, \widetilde{p}^{k}_{X} \left( q \right) \right) \right) $. Since we want to bound the error with confidence $1-\delta $, we obtain the following guarantee on the error:

$$\begin{aligned} Err\le \epsilon = \frac{d\left( q, \widetilde{p}^{k}_{X} \left( q \right) \right) }{\sup \left\{ x | G_{}^{k} \left( x \right) \le \delta \right\} } - 1 \end{aligned}$$

because the probability that $d\left( q, p^{k}_{X} \left( q \right) \right) < \sup \left\{ x | G_{}^{k} \left( x \right) \le \delta \right\} $ is less than $\delta $. The (one-sided) confidence interval corresponding to $1-\delta $ is therefore $[0,\epsilon ]$. Thus, the threshold $\theta $ for a quality-controlled query can be obtained as $(1+\epsilon ) \sup \left\{ x | G_{}^{k} \left( x \right) \le \delta \right\} $. Clearly, if $G_{}^{k} \left( \right) $ is invertible, it is: $\sup \left\{ x | G_{}^{k} \left( x \right) \le \delta \right\} = (G^{k})^{-1}(\delta )$.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ciaccia, P., Patella, M. (2017). The Power of Distance Distributions: Cost Models and Scheduling Policies for Quality-Controlled Similarity Queries. In: Beecks, C., Borutta, F., Kröger, P., Seidl, T. (eds) Similarity Search and Applications. SISAP 2017. Lecture Notes in Computer Science(), vol 10609. Springer, Cham. https://doi.org/10.1007/978-3-319-68474-1_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-68474-1_1
Published: 28 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68473-4
Online ISBN: 978-3-319-68474-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Power of Distance Distributions: Cost Models and Scheduling Policies for Quality-Controlled Similarity Queries

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Guaranteeing Quality of Results

A Guaranteeing Quality of Results

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation