Abstract
Approximate similarity queries are a practical way to obtain good, yet suboptimal, results from large data sets without having to pay high execution costs. In this paper we analyze the problem of understanding how the strategy for searching through an index tree, also called scheduling policy, can influence costs. We consider quality-controlled similarity queries, in which the user sets a quality (distance) threshold \(\theta \) and the system halts as soon as it finds k objects in the data set at distance \(\le \theta \) from the query object. After providing experimental evidence that the scheduling policy might indeed have a high impact on paid costs, we characterize the policies’ behavior through an analytical cost model, in which a major role is played by parameterized local distance distributions. Such distributions are also the key to derive new scheduling policies, which we show to be optimal in a simplified, yet relevant, scenario.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Here we assume, for simplicity, that indicator \(\varPsi \) is of discrete type. For continuous types, our results still apply when density functions are used in place of probabilities.
- 2.
In [PC09] we introduced only the query-independent cost model for approximate queries.
References
Arya, S., Mount, D.M., et al.: An optimal algorithm for approximate nearest neighbor searching. JACM 45(6), 891–923 (1998)
Berchtold, S., Böhm, C., et al.: A cost model for nearest neighbor search in high-dimensional data space. In: Proceedings of PODS 1997, Tucson, AZ, pp. 78–86 (1997)
Bennett, K.P., Fayyad, U.M., Geiger, D.: Density-based indexing for approximate nearest-neighbor queries. In: Proceedings of KDD 1999, San Diego, CA, pp. 233–243 (1999)
Bustos, B., Navarro, G.: Probabilistic proximity searching algorithms based on compact partitions. JDA 2(1), 115–134 (2004)
Chaudhuri, S., Ding, B., Kandula, S.: Approximate query processing: no silver bullet. In: Proceedings of SIGMOD 2017, Chicago, IL (2017, to appear)
Chávez, E., Navarro, G., et al.: Proximity searching in metric spaces. ACM Comp. Sur. 33(3), 273–321 (2001)
Ciaccia, P., Patella, M.: PAC nearest neighbor queries: approximate and controlled search in high-dimensional and metric spaces. In: Proceedings of ICDE 2000, San Diego, CA, pp. 244–255 (2000)
Ciaccia, P., Patella, M.: Approximate and probabilistic methods. SIGSPATIAL Spec. 2(2), 16–19 (2010)
Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of VLDB 1997, Athens, Greece, pp. 426–435 (1997)
Ciaccia, P., Patella, M., Zezula, P.: A cost model for similarity queries in metric spaces. In: Proceedings of PODS 1998, Seattle, WA, pp. 59–68 (1998)
Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: Proceedings of SIGMOD 1999, New York, NY, pp. 287–298 (1999)
Hellerstein, J.M., Naughton, J.F., Pfeffer, A.: Generalized search trees for database systems. In: Proceedings of VLDB 1995, Zurich, Switzerland, pp. 562–573 (1995)
Hjaltason, G.R., Samet, H.: Index-driven similarity search in metric spaces. ACM TODS 28(4), 517–580 (2003)
Patella, M., Ciaccia, P.: Approximate similarity search: a multi-faceted problem. JDA 7(1), 36–48 (2009)
Zezula, P., Amato, G., et al.: Similarity Search: The Metric Space Approach. Springer, Heidelberg (2006)
Zezula, P., Savino, P., et al.: Approximate similarity retrieval with M-trees. VLDBJ 7(4), 275–293 (1998)
Acknowledgments
The authors thank Dr. Alessandro Linari for helping with the experiments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Guaranteeing Quality of Results
A Guaranteeing Quality of Results
In this Appendix, we show how the threshold \(\theta \) of a quality-controlled query can be chosen so as to provide probabilistic guarantees on the quality of results, extending the results presented in [CP00] to the case \(k>1\). The approximate result \(\widetilde{\mathcal {R}}\) of a query is a list of \(k\) objects that may, however, not be the \(k\) closest ones to the query \(q\). A possible way to define the quality of \(\widetilde{\mathcal {R}}\) is in terms of the relative error \(Err\) wrt the exact result \(\mathcal {R}\). Let us denote the i-th NN of a query \(q\) in a set of objects \(X\) as \(p^{i}_{X} \left( q \right) \) and with \(\widetilde{p}^{i}_{X} \left( q \right) \) the i-th NN of \(q\) in \(\widetilde{\mathcal {R}}\). In the (simplest) case presented in [CP00], when \(k=1\), the error is computed as:
This can be extended to the case \(k> 1\) by using the error on the \(k\)-th nearest neighbor (see also [AM+98, ZS+98]):
which reduces to Eq. 11 when \(k=1\).
The type of guarantee provided on the quality of results in [CP00] has the form: “with probability at least \(1 - \delta \) the error does not exceed \(\epsilon \)”, that is \(\Pr \left\{ \mathbf {Err} \le \epsilon \right\} \ge 1 - \delta \), where \(\epsilon \ge 0\) is an accuracy parameter, \(\delta \in [0,1)\) is a confidence parameter, and \(\mathbf {Err}\) is the random variable obtained from Eq. 12 applied to a random query \(\mathbf {q}\). In [CP00], this is computed by using \(G_{}^{} \left( x \right) \), i.e., the distance distribution of the 1-NN, which can be obtained from the distance distribution \(F_{} \left( \cdot \right) \) as:
For \(k\ge 1\), this generalizes to \(G_{}^{k} \left( x \right) \), i.e., the probability to find at least \(k\) objects at a distance \(\le x\), which can be computed as:
The probability that the result of an approximate \(k\)-NN query \(\theta \) is correct, i.e., that the approximate k-th NN, \(\widetilde{p}^{k}_{X} \left( q \right) \), is indeed the correct one, is given by \(G_{}^{k} \left( d\left( q, \widetilde{p}^{k}_{X} \left( q \right) \right) \right) \). Since we want to bound the error with confidence \(1-\delta \), we obtain the following guarantee on the error:
because the probability that \(d\left( q, p^{k}_{X} \left( q \right) \right) < \sup \left\{ x | G_{}^{k} \left( x \right) \le \delta \right\} \) is less than \(\delta \). The (one-sided) confidence interval corresponding to \(1-\delta \) is therefore \([0,\epsilon ]\). Thus, the threshold \(\theta \) for a quality-controlled query can be obtained as \((1+\epsilon ) \sup \left\{ x | G_{}^{k} \left( x \right) \le \delta \right\} \). Clearly, if \(G_{}^{k} \left( \right) \) is invertible, it is: \(\sup \left\{ x | G_{}^{k} \left( x \right) \le \delta \right\} = (G^{k})^{-1}(\delta )\).
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Ciaccia, P., Patella, M. (2017). The Power of Distance Distributions: Cost Models and Scheduling Policies for Quality-Controlled Similarity Queries. In: Beecks, C., Borutta, F., Kröger, P., Seidl, T. (eds) Similarity Search and Applications. SISAP 2017. Lecture Notes in Computer Science(), vol 10609. Springer, Cham. https://doi.org/10.1007/978-3-319-68474-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-68474-1_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68473-4
Online ISBN: 978-3-319-68474-1
eBook Packages: Computer ScienceComputer Science (R0)