Abstract
In modern database applications the similarity of complex objects is examined by performing distance-based queries (e.g. nearest neighbour search) on data of high dimensionality. Most multidimensional indexing methods have failed to efficiently support these queries in arbitrary high-dimensional datasets (due to the dimensionality curse). Similarity join queries and K closest pairs queries are the most representative distance join queries, where two high-dimensional datasets are combined. These queries are very expensive in terms of response time and I/O activity in case of high-dimensional spaces. On the other hand, the filtering-based approach, as applied by the VA-file, has turned out to be a very promising alternative for nearest neighbour search. In general, the filtering-based approach represents vectors as compact approximations, whereas by first scanning these approximations, only a small fraction of the real vectors is visited. Here, we elaborate on VA-files and develop VA-file based algorithms for answering similarity join and K closest pairs queries on high-dimensional data. Also, performance-wise we compare the use of VA-files and R*-trees (a structure that has been proven to be of robust nature) for answering these queries. The results of the comparison do not lead to a clear winner.
Supported by the ARCHIMEDES project 2.2.14, «Management of Moving Objects and the WWW», of the Technological Educational Institute of Thessaloniki (EPEAEK II), co-funded by the Greek Ministry of Education and Religious Affairs and the European Union, INDALOG TIC2002-03968 project «A Database Language Based on Functional Logic Programming» of the Spanish Ministry of Science and Technology under FEDER funds, and the framework of the Greek-Serbian bilateral protocol.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-tree: an Efficient and Robust Access Method for Points and Rectangles. In: Proc. SIGMOD Conf., pp. 322–331 (1990)
Berchtold, S., Böhm, C., Jagadish, H., Kriegel, H.P., Sander, J.: Independent Quantization: an Index Compression Technique for High-Dimensional Data Spaces. In: Proc. ICDE Conf., pp. 577–588 (2000)
Böhm, C., Braunmuller, B., Breuning, M.M., Kriegel, H.P.: High Performance Clustering based on Similarity Join. In: Proc. CIKM Conf., pp. 298–305 (2000)
Böhm, C., Kriegel, H.P.: A Cost Model and Index Architecture for the Similarity Join. In: Proc. ICDE Conf., pp. 411–420 (2001)
Cha, G.H., Chung, C.W.: The GC-tree: a High-Dimensional Index Structure for Similarity Search in Image Databases. Transactions on Multimedia 4(2), 235–247 (2002)
Cha, G.H., Zhu, X., Petkovic, D., Chung, C.W.: An Efficient Indexing Method for Nearest Neighbor Searches in High-Dimensional Image Databases. Transactions on Multimedia 4(1), 76–87 (2002)
Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Algorithms for Processing K-Closest-Pair Queries in Spatial Databases. Data and Knowledge Engineering Journal 49(1), 67–104 (2004)
Corral, A., Vassilakopoulos, M.: On Approximate Algorithms for Distance-Based Queries using R-trees. The Computer Journal 48(2), 220–238 (2005)
Cui, B., Hu, J., Shen, H., Yu, C.: Adaptive Quantization of the High-Dimensional Data for Efficient KNN Processing. In: Proc. DASFAA Conf., pp. 302–313 (2004)
Dittrich, J.P., Seeger, B.: GESS: a Scalable Similarity-Join Algorithm for Mining Large Data Sets in High Dimensional Spaces. In: Proc. SIGKDD Conf., pp. 47–56 (2001)
Faloutsos, C., Barber, R., Flickner, M., Hafner, J., Niblack, W., Petkovic, D., Equitz, W.: Efficient and Effective Querying by Image Content. Journal of Intelligent Information System 3(3-4), 231–262 (1994)
Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., Abbadi, A.E.: Vector Approximation Based Indexing for Non-Uniform High Dimensional Data Sets. In: Proc. CIKM Conf., pp. 202–209 (2000)
Guttman, A.: R-trees: a Dynamic Index Structure for Spatial Searching. In: Proc. SIGMOD Conf., pp. 47–57 (1984)
Koudas, N., Sevcik, K.C.: High Dimensional Similarity Joins: Algorithms and Performance Evaluation. Transactions on Knowledge and Data Engineering 12(1), 3–18 (2000)
Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, C., Protopapas, Z.: Fast Nearest Neighbor Search in Medical Images Databases. In: Proc. VLDB Conf., pp. 215–226 (1996)
Nanopoulos, A., Theodoridis, Y., Manolopoulos, Y.: C2P: Clustering based on Closest Pairs. In: Proc. VLDB Conf., pp. 331–340 (2001)
Sakurai, Y., Yoshikawa, M., Uemura, S., Kojima, H.: The A-tree: an Index Structure for High-Dimensional Spaces using Relative Approximation. In: Proc. VLDB Conf., pp. 516–526 (2000)
Shim, K., Srikant, R., Agrawal, R.: High-Dimensional Similarity Joins. In: Proc. of ICDE Conf., pp. 301–311 (1997)
Weber, R., Böhm, K.: Trading Quality for Time with Nearest Neighbor Search. In: Proc. EDBT Conf., pp. 21–35 (2000)
Weber, R., Schek, H.J., Blott, S.: A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. In: Proc. VLDB Conf., pp. 194–205 (1998)
Web site: http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Corral, A., D’Ermiliis, A., Manolopoulos, Y., Vassilakopoulos, M. (2005). VA-Files vs. R*-Trees in Distance Join Queries. In: Eder, J., Haav, HM., Kalja, A., Penjam, J. (eds) Advances in Databases and Information Systems. ADBIS 2005. Lecture Notes in Computer Science, vol 3631. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11547686_12
Download citation
DOI: https://doi.org/10.1007/11547686_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28585-4
Online ISBN: 978-3-540-31895-8
eBook Packages: Computer ScienceComputer Science (R0)