Abstract
A data object is broad if it is one of the k-Nearest Neighbors (k-NN) of many data objects. We introduce a new database primitive called Generalized Nearest Neighbor (GNN) to express data broadness. We also develop three strategies to answer GNN queries efficiently for large datasets of multidimensional objects. The R*-Tree based search algorithm generates candidate pages and ranks them based on their distances. Our first algorithm, Fetch All (FA), fetches as many candidate pages as possible. Our second algorithm, Fetch One (FO), fetches one candidate page at a time. Our third algorithm, Fetch Dynamic (FD), dynamically decides on the number of pages that needs to be fetched. We also propose three optimizations, Column Filter, Row Filter and Adaptive Filter, to eliminate pages from each dataset. Column Filter prunes the pages that are guaranteed to be non-broad. Row Filter prunes the pages whose removal do not change the broadness of any data point. Adaptive Filter prunes the search space dynamically along each dimension to eliminate unpromising objects. Our experiments show that FA is the fastest when the buffer size is large and FO is the fastest when the buffer size is small. FD is always either fastest or very close to the faster of FA and FO. FD is significantly faster than the existing methods adapted to the GNN problem.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Albers, S.: Competitive Online Algorithms. Technical Report LS-96-2, brics (September 1996)
Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B.: The R*-tree: An Efficient and Robust Access Method for Points and Rectangles. In: International Conference on Management of Data (SIGMOD), pp. 322–331 (1990)
Berchtold, S., Ertl, B., Keim, D.A., Kriegel, H.-P., Seidl, T.: Fast Nearest Neighbor Search in High-dimensional Space. In: International Conference on Data Engineering (ICDE), pp. 209–218 (1998)
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1998)
Böhm, C., Krebs, F.: The k-Nearest Neighbour Join: Turbo Charging the KDD Process. Knowledge and Information Systems (KAIS) 6(6) (2004)
Çamoğlu, O., Kahveci, T., Singh, A.K.: Towards Index-based Similarity Search for Protein Structure Databases. Journal of Bioinformatics and Computational Biology (JBCB) 2(1), 99–126 (2004)
Chan, C.Y., Ooi, B.C.: Efficient Scheduling of Page Access in Index- Based Join Processing. IEEE Transactions on Knowledge and Data Engineering (TKDE) 9(6), 1005–1011 (1997)
Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. In: Computational Systems Bioinformatics Conference (CSB), pp. 523–528 (2003)
Hjaltason, G.R., Samet, H.: Ranking in Spatial Databases. In: Symposium on Spatial Databases, Portland, Maine, August 1995, pp. 83–95 (1995)
Huang, X., Madan, A.: CAP3: A DNA Sequence Assembly Program. Genome Research 9(9), 868–877 (1999)
Kamel, I., Faloutsos, C.: Hilbert R-tree: An Improved R-tree using Fractals. In: International Conference on Very Large Databases (VLDB), pp. 500–509 (1994)
Korn, F., Muthukrishnan, S.: Influence sets based on reverse nearest neighbor queries. In: International Conference on Management of Data (SIGMOD), pp. 201–212 (2000)
Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., Protopapas, Z.: Fast Nearest Neighbor Search in Medical Databases. In: International Conference on Very Large Databases (VLDB), India, pp. 215–226 (1996)
Merrett, T.H., Kambayashi, Y., Yasuura, H.: Scheduling of Page-Fetches in Join Operations. In: International Conference on Very Large Databases (VLDB), pp. 488–498 (1981)
Roussopoulos, N., Kelley, S., Vincent, F.: Nearest Neighbor Queries. In: International Conference on Management of Data (SIGMOD), San Jose, CA (1995)
Leutenegger, M.L.S., Edgington, J.: STR: A Simple and Efficient Algorithm for R-Tree Packing. In: International Conference on Data Engineering (ICDE), pp. 497–506 (1997)
Seeger, B.: An analysis of schedules for performing multi-page requests. Information Systems 21(5), 387–407 (1996)
Seidl, T., Kriegel, H.P.: Optimal Multi-Step k-Nearest Neighbor Search. In: International Conference on Management of Data, SIGMOD (1998)
Stanoi, I., Riedewald, M., Agrawal, D., Abbadi, A.E.: Discovery of Influence Sets in Frequently Updated Databases. In: International Conference on Very Large Databases (VLDB), pp. 99–108 (2001)
Tao, Y., Papadias, D., Lian, X.: Reverse kNN Search in Arbitrary Dimensionality. In: International Conference on Very Large Databases, VLDB (2004)
Xia, C., Lu, H., Ooi, B.C., Hu, J.: GORDER: An Efficient Method for KNN Join Processing. In: International Conference on Very Large Databases, VLDB (2004)
Yang, C., Lin, K.-I.: An Index Structure for Efficient Reverse Nearest Neighbor Queries. In: International Conference on Data Engineering (ICDE), pp. 485–492 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Venkateswaran, J., Kahveci, T., Camoglu, O. (2006). Finding Data Broadness Via Generalized Nearest Neighbors. In: Ioannidis, Y., et al. Advances in Database Technology - EDBT 2006. EDBT 2006. Lecture Notes in Computer Science, vol 3896. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11687238_39
Download citation
DOI: https://doi.org/10.1007/11687238_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32960-2
Online ISBN: 978-3-540-32961-9
eBook Packages: Computer ScienceComputer Science (R0)