Finding Data Broadness Via Generalized Nearest Neighbors

Venkateswaran, Jayendra; Kahveci, Tamer; Camoglu, Orhan

doi:10.1007/11687238_39

Finding Data Broadness Via Generalized Nearest Neighbors

Jayendra Venkateswaran²⁵,
Tamer Kahveci²⁵ &
Orhan Camoglu²⁶

Conference paper

1634 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3896))

Abstract

A data object is broad if it is one of the k-Nearest Neighbors (k-NN) of many data objects. We introduce a new database primitive called Generalized Nearest Neighbor (GNN) to express data broadness. We also develop three strategies to answer GNN queries efficiently for large datasets of multidimensional objects. The R*-Tree based search algorithm generates candidate pages and ranks them based on their distances. Our first algorithm, Fetch All (FA), fetches as many candidate pages as possible. Our second algorithm, Fetch One (FO), fetches one candidate page at a time. Our third algorithm, Fetch Dynamic (FD), dynamically decides on the number of pages that needs to be fetched. We also propose three optimizations, Column Filter, Row Filter and Adaptive Filter, to eliminate pages from each dataset. Column Filter prunes the pages that are guaranteed to be non-broad. Row Filter prunes the pages whose removal do not change the broadness of any data point. Adaptive Filter prunes the search space dynamically along each dimension to eliminate unpromising objects. Our experiments show that FA is the fastest when the buffer size is large and FO is the fastest when the buffer size is small. FD is always either fastest or very close to the faster of FA and FO. FD is significantly faster than the existing methods adapted to the GNN problem.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Albers, S.: Competitive Online Algorithms. Technical Report LS-96-2, brics (September 1996)
Google Scholar
Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B.: The R*-tree: An Efficient and Robust Access Method for Points and Rectangles. In: International Conference on Management of Data (SIGMOD), pp. 322–331 (1990)
Google Scholar
Berchtold, S., Ertl, B., Keim, D.A., Kriegel, H.-P., Seidl, T.: Fast Nearest Neighbor Search in High-dimensional Space. In: International Conference on Data Engineering (ICDE), pp. 209–218 (1998)
Google Scholar
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1998)
Chapter Google Scholar
Böhm, C., Krebs, F.: The k-Nearest Neighbour Join: Turbo Charging the KDD Process. Knowledge and Information Systems (KAIS) 6(6) (2004)
Google Scholar
Çamoğlu, O., Kahveci, T., Singh, A.K.: Towards Index-based Similarity Search for Protein Structure Databases. Journal of Bioinformatics and Computational Biology (JBCB) 2(1), 99–126 (2004)
Article Google Scholar
Chan, C.Y., Ooi, B.C.: Efficient Scheduling of Page Access in Index- Based Join Processing. IEEE Transactions on Knowledge and Data Engineering (TKDE) 9(6), 1005–1011 (1997)
Article Google Scholar
Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. In: Computational Systems Bioinformatics Conference (CSB), pp. 523–528 (2003)
Google Scholar
Hjaltason, G.R., Samet, H.: Ranking in Spatial Databases. In: Symposium on Spatial Databases, Portland, Maine, August 1995, pp. 83–95 (1995)
Google Scholar
Huang, X., Madan, A.: CAP3: A DNA Sequence Assembly Program. Genome Research 9(9), 868–877 (1999)
Article Google Scholar
Kamel, I., Faloutsos, C.: Hilbert R-tree: An Improved R-tree using Fractals. In: International Conference on Very Large Databases (VLDB), pp. 500–509 (1994)
Google Scholar
Korn, F., Muthukrishnan, S.: Influence sets based on reverse nearest neighbor queries. In: International Conference on Management of Data (SIGMOD), pp. 201–212 (2000)
Google Scholar
Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., Protopapas, Z.: Fast Nearest Neighbor Search in Medical Databases. In: International Conference on Very Large Databases (VLDB), India, pp. 215–226 (1996)
Google Scholar
Merrett, T.H., Kambayashi, Y., Yasuura, H.: Scheduling of Page-Fetches in Join Operations. In: International Conference on Very Large Databases (VLDB), pp. 488–498 (1981)
Google Scholar
Roussopoulos, N., Kelley, S., Vincent, F.: Nearest Neighbor Queries. In: International Conference on Management of Data (SIGMOD), San Jose, CA (1995)
Google Scholar
Leutenegger, M.L.S., Edgington, J.: STR: A Simple and Efficient Algorithm for R-Tree Packing. In: International Conference on Data Engineering (ICDE), pp. 497–506 (1997)
Google Scholar
Seeger, B.: An analysis of schedules for performing multi-page requests. Information Systems 21(5), 387–407 (1996)
Article MathSciNet Google Scholar
Seidl, T., Kriegel, H.P.: Optimal Multi-Step k-Nearest Neighbor Search. In: International Conference on Management of Data, SIGMOD (1998)
Google Scholar
Stanoi, I., Riedewald, M., Agrawal, D., Abbadi, A.E.: Discovery of Influence Sets in Frequently Updated Databases. In: International Conference on Very Large Databases (VLDB), pp. 99–108 (2001)
Google Scholar
Tao, Y., Papadias, D., Lian, X.: Reverse kNN Search in Arbitrary Dimensionality. In: International Conference on Very Large Databases, VLDB (2004)
Google Scholar
Xia, C., Lu, H., Ooi, B.C., Hu, J.: GORDER: An Efficient Method for KNN Join Processing. In: International Conference on Very Large Databases, VLDB (2004)
Google Scholar
Yang, C., Lin, K.-I.: An Index Structure for Efficient Reverse Nearest Neighbor Queries. In: International Conference on Data Engineering (ICDE), pp. 485–492 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

CISE Department, University of Florida, Gainesville, FL, 32611
Jayendra Venkateswaran & Tamer Kahveci
University of California, Santa Barbara, CA, 93106
Orhan Camoglu

Authors

Jayendra Venkateswaran
View author publications
You can also search for this author in PubMed Google Scholar
Tamer Kahveci
View author publications
You can also search for this author in PubMed Google Scholar
Orhan Camoglu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Athens, Greece
Yannis Ioannidis
University of Konstanz, P.O.Box D188, 78457, Konstanz, Germany
Marc H. Scholl
Sustainable Content Logistics Centre, Hamburg, Germany
Joachim W. Schmidt
Chair of Software Engineering for Business Information Systems, Technische Universität München, Boltzmannstraße 3, 85748, Garching b. München,
Florian Matthes
Department of Informatics, University of Athens Panepistimiopolis, 15771, Athens, Greece
Mike Hatzopoulos
IPD, Universität Karlsruhe, Am Fasanengarten 5, 76131, Karlsruhe,
Klemens Boehm
TU München, D-85748, Garching, Germany
Alfons Kemper
Technische Universität München, Germany
Torsten Grust
Institute for Computer Science, Ludwig-Maximilians Universität München,
Christian Boehm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Venkateswaran, J., Kahveci, T., Camoglu, O. (2006). Finding Data Broadness Via Generalized Nearest Neighbors. In: Ioannidis, Y., et al. Advances in Database Technology - EDBT 2006. EDBT 2006. Lecture Notes in Computer Science, vol 3896. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11687238_39

Download citation

DOI: https://doi.org/10.1007/11687238_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32960-2
Online ISBN: 978-3-540-32961-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics