Advertisement

High-dimensional similarity searches using query driven dynamic quantization and distributed indexing

  • Gheorghi GuzunEmail author
  • Guadalupe Canahuate
Article
  • 23 Downloads
Part of the following topical collections:
  1. Special Issue on Extending Database Technology

Abstract

The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic and there is potential to improve accuracy when a query-dependent quantization is used. In this work we propose a query dependent equi-depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest p fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distance metric, but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions. Furthermore, similarity searches with QED show linear or better scalability in relation to the number of dimensions, and the number of compute nodes.

Keywords

Similarity searches High-dimensional data Indexing Bit-vector Query aware quantization QED Query optimization Distributed and parallel algorithms 

Notes

Acknowledgements

This work was partially supported by the NIH National Cancer Institute/Big Data to Knowledge (BD2K) Program under Grant R01CA214825 and joint NSF/NIH Initiative on Quantitative Approaches to Biomedical Big Data (QuBDD) (R01) Grant R01CA225190.

References

  1. 1.
    Aggarwal, C.C., Yu, P.S.: The igrid index: reversing the dimensionality curse for similarity indexing in high dimensional space. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 119–129 (2000)Google Scholar
  2. 2.
    Amato, G., Esuli, A., Falchi, F.: A comparison of pivot selection techniques for permutation-based indexing. Inf. Syst. 52, 176–188 (2015)CrossRefGoogle Scholar
  3. 3.
    Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5, 4308 (2014)CrossRefGoogle Scholar
  4. 4.
    Barrios, J.M., Bustos, B., Skopal, T.: Analyzing and dynamically indexing the query set. Inf. Syst. 45, 37–47 (2014)CrossRefGoogle Scholar
  5. 5.
    Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: International Conference on Database Theory, Springer, New York, pp. 217–235 (1999)Google Scholar
  6. 6.
    Blake, C., Merz, C.: Uci repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, CA. http://www.ics.uci.edu/~mlearn/mlrepository.html (1998)
  7. 7.
    Boiman, O., Shechtman, E., Irani, M.: In defense of nearest-neighbor based image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008, pp. 1–8 (2008)Google Scholar
  8. 8.
    Chambi, S., Lemire, D., Kaser, O., Godin, R.: Better bitmap performance with roaring bitmaps (2014). arXiv:1402.6407 Google Scholar
  9. 9.
    Chen, L., Gao, Y., Zheng, B., Jensen, C.S., Yang, H., Yang, K.: Pivot-based metric indexing. Proc. VLDB Endow. 10(10), 1058–1069 (2017)CrossRefGoogle Scholar
  10. 10.
    Donoho, D.L.: High-dimensional data analysis: the curses and blessings of dimensionality. AMS Math. Chall. Lect. 1, 32 (2000)Google Scholar
  11. 11.
    Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96, 226–231 (1996)Google Scholar
  12. 12.
    Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Joint Conference on Artificial Intelligence, pp. 1022–1027 (1993)Google Scholar
  13. 13.
    Gao, J., Jagadish, H.V., Lu, W., Ooi, B.C.: Dsh: data sensitive hashing for high-dimensional k-nn search. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, ACM, pp. 1127–1138 (2014)Google Scholar
  14. 14.
    García, S., Luengo, J., Sáez, J.A., López, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)CrossRefGoogle Scholar
  15. 15.
    Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., pp. 518–529 (1999)Google Scholar
  16. 16.
    Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (New York, NY, USA), SIGMOD ’84, ACM, pp. 47–57 (1984)Google Scholar
  17. 17.
    Guzun, G., Canahuate, G.: Hybrid query optimization for hard-to-compress bit-vectors. VLDB J 25, 339–354 (2015)CrossRefGoogle Scholar
  18. 18.
    Guzun, G., Canahuate, G.: Supporting dynamic quantization for high-dimensional data analytics. In: Proceedings of the ExploreDB’17 (New York, NY, USA), ExploreDB ’17, ACM, pp. 6:1–6:6 (2017)Google Scholar
  19. 19.
    Guzun, G., Canahuate, G.: Distributed query-aware quantization for high-dimensional similarity searches. In: Advances in Database Technology: Proceedings. International Conference on Extending Database Technology, vol. 2018, pp. 373–384 (2018)Google Scholar
  20. 20.
    Guzun, G., Canahuate, G., Chiu, D., Sawin, J.: A tunable compression framework for bitmap indices. In: IEEE 30th International Conference on Data Engineering (ICDE), IEEE, 2014, pp. 484–495 (2014)Google Scholar
  21. 21.
    Guzun, G., Tosado, J., Canahuate, G.: Slicing the dimensionality: top-k query processing for high-dimensional spaces. In: Transactions on Large-Scale Data-and Knowledge-Centered Systems XIV. Springer, New York, pp. 26–50 (2014)Google Scholar
  22. 22.
    Guzun, G., Canahuate, G., Chiu, D.: A two-phase Mapreduce algorithm for scalable preference queries over high-dimensional data. In: Proceedings of the 20th International Database Engineering & Applications Symposium, IDEAS (2016)Google Scholar
  23. 23.
    Guzun, G., McClurg, J.C., Canahuate, G., Mudumbai, R.: Power efficient big data analytics algorithms through low-level operations. In: 2016 IEEE International Conference on Big Data (Big Data), IEEE, pp. 355–361 (2016)Google Scholar
  24. 24.
    Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Tech/ J. 29(2), 147–160 (1950)MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    Huang, Q., Feng, J., Zhang, Y., Fang, Q., Ng, W.: Query-aware locality-sensitive hashing for approximate nearest neighbor search. Proc. VLDB Endow. 9(1), 1–12 (2015)CrossRefGoogle Scholar
  26. 26.
    Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2011)CrossRefGoogle Scholar
  27. 27.
    Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus (2017). arXiv:1702.08734 Google Scholar
  28. 28.
    Kamel, I., Faloutsos, C.: Hilbert r-tree: an improved r-tree using fractals. In: Proceedings of the 20th International Conference on Very Large Data Bases (San Francisco, CA, USA), VLDB ’94, Morgan Kaufmann Publishers Inc., pp. 500–509 (1994)Google Scholar
  29. 29.
    Katayama, N., Satoh, S.: The sr-tree: an index structure for high-dimensional nearest neighbor queries. SIGMOD Rec. 26(2), 369–380 (1997)CrossRefGoogle Scholar
  30. 30.
    Kerber, R.: Chimerge: discretization of numeric attributes. In: Proceedings of the 10th National Conference on Artificial Intelligence. San Jose, CA, July 12–16, pp. 123–128 (1992)Google Scholar
  31. 31.
    Kurgan, L.A., Cios, K.J.: Caim discretization algorithm. IEEE Trans. Knowl. Data Eng. 16(2), 145–153 (2004)CrossRefGoogle Scholar
  32. 32.
    Lemire, D., Kaser, O., Gutarra, E.: Reordering rows for better compression: beyond the lexicographic order. ACM Trans. Datab. Syst. 37(3), 20:1–20:29 (2012)Google Scholar
  33. 33.
    Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)CrossRefGoogle Scholar
  34. 34.
    Liu, F., Lee, H.J.: Use of social network information to enhance collaborative filtering performance. Expert Syst. Appl. 37(7), 4772–4778 (2010)CrossRefGoogle Scholar
  35. 35.
    O’Neil, P., Quass, D.: Improved query performance with variant indexes. In: Proceedings of the 1997 ACM SIGMOD International Conference on Management of data, ACM Press, pp. 38–49 (1997)Google Scholar
  36. 36.
    Pareto, V.: Manual of Political Economy. Augustus M. Kelley Publishers, New York (1906)Google Scholar
  37. 37.
    Phung, S.L., Bouzerdoum, A., Chai, D.: Skin segmentation using color pixel classification: analysis and comparison. IEEE Trans. Pattern Anal. Mach. Intell. 27(1), 148–154 (2005)CrossRefGoogle Scholar
  38. 38.
    Rinfret, D.: Answering preference queries with bit-sliced index arithmetic. In: Proceedings of the 2008 C3S2E Conference, ACM, pp. 173–185 (2008)Google Scholar
  39. 39.
    Rinfret, D., O’Neil, P., O’Neil, E.: Bit-sliced index arithmetic. In: ACM SIGMOD Record, vol. 30, ACM, pp. 47–57 (2001)Google Scholar
  40. 40.
    Samet, H.: The Design and Analysis of Spatial Data Structures, vol. 199. Addison-Wesley, Reading, MA (1990)Google Scholar
  41. 41.
    Shy Goh, K., Li, B., Chang, E.: Dyndex: a dynamic and non-metric space indexer. In: In ACM Multimedia, pp. 466–475 (2002)Google Scholar
  42. 42.
    Tsai, C., Lee, C., Yang, W.: A discretization algorithm based on class-attribute contingency coefficient. Inf. Sci. 178(3), 714–731 (2008)CrossRefGoogle Scholar
  43. 43.
    Tung, A., K.H., Zhang, R., Koudas, N., Ooi, B.C.: Similarity search: a matching based approach. In: VLDB’2006: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB Endowment, pp. 631–642 (2006)Google Scholar
  44. 44.
    Weber, R.: Parallel va-file. In: Proceeding of ECDL, Springer, New York, pp. 83–92 (2000)Google Scholar
  45. 45.
    Weber, R., Blott, S.: An approximation based data structure for similarity search. Technical Report (1997)Google Scholar
  46. 46.
    Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the 24rd International Conference on Very Large Data Bases (San Francisco, CA, USA), VLDB ’98, Morgan Kaufmann Publishers Inc., pp. 194–205 (1998)Google Scholar
  47. 47.
    Wu, K., Otoo, E.J., Shoshani, A., Nordberg, H.: Notes on design and implementation of compressed bit vectors. Technical Report LBNL/PUB-3161, Lawrence Berkeley National Laboratory (2001)Google Scholar
  48. 48.
    Wu, K., Otoo, E.J., Shoshani, A.: Compressing bitmap indexes for faster search operations. In: Proceedings of the 2002 International Conference on Scientific and Statistical Database Management Conference (SSDBM’02), pp. 99–108 (2002)Google Scholar
  49. 49.
    Zipf, G.K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Ravenio Books, Cambridge (2016)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Computer EngineeringSan Jose State UniversitySan JoseUSA
  2. 2.Department of Electrical and Computer EngineeringThe University of IowaIowaUSA

Personalised recommendations