Advertisement

ISIS: A New Approach for Efficient Similarity Search in Sparse Databases

  • Bin Cui
  • Jiakui Zhao
  • Gao Cong
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5982)

Abstract

High-dimensional sparse data is prevalent in many real-life applications. In this paper, we propose a novel index structure for accelerating similarity search in high-dimensional sparse databases, named ISIS, which stands for Indexing Sparse databases using Inverted fileS. ISIS clusters a dataset and converts the original high-dimensional space into a new space where each dimension represents a cluster; furthermore, the key values in the new space are used by Inverted-files indexes. We also propose an extension of ISIS, named ISIS + , which partitions the data space into lower dimensional subspaces and clusters the data within each subspace. Extensive experimental study demonstrates the superiority of our approaches in high-dimensional sparse databases.

Keywords

Active Dimension Near Neighbor Query Point Subspace Cluster Query Object 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. ACM SIGMOD Conference, pp. 94–105 (1998)Google Scholar
  2. 2.
    Agrawal, R., Somani, A., Xu, Y.: Storage and querying of e-commerce data. In: Proc. 27th VLDB Conference, pp. 149–158 (2001)Google Scholar
  3. 3.
    Athitsos, V., Potamias, M., Papapetrou, P., Kollios, G.: Nearest neighbor retrieval using distance-based hashing. In: Proc. of ICDE Conference, pp. 327–336 (2008)Google Scholar
  4. 4.
    Beckmann, J.L., Halverson, A., Krishnamurthy, R., Naughton, J.F.: Extending rdbmss to support sparse datasets using an interpreted attribute storage format. In: Proc. 22nd ICDE Conference, p. 58 (2006)Google Scholar
  5. 5.
    Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Comput. Surv. 33(3), 322–373 (2001)CrossRefGoogle Scholar
  6. 6.
    Cui, B., Ooi, B.C., Su, J.W., Tan, K.L.: Contorting high dimensional data for efficient main memory processing. In: Proc. ACM SIGMOD Conference, pp. 479–490 (2003)Google Scholar
  7. 7.
    Hartigan, J., Wong, M.: A K-means clustering algorithm. Applied Statistics 28(1), 100–108 (1979)zbMATHCrossRefGoogle Scholar
  8. 8.
    Hui, J., Ooi, B.C., Shen, H., Yu, C., Zhou, A.: An adaptive and efficient dimensionality reduction algorithm for high-dimensional indexing. In: Proc. 19th ICDE Conference, p. 87 (2003)Google Scholar
  9. 9.
    Koudas, N., Ooi, B.C., Shen, H.T., Tung, A.K.H.: Ldc: Enabling search by partial distance in a hyper-dimensional space. In: Proc. 20th ICDE Conference, pp. 6–17 (2004)Google Scholar
  10. 10.
    Li, C., Chang, E.Y., Garcia-Molina, H., Wiederhold, G.: Clustering for approximate similarity search in high-dimensional spaces. IEEE Trans. Knowl. Data Eng. 14(4), 792–808 (2002)CrossRefGoogle Scholar
  11. 11.
    Moffat, A., Zobel, J.: Self-indexing inverted files for fast text retrieval. ACM Trans. Information Systems 14(4), 349–379 (1996)CrossRefGoogle Scholar
  12. 12.
    Tao, Y., Ye, K., Sheng, C., Kalnis, P.: Quality and efficiency in high-dimensional nearest neighbor search. In: Proc. ACM SIGMOD Conference, pp. 563–576 (2009)Google Scholar
  13. 13.
    Wang, C., Wang, X.S.: Indexing very high-dimensional sparse and quasi-sparse vectors for similarity searches. VLDB J. 9(4), 344–361 (2001)zbMATHGoogle Scholar
  14. 14.
    Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proc. 24th VLDB Conference, pp. 194–205 (1998)Google Scholar
  15. 15.
    Yu, C., Ooi, B.C., Tan, K.L., Jagadish, H.V.: Indexing the distance: An efficient method to KNN processing. In: Proc. 27th VLDB Conference, pp. 421–430 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Bin Cui
    • 1
  • Jiakui Zhao
    • 2
  • Gao Cong
    • 3
  1. 1.Department of Computer Science & Key Laboratory of High Confidence Software Technologies (Ministry of Education)Peking University 
  2. 2.China Electric Power Research InstituteChina
  3. 3.Aalborg UniversityDenmark

Personalised recommendations