Journal of Intelligent Information Systems

, Volume 42, Issue 3, pp 567–594 | Cite as

Token list based information search in a multi-dimensional massive database



Finding proximity information is crucial for massive database search. Locality Sensitive Hashing (LSH) is a method for finding nearest neighbors of a query point in a high-dimensional space. It classifies high-dimensional data according to data similarity. However, the “curse of dimensionality” makes LSH insufficiently effective in finding similar data and insufficiently efficient in terms of memory resources and search delays. The contribution of this work is threefold. First, we study a Token List based information Search scheme (TLS) as an alternative to LSH. TLS builds a token list table containing all the unique tokens from the database, and clusters data records having the same token together in one group. Querying is conducted in a small number of groups of relevant data records instead of searching the entire database. Second, in order to decrease the searching time of the token list, we further propose the Optimized Token list based Search schemes (OTS) based on index-tree and hash table structures. An index-tree structure orders the tokens in the token list and constructs an index table based on the tokens. Searching the token list starts from the entry of the token list supplied by the index table. A hash table structure assigns a hash ID to each token. A query token can be directly located in the token list according to its hash ID. Third, since a single-token based method leads to high overhead in the results refinement given a required similarity, we further investigate how a Multi-Token List Search scheme (MTLS) improves the performance of database proximity search. We conducted experiments on the LSH-based searching scheme, TLS, OTS, and MTLS using a massive customer data integration database. The comparison experimental results show that TLS is more efficient than an LSH-based searching scheme, and OTS improves the search efficiency of TLS. Further, MTLS per forms better than TLS when the number of tokens is appropriately chosen, and a two-token adjacent token list achieves the shortest query delay in our testing dataset.


Similarity data search Proximity search Locality sensitive hash Database 



This research was supported in part by U.S. NSF grants IIS-1354123, CNS-1254006, CNS-1249603, OCI-1064230, CNS-1049947, CNS-0917056 and, CNS-1025652, Microsoft Research Faculty Fellowship 8300751, Microsoft Research Faculty Fellowship 8300751, and the United States Department of Defense 238866. Early versions of this work were presented in the Proceedings of DMIN’08 (Li et al. 2008) and ICCIT’08 (Shen et al. 2008). We would like to thank Mr. Yuhua Lin for his valuable comments in addressing the review feedback.


  1. Aberer, K., Cudrè-Mauroux, P., Hauswirth, M. (2003). The chatty web: emergent semantics through gossiping. In Proceedings of the 12nd international world wide web conference.Google Scholar
  2. Alimohammadi, D. (2003). Meta-tag: a means to control the process of web indexing. Online Information Review, 27(4), 238–242.CrossRefGoogle Scholar
  3. Andoni, A. (2005). Lsh algorithm and implementation (e2lsh).
  4. Andoni, A., & Indyk, P. (2005). E2lsh 0.1 user manual.
  5. Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A. (1994). An optimal algorithm for approximate nearest neighbor searching. In Proceedings 5th ACM-SIAM symposium discrete algorithms.Google Scholar
  6. Bayer, R., & McCreight, E. (1970). Organization and maintenance of large ordered indices. In Proceedings of ACM-SIGFIDET workshop on data description and access (pp. 107–141).Google Scholar
  7. Beckmann, N., Kriegel, H., Schneider, R., Seeger, B. (1990). The r*-tree: an efficient and robust access method for points and rectangles. In Proceedings of the ACM SIGMOD international conference on management of data (pp. 322–331).Google Scholar
  8. Bennett, K.P., Fayyad, U., Geiger, D. (1999). Density-based indexing for approximate nearest-neighbor queries. In Proceedings of KDD.Google Scholar
  9. Bentle, J.L., Friedman, J.H., Finkel, R.A. (1977). An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3), 209–226.CrossRefGoogle Scholar
  10. Berchtold, S., Keim, D.A., Kriegel, H.-P. (1996). The x-tree: an index structure for high-dimensional data. In Proceedings of the 22nd international conference on very large databases (pp. 28–39).Google Scholar
  11. Berrani, S.A., Amsaleg, L., Grosr, P. (2003). Approximate searches: k-neighbors + precision. In Proceedings of CIKM.Google Scholar
  12. Berry, M.W., Drmac, Z., Jessup, E.R. (1999). Matrices vector spaces, and information retrieval. SIAM Review, 41(2), 335–362.CrossRefMATHMathSciNetGoogle Scholar
  13. Blachman, N. (2007). Google guide, making searching even easier.
  14. Bohm, C., Berchtold, S., Keim, D.A. (2001). Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3), 322–373.CrossRefGoogle Scholar
  15. Brin, S. (1995). Near neighbor search in large metric space. In Proceedings of the 21st international conference on VLDB.Google Scholar
  16. Chaudhuri, S., Church, K., Konig, A., Sui, L. (2007). Heavy-tailed distributions and multi-keyword queries. In Proceedings of SIGIR.Google Scholar
  17. Chen, H., Jin, H., Wang, J., Chen, L., Liu, Y., Ni, L. (2008). Efficient multi-keyword search over p2p web. In Proceedings of WWW (pp. 989–998).Google Scholar
  18. Chen, H., Yan, J., Jin, H., Liu, Y., Ni, L. (2010). TSS: efficient term set search in large peer-to-peer textual collections. TC, 59(7), 969–980.MathSciNetGoogle Scholar
  19. Comer, D. (1979). The ubiquitous B-tree. Computing Surveys, 11(2), 121–138.CrossRefMATHGoogle Scholar
  20. Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions of Information Theory, IT-13(1), 21–27.CrossRefGoogle Scholar
  21. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S. (2003). Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of DIMACS workshop on streaming data analysis and mining.Google Scholar
  22. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th annual symposium on computational geometry (SCG).Google Scholar
  23. Deerwester, S., Dumais, S.T., Landauer, T.K., Fumas, G.W., Harshman, R.A. (1990). Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6), 391–407.CrossRefGoogle Scholar
  24. Fagin, R. (1998). Fuzzy queries in multimedia database systems. In Proceedings ACM symposium on principles of database systems.Google Scholar
  25. Filho, R.F.S., Traina, A.J.M., Traina, J.C., Faloutsos, C. (2001). Similarity search without tears: the omni family of all-purpose access methods. In Proceedings of ICDE.Google Scholar
  26. Fu, A., Chan, P.M.S., Cheung, Y.L., Moon, Y.S. (2000). Dynamic vp-tree indexing for n-nearest neighbor search given pair-wise distances. VLDB Journal, 9(2), 154–173.CrossRefGoogle Scholar
  27. Gionis, A., Indyk, P., Motwani, R. (1999). Similarity search in high dimensions via hashing. In Proceedings of international conference on very large data bases (VLDB) (pp. 518–529).Google Scholar
  28. Grossman, D.A., & Frieder, O. (2004). iFlow: information retrieval. The Netherlands: Springer.CrossRefGoogle Scholar
  29. Guttman, A. (1984). R-trees: a dynamic index structure for spatial searching. In Proceedings of the SIGMOD conference (pp. 47–57).Google Scholar
  30. Halevy, A.Y., Ives, Z.G., Mork, P., Tatarinov, I. (2003). Piazza: data management infrastructure for semantic web applications. In Proceedings of the 12nd international world wide web conference.Google Scholar
  31. Hu, J.J., Tang, C.J., Peng, J., Li, C., Yuan, C.A., Chen, A.L. (2005). A clustering algorithm based absorbing nearest neighbors. In 6th International conference of WAIM.Google Scholar
  32. Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the 30th annual ACM symposium on theory of computing.Google Scholar
  33. Kleinberg, J.M. (1997). Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of ACM symposium on theory of computing (STOC).Google Scholar
  34. Kruskal, J.B., & Wish, M. (1978). Multidimensional scaling. Beverly Hills: SAGE publication.Google Scholar
  35. Kulkami, S., & Orlandic, R. (2006). High-dimensional similarity search using data sensitive space partitioning. Lecture Notes in Computer Science (LNCS), 4080(2006), 738–750.Google Scholar
  36. Lam, H., Perego, R., Quan, N., Silvestri, F. (2009). Entry pairing in inverted file. In Proceedings of WISE (Vol. 5802, pp. 511–522).Google Scholar
  37. Li, C., Chang, E., Garcia-Molina, H.,Wiederhold, G. (2002). Clustering for approximate similarity search in high-dimensional spaces. IEEE Transactions of Knowledge and Data Engineering, 14(4), 792–808.Google Scholar
  38. Li, T., Shen, H., Rosequist, A. (2008). Token list based data searching in a multi-dimensional massive database. In Proceedings of The 4th international conference on data mining (DMIN).Google Scholar
  39. Loccoz, N.M. (2005). High-dimensional access methods for efficient similarity queries. Technical Report TR-2005-05-05, Universite De GENEVE.Google Scholar
  40. Long, X., & Suel, T. (2005). Three-level caching for efficient query processing in large Web search engines. In Proceedings of WWW (pp. 257–266).Google Scholar
  41. Luu, T., Skobeltsyn, G., Klemm, F., Puh, M., Zarko, I., Rajman, M., Aberer, K. (2008). AlvisP2P: scalable peer-to-peer text retrieval in a structured P2P network. PVLDB, 1(2), 1424–1427.Google Scholar
  42. Nejdl, W., Siberski, W., Wolpers, M., Schmnitz, C. (2003). Routing and clustering in schema-based super peer networks. In Proceedings of IPTPS.Google Scholar
  43. Nejdl, W., Wolpers, M., Siberski, W., Löser, A., Bruckhorst, I., Schlosser, M., Schmitz, C. (2003). Super-peer-based routing and clustering strategies for rdf-based peer-to-peer networks. In Proceedings of the 12nd international world wide web conference.Google Scholar
  44. Niblack, C.W., Barber, R., Equitz, W., Flickner, M.D., Glasman, E.H., Petkovic, D., Yanker, P., Faloutsos, C., Taubin, G. (1993). The QBIC project: querying images by content using color, texture and shape. In Proceedings of SPIE: storage and retrieval for image and video database.Google Scholar
  45. Panigrahy, R. (2006). Nearest neighbor search using kd-trees. Technical report, Stanford University.Google Scholar
  46. Qi, X., & Davison, B. (2009). Web page classification: features and algorithms. ACM Computing Surveys, 41(2), 1–31.CrossRefGoogle Scholar
  47. Salton, G., & McGill, M. (1983). Introduction to modern information retrieval. International Student Edition, McGraw-Hill.Google Scholar
  48. Sellis, T., Roussopoulos, N., Faloutsos, C. (1997). Multidimensional access methods: trees have grown everywhere. In Proceedings of the 23rd international conference on very large data bases.Google Scholar
  49. Shen, H., Li, Z., Li, T. (2008). An investigation on multi-token list based proximity search in multi-dimensional massive database. In Proceedings of the international conference on convergence and hybrid information technology (ICCIT).Google Scholar
  50. Skobeltsyn, G., Luu, T., Zarko, I., Rajman, M., Aberer, K. (2009). Query-driven indexing for scalable peer-to-peer text retrieval. Future Generation Computing Systems, 25(1), 89–99.CrossRefGoogle Scholar
  51. Weth, C., & Datta, A. (2012). Multiterm keyword search in NoSQL systems. IEEE Internet Computing, 16(1), 34–42.CrossRefGoogle Scholar
  52. White, D.A., & Jain, R. (1996). Algorithm and strategies for similarity retrieval. Technical Report VCL-96-101, University of California.Google Scholar
  53. Yianlios, P.N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the fourth annual ACM-SIAM symposium on discrete algorithms.Google Scholar
  54. Zolotarev, V.M. (1986). One-dimensional stable distributions. American Mathematical Society.Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Department of Electrical and Computer EngineeringClemson UniversityClemsonUSA
  2. 2.MicroStrategyFairfaxUSA
  3. 3.Wal-mart Stores Inc.BentonvilleUSA

Personalised recommendations