# Token list based information search in a multi-dimensional massive database

- 228 Downloads

## Abstract

Finding proximity information is crucial for massive database search. Locality Sensitive Hashing (LSH) is a method for finding nearest neighbors of a query point in a high-dimensional space. It classifies high-dimensional data according to data similarity. However, the “curse of dimensionality” makes LSH insufficiently effective in finding similar data and insufficiently efficient in terms of memory resources and search delays. The contribution of this work is threefold. First, we study a Token List based information Search scheme (TLS) as an alternative to LSH. TLS builds a token list table containing all the unique tokens from the database, and clusters data records having the same token together in one group. Querying is conducted in a small number of groups of relevant data records instead of searching the entire database. Second, in order to decrease the searching time of the token list, we further propose the Optimized Token list based Search schemes (OTS) based on index-tree and hash table structures. An index-tree structure orders the tokens in the token list and constructs an index table based on the tokens. Searching the token list starts from the entry of the token list supplied by the index table. A hash table structure assigns a hash ID to each token. A query token can be directly located in the token list according to its hash ID. Third, since a single-token based method leads to high overhead in the results refinement given a required similarity, we further investigate how a Multi-Token List Search scheme (MTLS) improves the performance of database proximity search. We conducted experiments on the LSH-based searching scheme, TLS, OTS, and MTLS using a massive customer data integration database. The comparison experimental results show that TLS is more efficient than an LSH-based searching scheme, and OTS improves the search efficiency of TLS. Further, MTLS per forms better than TLS when the number of tokens is appropriately chosen, and a two-token adjacent token list achieves the shortest query delay in our testing dataset.

## Keywords

Similarity data search Proximity search Locality sensitive hash Database## Notes

### Acknowledgments

This research was supported in part by U.S. NSF grants IIS-1354123, CNS-1254006, CNS-1249603, OCI-1064230, CNS-1049947, CNS-0917056 and, CNS-1025652, Microsoft Research Faculty Fellowship 8300751, Microsoft Research Faculty Fellowship 8300751, and the United States Department of Defense 238866. Early versions of this work were presented in the Proceedings of DMIN’08 (Li et al. 2008) and ICCIT’08 (Shen et al. 2008). We would like to thank Mr. Yuhua Lin for his valuable comments in addressing the review feedback.

## References

- Aberer, K., Cudrè-Mauroux, P., Hauswirth, M. (2003). The chatty web: emergent semantics through gossiping. In
*Proceedings of the 12nd international world wide web conference*.Google Scholar - Alimohammadi, D. (2003). Meta-tag: a means to control the process of web indexing.
*Online Information Review*,*27*(4), 238–242.CrossRefGoogle Scholar - Andoni, A. (2005). Lsh algorithm and implementation (e2lsh). http://web.mit.edu/andoni/www/LSH/index.html.
- Andoni, A., & Indyk, P. (2005). E2lsh 0.1 user manual. http://web.mit.edu/andoni/www/LSH/index.html.
- Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A. (1994). An optimal algorithm for approximate nearest neighbor searching. In
*Proceedings 5th ACM-SIAM symposium discrete algorithms*.Google Scholar - Bayer, R., & McCreight, E. (1970). Organization and maintenance of large ordered indices. In
*Proceedings of ACM-SIGFIDET workshop on data description and access*(pp. 107–141).Google Scholar - Beckmann, N., Kriegel, H., Schneider, R., Seeger, B. (1990). The r*-tree: an efficient and robust access method for points and rectangles. In
*Proceedings of the ACM SIGMOD international conference on management of data*(pp. 322–331).Google Scholar - Bennett, K.P., Fayyad, U., Geiger, D. (1999). Density-based indexing for approximate nearest-neighbor queries. In
*Proceedings of KDD*.Google Scholar - Bentle, J.L., Friedman, J.H., Finkel, R.A. (1977). An algorithm for finding best matches in logarithmic expected time.
*ACM Transactions on Mathematical Software*,*3*(3), 209–226.CrossRefGoogle Scholar - Berchtold, S., Keim, D.A., Kriegel, H.-P. (1996). The x-tree: an index structure for high-dimensional data. In
*Proceedings of the 22nd international conference on very large databases*(pp. 28–39).Google Scholar - Berrani, S.A., Amsaleg, L., Grosr, P. (2003). Approximate searches: k-neighbors + precision. In
*Proceedings of CIKM*.Google Scholar - Berry, M.W., Drmac, Z., Jessup, E.R. (1999). Matrices vector spaces, and information retrieval.
*SIAM Review*,*41*(2), 335–362.CrossRefMATHMathSciNetGoogle Scholar - Blachman, N. (2007). Google guide, making searching even easier. http://www.googleguide.com/google_works.html.
- Bohm, C., Berchtold, S., Keim, D.A. (2001). Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases.
*ACM Computing Surveys*,*33*(3), 322–373.CrossRefGoogle Scholar - Brin, S. (1995). Near neighbor search in large metric space. In
*Proceedings of the 21st international conference on VLDB*.Google Scholar - Chaudhuri, S., Church, K., Konig, A., Sui, L. (2007). Heavy-tailed distributions and multi-keyword queries. In
*Proceedings of SIGIR*.Google Scholar - Chen, H., Jin, H., Wang, J., Chen, L., Liu, Y., Ni, L. (2008). Efficient multi-keyword search over p2p web. In
*Proceedings of WWW*(pp. 989–998).Google Scholar - Chen, H., Yan, J., Jin, H., Liu, Y., Ni, L. (2010). TSS: efficient term set search in large peer-to-peer textual collections.
*TC*,*59*(7), 969–980.MathSciNetGoogle Scholar - Comer, D. (1979). The ubiquitous B-tree.
*Computing Surveys*,*11*(2), 121–138.CrossRefMATHGoogle Scholar - Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification.
*IEEE Transactions of Information Theory*,*IT-13*(1), 21–27.CrossRefGoogle Scholar - Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S. (2003). Locality-sensitive hashing scheme based on p-stable distributions. In
*Proceedings of DIMACS workshop on streaming data analysis and mining*.Google Scholar - Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In
*Proceedings of the 20th annual symposium on computational geometry (SCG)*.Google Scholar - Deerwester, S., Dumais, S.T., Landauer, T.K., Fumas, G.W., Harshman, R.A. (1990). Indexing by latent semantic analysis.
*Journal of the Society for Information Science*,*41*(6), 391–407.CrossRefGoogle Scholar - Fagin, R. (1998). Fuzzy queries in multimedia database systems. In
*Proceedings ACM symposium on principles of database systems*.Google Scholar - Filho, R.F.S., Traina, A.J.M., Traina, J.C., Faloutsos, C. (2001). Similarity search without tears: the omni family of all-purpose access methods. In
*Proceedings of ICDE*.Google Scholar - Fu, A., Chan, P.M.S., Cheung, Y.L., Moon, Y.S. (2000). Dynamic vp-tree indexing for n-nearest neighbor search given pair-wise distances.
*VLDB Journal*,*9*(2), 154–173.CrossRefGoogle Scholar - Gionis, A., Indyk, P., Motwani, R. (1999). Similarity search in high dimensions via hashing. In
*Proceedings of international conference on very large data bases (VLDB)*(pp. 518–529).Google Scholar - Grossman, D.A., & Frieder, O. (2004).
*iFlow: information retrieval*. The Netherlands: Springer.CrossRefGoogle Scholar - Guttman, A. (1984). R-trees: a dynamic index structure for spatial searching. In
*Proceedings of the SIGMOD conference*(pp. 47–57).Google Scholar - Halevy, A.Y., Ives, Z.G., Mork, P., Tatarinov, I. (2003). Piazza: data management infrastructure for semantic web applications. In
*Proceedings of the 12nd international world wide web conference*.Google Scholar - Hu, J.J., Tang, C.J., Peng, J., Li, C., Yuan, C.A., Chen, A.L. (2005). A clustering algorithm based absorbing nearest neighbors. In
*6th International conference of WAIM*.Google Scholar - Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: towards removing the curse of dimensionality. In
*Proceedings of the 30th annual ACM symposium on theory of computing*.Google Scholar - Kleinberg, J.M. (1997). Two algorithms for nearest-neighbor search in high dimensions. In
*Proceedings of ACM symposium on theory of computing (STOC)*.Google Scholar - Kruskal, J.B., & Wish, M. (1978).
*Multidimensional scaling*. Beverly Hills: SAGE publication.Google Scholar - Kulkami, S., & Orlandic, R. (2006). High-dimensional similarity search using data sensitive space partitioning.
*Lecture Notes in Computer Science (LNCS)*,*4080*(2006), 738–750.Google Scholar - Lam, H., Perego, R., Quan, N., Silvestri, F. (2009). Entry pairing in inverted file. In
*Proceedings of WISE*(Vol. 5802, pp. 511–522).Google Scholar - Li, C., Chang, E., Garcia-Molina, H.,Wiederhold, G. (2002). Clustering for approximate similarity search in high-dimensional spaces.
*IEEE Transactions of Knowledge and Data Engineering*,*14*(4), 792–808.Google Scholar - Li, T., Shen, H., Rosequist, A. (2008). Token list based data searching in a multi-dimensional massive database. In
*Proceedings of The 4th international conference on data mining (DMIN)*.Google Scholar - Loccoz, N.M. (2005).
*High-dimensional access methods for efficient similarity queries*. Technical Report TR-2005-05-05, Universite De GENEVE.Google Scholar - Long, X., & Suel, T. (2005). Three-level caching for efficient query processing in large Web search engines. In
*Proceedings of WWW*(pp. 257–266).Google Scholar - Luu, T., Skobeltsyn, G., Klemm, F., Puh, M., Zarko, I., Rajman, M., Aberer, K. (2008). AlvisP2P: scalable peer-to-peer text retrieval in a structured P2P network.
*PVLDB*,*1*(2), 1424–1427.Google Scholar - Nejdl, W., Siberski, W., Wolpers, M., Schmnitz, C. (2003). Routing and clustering in schema-based super peer networks. In
*Proceedings of IPTPS*.Google Scholar - Nejdl, W., Wolpers, M., Siberski, W., Löser, A., Bruckhorst, I., Schlosser, M., Schmitz, C. (2003). Super-peer-based routing and clustering strategies for rdf-based peer-to-peer networks. In
*Proceedings of the 12nd international world wide web conference*.Google Scholar - Niblack, C.W., Barber, R., Equitz, W., Flickner, M.D., Glasman, E.H., Petkovic, D., Yanker, P., Faloutsos, C., Taubin, G. (1993). The QBIC project: querying images by content using color, texture and shape. In
*Proceedings of SPIE: storage and retrieval for image and video database*.Google Scholar - Panigrahy, R. (2006).
*Nearest neighbor search using kd-trees*. Technical report, Stanford University.Google Scholar - Qi, X., & Davison, B. (2009). Web page classification: features and algorithms.
*ACM Computing Surveys*,*41*(2), 1–31.CrossRefGoogle Scholar - Salton, G., & McGill, M. (1983).
*Introduction to modern information retrieval*. International Student Edition, McGraw-Hill.Google Scholar - Sellis, T., Roussopoulos, N., Faloutsos, C. (1997). Multidimensional access methods: trees have grown everywhere. In
*Proceedings of the 23rd international conference on very large data bases*.Google Scholar - Shen, H., Li, Z., Li, T. (2008). An investigation on multi-token list based proximity search in multi-dimensional massive database. In
*Proceedings of the international conference on convergence and hybrid information technology (ICCIT)*.Google Scholar - Skobeltsyn, G., Luu, T., Zarko, I., Rajman, M., Aberer, K. (2009). Query-driven indexing for scalable peer-to-peer text retrieval.
*Future Generation Computing Systems*,*25*(1), 89–99.CrossRefGoogle Scholar - Weth, C., & Datta, A. (2012). Multiterm keyword search in NoSQL systems.
*IEEE Internet Computing*,*16*(1), 34–42.CrossRefGoogle Scholar - White, D.A., & Jain, R. (1996).
*Algorithm and strategies for similarity retrieval*. Technical Report VCL-96-101, University of California.Google Scholar - Yianlios, P.N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In
*Proceedings of the fourth annual ACM-SIAM symposium on discrete algorithms*.Google Scholar - Zolotarev, V.M. (1986).
*One-dimensional stable distributions*. American Mathematical Society.Google Scholar