Abstract
This work presents an index partitioning technique for large-scale text-based search engines. Large e-commerce sites contain millions of products visited by millions of users. Textual similarity search has many uses in e-commerce sites, for instance in building recommendation engines. However, the size of the corpus makes it prohibitive to use naive approaches for real-time search. In order to reduce response times, the search is executed within a small subset of most related documents. To achieve this goal, documents are clustered using k-means. However, vectors used for k-means clustering are very high dimensional. Random indexing is applied to reduce dimensionality. We boosted these steps with GPUs to reduce preprocessing overheads. Once clusters are built, text queries are executed within the closest clusters. Our experiments on a large document collection for a recommendation scenario reveal that only 1.7 % loss in recommendation precision is possible by realizing only 28 % of search operations in the inverted index.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
References
Salton, G., Anita, W., Chung-Shu, Y.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Kanerva, P., Jan K., Anders H.: Random indexing of text samples for latent semantic analysis. In: Proceedings of the 22nd Annual Conference of the Cognitive Science Society, p. 1036 (2000)
Cambazoglu, B.B., Catal, A., Aykanat, C.: Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems. In: Levi, A., Savaş, E., Yenigün, H., Balcısoy, S., Saygın, Y. (eds.) ISCIS 2006. LNCS, vol. 4263, pp. 717–725. Springer, Heidelberg (2006)
Bhagwat, D., Eshghi, K., Mehra, P.: Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’07), pp. 105–112. ACM, New York (2007)
Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009)
Goldberg, K., Roeder, T., Gupta, D., Perkins, C.: Eigentaste: a constant time collaborative filtering algorithm. Inf. Retrieval 4(2), 133–151 (2001)
Rennie, J.D., Srebro, N.: Fast maximum margin matrix factorization for collaborative prediction. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 713–719. ACM, New York (2005)
Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26, 189–206 (1984)
Achlioptas, D.: Database-friendly random projections: johnson-lindenstrauss with binary coins. J. Comput. Syst. Sci. 66(4), 671–687 (2003)
Li, P., Hastie, T.J., Church, K.W.: Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06), pp. 287–296. ACM, New York (2006)
Cevahir, A., Torii, J.: High performance online image search with GPUs on large image databases. Int. J. Multimedia Data Eng. Manage. (IJMDEM) 4(3), 24–41 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Cevahir, A. (2014). Scalable Textual Similarity Search on Large Document Collections Through Random Indexing and K-means Clustering. In: Peng, WC., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8643. Springer, Cham. https://doi.org/10.1007/978-3-319-13186-3_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-13186-3_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13185-6
Online ISBN: 978-3-319-13186-3
eBook Packages: Computer ScienceComputer Science (R0)