Scalable Textual Similarity Search on Large Document Collections Through Random Indexing and K-means Clustering

Cevahir, Ali

doi:10.1007/978-3-319-13186-3_22

Scalable Textual Similarity Search on Large Document Collections Through Random Indexing and K-means Clustering

Ali Cevahir¹¹

Conference paper
First Online: 26 November 2014

2206 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8643))

Abstract

This work presents an index partitioning technique for large-scale text-based search engines. Large e-commerce sites contain millions of products visited by millions of users. Textual similarity search has many uses in e-commerce sites, for instance in building recommendation engines. However, the size of the corpus makes it prohibitive to use naive approaches for real-time search. In order to reduce response times, the search is executed within a small subset of most related documents. To achieve this goal, documents are clustered using k-means. However, vectors used for k-means clustering are very high dimensional. Random indexing is applied to reduce dimensionality. We boosted these steps with GPUs to reduce preprocessing overheads. Once clusters are built, text queries are executed within the closest clusters. Our experiments on a large document collection for a recommendation scenario reveal that only 1.7 % loss in recommendation precision is possible by realizing only 28 % of search operations in the inverted index.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://www.rakuten.co.jp/

References

Salton, G., Anita, W., Chung-Shu, Y.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Kanerva, P., Jan K., Anders H.: Random indexing of text samples for latent semantic analysis. In: Proceedings of the 22nd Annual Conference of the Cognitive Science Society, p. 1036 (2000)
Google Scholar
Cambazoglu, B.B., Catal, A., Aykanat, C.: Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems. In: Levi, A., Savaş, E., Yenigün, H., Balcısoy, S., Saygın, Y. (eds.) ISCIS 2006. LNCS, vol. 4263, pp. 717–725. Springer, Heidelberg (2006)
Chapter Google Scholar
Bhagwat, D., Eshghi, K., Mehra, P.: Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’07), pp. 105–112. ACM, New York (2007)
Google Scholar
Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009)
Article Google Scholar
Goldberg, K., Roeder, T., Gupta, D., Perkins, C.: Eigentaste: a constant time collaborative filtering algorithm. Inf. Retrieval 4(2), 133–151 (2001)
Article MATH Google Scholar
Rennie, J.D., Srebro, N.: Fast maximum margin matrix factorization for collaborative prediction. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 713–719. ACM, New York (2005)
Google Scholar
Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26, 189–206 (1984)
Article MATH MathSciNet Google Scholar
Achlioptas, D.: Database-friendly random projections: johnson-lindenstrauss with binary coins. J. Comput. Syst. Sci. 66(4), 671–687 (2003)
Article MATH MathSciNet Google Scholar
Li, P., Hastie, T.J., Church, K.W.: Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06), pp. 287–296. ACM, New York (2006)
Google Scholar
Cevahir, A., Torii, J.: High performance online image search with GPUs on large image databases. Int. J. Multimedia Data Eng. Manage. (IJMDEM) 4(3), 24–41 (2013)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Rakuten Institute of Technology, Rakuten Inc., Tokyo, Japan
Ali Cevahir

Authors

Ali Cevahir
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Cevahir .

Editor information

Editors and Affiliations

National Chiao Tung University, Hsinchu, Taiwan
Wen-Chih Peng
Google Research, Mountain View, California, USA
Haixun Wang
University of Melbourne, Melbourne, Victoria, Australia
James Bailey
National Cheng Kung University, Tainan, Taiwan
Vincent S. Tseng
Japan Advanced Institute of Science and Technology, Nomi City, Japan
Tu Bao Ho
Nanjing University, Nanjing, China
Zhi-Hua Zhou
National Chengchi University, Taipei, Taiwan
Arbee L.P. Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cevahir, A. (2014). Scalable Textual Similarity Search on Large Document Collections Through Random Indexing and K-means Clustering. In: Peng, WC., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8643. Springer, Cham. https://doi.org/10.1007/978-3-319-13186-3_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-13186-3_22
Published: 26 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13185-6
Online ISBN: 978-3-319-13186-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics