Skip to main content

Scalable Textual Similarity Search on Large Document Collections Through Random Indexing and K-means Clustering

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8643))

Abstract

This work presents an index partitioning technique for large-scale text-based search engines. Large e-commerce sites contain millions of products visited by millions of users. Textual similarity search has many uses in e-commerce sites, for instance in building recommendation engines. However, the size of the corpus makes it prohibitive to use naive approaches for real-time search. In order to reduce response times, the search is executed within a small subset of most related documents. To achieve this goal, documents are clustered using k-means. However, vectors used for k-means clustering are very high dimensional. Random indexing is applied to reduce dimensionality. We boosted these steps with GPUs to reduce preprocessing overheads. Once clusters are built, text queries are executed within the closest clusters. Our experiments on a large document collection for a recommendation scenario reveal that only 1.7 % loss in recommendation precision is possible by realizing only 28 % of search operations in the inverted index.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www.rakuten.co.jp/

References

  1. Salton, G., Anita, W., Chung-Shu, Y.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  2. Kanerva, P., Jan K., Anders H.: Random indexing of text samples for latent semantic analysis. In: Proceedings of the 22nd Annual Conference of the Cognitive Science Society, p. 1036 (2000)

    Google Scholar 

  3. Cambazoglu, B.B., Catal, A., Aykanat, C.: Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems. In: Levi, A., Savaş, E., Yenigün, H., Balcısoy, S., Saygın, Y. (eds.) ISCIS 2006. LNCS, vol. 4263, pp. 717–725. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  4. Bhagwat, D., Eshghi, K., Mehra, P.: Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’07), pp. 105–112. ACM, New York (2007)

    Google Scholar 

  5. Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009)

    Article  Google Scholar 

  6. Goldberg, K., Roeder, T., Gupta, D., Perkins, C.: Eigentaste: a constant time collaborative filtering algorithm. Inf. Retrieval 4(2), 133–151 (2001)

    Article  MATH  Google Scholar 

  7. Rennie, J.D., Srebro, N.: Fast maximum margin matrix factorization for collaborative prediction. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 713–719. ACM, New York (2005)

    Google Scholar 

  8. Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26, 189–206 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  9. Achlioptas, D.: Database-friendly random projections: johnson-lindenstrauss with binary coins. J. Comput. Syst. Sci. 66(4), 671–687 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  10. Li, P., Hastie, T.J., Church, K.W.: Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06), pp. 287–296. ACM, New York (2006)

    Google Scholar 

  11. Cevahir, A., Torii, J.: High performance online image search with GPUs on large image databases. Int. J. Multimedia Data Eng. Manage. (IJMDEM) 4(3), 24–41 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Cevahir .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Cevahir, A. (2014). Scalable Textual Similarity Search on Large Document Collections Through Random Indexing and K-means Clustering. In: Peng, WC., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8643. Springer, Cham. https://doi.org/10.1007/978-3-319-13186-3_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-13186-3_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-13185-6

  • Online ISBN: 978-3-319-13186-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics