Advertisement

CuAPSS: A Hybrid CUDA Solution for AllPairs Similarity Search

  • Yilin Feng
  • Jie Tang
  • Chongjun Wang
  • Junyuan Xie
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11334)

Abstract

Given a set of high dimensional sparse vectors, a similarity function and a threshold, AllPairs Similarity Search finds out all pairs of vectors whose similarity values are higher than or equal to the threshold. AllPairs Similarity Search (APSS) has been studied in many different fields of computer science, including information retrieval, data mining, database and so on. It is a crucial part of lots of applications, such as near-duplicate document detection, collaborative filtering, query refinement and clustering. For cosine similarity, many serial algorithms have been proposed to solve the problem by decreasing the possible similarity candidates for each query object. However, the efficiency of those serial algorithms degrade badly as the threshold decreases. Other parallel implementations of APSS based on OpenMP or MapReduce also adopt the pruning policy and do not solve the problem thoroughly. In this context, we introduce CuAPSS, which solves the All Pairs cosine similarity search problem in CUDA environment on GPUs. Our method adopts a hybrid method to utilize both forward list and inverted list in APSS which compromises between the memory visiting and dot-product computing. The experimental results show that our method could solve the problem much faster than existing methods on several benchmark datasets with hundreds of millions of non-zero values, achieving the speedup of 1.5x–23x against the state-of-the-art parallel algorithm, while keep a relatively stable running time with different values of the threshold.

Keywords

Similarity search Load balancing CUDA 

References

  1. 1.
    Alabduljalil, M., Tang, X., Yang, T.: Cache-conscious performance optimization for similarity search. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 713–722. ACM, New York (2013)Google Scholar
  2. 2.
    Alabduljalil, M.A., Tang, X., Yang, T.: Optimizing parallel algorithms for all pairs similarity search. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pp. 203–212. ACM, New York (2013)Google Scholar
  3. 3.
    Alewiwi, M., Orencik, C., Savaş, E.: Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Cluster Comput. 19(1), 109–126 (2016)CrossRefGoogle Scholar
  4. 4.
    Anastasiu, D.C., Karypis, G.: L2AP: fast cosine similarity search with prefix L-2 norm bounds. In: IEEE 30th International Conference on Data Engineering, ICDE 2014, 31 March–4 April 2014, Chicago, IL, USA, pp. 784–795 (2014)Google Scholar
  5. 5.
    Anastasiu, D.C., Karypis, G.: PL2AP: fast parallel cosine similarity search. In: Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, pp. 8:1–8:8. ACM, New York (2015)Google Scholar
  6. 6.
    Awekar, A., Samatova, N.F.: Fast matching for all pairs similarity search. In: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, vol. 01, pp. 295–300. IEEE Computer Society, Washington, DC (2009)Google Scholar
  7. 7.
    Awekar, A., Samatova, N.F.: Parallel all pairs similarity search. In: Proceedings of the 10th International Conference on Information and Knowledge Engineering (2011)Google Scholar
  8. 8.
    Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document similarity self-join with mapreduce. In: Proceedings of the 2010 IEEE International Conference on Data Mining, pp. 731–736. IEEE Computer Society, Washington, DC (2010)Google Scholar
  9. 9.
    Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, pp. 131–140. ACM, New York (2007)Google Scholar
  10. 10.
    Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, p. 5. IEEE Computer Society, Washington, DC (2006)Google Scholar
  11. 11.
    De Francisci, G., Lucchese, C., Baraglia, R.: Scaling out all pairs similarity search with mapreduce. In: Large-Scale Distributed Systems for Information Retrieval, p. 27 (2010)Google Scholar
  12. 12.
    Hajishirzi, H., Yih, W., Kolcz, A.: Adaptive near-duplicate detection via similarity learning. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 419–426. ACM, New York (2010)Google Scholar
  13. 13.
    Haveliwala, T.H., Gionis, A., Indyk, P.: Scalable techniques for clustering the web. In: Suciu, D., Vossen, G. (eds.) WebDB (Informal Proceedings), pp. 129–134 (2000)Google Scholar
  14. 14.
    Lee, D., Park, J., Shim, J., Lee, S.: An efficient similarity join algorithm with cosine similarity predicate. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds.) DEXA 2010. LNCS, vol. 6262, pp. 422–436. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15251-1_33CrossRefGoogle Scholar
  15. 15.
    Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)Google Scholar
  16. 16.
    Matsumoto, T., Yiu, M.L.: Accelerating exact similarity search on CPU-GPU systems. In: 2015 IEEE International Conference on Data Mining, pp. 320–329, November 2015Google Scholar
  17. 17.
    Salton, G.: Term-weighting approach in automatic text retrieval. In: Readings in Information Retrieval (1998)Google Scholar
  18. 18.
    Satuluri, V., Parthasarathy, S.: Bayesian locality sensitive hashing for fast similarity search. Proc. VLDB Endow. 5(5), 430–441 (2012)CrossRefGoogle Scholar
  19. 19.
    Tang, X., Alabduljalil, M., Jin, X., Yang, T.: Load balancing for partition-based similarity search. In: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 193–202. ACM, New York (2014)Google Scholar
  20. 20.
    Zeng, C., Xing, C.X., Zhou, L.Z.: Similarity measure and instance selection for collaborative filtering. In: Proceedings of the 12th International Conference on World Wide Web, pp. 652–658. ACM, New York (2003)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Yilin Feng
    • 1
  • Jie Tang
    • 1
  • Chongjun Wang
    • 1
  • Junyuan Xie
    • 1
  1. 1.State Key Laboratory for Novel Software Technology, Department of Computer Science and TechnologyNanjing UniversityNanjingChina

Personalised recommendations