Skip to main content

Accelerating Set Similarity Joins Using GPUs

  • Chapter
  • First Online:
Transactions on Large-Scale Data- and Knowledge-Centered Systems XXVIII

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 9940))

Abstract

We propose a scheme for efficient set similarity joins on Graphics Processing Units (GPUs). Due to the rapid growth and diversification of data, there is an increasing demand for fast execution of set similarity joins in applications that vary from data integration to plagiarism detection. To tackle this problem, our solution takes advantage of the massive parallel processing offered by GPUs. Additionally, we employ MinHash to estimate the similarity between two sets in terms of Jaccard similarity. By exploiting the high parallelism of GPUs and the space efficiency provided by MinHash, we can achieve high performance without sacrificing accuracy. Experimental results show that our proposed method is more than two orders of magnitude faster than the serial version of CPU implementation, and 25 times faster than the parallel version of CPU implementation, while generating highly precise query results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://archive.ics.uci.edu/ml/datasets/.

  2. 2.

    http://trec.nist.gov/data/t9_filtering.html.

  3. 3.

    http://fimi.ua.ac.be/data/.

References

  1. Böhm, C., Noll, R., Plant, C., Zherdin, A.: Index-supported similarity join on graphics processors. BTW 144, 57–66 (2009)

    Google Scholar 

  2. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  3. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of ICDE, p. 5 (2006)

    Google Scholar 

  4. Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a MapReduce-based method for scalable string similarity joins. In: Proceedings of ICDE, pp. 340–351 (2014)

    Google Scholar 

  5. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: Proceedings of VLDB, pp. 491–500 (2001)

    Google Scholar 

  6. Greathouse, J.L., Daga, M.: Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In: Proceedings of SC, pp. 769–780 (2014)

    Google Scholar 

  7. He, B., Lu, M., Yang, K., Fang, R., Govindaraju, N.K., Luo, Q., Sander, P.V.: Relational query coprocessing on graphics processors. TODS 34(4), 21:1–21:39 (2009)

    Article  Google Scholar 

  8. He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q., Sander, P.: Relational joins on graphics processors. In: Proceedings of SIGMOD, pp. 511–524 (2008)

    Google Scholar 

  9. Hoberock, J., Bell, N.: Thrust: A Productivity-Oriented Library for CUDA. Morgan Kaufmann Publishers, San Francisco (2012)

    Google Scholar 

  10. Appleby, A.: MurmurHash3 (2016)

    Google Scholar 

  11. Jiang, Y., Li, G., Feng, J., Li, W.S.: String similarity joins: an experimental evaluation. PVLDB 7(8), 625–636 (2014)

    Google Scholar 

  12. Li, P., Knig, A.C.: b-bit minwise hashing. CoRR abs/0910.3349 (2009)

    Google Scholar 

  13. Li, P., Shrivastava, A., König, A.C.: GPU-based minwise hashing. In: Proceedings of WWW, pp. 565–566 (2012)

    Google Scholar 

  14. Li, P., Owen, A.B., Zhang, C.H.: One permutation hashing for efficient search and learning. CoRR abs/1208.1259 (2012)

    Google Scholar 

  15. Lieberman, M.D., Sankaranarayanan, J., Samet, H.: A fast similarity join algorithm using graphics processing units. In: Proceedings of ICDE, pp. 1111–1120 (2008)

    Google Scholar 

  16. Metwally, A., Faloutsos, C.: V-Smart-Join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8), 704–715 (2012)

    Google Scholar 

  17. NVIDIA Corporation: NVIDIA CUDA Compute Unified Device Architecture Programming Guide (2007)

    Google Scholar 

  18. OpenMP Architecture Review Board: OpenMP Application Program Interface Version 4.0 (2013)

    Google Scholar 

  19. Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krger, J., Lefohn, A., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26(1), 80–113 (2007)

    Article  Google Scholar 

  20. Rares, V., Carey, M.J., Chen, L.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of SIGMOD, pp. 495–506 (2010)

    Google Scholar 

  21. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of SIGMOD, pp. 743–754 (2004)

    Google Scholar 

  22. Sengupta, S., Harris, M., Zhang, Y., Owens, J.D.: Scan primitives for GPU computing. In: Proceedings of GH, pp. 97–106 (2007)

    Google Scholar 

  23. Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: Proceedings of SIGMOD, pp. 85–96 (2012)

    Google Scholar 

  24. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proceedings of WWW, pp. 131–140 (2008)

    Google Scholar 

  25. Cruz, M.S.H., Kozawa, Y., Amagasa, T., Kitagawa, H.: GPU acceleration of set similarity joins. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS, vol. 9261, pp. 384–398. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  26. Harris, M.: Parallel prefix sum (Scan) with CUDA (2009)

    Google Scholar 

  27. Dotsenko, Y., Govindaraju, N.K., Sloan, P., Boyd, C., Manferdelli, J.: Fast scan algorithms on graphics processors. In: Proceedings of ICS, pp. 205–213 (2008)

    Google Scholar 

  28. Yan, S., Long, G., Zhang, Y.: StreamScan: fast scan algorithms for GPUs without global barrier synchronization. In: Proceedings of PPoPP, pp. 229–238 (2013)

    Google Scholar 

  29. Han, S., Jang, K., Park, K., Moon, S.: PacketShader: a GPU-accelerated software router. In: Proceedings of SIGCOMM, pp. 195–206 (2010)

    Google Scholar 

  30. Gainaru, A., Slusanschi, E., Trausan-Matu, S.: Mapping data mining algorithms on a GPU architecture: a study. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS, vol. 6804, pp. 102–112. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  31. Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. PVLDB 5, 253–264 (2011)

    Google Scholar 

  32. Xiao, C., Wang, W., Lin, X.: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1, 933–944 (2008)

    Google Scholar 

  33. Bayardo, R., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of WWW, pp. 131–140 (2007)

    Google Scholar 

  34. Ribeiro, L., Härder, T.: Generalizing prefix filtering to improve set similarity joins. Inf. Syst. 36, 62–78 (2011)

    Article  Google Scholar 

  35. Wang, W., Qin, J., Chuan, X., Lin, X., Shen, H.: VChunkJoin: an efficient algorithm for edit similarity joins. TKDE 25, 1916–1929 (2013)

    Google Scholar 

Download references

Acknowledgments

We thank the editors and the reviewers for their remarks and suggestions. This research was partly supported by the Grant-in-Aid for Scientific Research (B) (#26280037) from the Japan Society for the Promotion of Science.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mateus S. H. Cruz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Cruz, M.S.H., Kozawa, Y., Amagasa, T., Kitagawa, H. (2016). Accelerating Set Similarity Joins Using GPUs. In: Hameurlain, A., Küng, J., Wagner, R., Chen, Q. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXVIII. Lecture Notes in Computer Science(), vol 9940. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-53455-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-53455-7_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-53454-0

  • Online ISBN: 978-3-662-53455-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics