Accelerating Set Similarity Joins Using GPUs

Cruz, Mateus S. H.; Kozawa, Yusuke; Amagasa, Toshiyuki; Kitagawa, Hiroyuki

doi:10.1007/978-3-662-53455-7_1

Mateus S. H. Cruz¹⁷,
Yusuke Kozawa¹⁷,
Toshiyuki Amagasa¹⁸ &
…
Hiroyuki Kitagawa¹⁸

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 9940))

604 Accesses
1 Citations

Abstract

We propose a scheme for efficient set similarity joins on Graphics Processing Units (GPUs). Due to the rapid growth and diversification of data, there is an increasing demand for fast execution of set similarity joins in applications that vary from data integration to plagiarism detection. To tackle this problem, our solution takes advantage of the massive parallel processing offered by GPUs. Additionally, we employ MinHash to estimate the similarity between two sets in terms of Jaccard similarity. By exploiting the high parallelism of GPUs and the space efficiency provided by MinHash, we can achieve high performance without sacrificing accuracy. Experimental results show that our proposed method is more than two orders of magnitude faster than the serial version of CPU implementation, and 25 times faster than the parallel version of CPU implementation, while generating highly precise query results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Böhm, C., Noll, R., Plant, C., Zherdin, A.: Index-supported similarity join on graphics processors. BTW 144, 57–66 (2009)
Google Scholar
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)
Article MathSciNet MATH Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of ICDE, p. 5 (2006)
Google Scholar
Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a MapReduce-based method for scalable string similarity joins. In: Proceedings of ICDE, pp. 340–351 (2014)
Google Scholar
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: Proceedings of VLDB, pp. 491–500 (2001)
Google Scholar
Greathouse, J.L., Daga, M.: Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In: Proceedings of SC, pp. 769–780 (2014)
Google Scholar
He, B., Lu, M., Yang, K., Fang, R., Govindaraju, N.K., Luo, Q., Sander, P.V.: Relational query coprocessing on graphics processors. TODS 34(4), 21:1–21:39 (2009)
Article Google Scholar
He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q., Sander, P.: Relational joins on graphics processors. In: Proceedings of SIGMOD, pp. 511–524 (2008)
Google Scholar
Hoberock, J., Bell, N.: Thrust: A Productivity-Oriented Library for CUDA. Morgan Kaufmann Publishers, San Francisco (2012)
Google Scholar
Appleby, A.: MurmurHash3 (2016)
Google Scholar
Jiang, Y., Li, G., Feng, J., Li, W.S.: String similarity joins: an experimental evaluation. PVLDB 7(8), 625–636 (2014)
Google Scholar
Li, P., Knig, A.C.: b-bit minwise hashing. CoRR abs/0910.3349 (2009)
Google Scholar
Li, P., Shrivastava, A., König, A.C.: GPU-based minwise hashing. In: Proceedings of WWW, pp. 565–566 (2012)
Google Scholar
Li, P., Owen, A.B., Zhang, C.H.: One permutation hashing for efficient search and learning. CoRR abs/1208.1259 (2012)
Google Scholar
Lieberman, M.D., Sankaranarayanan, J., Samet, H.: A fast similarity join algorithm using graphics processing units. In: Proceedings of ICDE, pp. 1111–1120 (2008)
Google Scholar
Metwally, A., Faloutsos, C.: V-Smart-Join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8), 704–715 (2012)
Google Scholar
NVIDIA Corporation: NVIDIA CUDA Compute Unified Device Architecture Programming Guide (2007)
Google Scholar
OpenMP Architecture Review Board: OpenMP Application Program Interface Version 4.0 (2013)
Google Scholar
Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krger, J., Lefohn, A., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26(1), 80–113 (2007)
Article Google Scholar
Rares, V., Carey, M.J., Chen, L.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of SIGMOD, pp. 495–506 (2010)
Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of SIGMOD, pp. 743–754 (2004)
Google Scholar
Sengupta, S., Harris, M., Zhang, Y., Owens, J.D.: Scan primitives for GPU computing. In: Proceedings of GH, pp. 97–106 (2007)
Google Scholar
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: Proceedings of SIGMOD, pp. 85–96 (2012)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proceedings of WWW, pp. 131–140 (2008)
Google Scholar
Cruz, M.S.H., Kozawa, Y., Amagasa, T., Kitagawa, H.: GPU acceleration of set similarity joins. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS, vol. 9261, pp. 384–398. Springer, Heidelberg (2015)
Chapter Google Scholar
Harris, M.: Parallel prefix sum (Scan) with CUDA (2009)
Google Scholar
Dotsenko, Y., Govindaraju, N.K., Sloan, P., Boyd, C., Manferdelli, J.: Fast scan algorithms on graphics processors. In: Proceedings of ICS, pp. 205–213 (2008)
Google Scholar
Yan, S., Long, G., Zhang, Y.: StreamScan: fast scan algorithms for GPUs without global barrier synchronization. In: Proceedings of PPoPP, pp. 229–238 (2013)
Google Scholar
Han, S., Jang, K., Park, K., Moon, S.: PacketShader: a GPU-accelerated software router. In: Proceedings of SIGCOMM, pp. 195–206 (2010)
Google Scholar
Gainaru, A., Slusanschi, E., Trausan-Matu, S.: Mapping data mining algorithms on a GPU architecture: a study. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS, vol. 6804, pp. 102–112. Springer, Heidelberg (2011)
Chapter Google Scholar
Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. PVLDB 5, 253–264 (2011)
Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1, 933–944 (2008)
Google Scholar
Bayardo, R., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of WWW, pp. 131–140 (2007)
Google Scholar
Ribeiro, L., Härder, T.: Generalizing prefix filtering to improve set similarity joins. Inf. Syst. 36, 62–78 (2011)
Article Google Scholar
Wang, W., Qin, J., Chuan, X., Lin, X., Shen, H.: VChunkJoin: an efficient algorithm for edit similarity joins. TKDE 25, 1916–1929 (2013)
Google Scholar

Download references

Acknowledgments

We thank the editors and the reviewers for their remarks and suggestions. This research was partly supported by the Grant-in-Aid for Scientific Research (B) (#26280037) from the Japan Society for the Promotion of Science.

Author information

Authors and Affiliations

Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Japan
Mateus S. H. Cruz & Yusuke Kozawa
Faculty of Engineering, Information and Systems, University of Tsukuba, Tsukuba, Japan
Toshiyuki Amagasa & Hiroyuki Kitagawa

Authors

Mateus S. H. Cruz
View author publications
You can also search for this author in PubMed Google Scholar
Yusuke Kozawa
View author publications
You can also search for this author in PubMed Google Scholar
Toshiyuki Amagasa
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Kitagawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mateus S. H. Cruz .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University , Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz , Linz, Austria
Josef Küng
FAW, University of Linz , Linz, Austria
Roland Wagner
HP Labs , Sunnyvale, California, USA
Qimin Chen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Cruz, M.S.H., Kozawa, Y., Amagasa, T., Kitagawa, H. (2016). Accelerating Set Similarity Joins Using GPUs. In: Hameurlain, A., Küng, J., Wagner, R., Chen, Q. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXVIII. Lecture Notes in Computer Science(), vol 9940. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-53455-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-662-53455-7_1
Published: 10 September 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-53454-0
Online ISBN: 978-3-662-53455-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics