Advertisement

The VLDB Journal

, Volume 27, Issue 4, pp 471–495 | Cite as

Efficient set containment join

  • Jianye Yang
  • Wenjie Zhang
  • Shiyu Yang
  • Ying Zhang
  • Xuemin Lin
  • Long Yuan
Regular Paper
  • 190 Downloads

Abstract

In this paper, we study the problem of set containment join. Given two collections \(\mathcal {R}\) and \(\mathcal {S}\) of records, the set containment join \(\mathcal {R} \bowtie _\subseteq \mathcal {S}\) retrieves all record pairs \(\{(r,s)\} \in \mathcal {R}\times \mathcal {S}\) such that \(r \subseteq s\). This problem has been extensively studied in the literature and has many important applications in commercial and scientific fields. Recent research focuses on the in-memory set containment join algorithms, and several techniques have been developed following intersection-oriented or union-oriented computing paradigms. Nevertheless, we observe that two computing paradigms have their limits due to the nature of the intersection and union operators. Particularly, intersection-oriented method relies on the intersection of the relevant inverted lists built on the elements of \(\mathcal {S}\). A nice property of the intersection-oriented method is that the join computation is verification free. However, the number of records explored during the join process may be large because there are multiple replicas for each record in \(\mathcal {S}\). On the other hand, the union-oriented method generates a signature for each record in \(\mathcal {R}\) and the candidate pairs are obtained by the union of the inverted lists of the relevant signatures. The candidate size of the union-oriented method is usually small because each record contributes only one replica in the index. Unfortunately, union-oriented method needs to verify the candidate pairs, which may be cost expensive especially when the join result size is large. As a matter of fact, the state-of-the-art union-oriented solution is not competitive compared to the intersection-oriented ones. In this paper, we propose a new union-oriented method, namely TT-Join, which not only enhances the advantage of the previous union-oriented methods but also integrates the goodness of intersection-oriented methods by imposing a variant of prefix tree structure. We conduct extensive experiments on 20 real-life datasets and synthetic datasets by comparing our method with 7 existing methods. The experiment results demonstrate that TT-Join significantly outperforms the existing algorithms on most of the datasets and can achieve up to two orders of magnitude speedup. Furthermore, to support large scale of datasets, we extend our techniques to distributed systems on top of MapReduce framework. With the help of careful designed load-aware distribution mechanisms, our distributed join algorithm can achieve up to an order of magnitude speedup than the baselines methods.

Keywords

Set containment join Prefix tree Data partitioning MapReduce framework 

Notes

Acknowledgements

Ying Zhang is supported by ARC FT170100128 and DP180103096. Wenjie Zhang is supported by ARC DP180103096. Xuemin Lin is supported by NSFC 61672235, DP170101628, and DP180103096. Shiyu Yang is sponsored by Shanghai Sailing Program.

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
    Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A., Ullman, J.D.: Fuzzy joins using mapreduce. In: ICDE, pp. 498–509 (2012)Google Scholar
  14. 14.
    Agrawal, P., Arasu, A., Kaushik, R.: On indexing error-tolerant set containment. In: SIGMOD, pp. 927–938 (2010)Google Scholar
  15. 15.
    Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)Google Scholar
  16. 16.
    Baeza-Yates, R., Salinger, A.: A fast set intersection algorithm for sorted sequences. In: CPM (2004)Google Scholar
  17. 17.
    Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)Google Scholar
  18. 18.
    Bouros, P., Mamoulis, N., Ge, S., Terrovitis, M.: Set containment join revisited. In: Knowledge and Information Systems, pp. 1–28 (2015)Google Scholar
  19. 19.
    Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)Google Scholar
  20. 20.
    Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a mapreduce-based method for scalable string similarity joins. In: ICDE, pp. 340–351 (2014)Google Scholar
  21. 21.
    Deng, D., Li, G., Wen, H., Feng, J.: An efficient partition based method for exact set similarity joins. In: VLDB, pp. 360–371 (2015)Google Scholar
  22. 22.
    Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: SIGMOD, pp. 1–12 (2000)Google Scholar
  23. 23.
    Helmer, S., Moerkotte, G.: Evaluation of main memory join algorithms for joins with set comparison predicates. In: VLDB, pp. 386–395 (1997)Google Scholar
  24. 24.
    Hmedeh, Z., Kourdounakis, H., Christophides, V., Du Mouza, C., Scholl, M., Travers., N.: Subscription indexes for web syndication systems. In: EDBT, pp. 312–323 (2012)Google Scholar
  25. 25.
    Hu, X., Tao, Y., Yi, K.: Output-optimal parallel algorithms for similarity joins. In: PODS, pp. 79–90 (2017)Google Scholar
  26. 26.
    Jampani, R., Pudi, V.: Using prefix-trees for efficiently computing set joins. In: DASFAA, pp. 761–772 (2005)Google Scholar
  27. 27.
    Kunkel, A., Rheinländer, A., Schiefer, C., Helmer, S., Bouros, P., Leser, U.: Piejoin: towards parallel set containment joins. In: SSDBM, p. 11 (2016)Google Scholar
  28. 28.
    Leskovec, J., Backstrom, L., Kleinberg, J.: Meme-tracking and the dynamics of the news cycle. In: SIGKDD, pp. 497–506 (2009)Google Scholar
  29. 29.
    Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)Google Scholar
  30. 30.
    Luo, Y., Fletcher, G.H., Hidders, J., De Bra, P.: Efficient and scalable trie-based algorithms for computing set containment relations. In: ICDE, pp. 303–314 (2015)Google Scholar
  31. 31.
    Mamoulis, N.: Efficient processing of joins on set-valued attributes. In: SIGMOD, pp. 157–168 (2003)Google Scholar
  32. 32.
    Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. In: VLDB, pp. 636–647 (2016)Google Scholar
  33. 33.
    Melnik, S., Garcia-Molina, H.: Divide-and-conquer algorithm for computing set containment joins. In: EDBT, pp. 427–444 (2002)Google Scholar
  34. 34.
    Melnik, S., Garcia Molina, H.: Adaptive algorithms for set containment joins. TODS 28(1), 56–99 (2003)CrossRefGoogle Scholar
  35. 35.
    Metwally, A., Faloutsos, C.: V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. In: VLDB, pp. 704–715 (2012)Google Scholar
  36. 36.
    Ramasamy, K., Patel, J.M., Naughton, J.F., Kaushik, R.: Set containment joins: the good, the bad and the ugly. In: VLDB, pp. 351–362 (2000)Google Scholar
  37. 37.
    Sun, J., Shang, Z., Li, G., Dend, D., Bao, Z.: Dima: a distributed in-memory similarity-based query processing system. In: VLDB, pp. 1925–1928 (2017)Google Scholar
  38. 38.
    Terrovitis, M., Bouros, P., Vassiliadis, P., Sellis, T., Mamoulis, N.: Efficient answering of set containment queries for skewed item distributions. In: EDBT, pp. 225–236 (2011)Google Scholar
  39. 39.
    Terrovitis, M., Passas, S., Vassiliadis, P., Sellis, T.: A combination of trie-trees and inverted files for the indexing of set-valued attributes. In: CIKM, pp. 728–737 (2006)Google Scholar
  40. 40.
    Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: SIGMOD, pp. 495–506 (2010)Google Scholar
  41. 41.
    Wang, J., Feng, J., Li, G.: Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. In: VLDB, pp. 1219–1230 (2010)Google Scholar
  42. 42.
    Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)Google Scholar
  43. 43.
    Wang, X., Qin, L., Lin, X., Zhang, Y., Chang, L.: Leveraging set relations in exact set similarity join. In: VLDB, pp. 925–936 (2017)Google Scholar
  44. 44.
    Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)Google Scholar
  45. 45.
    Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)Google Scholar
  46. 46.
    Yan, T.W., García-Molina, H.: Index structures for selective dissemination of information under the boolean model. TODS 19(2), 332–364 (1994)CrossRefGoogle Scholar
  47. 47.
    Zhu, E., Nargesian, F., Pu, K.Q., Miller, R.J.: LSH ensemble: Internet scale domain search. In: VLDB, pp. 1185–1196 (2016)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Jianye Yang
    • 1
  • Wenjie Zhang
    • 2
  • Shiyu Yang
    • 3
  • Ying Zhang
    • 4
  • Xuemin Lin
    • 2
  • Long Yuan
    • 2
  1. 1.Alibaba GroupHangzhouChina
  2. 2.The University of New South WalesSydneyAustralia
  3. 3.East China Normal UniversityShanghaiChina
  4. 4.CAI, School of SoftwareUniversity of Technology SydneySydneyAustralia

Personalised recommendations