Advertisement

An Experimental Survey of MapReduce-Based Similarity Joins

  • Yasin N. SilvaEmail author
  • Jason Reed
  • Kyle Brown
  • Adelbert Wadsworth
  • Chuitian Rong
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9939)

Abstract

In recent years, Big Data systems and their main data processing framework - MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.

Keywords

Similarity joins Big Data systems Performance evaluation MapReduce 

References

  1. 1.
    Silva, Y.N., Aref, W.G., Ali, M.: The similarity join database operator. In: ICDE (2010)Google Scholar
  2. 2.
    Silva, Y.N., Pearson, S.: Exploiting database similarity joins for metric spaces. In: VLDB (2012)Google Scholar
  3. 3.
    Silva, Y.N., Aly, A.M., Aref, W.G., Larson, P.-A.: SimDB: a similarity-aware database system. In: SIGMOD (2010)Google Scholar
  4. 4.
    Silva, Y.N., Aref, W.G., Larson, P.-A., Pearson, S., Ali, M.: Similarity queries: their conceptual evaluation, transformations, and processing. VLDB J. 22(3), 395–420 (2013)CrossRefGoogle Scholar
  5. 5.
    Silva, Y.N., Aref, W.G.: Similarity-aware query processing and optimization. In: VLDB Ph.D. Workshop, France (2009)Google Scholar
  6. 6.
    Bernstein, P.A., Jensen, C.S., Tan, K.-L.: A call for surveys. SIGMOD Rec. 41(2), 47 (2012)CrossRefGoogle Scholar
  7. 7.
    Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: Scope: easy and efficient parallel processing of massive data sets. In: VLDB (2008)Google Scholar
  8. 8.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 1–26 (2008)CrossRefGoogle Scholar
  9. 9.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)Google Scholar
  10. 10.
    Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: SOSP (2003)Google Scholar
  11. 11.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys (2007)Google Scholar
  12. 12.
    Dohnal, V., Gennaro, C., Zezula, P.: Similarity join in metric spaces using eD-index. In: Mařík, V., Štěpánková, O., Retschitzegger, W. (eds.) DEXA 2003. LNCS, vol. 2736, pp. 484–493. Springer, Heidelberg (2003). doi: 10.1007/978-3-540-45227-0_48 CrossRefGoogle Scholar
  13. 13.
    Böhm, C., Braunmüller, B., Krebs, F., Kriegel, H.-P.: Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In: SIGMOD (2001)Google Scholar
  14. 14.
    Dittrich, J.-P., Seeger, B.: GESS: a scalable similarity join algorithm for mining large data sets in high dimensional spaces. In: SIGKDD (2001)Google Scholar
  15. 15.
    Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33, 7:1–7:38 (2008)CrossRefGoogle Scholar
  16. 16.
    Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)Google Scholar
  17. 17.
    Chaudhuri, S., Ganti, V., Kaushik, R.: Data debugger: an operator-centric approach for data quality solutions. IEEE Data Eng. Bull. 29(2), 60–66 (2006)Google Scholar
  18. 18.
    Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB (2001)Google Scholar
  19. 19.
    Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD 2010 (2010)Google Scholar
  20. 20.
    Silva, Y.N., Reed, J.M., Tsosie, L.M.: MapReduce-based similarity join for metric spaces. In: VLDB/Cloud-I (2012)Google Scholar
  21. 21.
    Silva, Y.N., Reed, J.M.: Exploiting MapReduce-based similarity joins. In: SIGMOD (2012)Google Scholar
  22. 22.
    Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A., Ullman, J.D.: Fuzzy joins using MapReduce. In: ICDE (2012)Google Scholar
  23. 23.
    Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD (2011)Google Scholar
  24. 24.
    Metwally, A., Faloutsos, C.: V-SMART-join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. In: VLDB (2012)Google Scholar
  25. 25.
    Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW (2008)Google Scholar
  26. 26.
  27. 27.
    SimCloud Project: MapReduce-based similarity join survey. http://www.public.asu.edu/~ynsilva/SimCloud/SJSurvey
  28. 28.
    Harvard Library: Harvard bibliographic dataset. http://library.harvard.edu/open-metadata

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Yasin N. Silva
    • 1
    Email author
  • Jason Reed
    • 1
  • Kyle Brown
    • 1
  • Adelbert Wadsworth
    • 1
  • Chuitian Rong
    • 1
  1. 1.Arizona State UniversityGlendaleUSA

Personalised recommendations