A Theoretical and Experimental Comparison of Filter-Based Equijoins in MapReduce

  • Thuong-Cang PhanEmail author
  • Laurent d’Orazio
  • Philippe Rigaux
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9620)


MapReduce has become an increasingly popular framework for large-scale data processing. However, complex operations such as joins are quite expensive and require sophisticated techniques. In this paper, we review state-of-the-art strategies for joining several relations in a MapReduce environment and study their extension with filter-based approaches. The general objective of filters is to eliminate non-matching data as early as possible in order to reduce the I/O, communication and CPU costs. We examine the impact of systematically adding filters as early as possible in MapReduce join algorithms, both analytically with cost models and practically with evaluations. The study covers binary joins, multi-way joins and recursive joins, and addresses the case of large inputs that gives rise to the most intricate challenges.


Big data Cloud computing Big data analysis MapReduce Equijoin Bloom filter Intersection Bloom filter 


  1. 1.
    Afrati, F.N., Borkar, V., Carey, M., Polyzotis, N., Ullman, J.D.: Cluster computing, recursion and datalog. In: de Moor, O., Gottlob, G., Furche, T., Sellers, A. (eds.) Datalog 2010. LNCS, vol. 6702, pp. 120–144. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  2. 2.
    Afrati, F.N., Borkar, V.R., Carey, M.J., Polyzotis, N., Ullman, J.D.: Map-reduce extensions and recursive queries. In: Proceedings of the International Conference on Extending Database Technology (EDBT), Uppsala, Sweden, pp. 1–8 (2011)Google Scholar
  3. 3.
    Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the International Conference on Extending Database Technology (EDBT), Lausanne, Switzerland, pp. 99–110 (2010)Google Scholar
  4. 4.
    Ahmad, F.: Puma benchmarks and dataset downloads (2012). Accessed: 18 June 2015
  5. 5.
    Apache: Flink. Accessed: 18 June 2015
  6. 6.
    Apache: Hadoop. Accessed: 18 June 2015
  7. 7.
    Apache: Spark. Accessed: 18 June 2015
  8. 8.
    Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 975–986. ACM, New York (2010)Google Scholar
  9. 9.
    Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRefzbMATHGoogle Scholar
  10. 10.
    Broder, A.Z., Mitzenmacher, M.: Survey: network applications of Bloom filters: a survey. Internet Math. 1(4), 485–509 (2003)CrossRefMathSciNetGoogle Scholar
  11. 11.
    Bruno, N., Kwon, Y., Wu, M.C.: Advanced join strategies for large-scale distributed computation. Proc. VLDB Endow. 7(13), 1484–1495 (2014)CrossRefGoogle Scholar
  12. 12.
    Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDBJ 21(2), 169–190 (2012)CrossRefGoogle Scholar
  13. 13.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the International Symposium on Operating System Design and Implementation (OSDI), San Francisco, California, pp. 137–150 (2004)Google Scholar
  14. 14.
    Doulkeridis, C., Nrvg, K.: A survey of large-scale analytical query processing in mapreduce. VLDB J. 23(3), 355–380 (2014)CrossRefGoogle Scholar
  15. 15.
    Facebook,: Facebook reports fourth quarter and full year 2013 results - facebook (2014). Accessed: 18 June 2015
  16. 16.
    Hassan, M.A.H., Bamha, M.: Semi-join computation on distributed file systems using map-reduce-merge model. In: Proceedings of the Symposium on Applied Computing (SAC), Sierre, Switzerland, pp. 406–413 (2010)Google Scholar
  17. 17.
    Idreos, S., Liarou, E., Koubarakis, M.: Continuous multi-way joins over distributed hash tables. In: Proceedings of the EDBT, Nantes, France, pp. 594–605 (2008)Google Scholar
  18. 18.
    KVM: Kernel virtual machine. Accessed: 18 June 2015
  19. 19.
    Lam, C.: Hadoop in Action. Manning Publications, Greenwich (2010)Google Scholar
  20. 20.
    Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: a survey. SIGMOD Rec. 40(4), 11–20 (2012)CrossRefGoogle Scholar
  21. 21.
    Lee, T., Im, D.H., Kim, H., Kim, H.J.: Application of filters to multiway joins in MapReduce. Math. Probl. Eng. 2014, 11 (2014)Google Scholar
  22. 22.
    Lee, T., Kim, K., Kim, H.J.: Join processing using Bloom filter in MapReduce. In: Proceedings of the RACS, San Antonio, TX, USA, pp. 100–105 (2012)Google Scholar
  23. 23.
    Lee, T., Kim, K., Kim, H.J.: Exploiting bloom filters for efficient joins in MapReduce. Inf. Int. Interdisc. J. 16(8), 5869–5885 (2013)Google Scholar
  24. 24.
    Li, F., Ooi, B.C., Özsu, M.T., Wu, S.: Distributed data management using MapReduce. ACM Comput. Surv. 46(3), 31:1–31:42 (2014)Google Scholar
  25. 25.
    Liu, L., Yin, J., Gao, L.: Efficient social network data query processing on MapReduce. In: Proceedings of the Workshop on HotPlanet, Hong Kong, China, pp. 27–32 (2013)Google Scholar
  26. 26.
    Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRShare: sharing across multiple queries in MapReduce. Proc. Very Large Data Bases Endowment (PVLDB) 3(1), 494–505 (2010)Google Scholar
  27. 27.
    Okcan, A., Riedewald, M.: Processing theta-joins using mapreduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, pp. 949–960. ACM, New York (2011)Google Scholar
  28. 28.
    Oracle: Oracle vm virtualbox. Accessed: 18 June 2015
  29. 29.
    Ordonez, C.: Optimizing recursive queries in SQL. In: Proceedings of the SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, pp. 834–839 (2005)Google Scholar
  30. 30.
    Phan, T.C., d’Orazio, L., Rigaux, P.: Toward intersection filter-based optimization for joins in mapreduce. In: Proceedings of the 2nd International Workshop on Cloud Intelligence, Cloud-I 2013, pp. 2:1–2:8. ACM, New York (2013)Google Scholar
  31. 31.
    Sakr, S., Liu, A., Batista, D., Alomari, M.: A survey of large scale data management approaches in cloud environments. IEEE Commun. Surv. Tutorials 13(3), 311–336 (2011)CrossRefGoogle Scholar
  32. 32.
    Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11:1–11:44 (2013)CrossRefGoogle Scholar
  33. 33.
    Shaw, M., Koutris, P., Howe, B., Suciu, D.: Optimizing large-scale Semi-Naive Datalog evaluation in Hadoop. In: Proceedings of the International Workshop on Datalog 2.0 (Datalog), Vienna, Austria, pp. 165–176 (2012)Google Scholar
  34. 34.
    Stratosphere: Next generation big data analytics platform. Accessed: 18 June 2015
  35. 35.
    Tan, K.L., Lu, H.: a note on the strategy space of multiway join query optimization problem in parallel systems. SIGMOD Rec. 20(4), 81–82 (1991)CrossRefGoogle Scholar
  36. 36.
    Ullman, J.D.: Principles of Database and Knowledge-Base Systems, vol. I. Computer Science Press, Rockville (1988)Google Scholar
  37. 37.
    White, T.: Hadoop: The Definitive Guide. O’Reilly, Sebastopol (2012)Google Scholar
  38. 38.
    Zhang, C., Li, J., Wu, L., Lin, M., Liu, W.: Sej: an even approach to multiway theta-joins using mapreduce. In: CGC 2012, pp. 73–80. IEEE Computer Society (2012)Google Scholar
  39. 39.
    Zhang, C., Wu, L., Li, J.: Optimizing distributed joins with bloom filters using MapReduce. In: Kim, T., Cho, H., Gervasi, O., Yau, S.S. (eds.) GDC, IESH and CGAG 2012. CCIS, vol. 351, pp. 88–95. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  40. 40.
    Zhang, C., Wu, L., Li, J.: Efficient processing distributed joins with bloom filter using mapreduce. Int. J. Grid Distrib. Comput. (IJGDC) 6(3), 43–58 (2013)Google Scholar
  41. 41.
    Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using mapreduce. Proc. VLDB Endow. 5(11), 1184–1195 (2012)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Thuong-Cang Phan
    • 1
    Email author
  • Laurent d’Orazio
    • 1
  • Philippe Rigaux
    • 2
  1. 1.Blaise Pascal University, CNRS-UMR 6158-LIMOSClermont-FerrandFrance
  2. 2.CNAM, CEDRICParisFrance

Personalised recommendations