Encyclopedia of Big Data Technologies

2019 Edition
| Editors: Sherif Sakr, Albert Y. Zomaya

Parallel Join Algorithms in MapReduce

  • Spyros BlanasEmail author
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-77525-8_206

Definitions

The MapReduce framework is often used to analyze large volumes of unstructured and semi-structured data. A common analysis pattern involves combining a massive file that describes events (commonly in the form of a log) with much smaller reference datasets. This analytical operation corresponds to a parallel join. Parallel joins have been extensively studied in data management research, and many algorithms are tailored to take advantage of interesting properties of the input or the analysis in a relational database management system. However, the MapReduce framework was designed to operate on a single input and is a cumbersome framework for join processing. As a consequence, a new class of parallel join algorithms has been designed, implemented, and optimized specifically for the MapReduce framework.

Overview

Since its introduction, the MapReduce framework (Dean and Ghemawat 2004) has become extremely popular for analyzing large datasets. The success of MapReduce stems from...

This is a preview of subscription content, log in to check access.

References

  1. Abouzeid A, Bajda-Pawlikowski K, Abadi DJ, Rasin A, Silberschatz A (2009) HadoopDB: an architectural hybrid of mapreduce and DBMS technologies for analytical workloads. PVLDB 2(1):922–933. http://www.vldb.org/pvldb/2/vldb09-861.pdfCrossRefGoogle Scholar
  2. Abouzied A, Abadi DJ, Silberschatz A (2013) Invisible loading: access-driven data transfer from raw files into database systems. In: EDBTCrossRefGoogle Scholar
  3. Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment. In: Proceedings of the 13th international conference on extending database technology, EDBT ’10. ACM, New York, pp 99–110. http://doi.acm.org/10.1145/1739041.1739056
  4. Alagiannis I, Borovica R, Branco M, Idreos S, Ailamaki A (2012) NoDB: efficient query execution on raw data files. In: SIGMODCrossRefGoogle Scholar
  5. AsterixDB (2017) Apache AsterixDB. https://asterixdb.apache.org/. Accessed Dec 2017
  6. Avro (2017) Apache Avro. https://avro.apache.org/. Accessed Dec 2017
  7. Bajda-Pawlikowski K, Abadi DJ, Silberschatz A, Paulson E (2011) Efficient processing of data warehousing queries in a split execution environment. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data, SIGMOD ’11. ACM, New York, pp 1165–1176. http://doi.acm.org/10.1145/1989323.1989447
  8. Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquin R (2011) Incoop: MapReduce for incremental computations. In: Proceedings of the 2nd ACM symposium on cloud computing, SOCC ’11. ACM, New York, pp 7: 1–7:14. http://doi.acm.org/10.1145/2038916.2038923
  9. Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in MapReduce. In: ACM SIGMOD. http://doi.acm.org/10.1145/1807167.1807273
  10. Blanas S, Wu K, Byna S, Dong B, Shoshani A (2014) Parallel data analysis directly on scientific file formats. In: SIGMODCrossRefGoogle Scholar
  11. Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K (2015) Apache Flink: stream and batch processing in a single engine. IEEE Data Eng Bull 38(4):28–38. http://sites.computer.org/debull/A15dec/p28.pdf
  12. Cheng Y, Rusu F (2015) SCANRAW: a database meta-operator for parallel in-situ processing and loading. TODS 40(3)MathSciNetCrossRefGoogle Scholar
  13. Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design & implementation – volume 6, OSDI’04. USENIX Association, Berkeley, pp 10–10. http://dl.acm.org/citation.cfm?id=1251254.1251264
  14. DeWitt DJ, Halverson A, Nehme R, Shankar S, Aguilar-Saborit J, Avanes A, Flasza M, Gramling J (2013) Split query processing in Polybase. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, SIGMOD ’13. ACM, New York, pp 1255–1266. http://doi.acm.org/10.1145/2463676.2463709
  15. Drill (2017) Apache Drill. https://drill.apache.org/. Accessed Dec 2017
  16. Eltabakh MY, Tian Y, Özcan F, Gemulla R, Krettek A, McPherson J (2011) CoHadoop: flexible data placement and its exploitation in Hadoop. PVLDB 4(9):575–585. http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdfCrossRefGoogle Scholar
  17. Floratou A, Patel JM, Shekita EJ, Tata S (2011) Column-oriented storage techniques for mapreduce. PVLDB 4(7):419–429. http://www.vldb.org/pvldb/vol4/p419-floratou.pdfCrossRefGoogle Scholar
  18. Hadoop (2017) Apache Hadoop. https://hadoop.apache.org/. Accessed Dec 2017
  19. He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z (2011) RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proceedings of the 27th international conference on data engineering, ICDE 2011, 11–16 Apr 2011, Hannover, pp 1199–1208.  https://doi.org/10.1109/ICDE.2011.5767933
  20. Herodotou H, Babu S (2011) Profiling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB 4(11):1111–1122. http://www.vldb.org/pvldb/vol4/p1111-herodotou.pdf
  21. Hive (2017) Apache Hive. https://hive.apache.org/. Accessed Dec 2017
  22. Impala (2017) Apache Impala. https://impala.apache.org/. Accessed Dec 2017
  23. Liu F, Blanas S (2015) Forecasting the cost of processing multi-join queries via hashing for main-memory databases. In: Proceedings of the sixth ACM symposium on cloud computing, SoCC 2015, Kohala Coast, 27–29 Aug 2015, pp 153–166. http://doi.acm.org/10.1145/2806777.2806944
  24. Okcan A, Riedewald M (2011) Processing theta-joins using mapreduce. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data, SIGMOD ’11. ACM, New York, pp 949–960. http://doi.acm.org/10.1145/1989323.1989423
  25. Parquet (2017) Apache Parquet. https://parquet.apache.org. Accessed Dec 2017
  26. Quickstep (2017) Apache Quickstep. https://quickstep.incubator.apache.org/. Accessed Dec 2017
  27. Spark (2017) Apache Spark. https://spark.apache.org/. Accessed Dec 2017
  28. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI’12. USENIX Association, Berkeley, pp 2–12. http://dl.acm.org/citation.cfm?id=2228298.2228301

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Computer Science and EngineeringThe Ohio State UniversityColumbusUSA