Advertisement

Improvement of Join Algorithms for Low-Selectivity Joins on MapReduce

  • Akiyoshi MatonoEmail author
  • Hirotaka Ogawa
  • Isao Kojima
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9093)

Abstract

So far, many studies on join operations on MapReduce have already been proposed. Some of those studies focus on the low-selectivity joins that are frequently used in query processing for datasets among several management domains, such as those used with Linked Open Data. We found there is room for improvement of the state-of-the-art approach for the low-selectivity join on MapReduce called the per-split semi-join[5]. In this paper, we first thus extend the per-split semi-join to improve performance. Our approach can reduce three costs for low-selectivity joins: the amount of network traffic, the number of jobs, and the amount of disk I/O. Moreover, when the number of input relations is large, the selectivity becomes low and thus the effect of our proposed reductions is maximized. Therefore, we also propose an extension of the reduce-side join, which can easily apply three or more inputs, based on the extension of the per-split semi-join. In our experiments, we evaluated two comparisons: the per-split semi-join and our extension of it and the reduce-side join and our extension of it. The first experiment shows that our extension is better than its competitor in any case. In the second experiment, we found that our extension is superior to its competitor when the selectivity is low and the number of inputs is three or more.

Keywords

Query Processing Hash Table Computing Node Link Open Data Management Domain 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, EDBT 2010, pp. 99–110. ACM, New York (2010). http://doi.acm.org/10.1145/1739041.1739056
  2. 2.
    Andreas, C.: Designing a Parallel Query Engine over Map/Reduce. Master’s thesis, Informatics MSc, School of Informatics, University of Edinburgh (2010)Google Scholar
  3. 3.
    Atta, F.: Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework. Master’s thesis, Informatics MSc, School of Informatics, University of Edinburgh (2010)Google Scholar
  4. 4.
    Berners-Lee, T.: Linked data (August 27, 2006). http://www.w3.org/DesignIssues/LinkedData.html
  5. 5.
    Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 975–986. ACM, New York (2010). http://doi.acm.org/10.1145/1807167.1807273
  6. 6.
    Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970). http://doi.acm.org/10.1145/362686.362692
  7. 7.
    Chandar, J.: Join Algorithms using Map/Reduce. Master’s thesis, Informatics MSc, School of Informatics, University of Edinburgh (2010)Google Scholar
  8. 8.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating System Design & Implementation, OSDI 2004, vol. 6, p. 10. USENIX Association, Berkeley (2004). http://dl.acm.org/citation.cfm?id=1251254.1251264
  9. 9.
    DeWitt, D.J., Gerber, R.H.: Multiprocessor hash-based join algorithms. In: Proceedings of the 11th International Conference on Very Large Data Bases, VLDB 1985, vol. 11, pp. 151–164. VLDB Endowment (1985). http://dl.acm.org/citation.cfm?id=1286760.1286774
  10. 10.
    Gates, A.: Programming Pig. O’Reilly, Media (September 2011)Google Scholar
  11. 11.
    Jadhav, V., Aghav, J., Dorwani, S.: Join algorithms using mapreduce: a survey. In: International Conference on Electrical Engineering and Computer Science, ICETACS 2013, Coimbatore, pp. 40–44, April 2013Google Scholar
  12. 12.
    Radu, V.: Application. In: Radu, V. (ed.) Stochastic Modeling of Thermal Fatigue Crack Growth. ACM, vol. 1, pp. 63–70. Springer, Heidelberg (2015) Google Scholar
  13. 13.
    Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: A survey. SIGMOD Rec. 40(4), 11–20 (2012). http://doi.acm.org/10.1145/2094114.2094118
  14. 14.
    Lee, T., Kim, K., Kim, H.J.: Join processing using bloom filter in mapreduce. In: Proceedings of the 2012 ACM Research in Applied Computation Symposium, RACS 2012, pp. 100–105. ACM, New York (2012). http://doi.acm.org/10.1145/2401603.2401626
  15. 15.
    Luo, G., Dong, L.: Adaptive join plan generation in hadoop. Tech. rep., Duke University (2010), cPS296.1 Course ProjectGoogle Scholar
  16. 16.
    Pigul, A.: Generalized parallel join algorithms and designing cost models. In: Proceedings of the Spring Young Researcher’s Colloquium On Database and Information Systems, SYRCoDIS 2012, pp. 29–40 (2012)Google Scholar
  17. 17.
    White, T.: Hadoop: The Definitive Guide. O’Reilly Media / Yahoo Press, 3rd edn. (May 2012)Google Scholar
  18. 18.
    Zhang, C., Wu, L., Li, J.: Efficient processing distributed joins with bloomfilter using mapreduce. International Journal of Grid and Distributed Computing 6(3), 43–58 (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.National Institute of Advanced Industrial Science and TechnologyIbarakiJapan

Personalised recommendations