Abstract
MapReduce is an important programming model for processing big data with a parallel, distributed algorithm on a cluster. In big data analytic application, equi-join is an important operation. However, it is inefficient to perform equi-join operations in MapReduce when multiple datasets are involved in the join. In this paper, a time cost evaluation model is extended for an equi-join by considering the time cost of calculation. In addition, the sub-joins in an equi-join are classified into star pattern sub-joins on single attribute and chain pattern sub-joins. Based on the extended model, optimization methods are presented and an equi-join plan with lower time cost is chosen for the equi-join. The optimization methods include: the star pattern sub-joins on one attribute are first processed; next, a chain pattern sub-join with minimal scale of intermediate results (i.e. the number of tuples in intermediate results) is processed; at last, a chain pattern sub-join is decomposed into several MapReduce jobs or single MapReduce job by dynamic programming to obtain an optimal scheme for the chain pattern sub-join. We conducted extensive experiments, and the results show that our method is more efficient than those methods such as MDMJ, Hive and Pig.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 99–110. ACM (2010)
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 975–986. ACM (2010)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database system implementation, vol. 654. Prentice Hall, Upper Saddle River (2000)
Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: A survey. AcM sIGMoD Record 40(4), 11–20 (2012)
Lee, T., Kim, K., Kim, H.J.: Exploiting bloom filters for efficient joins in mapreduce. Information an International Interdisciplinary Journal 16(8), 5869–5885 (2013)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: A not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)
Slagter, K., Hsu, C.-H., Chung, Y.-C., Park, J.H.: Network-aware multiway join for mapreduce. In: Park, J.J(J.H.), Arabnia, H.R., Kim, C., Shi, W., Gil, J.-M. (eds.) GPC 2013. LNCS, vol. 7861, pp. 73–80. Springer, Heidelberg (2013)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: A warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2(2), 1626–1629 (2009)
Yang, H.C., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: Simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007)
Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using mapreduce. Proceedings of the VLDB Endowment 5(11), 1184–1195 (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhu, H., Xia, L., Xie, M., Yan, K. (2014). Equi-join for Multiple Datasets Based on Time Cost Evaluation Model. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-11194-0_10
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11193-3
Online ISBN: 978-3-319-11194-0
eBook Packages: Computer ScienceComputer Science (R0)