Equi-join for Multiple Datasets Based on Time Cost Evaluation Model

Zhu, Hong; Xia, Libo; Xie, Mieyi; Yan, Ke

doi:10.1007/978-3-319-11194-0_10

Hong Zhu²⁵,
Libo Xia²⁵,
Mieyi Xie²⁵ &
…
Ke Yan²⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8631))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

2626 Accesses

Abstract

MapReduce is an important programming model for processing big data with a parallel, distributed algorithm on a cluster. In big data analytic application, equi-join is an important operation. However, it is inefficient to perform equi-join operations in MapReduce when multiple datasets are involved in the join. In this paper, a time cost evaluation model is extended for an equi-join by considering the time cost of calculation. In addition, the sub-joins in an equi-join are classified into star pattern sub-joins on single attribute and chain pattern sub-joins. Based on the extended model, optimization methods are presented and an equi-join plan with lower time cost is chosen for the equi-join. The optimization methods include: the star pattern sub-joins on one attribute are first processed; next, a chain pattern sub-join with minimal scale of intermediate results (i.e. the number of tuples in intermediate results) is processed; at last, a chain pattern sub-join is decomposed into several MapReduce jobs or single MapReduce job by dynamic programming to obtain an optimal scheme for the chain pattern sub-join. We conducted extensive experiments, and the results show that our method is more efficient than those methods such as MDMJ, Hive and Pig.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 99–110. ACM (2010)
Google Scholar
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 975–986. ACM (2010)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Article Google Scholar
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database system implementation, vol. 654. Prentice Hall, Upper Saddle River (2000)
Google Scholar
Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: A survey. AcM sIGMoD Record 40(4), 11–20 (2012)
Article Google Scholar
Lee, T., Kim, K., Kim, H.J.: Exploiting bloom filters for efficient joins in mapreduce. Information an International Interdisciplinary Journal 16(8), 5869–5885 (2013)
Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: A not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)
Google Scholar
Slagter, K., Hsu, C.-H., Chung, Y.-C., Park, J.H.: Network-aware multiway join for mapreduce. In: Park, J.J(J.H.), Arabnia, H.R., Kim, C., Shi, W., Gil, J.-M. (eds.) GPC 2013. LNCS, vol. 7861, pp. 73–80. Springer, Heidelberg (2013)
Chapter Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: A warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2(2), 1626–1629 (2009)
Article Google Scholar
Yang, H.C., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: Simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007)
Google Scholar
Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using mapreduce. Proceedings of the VLDB Endowment 5(11), 1184–1195 (2012)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, P.R. China
Hong Zhu, Libo Xia, Mieyi Xie & Ke Yan

Authors

Hong Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Libo Xia
View author publications
You can also search for this author in PubMed Google Scholar
Mieyi Xie
View author publications
You can also search for this author in PubMed Google Scholar
Ke Yan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Illinois Institute of Technology, 60616-3793, Chicago, IL, USA
Xian-he Sun
School of Computer Science and Technology, Dalian Maritime University, 1 Linghai Road, 116026, Dalian, China
Wenyu Qu
SEECS, University of Ottawa, 8, King Edward Ave, K1N 6N5, Ottawa, ON, Canada
Ivan Stojmenovic
Deakin University, 221 Burwood Highway, 3125, Burwood, VIC, Australia
Wanlei Zhou
Dalian Maritime University, NO.1 Linhai Road Dailian, 116026, China
Zhiyang Li
BeiHang University, XueYuan Road No.37, HaiDian District, Beijing, China
Hua Guo
University of Bradford, BD7 1DP, Bradford, West Yorkshire, United Kingdom
Geyong Min
Dalian Maritime University, NO.1 Linhai Road Dailian, China, 116026
Tingting Yang
Computer Network Information Center, Chinese Academy of Sciences, 100190, Beijing, China
Yulei Wu
Shandong University, 27 Shanda Nanlu, 250100, Jinan City, Shandong Province, China
Lei Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, H., Xia, L., Xie, M., Yan, K. (2014). Equi-join for Multiple Datasets Based on Time Cost Evaluation Model. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-11194-0_10
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11193-3
Online ISBN: 978-3-319-11194-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics