An Adaptive Skew Handling Join Algorithm for Large-scale Data Analysis

Wu, Di; Wang, Tengjiao; Chen, Yuxin; Li, Shun; Li, Hongyan; Lei, Kai

doi:10.1007/978-3-319-21042-1_35

Di Wu^17,19,
Tengjiao Wang^17,18,19,
Yuxin Chen^18,19,
Shun Li²¹,
Hongyan Li^18,20 &
…
Kai Lei¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9098))

Included in the following conference series:

International Conference on Web-Age Information Management

2670 Accesses

Abstract

Join plays an essential role in large-scale data analysis, but the performance is severely degraded by data skew. Existing works can’t adaptively handle data skew very well and reduce communication cost simultaneously. To address these problems, we firstly propose a mixed data structure comprising Bloom Filter and Histogram(BFH). Based on BFH, Bloom Filter and Histogram Join(BFHJ) is proposed to handle data skew adaptively. BFHJ can reduce communication cost by filtering unnecessary records. Furthermore, BFHJ adopts a heuristic partitioning strategies to balance workload. Experiments on TPC-H demonstrate that BFHJ outperforms the state-of-the-art methods in terms of communication cost, load balance and query time.

This work was supported by Natural Science Foundation of China (Grant No. 61300003), Specialized Research Fund for the Doctoral Program of Higher Education(Grant No. 20130001120001) and Ministry of Education & China Mobile Joint Research Fund Program (MCM20130361).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Article Google Scholar
Walton, C.B., Dale, A.G., Jenevein, R.M.: A taxonomy and performance model of data skew effects in parallel joins. In: VLDB, vol. 91, pp. 537–548 (1991)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2(2), 1626–1629 (2009)
Article Google Scholar
Atta, F., Viglas, S.D., Niazi, S.: Sand join skew handling join algorithm for google’s mapreduce framework. In: 2011 IEEE 14th International on Multitopic Conference (INMIC), pp. 170–175. IEEE (2011)
Google Scholar
Gates, A.: Programming Pig. O’Reilly (2011)
Google Scholar
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970)
Article MATH Google Scholar
Council, T.P.P.: Tpc-h benchmark specification (2008). Published at http://www.tpc.org/tpch/

Download references

Author information

Authors and Affiliations

School of Electronics and Computer Engineering (ECE), Peking University, Shenzhen, 518055, China
Di Wu, Tengjiao Wang & Kai Lei
School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China
Tengjiao Wang, Yuxin Chen & Hongyan Li
Key Laboratory of High Confidence Software Technologies, Peking University, Ministry of Education, Beijing, 100871, China
Di Wu, Tengjiao Wang & Yuxin Chen
Key Laboratory of Machine Perception, Peking University, Ministry of Education, Beijing, 100871, China
Hongyan Li
University of International Relations, Beijing, 100091, China
Shun Li

Authors

Di Wu
View author publications
You can also search for this author in PubMed Google Scholar
Tengjiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuxin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Shun Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongyan Li
View author publications
You can also search for this author in PubMed Google Scholar
Kai Lei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tengjiao Wang .

Editor information

Editors and Affiliations

Google, CA, USA
Xin Luna Dong
Postdoc Apartments (Hong Lou) 4-1-4, Shandong University, Li Cheng, Jinan, China
Xiaohui Yu
Tsinghua University, Beijing, China
Jian Li
Northeastern University, BOSTON, USA
Yizhou Sun

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, D., Wang, T., Chen, Y., Li, S., Li, H., Lei, K. (2015). An Adaptive Skew Handling Join Algorithm for Large-scale Data Analysis. In: Dong, X., Yu, X., Li, J., Sun, Y. (eds) Web-Age Information Management. WAIM 2015. Lecture Notes in Computer Science(), vol 9098. Springer, Cham. https://doi.org/10.1007/978-3-319-21042-1_35

Download citation

DOI: https://doi.org/10.1007/978-3-319-21042-1_35
Published: 06 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21041-4
Online ISBN: 978-3-319-21042-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics