Skip to main content

An Adaptive Skew Handling Join Algorithm for Large-scale Data Analysis

  • Conference paper
  • First Online:
Web-Age Information Management (WAIM 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9098))

Included in the following conference series:

  • 2670 Accesses

Abstract

Join plays an essential role in large-scale data analysis, but the performance is severely degraded by data skew. Existing works can’t adaptively handle data skew very well and reduce communication cost simultaneously. To address these problems, we firstly propose a mixed data structure comprising Bloom Filter and Histogram(BFH). Based on BFH, Bloom Filter and Histogram Join(BFHJ) is proposed to handle data skew adaptively. BFHJ can reduce communication cost by filtering unnecessary records. Furthermore, BFHJ adopts a heuristic partitioning strategies to balance workload. Experiments on TPC-H demonstrate that BFHJ outperforms the state-of-the-art methods in terms of communication cost, load balance and query time.

This work was supported by Natural Science Foundation of China (Grant No. 61300003), Specialized Research Fund for the Doctoral Program of Higher Education(Grant No. 20130001120001) and Ministry of Education & China Mobile Joint Research Fund Program (MCM20130361).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  2. Walton, C.B., Dale, A.G., Jenevein, R.M.: A taxonomy and performance model of data skew effects in parallel joins. In: VLDB, vol. 91, pp. 537–548 (1991)

    Google Scholar 

  3. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2(2), 1626–1629 (2009)

    Article  Google Scholar 

  4. Atta, F., Viglas, S.D., Niazi, S.: Sand join skew handling join algorithm for google’s mapreduce framework. In: 2011 IEEE 14th International on Multitopic Conference (INMIC), pp. 170–175. IEEE (2011)

    Google Scholar 

  5. Gates, A.: Programming Pig. O’Reilly (2011)

    Google Scholar 

  6. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  7. Council, T.P.P.: Tpc-h benchmark specification (2008). Published at http://www.tpc.org/tpch/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tengjiao Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Wu, D., Wang, T., Chen, Y., Li, S., Li, H., Lei, K. (2015). An Adaptive Skew Handling Join Algorithm for Large-scale Data Analysis. In: Dong, X., Yu, X., Li, J., Sun, Y. (eds) Web-Age Information Management. WAIM 2015. Lecture Notes in Computer Science(), vol 9098. Springer, Cham. https://doi.org/10.1007/978-3-319-21042-1_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-21042-1_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-21041-4

  • Online ISBN: 978-3-319-21042-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics