Research on the Optimization of Spark Big Table Equal Join

Wang, Suzhen; Zhang, Lu; Zhang, Yanpiao

doi:10.1007/978-3-030-24265-7_37

Suzhen Wang¹¹,
Lu Zhang¹¹ &
Yanpiao Zhang¹¹

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11633))

Included in the following conference series:

International Conference on Artificial Intelligence and Security

1657 Accesses

Abstract

The big table equal join operation is one of the key operations of Spark for processing large-scale data. However, when Spark handles large table equal join problems, the network transmission overhead is relatively expensive and the I/O cost is high, so this paper proposes an optimized Spark large table join method. Firstly, this method proposes a Split Compressed Bloom Filter algorithm which is suitable for filtering data sets with unknown data volume. Then, the Maxdiff histogram is used to statistically analyze the data distribution of the connected data tables, and the skew data in the data set is obtained. According to the statistical results, the RDD is split, and finally the data connection is joined by a suitable join algorithm, and the sub-results are combined to obtain the final result. Experiments show that the Spark large table equal join optimization method proposed in this paper has obvious advantages in shuffle write, shuffle read and task running time compared with Spark original method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Apache Spark. http://spark.apache.org. Accessed 28 Apr 2018
Sun, H.: Join processing and optimization on large datasets based on hadoop framework. Nanjing University of Posts and Telecommunications (2013)
Google Scholar
Zhang, Z.D., Zheng, Y.B.: Optimizaiton of two-table equivalent connection process based on spark. Appl. Res. Comput. 02, 1–2 (2019)
Google Scholar
Bian, H.Q., Chen, Y.G., Du, X.Y.: Equi-join optimization on spark. J. East China Normal Univ. (Nat. Sci.) 2014(5), 263–270 (2014)
Google Scholar
Liu, H., Xiao, J., Peng, F.: Scalable hash ripple join on spark. In: 23rd International Conference on Parallel and Distributed Systems, pp. 419–428. IEEE, Shenzhen (2014)
Google Scholar
Hoel, E., Whitman, R.T., Park, M.B.: Spatio-temporal join on apache spark. In: 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, California (2017)
Google Scholar
Wang, S.Z., Zhang, Y.P., Zhang, L., et al.: An improved memory cache management study based on spark. Comput., Mater. Continua 56(3), 415–431 (2018)
Google Scholar
Lin, D.G.: Hadoop + spark big data massive analysis and machine learning integration development, 1st edn. Tsinghua University Press, Beijing (2017)
Google Scholar
Zhang, X.: An Intermediate Data Placement Algorithm for Load Balancing in Spark Computing Environment. Hunan University (2016)
Google Scholar
Zhang, W.H.: Implementation and optimization for join operation in spark, National University of Defense Technology (2016)
Google Scholar
Pi, X.J.: Optimization and Application of the Equi-join Problem based on Grid Big Data in Spark. Chongqing University (2016)
Google Scholar
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM(CACM) 13(7), 422–426 (1970)
Article Google Scholar
Ioannidis, Y.: The history of histograms (abridged). In: 29th International Conference on Very Large Data Bases, pp. 19–30. VLDB Endowment, Berlin (2003)
Google Scholar
Chaudhuri, S., Das, G., Srivastava, U.: Effective use of block-level sampling in statistics estimation. In: 2004 ACM SIGMOD International Conference on Management of Data, pp. 287–298. ACM, Paris (2004)
Google Scholar
Jagadish, H.V., Poosala, V., Koudas, N.: Optimal histograms with quality guarantees. In: 24th International Conference on Very Large Data Bases, pp. 275–286. Morgan Kaufmann Publishers Inc. (1998)
Google Scholar
Tang, M.W.: Efficient and scalable monitoring and summarization of large probalistic data. In: SIGMOD 2013 PhD Symposium, pp. 61–66. New York (2013)
Google Scholar
Zhang, C.C.: Design and optimize big-data join algorithms using MapReduce. University of Science and Technology of China (2014)
Google Scholar
Mitzenmacher, M.: Compressed bloom filters. IEEE/ACM Trans. Networking 10(5), 604–612 (2001)
Article Google Scholar
Xiao, M.Z.H., Dai, Y.F., Li, X.M.: Split Bloom filter. Acta Electronica Sinica 32(2), 241–245 (2004)
Google Scholar
Poosala, V., Haas, P.J., Ioannidis, Y.E.: Improved histograms for selectivity estimation of range predicates. ACM SIGMOD Rec. 25(2), 294–305 (1996)
Article Google Scholar
Zhang, D.D.: Load balancing in MapReduce based on Maxdiff histogram. Zhengzhou University, (2015)
Google Scholar
Wang, S.Z., Zhang, L., Zhang, Y.P., et al.: Natural language semantic construction based on cloud database. Comput., Mater. Continua 57(3), 603–619 (2018)
Article Google Scholar

Download references

Acknowledgements

This paper is partially supported by the Education technology Foundation of the Ministry of Education (No. 2017A01020), the Major Project of the Hebei Province Education Department (No. 2017GJJG083) and the Graduate Innovation Program of Hebei University of Economics and Business in 2018.

Author information

Authors and Affiliations

Hebei University of Economics and Business, Shijiazhuang, 050061, Hebei, China
Suzhen Wang, Lu Zhang & Yanpiao Zhang

Authors

Suzhen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yanpiao Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suzhen Wang .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Xingming Sun
Nanjing University of Information Science and Technology, Nanjing, China
Zhaoqing Pan
Purdue University, West Lafayette, IN, USA
Elisa Bertino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, S., Zhang, L., Zhang, Y. (2019). Research on the Optimization of Spark Big Table Equal Join. In: Sun, X., Pan, Z., Bertino, E. (eds) Artificial Intelligence and Security. ICAIS 2019. Lecture Notes in Computer Science(), vol 11633. Springer, Cham. https://doi.org/10.1007/978-3-030-24265-7_37

Download citation

DOI: https://doi.org/10.1007/978-3-030-24265-7_37
Published: 11 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-24264-0
Online ISBN: 978-3-030-24265-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics