A Method to Identify Spark Important Parameters Based on Machine Learning

Li, Tianyu; Shi, Shengfei; Luo, Jizhou; Wang, Hongzhi

doi:10.1007/978-981-13-2203-7_42

Tianyu Li¹⁴,
Shengfei Shi¹⁴,
Jizhou Luo¹⁴ &
…
Hongzhi Wang¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 901))

Included in the following conference series:

International Conference of Pioneering Computer Scientists, Engineers and Educators

1604 Accesses
2 Citations

Abstract

Apache Spark is the most popular open-source framework today that uses an in-memory-oriented abstraction Resilient Distributed Dataset (RDD) to process large-scale data. Recently, research work on performance prediction and optimization for Spark platform continues to increase rapidly. However, selecting important configuration parameters in most wok is always dependent on the experience of domain experts yet. Therefore, configuration parameters selection based on machine learning algorithms is a non-trivial research issue. In this paper, a method based on machine learning to identify Spark important parameters ISIP is proposed. By providing a relatively important subset of configuration parameters, the parameter space for performance tuning on Spark can be reduced, thereby saving the time and effort of users or researchers. ISIP uses Mean-shift algorithm to cluster the applications based on the workload characteristics of the applications from Spark MLlib. Then the relationship between the performance and the configuration parameters is modeled by Regression Algorithm. In the meanwhile, the ranked list of parameters by their importance is provided respectively for each type of applications. The subset of most important configuration parameters consists of the parameters at the front of the list. The experimental results show that the effect of adjusting the subset of relatively important configuration parameters provided by ISIP is almost the same as the complete parameters set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Duan, S., Thummala, V., Babu, S.: Tuning database configuration parameters with iTuned. Proc. VLDB Endow. 2(1), 1246–1257 (2009)
Article Google Scholar
Apache Spark. https://spark.apache.org
Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: 2015 IEEE International Conference on High PERFORMANCE Computing and Communications and 2015 IEEE International Symposium on Cyberspace Safety and Security and International Conference on Embedded Software and Systems, pp. 166–173. IEEE Computer Society (2015)
Google Scholar
Wang, G., Xu, J., He, B.: A novel method for tuning configuration parameters of spark based on machine learning. In: IEEE International Conference on High PERFORMANCE Computing and Communications and IEEE International Conference on Smart City and IEEE International Conference on Data Science and Systems, pp. 586–593. IEEE (2017)
Google Scholar
Aken, D.V., Pavlo, A., Gordon, G.J., et al.: Automatic database management system tuning through large-scale machine learning. In: ACM International Conference on Management of Data, pp. 1009–1024. ACM (2017)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T, et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Usenix Conference on Networked Systems Design and Implementation, p. 2. USENIX Association (2012)
Google Scholar
Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on Apache Spark. In: IEEE International Symposium on PERFORMANCE Analysis of Systems and Software, pp. 112–121. IEEE (2016)
Google Scholar
Driscoll, P., Lecky, F., Crosby, M.: An introduction to statistics. 30(10), 540 (2000)
Google Scholar
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974)
MathSciNet MATH Google Scholar
Sklearn. http://scikit-learn.org
Feizollah, A., Anuar, N.B., Salleh, R., et al.: Comparative study of k-means and mini batch k-means clustering algorithms in android malware detection using network traffic analysis. In: International Symposium on Biometrics and Security Technologies, pp. 193–197. IEEE (2015)
Google Scholar
Newling, J., Fleuret, F.: Nested mini-batch K-means (2016)
Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. 58(1), 267–288 (1996)
MathSciNet MATH Google Scholar
Wardjr, J.: Hierarchical grouping to optimize an objective function. Publ. Am. Stat. Assoc. 58(301), 236–244 (1963)
Article MathSciNet Google Scholar
Szekely, G.J., Rizzo, M.L.: Hierarchical clustering via joint between-within distances: extending ward’s minimum variance method. J. Classif. 22(2), 151–183 (2005)
Article MathSciNet Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.H., et al.: The Elements of Statistical Learning. World Publishing Corporation, New York (2015)
MATH Google Scholar

Download references

Acknowledgement

This work is supported by the National Key Research and Development Program under No. 2016YFB1000703.

Author information

Authors and Affiliations

Harbin Institute of Technology, Xidazhi Str. 92, Harbin, 150001, China
Tianyu Li, Shengfei Shi, Jizhou Luo & Hongzhi Wang

Authors

Tianyu Li
View author publications
You can also search for this author in PubMed Google Scholar
Shengfei Shi
View author publications
You can also search for this author in PubMed Google Scholar
Jizhou Luo
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shengfei Shi .

Editor information

Editors and Affiliations

Zhengzhou University, Zhengzhou, Henan, China
Qinglei Zhou
Zhengzhou University of Light Industry, Zhengzhou, Henan, China
Yong Gan
Northeast Forestry University, Harbin, China
Weipeng Jing
Harbin University of Science and Technology, Harbin, China
Xianhua Song
Zhengzhou Institute of Technology, Zhengzhou, China
Yan Wang
National Academy of Guo Ding Institute of Data Science, Beijing, China
Zeguang Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, T., Shi, S., Luo, J., Wang, H. (2018). A Method to Identify Spark Important Parameters Based on Machine Learning. In: Zhou, Q., Gan, Y., Jing, W., Song, X., Wang, Y., Lu, Z. (eds) Data Science. ICPCSEE 2018. Communications in Computer and Information Science, vol 901. Springer, Singapore. https://doi.org/10.1007/978-981-13-2203-7_42

Download citation

DOI: https://doi.org/10.1007/978-981-13-2203-7_42
Published: 09 September 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2202-0
Online ISBN: 978-981-13-2203-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics