Skip to main content

A Method to Identify Spark Important Parameters Based on Machine Learning

  • Conference paper
  • First Online:
Data Science (ICPCSEE 2018)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 901))

Abstract

Apache Spark is the most popular open-source framework today that uses an in-memory-oriented abstraction Resilient Distributed Dataset (RDD) to process large-scale data. Recently, research work on performance prediction and optimization for Spark platform continues to increase rapidly. However, selecting important configuration parameters in most wok is always dependent on the experience of domain experts yet. Therefore, configuration parameters selection based on machine learning algorithms is a non-trivial research issue. In this paper, a method based on machine learning to identify Spark important parameters ISIP is proposed. By providing a relatively important subset of configuration parameters, the parameter space for performance tuning on Spark can be reduced, thereby saving the time and effort of users or researchers. ISIP uses Mean-shift algorithm to cluster the applications based on the workload characteristics of the applications from Spark MLlib. Then the relationship between the performance and the configuration parameters is modeled by Regression Algorithm. In the meanwhile, the ranked list of parameters by their importance is provided respectively for each type of applications. The subset of most important configuration parameters consists of the parameters at the front of the list. The experimental results show that the effect of adjusting the subset of relatively important configuration parameters provided by ISIP is almost the same as the complete parameters set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Duan, S., Thummala, V., Babu, S.: Tuning database configuration parameters with iTuned. Proc. VLDB Endow. 2(1), 1246–1257 (2009)

    Article  Google Scholar 

  2. Apache Spark. https://spark.apache.org

  3. Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: 2015 IEEE International Conference on High PERFORMANCE Computing and Communications and 2015 IEEE International Symposium on Cyberspace Safety and Security and International Conference on Embedded Software and Systems, pp. 166–173. IEEE Computer Society (2015)

    Google Scholar 

  4. Wang, G., Xu, J., He, B.: A novel method for tuning configuration parameters of spark based on machine learning. In: IEEE International Conference on High PERFORMANCE Computing and Communications and IEEE International Conference on Smart City and IEEE International Conference on Data Science and Systems, pp. 586–593. IEEE (2017)

    Google Scholar 

  5. Aken, D.V., Pavlo, A., Gordon, G.J., et al.: Automatic database management system tuning through large-scale machine learning. In: ACM International Conference on Management of Data, pp. 1009–1024. ACM (2017)

    Google Scholar 

  6. Zaharia, M., Chowdhury, M., Das, T, et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Usenix Conference on Networked Systems Design and Implementation, p. 2. USENIX Association (2012)

    Google Scholar 

  7. Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on Apache Spark. In: IEEE International Symposium on PERFORMANCE Analysis of Systems and Software, pp. 112–121. IEEE (2016)

    Google Scholar 

  8. Driscoll, P., Lecky, F., Crosby, M.: An introduction to statistics. 30(10), 540 (2000)

    Google Scholar 

  9. Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974)

    MathSciNet  MATH  Google Scholar 

  10. Sklearn. http://scikit-learn.org

  11. Feizollah, A., Anuar, N.B., Salleh, R., et al.: Comparative study of k-means and mini batch k-means clustering algorithms in android malware detection using network traffic analysis. In: International Symposium on Biometrics and Security Technologies, pp. 193–197. IEEE (2015)

    Google Scholar 

  12. Newling, J., Fleuret, F.: Nested mini-batch K-means (2016)

    Google Scholar 

  13. Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  14. Wardjr, J.: Hierarchical grouping to optimize an objective function. Publ. Am. Stat. Assoc. 58(301), 236–244 (1963)

    Article  MathSciNet  Google Scholar 

  15. Szekely, G.J., Rizzo, M.L.: Hierarchical clustering via joint between-within distances: extending ward’s minimum variance method. J. Classif. 22(2), 151–183 (2005)

    Article  MathSciNet  Google Scholar 

  16. Hastie, T., Tibshirani, R., Friedman, J.H., et al.: The Elements of Statistical Learning. World Publishing Corporation, New York (2015)

    MATH  Google Scholar 

Download references

Acknowledgement

This work is supported by the National Key Research and Development Program under No. 2016YFB1000703.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shengfei Shi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, T., Shi, S., Luo, J., Wang, H. (2018). A Method to Identify Spark Important Parameters Based on Machine Learning. In: Zhou, Q., Gan, Y., Jing, W., Song, X., Wang, Y., Lu, Z. (eds) Data Science. ICPCSEE 2018. Communications in Computer and Information Science, vol 901. Springer, Singapore. https://doi.org/10.1007/978-981-13-2203-7_42

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-2203-7_42

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-2202-0

  • Online ISBN: 978-981-13-2203-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics