Skip to main content

Scalable Performance Modeling and Evaluation of MapReduce Applications

  • Conference paper
  • First Online:
High-Performance Computing and Big Data Analysis (TopHPC 2019)

Abstract

Big Data frameworks are becoming complex systems which have to cope with the increasing rate and diversity of data production in nowadays applications. This implies an increase in number of the variables and parameters to set in the framework for it to perform well. Therefor an accurate performance model is necessary to evaluate the execution time before actually executing the application. Two main and prominent Big Data frameworks are Hadoop and Spark, for which multiple performance models have been proposed in literature. Unfortunately, these models lack enough scalability to compete with the increasing size and complexity of the frameworks and of the underlying infrastructures used in production environments. In this paper we propose a scalable Lumped SRN model to predict execution time of multi-stage MapReduce and Spark applications, and validate the model against experiments on TPC-DS benchmark using the CINECA Italian super computing center. Results show that the proposed model enables analysis for multiple simultaneous jobs with multiple users and stages for each job in reasonable time and predicts execution time of an application with an average error about 14.5%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Reinsel, D., Gantz, J., Rydning, J.: Data age 2025: the evolution of data to life-critical (2017). https://www.seagate.com/de/de/our-story/data-age-2025/. Accessed July 2018

  2. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). https://doi.org/10.1145/1327452.1327492

    Article  Google Scholar 

  3. Apache, Apache Hadoop. http://hadoop.apache.org/. Accessed July 2018

  4. Vavilapalli, V.K., et al.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC 2013, pp. 1–16. ACM Press, Santa Clara (2013). https://doi.org/10.1145/2523616.2523633

  5. Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A., Curino, C.: Apache Tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the 2015 ACM International Conference on Management of Data, SIGMOD 2015, pp. 1357–1369. ACM Press, Melbourne (2015). https://doi.org/10.1145/2723372.2742790

  6. Ardagna, D., et al.: Performance prediction of cloud-based big data applications. In: Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, ICPE 2018, pp. 192–199. ACM Press, Berlin (2018). https://doi.org/10.1145/3184407.3184420

  7. Requeno, J.I., Gascón, I., Merseguer, J.: Towards the performance analysis of Apache Tez applications. In: Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, ICPE 2018, pp. 147–152. ACM Press, Berlin (2018). https://doi.org/10.1145/3185768.3186284

  8. Ataie, E., Gianniti, E., Ardagna, D., Movaghar, A.: A combined analytical modeling machine learning approach for performance prediction of MapReduce jobs in cloud environment. In: Proceedings of the 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2016, pp. 431–439. IEEE, Timisoara (2016). https://doi.org/10.1109/SYNASC.2016.072

  9. Zhang, Z., Cherkasova, L., Loo, B.T.: Benchmarking approach for designing a MapReduce performance model. In: Proceedings of the ACM/SPEC International Conference on International Conference on Performance Engineering, ICPE 2013, pp. 253–258. ACM Press, Prague (2013). https://doi.org/10.1145/2479871.2479906

  10. Dai, J., Huang, J., Huang, S., Huang, B., Liu, Y.: HiTune: dataflow-based performance analysis for big data cloud. In: Proceedings of the USENIX Annual Technical Conference, pp. 87–100. USENIX Association, Portland (2011)

    Google Scholar 

  11. Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: an in-depth study. Proc. VLDB Endowment 3(1–2), 472–483 (2010). https://doi.org/10.14778/1920841.1920903

    Article  Google Scholar 

  12. Yigitbasi, N., Willke, T.L., Liao, G., Epema, D.: Towards machine learning-based auto-tuning of MapReduce. In: Proceedings of the IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, pp. 11–20. IEEE, San Francisco (2013). https://doi.org/10.1109/MASCOTS.2013.9

  13. Venkataraman, S., Yang, Z., Franklin, M., Recht, B., Stoica, I.: Ernest: efficient performance prediction for large-scale advanced analytics. In: Proceedings of the 13th Usenix Conference on Networked Systems Design and Implementation, pp. 363–378. USENIX Association, Santa Clara (2016)

    Google Scholar 

  14. Gibilisco, G.P., Li, M., Zhang, L., Ardagna, D.: Stage aware performance modeling of DAG based in memory analytic platforms. In: Proceedings of the 9th International Conference on Cloud Computing, CLOUD 2016, pp. 188–195. IEEE, San Francisco (2016). https://doi.org/10.1109/CLOUD.2016.0034

  15. Liu, Y., Li, M., Alham, N.K., Hammoud, S.: HSim: a MapReduce simulator in enabling cloud computing. Future Gener. Comput. Syst. 29(1), 300–308 (2013). https://doi.org/10.1016/j.future.2011.05.007

    Article  Google Scholar 

  16. Gribaudo, M., Barbierato, E., Iacono, M.: Modeling apache hive based applications in big data architectures. In: Proceedings of the 7th International Conference on Performance Evaluation Methodologies and Tools, ValueTools 2013, pp. 30–38 ICST, Torino (2013). https://doi.org/10.4108/icst.valuetools.2013.254398

  17. Ruiz, M.C., Calleja, J., Cazorla, D.: Petri nets formalization of Map/Reduce paradigm to optimise the performance-cost tradeoff. In: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA, Vol. 3, pp. 92–99. IEEE, Helsinki (2015). https://doi.org/10.1109/Trustcom.2015.617

  18. Ardagna, D., Bernardi, S., Gianniti, E., Karimian Aliabadi, S., Perez-Palacin, D., Requeno, J.I.: Modeling performance of hadoop applications: a journey from queueing networks to stochastic well formed nets. In: Carretero, J., Garcia-Blas, J., Ko, R.K.L., Mueller, P., Nakano, K. (eds.) ICA3PP 2016. LNCS, vol. 10048, pp. 599–613. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49583-5_47

    Chapter  Google Scholar 

  19. Malekimajd, M., Ardagna, D., Ciavotta, M., Rizzi, A.M., Passacantando, M.: Optimal map reduce job capacity allocation in cloud systems. ACM SIGMETRICS Perform. Eval. Rev. 42(4), 51–61 (2015). https://doi.org/10.1145/2788402.2788410

    Article  Google Scholar 

  20. Castiglione, A., Gribaudo, M., Iacono, M., Palmieri, F.: Exploiting mean field analysis to model performances of big data architectures. Future Gener. Comput. Syst. 37, 203–211 (2014). https://doi.org/10.1016/j.future.2013.07.016

    Article  Google Scholar 

  21. Gianniti, E., Rizzi, A.M., Barbierato, E., Gribaudo, M., Ardagna, D.: Fluid petri nets for the performance evaluation of MapReduce and spark applications. ACM SIGMETRICS Perform. Eval. Rev. 44(4), 23–36 (2017). https://doi.org/10.1145/3092819.3092824

    Article  Google Scholar 

  22. Spark, Apache Spark. http://spark.apache.org/. Accessed July 2018

  23. Alipourfard, O., Harry Liu, H., Chen, J., Venkataraman, S., Yu, M., Zhang, M.: CherryPick: adaptively unearthing the best cloud configurations for big data analytics. In: Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2017), Boston, MA, USA, pp. 469–482 (2017)

    Google Scholar 

  24. Teng, F., Yu, L., Magoulès, F.: SimMapReduce: a simulator for modeling MapReduce framework. In: Proceedings of the Fifth FTRA International Conference on Multimedia and Ubiquitous Engineering, pp. 277–282. IEEE, Loutraki (2011). https://doi.org/10.1109/MUE.2011.56

  25. Lazowska, E.D., Zahorjan, J., Graham, G.S., Sevcik, K.C.: Quantitative System Performance: Computer System Analysis Using Queueing Network Models, 1st edn. Prentice-Hall, Upper Saddle River (1984)

    Google Scholar 

  26. Ciardo, G., Trivedi, K.S.: A decomposition approach for stochastic reward net models. Perform. Eval. 18(1), 37–59 (1993). https://doi.org/10.1016/0166-5316(93)90026-Q. https://www.sciencedirect.com/science/article/pii/016653169390026Q

    Article  MathSciNet  MATH  Google Scholar 

  27. Ataie, E., Entezari-Maleki, R., Rashidi, L., Trivedi, K.S., Ardagna, D., Movaghar, A.: Hierarchical stochastic models for performance, availability, and power consumption analysis of IaaS clouds. IEEE Trans. Cloud Comput. (to appear). https://doi.org/10.1109/TCC.2017.2760836

  28. Entezari-Maleki, R., Trivedi, K.S., Movaghar, A.: Performability evaluation of grid environments using stochastic reward nets. IEEE Trans. Dependable Secure Comput. 12(2), 204–216 (2015). https://doi.org/10.1109/TDSC.2014.2320741

    Article  Google Scholar 

  29. Meyer, J.F., Movaghar, A., Sanders, W.H.: Stochastic activity networks: structure, behavior, and application. In: Proceedings of the International Workshop on Timed Petri Nets, Torino, Italy, pp. 106–115 (1985)

    Google Scholar 

  30. Reinecke, P., Bodrog, L., Danilkina, A.: Phase-type distributions. In: Wolter, K., Avritzer, A., Vieira, M., van Moorsel, A. (eds.) Resilience Assessment and Evaluation of Computing Systems, pp. 85–113. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-29032-9_5

    Chapter  Google Scholar 

  31. Flexiant: Flexiant cloud management software & cloud orchestration. https://www.flexiant.com/. Accessed July 2018

  32. Cineca: Cineca computing center. http://www.cineca.it/. Accessed July 2018

  33. Poess, M., Smith, B., Kollar, L., Larson, P.: TPC-DS, taking decision support benchmarking to the next level. In: Proceedings of the 2002 ACM International Conference on Management of data, SIGMOD 2002, pp. 582–587. ACM Press, Madison (2002). https://doi.org/10.1145/564691.564759

  34. Hive: Apache Hive. https://hive.apache.org/. Accessed July 2018

  35. Hirel, C., Tuffin, B., Trivedi, K.S.: SPNP: stochastic petri nets. Version 6.0. In: Haverkort, B.R., Bohnenkamp, H.C., Smith, C.U. (eds.) TOOLS 2000. LNCS, vol. 1786, pp. 354–357. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-46429-8_30

    Chapter  Google Scholar 

Download references

Acknowledgment

The results of this work have been partially funded by the European DICE H2020 research project (grant agreement no. 644869).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soroush Karimian-Aliabadi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Karimian-Aliabadi, S., Ardagna, D., Entezari-Maleki, R., Movaghar, A. (2019). Scalable Performance Modeling and Evaluation of MapReduce Applications. In: Grandinetti, L., Mirtaheri, S., Shahbazian, R. (eds) High-Performance Computing and Big Data Analysis. TopHPC 2019. Communications in Computer and Information Science, vol 891. Springer, Cham. https://doi.org/10.1007/978-3-030-33495-6_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-33495-6_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-33494-9

  • Online ISBN: 978-3-030-33495-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics