Abstract
Big Data frameworks are becoming complex systems which have to cope with the increasing rate and diversity of data production in nowadays applications. This implies an increase in number of the variables and parameters to set in the framework for it to perform well. Therefor an accurate performance model is necessary to evaluate the execution time before actually executing the application. Two main and prominent Big Data frameworks are Hadoop and Spark, for which multiple performance models have been proposed in literature. Unfortunately, these models lack enough scalability to compete with the increasing size and complexity of the frameworks and of the underlying infrastructures used in production environments. In this paper we propose a scalable Lumped SRN model to predict execution time of multi-stage MapReduce and Spark applications, and validate the model against experiments on TPC-DS benchmark using the CINECA Italian super computing center. Results show that the proposed model enables analysis for multiple simultaneous jobs with multiple users and stages for each job in reasonable time and predicts execution time of an application with an average error about 14.5%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Reinsel, D., Gantz, J., Rydning, J.: Data age 2025: the evolution of data to life-critical (2017). https://www.seagate.com/de/de/our-story/data-age-2025/. Accessed July 2018
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). https://doi.org/10.1145/1327452.1327492
Apache, Apache Hadoop. http://hadoop.apache.org/. Accessed July 2018
Vavilapalli, V.K., et al.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC 2013, pp. 1–16. ACM Press, Santa Clara (2013). https://doi.org/10.1145/2523616.2523633
Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A., Curino, C.: Apache Tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the 2015 ACM International Conference on Management of Data, SIGMOD 2015, pp. 1357–1369. ACM Press, Melbourne (2015). https://doi.org/10.1145/2723372.2742790
Ardagna, D., et al.: Performance prediction of cloud-based big data applications. In: Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, ICPE 2018, pp. 192–199. ACM Press, Berlin (2018). https://doi.org/10.1145/3184407.3184420
Requeno, J.I., Gascón, I., Merseguer, J.: Towards the performance analysis of Apache Tez applications. In: Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, ICPE 2018, pp. 147–152. ACM Press, Berlin (2018). https://doi.org/10.1145/3185768.3186284
Ataie, E., Gianniti, E., Ardagna, D., Movaghar, A.: A combined analytical modeling machine learning approach for performance prediction of MapReduce jobs in cloud environment. In: Proceedings of the 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2016, pp. 431–439. IEEE, Timisoara (2016). https://doi.org/10.1109/SYNASC.2016.072
Zhang, Z., Cherkasova, L., Loo, B.T.: Benchmarking approach for designing a MapReduce performance model. In: Proceedings of the ACM/SPEC International Conference on International Conference on Performance Engineering, ICPE 2013, pp. 253–258. ACM Press, Prague (2013). https://doi.org/10.1145/2479871.2479906
Dai, J., Huang, J., Huang, S., Huang, B., Liu, Y.: HiTune: dataflow-based performance analysis for big data cloud. In: Proceedings of the USENIX Annual Technical Conference, pp. 87–100. USENIX Association, Portland (2011)
Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: an in-depth study. Proc. VLDB Endowment 3(1–2), 472–483 (2010). https://doi.org/10.14778/1920841.1920903
Yigitbasi, N., Willke, T.L., Liao, G., Epema, D.: Towards machine learning-based auto-tuning of MapReduce. In: Proceedings of the IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, pp. 11–20. IEEE, San Francisco (2013). https://doi.org/10.1109/MASCOTS.2013.9
Venkataraman, S., Yang, Z., Franklin, M., Recht, B., Stoica, I.: Ernest: efficient performance prediction for large-scale advanced analytics. In: Proceedings of the 13th Usenix Conference on Networked Systems Design and Implementation, pp. 363–378. USENIX Association, Santa Clara (2016)
Gibilisco, G.P., Li, M., Zhang, L., Ardagna, D.: Stage aware performance modeling of DAG based in memory analytic platforms. In: Proceedings of the 9th International Conference on Cloud Computing, CLOUD 2016, pp. 188–195. IEEE, San Francisco (2016). https://doi.org/10.1109/CLOUD.2016.0034
Liu, Y., Li, M., Alham, N.K., Hammoud, S.: HSim: a MapReduce simulator in enabling cloud computing. Future Gener. Comput. Syst. 29(1), 300–308 (2013). https://doi.org/10.1016/j.future.2011.05.007
Gribaudo, M., Barbierato, E., Iacono, M.: Modeling apache hive based applications in big data architectures. In: Proceedings of the 7th International Conference on Performance Evaluation Methodologies and Tools, ValueTools 2013, pp. 30–38 ICST, Torino (2013). https://doi.org/10.4108/icst.valuetools.2013.254398
Ruiz, M.C., Calleja, J., Cazorla, D.: Petri nets formalization of Map/Reduce paradigm to optimise the performance-cost tradeoff. In: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA, Vol. 3, pp. 92–99. IEEE, Helsinki (2015). https://doi.org/10.1109/Trustcom.2015.617
Ardagna, D., Bernardi, S., Gianniti, E., Karimian Aliabadi, S., Perez-Palacin, D., Requeno, J.I.: Modeling performance of hadoop applications: a journey from queueing networks to stochastic well formed nets. In: Carretero, J., Garcia-Blas, J., Ko, R.K.L., Mueller, P., Nakano, K. (eds.) ICA3PP 2016. LNCS, vol. 10048, pp. 599–613. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49583-5_47
Malekimajd, M., Ardagna, D., Ciavotta, M., Rizzi, A.M., Passacantando, M.: Optimal map reduce job capacity allocation in cloud systems. ACM SIGMETRICS Perform. Eval. Rev. 42(4), 51–61 (2015). https://doi.org/10.1145/2788402.2788410
Castiglione, A., Gribaudo, M., Iacono, M., Palmieri, F.: Exploiting mean field analysis to model performances of big data architectures. Future Gener. Comput. Syst. 37, 203–211 (2014). https://doi.org/10.1016/j.future.2013.07.016
Gianniti, E., Rizzi, A.M., Barbierato, E., Gribaudo, M., Ardagna, D.: Fluid petri nets for the performance evaluation of MapReduce and spark applications. ACM SIGMETRICS Perform. Eval. Rev. 44(4), 23–36 (2017). https://doi.org/10.1145/3092819.3092824
Spark, Apache Spark. http://spark.apache.org/. Accessed July 2018
Alipourfard, O., Harry Liu, H., Chen, J., Venkataraman, S., Yu, M., Zhang, M.: CherryPick: adaptively unearthing the best cloud configurations for big data analytics. In: Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2017), Boston, MA, USA, pp. 469–482 (2017)
Teng, F., Yu, L., Magoulès, F.: SimMapReduce: a simulator for modeling MapReduce framework. In: Proceedings of the Fifth FTRA International Conference on Multimedia and Ubiquitous Engineering, pp. 277–282. IEEE, Loutraki (2011). https://doi.org/10.1109/MUE.2011.56
Lazowska, E.D., Zahorjan, J., Graham, G.S., Sevcik, K.C.: Quantitative System Performance: Computer System Analysis Using Queueing Network Models, 1st edn. Prentice-Hall, Upper Saddle River (1984)
Ciardo, G., Trivedi, K.S.: A decomposition approach for stochastic reward net models. Perform. Eval. 18(1), 37–59 (1993). https://doi.org/10.1016/0166-5316(93)90026-Q. https://www.sciencedirect.com/science/article/pii/016653169390026Q
Ataie, E., Entezari-Maleki, R., Rashidi, L., Trivedi, K.S., Ardagna, D., Movaghar, A.: Hierarchical stochastic models for performance, availability, and power consumption analysis of IaaS clouds. IEEE Trans. Cloud Comput. (to appear). https://doi.org/10.1109/TCC.2017.2760836
Entezari-Maleki, R., Trivedi, K.S., Movaghar, A.: Performability evaluation of grid environments using stochastic reward nets. IEEE Trans. Dependable Secure Comput. 12(2), 204–216 (2015). https://doi.org/10.1109/TDSC.2014.2320741
Meyer, J.F., Movaghar, A., Sanders, W.H.: Stochastic activity networks: structure, behavior, and application. In: Proceedings of the International Workshop on Timed Petri Nets, Torino, Italy, pp. 106–115 (1985)
Reinecke, P., Bodrog, L., Danilkina, A.: Phase-type distributions. In: Wolter, K., Avritzer, A., Vieira, M., van Moorsel, A. (eds.) Resilience Assessment and Evaluation of Computing Systems, pp. 85–113. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-29032-9_5
Flexiant: Flexiant cloud management software & cloud orchestration. https://www.flexiant.com/. Accessed July 2018
Cineca: Cineca computing center. http://www.cineca.it/. Accessed July 2018
Poess, M., Smith, B., Kollar, L., Larson, P.: TPC-DS, taking decision support benchmarking to the next level. In: Proceedings of the 2002 ACM International Conference on Management of data, SIGMOD 2002, pp. 582–587. ACM Press, Madison (2002). https://doi.org/10.1145/564691.564759
Hive: Apache Hive. https://hive.apache.org/. Accessed July 2018
Hirel, C., Tuffin, B., Trivedi, K.S.: SPNP: stochastic petri nets. Version 6.0. In: Haverkort, B.R., Bohnenkamp, H.C., Smith, C.U. (eds.) TOOLS 2000. LNCS, vol. 1786, pp. 354–357. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-46429-8_30
Acknowledgment
The results of this work have been partially funded by the European DICE H2020 research project (grant agreement no. 644869).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Karimian-Aliabadi, S., Ardagna, D., Entezari-Maleki, R., Movaghar, A. (2019). Scalable Performance Modeling and Evaluation of MapReduce Applications. In: Grandinetti, L., Mirtaheri, S., Shahbazian, R. (eds) High-Performance Computing and Big Data Analysis. TopHPC 2019. Communications in Computer and Information Science, vol 891. Springer, Cham. https://doi.org/10.1007/978-3-030-33495-6_34
Download citation
DOI: https://doi.org/10.1007/978-3-030-33495-6_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33494-9
Online ISBN: 978-3-030-33495-6
eBook Packages: Computer ScienceComputer Science (R0)