Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach


MapReduce framework is an effective method for big data parallel processing. Enhancing the performance of MapReduce clusters, along with reducing their job execution time, is a fundamental challenge to this approach. In fact, one is faced with two challenges here: how to maximize the execution overlap between jobs and how to create an optimum job scheduling. Accordingly, one of the most critical challenges to achieving these goals is developing a precise model to estimate the job execution time due to the large number and high volume of the submitted jobs, limited consumable resources, and the need for proper Hadoop configuration. This paper presents a model based on MapReduce phases for predicting the execution time of jobs in a heterogeneous cluster. Moreover, a novel heuristic method is designed, which significantly reduces the makespan of the jobs. In this method, first by providing the job profiling tool, we obtain the execution details of the MapReduce phases through log analysis. Then, using machine learning methods and statistical analysis, we propose a relevant model to predict runtime. Finally, another tool called job submission and monitoring tool is used for calculating makespan. Different experiments were conducted on the benchmarks under identical conditions for all jobs. The results show that the average makespan speedup for the proposed method was higher than an unoptimized case.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 199

This is the net price. Taxes to be calculated in checkout.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. 1.

    Dittrich J, Quiané-Ruiz J (2012) Efficient big data processing in Hadoop MapReduce. Proc VLDB Endow 5(12):2014–2015.

  2. 2.

    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113.

  3. 3.

    Babu S (2010) Towards automatic optimization of MapReduce programs. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp 137–142.

  4. 4.

    Lee K, Lee Y et al (2012) Parallel data processing with MapReduce. ACM SIGMOD Record 40(4):11–20.

  5. 5.

    White T, Cutting D (2015) Hadoop: the definitive guide. O’Reilly Media, Yahoo

  6. 6.

    Arora A, Mehrotra S (2015) Learning YARN. Packt Publishing Ltd, Birmingham

  7. 7.

    Vavilapalli VK, Murthy AC et al (2013) Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th ACM Annual Symposium on Cloud Computing, p 5.

  8. 8.

    Hashem IA, Anuar NB, Marjani M, Ahmed E, Chiroma H, Firdaus A, Abdullah MT, Alotaibi F, Ali WK, Yaqoob I, Gani A (2018) MapReduce scheduling algorithms: a review. J Supercomput.

  9. 9.

    Zikopoulos P, Eaton C (2011) Understanding big data: analytics for enterprise class Hadoop and streaming data. McGraw-Hill Osborne Media, New York City

  10. 10.

    Lin JC, Lee MC (2016) Performance evaluation of job schedulers on Hadoop YARN. Concurr Comput Practice Exp 28(9):2711–2728.

  11. 11.

    Zaharia M, Borthakur D et al (2009) Job scheduling for multi-user MapReduce clusters. EECS Department University of California Berkeley Technical Report UCB/EECS-2009-55 Apr, (UCB/EECS-2009-55), vol 47, p 131

  12. 12.

    Gautam J, Prajapati H et al (2015) A survey on job scheduling algorithms in Big data processing. In: IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp 1–11.

  13. 13.

    Shabestari F, Rahmani AM, Navimipour NJ, Jabbehdari S (2019) A taxonomy of software-based and hardware-based approaches for energy efficiency management in the Hadoop. J Netw Comput Appl 126:162–177.

  14. 14.

    Witt C, Bux M, Gusew W, Leser U (2019) Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf Syst.

  15. 15.

    Dong B, Zheng Q, Tian F, Chao KM, Godwin N, Ma T, Xu H (2014) Performance models and dynamic characteristics analysis for HDFS write and read operations: a systematic view. J Syst Softw 93:132–151.

  16. 16.

    Khan M, Jin Y, Li M, Xiang Y, Jiang C (2016) Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans Parallel Distrib Syst 27(2):441–454.

  17. 17.

    Ataie E, Gianniti E, Ardagna D, Movaghar A (2017) A combined analytical modeling machine learning approach for performance prediction of MapReduce jobs in Hadoop clusters. In: MICAS 2017 Management of Resources and Services in Cloud and Sky Computing, pp 0–7.

  18. 18.

    Wang N, Yang J, Lu Z, Li X, Wu J (2016) Comparison and improvement of Hadoop MapReduce performance prediction models in the private cloud. In: Asia-Pacific Services Computing Conference. Springer, Cham, pp 77–91.

  19. 19.

    Herodotou H, Babu S (2011) Profiling, what-if analysis, and cost-based optimization of MapReduce programs. In: Proceedings of the VLDB Endowment, vol 4, no. 11, pp 1111–1122

  20. 20.

    Karimian-Aliabadi S, Ardagna D, Entezari-Maleki R, Gianniti E, Movaghar A (2019) Analytical composite performance models for Big Data applications. J Netw Comput Appl.

  21. 21.

    Herodotou H, Lim H, Luo G, Borisov N, Dong L, Cetin F, Babu S (2011) Starfish: a self-tuning system for big data analytics. In: CIDR, vol 11, no 2011, pp 261–272

  22. 22.

    Herodotou H (2011) Hadoop performance models. Technical Report, CS-2011-05 Computer Science Department Duke University, p 19

  23. 23.

    Vianna E, Comarela G, Pontes T et al (2013) Analytical performance models for MapReduce workloads. Int J Parallel Prog 41(4):495–525.

  24. 24.

    Liang DR, Tripathi SK (2000) On performance prediction of parallel computations with precedent constraints. IEEE Trans Parallel Distrib Syst 11(5):491–508.

  25. 25.

    Glushkova D, Jovanovic P, Abelló A (2019) MapReduce performance model for Hadoop 2. x. Inf Syst 79:32–43.

  26. 26.

    Liu Q, Cai W, Jin D, Shen J, Fu Z, Liu X, Linge N (2016) Estimation accuracy on execution time of run-time tasks in a heterogeneous distributed environment. Sensors 16(9):1386.

  27. 27.

    Hammoud M, Sakr M (2011) Locality-aware reduce task scheduling for MapReduce. In: 2011 IEEE 3rd International Conference on Cloud Computing Technology and Science (CloudCom), pp 570–576.

  28. 28.

    Zhang X, Feng Y et al (2011) An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments. In: International Conference on Cloud and Service Computing (CSC), pp 235–242.

  29. 29.

    Wang G, Khasymski A, Krish KR, Butt AR (2013) Towards improving MapReduce task scheduling using online simulation based predictions. In: IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp 299–306.

  30. 30.

    Yong M, Garegrat N, Mohan S (2009) Towards a resource aware scheduler in Hadoop. In: Proceedings of ICWS, pp 102–109

  31. 31.

    Zaharia M, Konwinski A, Joseph A, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: OSDI, vol 8, no 4, p 7.

  32. 32.

    Chen Q, Zhang D et al (2010) SAMR: a self-adaptive MapReduce scheduling algorithm in heterogeneous environment. In: 2010 IEEE 10th International Conference on Computer and Information Technology (CIT), pp 2736–2743.

  33. 33.

    Tang Z, Liu M, Ammar A, Li K, Li K (2016) An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. J Supercomput 72(6):2059–2079.

  34. 34.

    Zhang Q, Zhani MF, Yang Y, Boutaba R, Wong B (2015) PRISM: fine-grained resource-aware scheduling for MapReduce. IEEE Trans Cloud Comput 3(2):182–194.

  35. 35.

    Polo J, Castillo C et al (2011) Resource-aware adaptive scheduling for MapReduce clusters. In: Middleware 2011, pp 187–207.

  36. 36.

    Lama P, Zhou X (2012) AROMA: automated resource allocation and configuration of MapReduce environment in the cloud. In: Proceedings of the 9th ACM International Conference on AUTONOMIC COMPUTING, pp 63–72.

  37. 37.

    Verma A, Cherkasova L, Campbell RH (2011) ARIA: automatic resource inference and allocation for MapReduce environments. In: Proceedings of the 8th ACM International Conference on Autonomic Computing, pp 235–244.

  38. 38.

    Chen Q, Liu C, Xiao Z (2014) Improving MapReduce performance using smart speculative execution strategy. IEEE Trans Comput 63(4):954–967.

  39. 39.

    Wang Y et al (2015) Improving MapReduce performance with partial speculative execution. J Grid Comput 13(4):587–604.

  40. 40.

    Tang S, Lee BS, He B (2014) DynamicMR: a dynamic slot allocation optimization framework for MapReduce clusters. IEEE Trans Cloud Comput 2(3):333–347.

  41. 41.

    Verma A, Cherkasova L, Campbell RH (2013) Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secure Comput 10(5):314–327.

  42. 42.

    Tian W, Li G, Yang W, Buyya R (2016) HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs. J Supercomput 72(6):2376–2393.

  43. 43.

    Tang S, Lee B, He B (2016) Dynamic job ordering and slot configurations for MapReduce workloads. IEEE Trans Serv Comput 9(1):4–17.

  44. 44.

    Zhang Z, Cherkasova L, Loo BT (2013) Benchmarking approach for designing a MapReduce performance model. In: Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, pp 253–258.

  45. 45.

    Yao Y, Wang J, Sheng B, Lin J, Mi N (2014) HASTE: Hadoop YARN scheduling based on task-dependency and resource-demand. In: 2014 IEEE 7th International Conference on Cloud Computing (CLOUD), pp 184–191.

  46. 46.

    Wasi-ur-Rahman M, Lu X, Islam NS, Rajachandrasekar R, Panda DK (2015) High-performance design of YARN MapReduce on modern HPC clusters with Lustre and RDMA. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp 291–300.

  47. 47.

    Verma A, Cherkasova L, Campbell RH (2011) Resource provisioning framework for MapReduce jobs with performance goals. In: ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing. Springer, Berlin, pp 165–186.

  48. 48.

    Hamooni H, Debnath B, Xu J et al (2016) LogMine: fast pattern recognition for log analytics. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp 1573–1582.

  49. 49.

    Sheu RK, Yuan SM, Lo WT, Ku CI (2014) Design and implementation of file deduplication framework on HDFS. Int J Distrib Sens Netw 10(4):561340.

  50. 50.

    Huang S, Huang J, Dai J, Xie T, Huang B (2010) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp 41–51.

Download references

Author information

Correspondence to Ali Movaghar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gandomi, A., Movaghar, A., Reshadi, M. et al. Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach. J Supercomput (2020).

Download citation


  • MapReduce
  • YARN
  • Hadoop
  • Scheduling
  • Modeling
  • Makespan