Skip to main content

Performance Evaluation of Big Data Frameworks: MapReduce and Spark

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 624))

Abstract

Spark and MapReduce are two prominent open-source distributed computing frameworks for big data processing and analytics. These frameworks introduce a simple programming APIs for new users and suppress the complication and fault tolerance of distributed tasks. Most of Internet companies widely deploy these frameworks to process their massive data. Furthermore, all other big communities are adopting these HPC because high-performance data analytics is required to solve big data problems. To provide an efficient framework for processing and analyzing large amount of data, today’s researchers correlate both the frameworks. (1) This paper discusses the evaluation of the performance of MapReduce and Spark on page rank, sort and word count. From some existing research, we evaluate page rank and sort algorithms in these frameworks. (2) We provide in-depth analysis of task execution time on word count algorithm in both of these frameworks, through detailed experiment and quantify the performance based on different dataset sizes. Overall experimental results show that Spark is faster than MapReduce. The prime causes of speedups in Spark are the reduced DISK and CPU overheads due to RDD cashing.

This is a preview of subscription content, log in via an institution.

References

  1. Landset, S., Khoshgoftaar, T. M., Richter, A. N., & Hasanin, T. (2015). A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data, 2(1). doi:10.1186/s40537-015-0032-1.

  2. Elser, B., & Montresor, A. (2013). An evaluation study of Big Data frameworks for graph processing. 2013 IEEE International Conference on Big Data. doi:10.1109/bigdata.2013.6691555.

  3. Vavilapalli, V. K., Seth, S., Saha, B., Curino, C., O’malley, O., Radia, S., … Shah, H. (2013). Apache Hadoop YARN. Proceedings of the 4th annual Symposium on Cloud Computing - SOCC ‘13. doi:10.1145/2523616.2523633.

  4. HDFS Architecture. (n.d.). Retrieved March 2, 2017, from http://hadoop.apache.org/docs/current/hadoop-project dist/hadoop-hdfs/HdfsDesign.html.

  5. Cugola G, Margara A (2012) Processing flows of information: from data stream to complex event processing. ACM Comput Surv 44(3):15:1–15:62.

    Google Scholar 

  6. Dittrich, J., & Quiané-Ruiz, J. (2012). Efficient big data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment, 5(12), 2014–2015. doi:10.14778/2367502.2367562.

    Google Scholar 

  7. S, G. P., R, N. H., & Prabhu, S. (2017). High Performance Computation of Big Data: Performance Optimization Approach towards a Parallel Frequent Item Set Mining Algorithm for Transaction Data based on Hadoop MapReduce Framework. International Journal of Intelligent Systems and Applications, 9(1), 75–84. doi:10.5815/ijisa.2017.01.08.

  8. Kabáč, M., Consel, C., & Volanschi, N. (2017). Designing parallel data processing for enabling large-scale sensor applications. Personal and Ubiquitous Computing. doi:10.1007/s00779-017-1009-1.

  9. Mavridis, I., & Karatza, H. (2017). Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark. Journal of Systems and Software, 125, 133–151. doi:10.1016/j.jss.2016.11.037.

  10. Svyatkovskiy, A., Imai, K., Kroeger, M., & Shiraito, Y. (2016). Large-scale text processing pipeline with Apache Spark. 2016 IEEE International Conference on Big Data (Big Data). doi:10.1109/bigdata.2016.7841068.

  11. Gopalani, S., & Arora, R. (2015). Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means. International Journal of Computer Applications, 113(1), 8–11. doi:10.5120/19788-0531.

  12. Huang, W., Meng, L., Zhang, D., & Zhang, W. (2017). In-Memory Parallel Processing of Massive Remotely Sensed Data Using an Apache Spark on Hadoop YARN Model. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(1), 3–19. doi:10.1109/jstars.2016.2547020.

  13. M. Zahariaetal. Resilient Distributed Datasets: A Fault Tolerant Abstraction for In Memory Cluster Computing. NSDI 2012.

    Google Scholar 

  14. Liang, F., & Lu, X. (2015). Accelerating Iterative Big Data Computing Through MPI. Journal of Computer Science and Technology, 30(2), 283–294. doi:10.1007/s11390-015-1522-5.

  15. Wang, K., & Khan, M. M. (2015). Performance Prediction for Apache Spark Platform. 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems. doi:10.1109/hpcc-css-icess.2015.246.

  16. Barrachina, A. D., & O’Driscoll, A. (2014). A big data methodology for categorising technical support requests using Hadoop and Mahout. Journal Of Big Data, 1(1), 1. doi:10.1186/2196-1115-1-1.

  17. Jiang, T., Zhang, Q., Hou, R., Chai, L., Mckee, S. A., Jia, Z., & Sun, N. (2014). Understanding the behavior of in-memory computing workloads. 2014 IEEE International Symposium on Workload Characterization (IISWC). doi:10.1109/iiswc.2014.698.

  18. Liang, F., & Lu, X. (2015). Accelerating Iterative Big Data Computing Through MPI. Journal of Computer Science and Technology, 30(2), 283–294. doi:10.1007/s11390-015-1522-5.

  19. Shi J., Qiu Y., Minhas U. F., Jiao L., Wang C., Reinwald B., & Ozcan F., “Clash of the titans: MapReduce vs. Spark for large scale data analytics”, In Proceedings of the VLDB Endowment, 8(13), pp. 2110–2121, 2015.

    Google Scholar 

  20. Apache Spark the fastest open source engine for sorting a petabyte. (2016, October 27). Retrieved March 4, 2017, from https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.

  21. Armbrust M., Das T., Davidson A., Ghodsi A., Or A., Rosen J., & Zaharia M., “Scaling spark in the real world: performance and usability”, In Proceedings of the VLDB Endowment, 8(12), pp. 1840–1843, 2015.

    Google Scholar 

  22. Awan, A. J., Brorsson, M., Vlassov, V., & Ayguade, E. (2015). Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server. 2015 IEEE Fifth International Conference on Big Data and Cloud Computing. doi:10.1109/bdcloud.2015.37.

  23. Liang, F., Feng, C., Lu, X., & Xu, Z. (2014). Performance Benefits of Data MPI: A Case Study with Big Data Bench. Big Data Benchmarks, Performance Optimization, and Emerging Hardware Lecture Notes in Computer Science, 111–123. doi:10.1007/978-3-319-13021-7_9.

  24. Gu L., & Li H., “Memory or time: Performance evaluation for iterative operation on hadoop and spark”, In High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference, pp. 721–727, IEEE, November, 2013.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jaspreet Singh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Singh, J., Panda, S.N., Kaushal, R. (2018). Performance Evaluation of Big Data Frameworks: MapReduce and Spark. In: Singh, R., Choudhury, S., Gehlot, A. (eds) Intelligent Communication, Control and Devices. Advances in Intelligent Systems and Computing, vol 624. Springer, Singapore. https://doi.org/10.1007/978-981-10-5903-2_167

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-5903-2_167

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-5902-5

  • Online ISBN: 978-981-10-5903-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics