Advertisement

Performance Evaluation of Big Data Frameworks: MapReduce and Spark

  • Jaspreet Singh
  • S. N. Panda
  • Rajesh Kaushal
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 624)

Abstract

Spark and MapReduce are two prominent open-source distributed computing frameworks for big data processing and analytics. These frameworks introduce a simple programming APIs for new users and suppress the complication and fault tolerance of distributed tasks. Most of Internet companies widely deploy these frameworks to process their massive data. Furthermore, all other big communities are adopting these HPC because high-performance data analytics is required to solve big data problems. To provide an efficient framework for processing and analyzing large amount of data, today’s researchers correlate both the frameworks. (1) This paper discusses the evaluation of the performance of MapReduce and Spark on page rank, sort and word count. From some existing research, we evaluate page rank and sort algorithms in these frameworks. (2) We provide in-depth analysis of task execution time on word count algorithm in both of these frameworks, through detailed experiment and quantify the performance based on different dataset sizes. Overall experimental results show that Spark is faster than MapReduce. The prime causes of speedups in Spark are the reduced DISK and CPU overheads due to RDD cashing.

Keywords

Hadoop Spark MapReduce HDFS Data analytics 

References

  1. 1.
    Landset, S., Khoshgoftaar, T. M., Richter, A. N., & Hasanin, T. (2015). A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data, 2(1). doi: 10.1186/s40537-015-0032-1.
  2. 2.
    Elser, B., & Montresor, A. (2013). An evaluation study of Big Data frameworks for graph processing. 2013 IEEE International Conference on Big Data. doi: 10.1109/bigdata.2013.6691555.
  3. 3.
    Vavilapalli, V. K., Seth, S., Saha, B., Curino, C., O’malley, O., Radia, S., … Shah, H. (2013). Apache Hadoop YARN. Proceedings of the 4th annual Symposium on Cloud Computing - SOCC ‘13. doi: 10.1145/2523616.2523633.
  4. 4.
    HDFS Architecture. (n.d.). Retrieved March 2, 2017, from http://hadoop.apache.org/docs/current/hadoop-project dist/hadoop-hdfs/HdfsDesign.html.
  5. 5.
    Cugola G, Margara A (2012) Processing flows of information: from data stream to complex event processing. ACM Comput Surv 44(3):15:1–15:62.Google Scholar
  6. 6.
    Dittrich, J., & Quiané-Ruiz, J. (2012). Efficient big data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment, 5(12), 2014–2015. doi:10.14778/2367502.2367562.Google Scholar
  7. 7.
    S, G. P., R, N. H., & Prabhu, S. (2017). High Performance Computation of Big Data: Performance Optimization Approach towards a Parallel Frequent Item Set Mining Algorithm for Transaction Data based on Hadoop MapReduce Framework. International Journal of Intelligent Systems and Applications, 9(1), 75–84. doi: 10.5815/ijisa.2017.01.08.
  8. 8.
    Kabáč, M., Consel, C., & Volanschi, N. (2017). Designing parallel data processing for enabling large-scale sensor applications. Personal and Ubiquitous Computing. doi: 10.1007/s00779-017-1009-1.
  9. 9.
    Mavridis, I., & Karatza, H. (2017). Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark. Journal of Systems and Software, 125, 133–151. doi: 10.1016/j.jss.2016.11.037.
  10. 10.
    Svyatkovskiy, A., Imai, K., Kroeger, M., & Shiraito, Y. (2016). Large-scale text processing pipeline with Apache Spark. 2016 IEEE International Conference on Big Data (Big Data). doi: 10.1109/bigdata.2016.7841068.
  11. 11.
    Gopalani, S., & Arora, R. (2015). Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means. International Journal of Computer Applications, 113(1), 8–11. doi: 10.5120/19788-0531.
  12. 12.
    Huang, W., Meng, L., Zhang, D., & Zhang, W. (2017). In-Memory Parallel Processing of Massive Remotely Sensed Data Using an Apache Spark on Hadoop YARN Model. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(1), 3–19. doi: 10.1109/jstars.2016.2547020.
  13. 13.
    M. Zahariaetal. Resilient Distributed Datasets: A Fault Tolerant Abstraction for In Memory Cluster Computing. NSDI 2012.Google Scholar
  14. 14.
    Liang, F., & Lu, X. (2015). Accelerating Iterative Big Data Computing Through MPI. Journal of Computer Science and Technology, 30(2), 283–294. doi: 10.1007/s11390-015-1522-5.
  15. 15.
    Wang, K., & Khan, M. M. (2015). Performance Prediction for Apache Spark Platform. 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems. doi: 10.1109/hpcc-css-icess.2015.246.
  16. 16.
    Barrachina, A. D., & O’Driscoll, A. (2014). A big data methodology for categorising technical support requests using Hadoop and Mahout. Journal Of Big Data, 1(1), 1. doi: 10.1186/2196-1115-1-1.
  17. 17.
    Jiang, T., Zhang, Q., Hou, R., Chai, L., Mckee, S. A., Jia, Z., & Sun, N. (2014). Understanding the behavior of in-memory computing workloads. 2014 IEEE International Symposium on Workload Characterization (IISWC). doi: 10.1109/iiswc.2014.698.
  18. 18.
    Liang, F., & Lu, X. (2015). Accelerating Iterative Big Data Computing Through MPI. Journal of Computer Science and Technology, 30(2), 283–294. doi: 10.1007/s11390-015-1522-5.
  19. 19.
    Shi J., Qiu Y., Minhas U. F., Jiao L., Wang C., Reinwald B., & Ozcan F., “Clash of the titans: MapReduce vs. Spark for large scale data analytics”, In Proceedings of the VLDB Endowment, 8(13), pp. 2110–2121, 2015.Google Scholar
  20. 20.
    Apache Spark the fastest open source engine for sorting a petabyte. (2016, October 27). Retrieved March 4, 2017, from https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.
  21. 21.
    Armbrust M., Das T., Davidson A., Ghodsi A., Or A., Rosen J., & Zaharia M., “Scaling spark in the real world: performance and usability”, In Proceedings of the VLDB Endowment, 8(12), pp. 1840–1843, 2015.Google Scholar
  22. 22.
    Awan, A. J., Brorsson, M., Vlassov, V., & Ayguade, E. (2015). Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server. 2015 IEEE Fifth International Conference on Big Data and Cloud Computing. doi: 10.1109/bdcloud.2015.37.
  23. 23.
    Liang, F., Feng, C., Lu, X., & Xu, Z. (2014). Performance Benefits of Data MPI: A Case Study with Big Data Bench. Big Data Benchmarks, Performance Optimization, and Emerging Hardware Lecture Notes in Computer Science, 111–123. doi: 10.1007/978-3-319-13021-7_9.
  24. 24.
    Gu L., & Li H., “Memory or time: Performance evaluation for iterative operation on hadoop and spark”, In High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference, pp. 721–727, IEEE, November, 2013.Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringChitkara University Institute of Engineering and Technology, Chitkara UniversityRajpuraIndia

Personalised recommendations