SparkBench is a flexible framework for benchmarking, simulating, comparing, and testing versions of Apache Spark and Spark applications. It provides users three levels of parallelism and a variety of built-in data generators and workloads that allow users to finely tune their setup and get the benchmarking results they need.
A framework for benchmarking Apache Spark.
Apache Spark began in 2010 as a research project by Matei Zaharia and others in the Berkeley AMPLab. Following the landmark success of Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Zaharia et al. (2012), Spark continued to gain popularity and usage as its performance gains over traditional MapReduce workflows became evident. Spark continued to grow as well, introducing Python and R APIs, machine learning, graph computation, SQL, and...
- AMPLab Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark. Accessed 23 Feb 2018
- Apache Airflow. http://airbnb.io/projects/airflow/. Accessed 23 Feb 2018
- Apache Spark. https://spark.apache.org/. Accessed 23 Feb 2018
- Apache Zeppelin. https://zeppelin.apache.org/. Accessed 23 Feb 2018
- Azkaban. https://azkaban.github.io/. Accessed 23 Feb 2018
- HOCON (Human-Optimized Config Object Notation). https://github.com/lightbend/config/blob/master/HOCON.md. Accessed 23 Feb 2018
- IBM Spark-Tacing. https://github.com/CODAI/spark-tracing. Accessed 23 Feb 2018
- Intel HiBench Suite. https://github.com/intel-hadoop/HiBench. Accessed 23 Feb 2018
- Li M et al (2015) SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark. https://research.spec.org/fileadmin/user_upload/documents/wg_bd/BD-20150401-spark_benchmark-v1.3-spec.pdf. Accessed 23 Feb 2018
- Project Jupyter. http://jupyter.org/. Accessed 23 Feb 2018
- TPC Decision Support Benchmark. http://www.tpc.org/tpcds/default.asp. Accessed 23 Feb 2018
- YourKit Java Profiler. https://www.yourkit.com/java/profiler/features/. Accessed 23 Feb 2018
- Zaharia M et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf. Accessed 23 Feb 2018