Distributed Data Processing on Microcomputers with Ascheduler and Apache Spark
Modern architectures of data acquisition and processing often consider low-cost and low-power devices that can be bound together to form a distributed infrastructure. In this paper we overview possibilities to organize a distributed computing testbed based on microcomputers similar to Raspberry Pi and Intel Edison. The goal of the research is to investigate and develop a scheduler for orchestrating distributed data processing and general purpose computations on such unreliable and resource-constrained hardware. Also we consider integration of the scheduler with well-known distributed data processing framework Apache Spark. We outline the project carried out in collaboration with Siemens LLC to compare different configurations of the hardware and software deployment and evaluate performance and applicability of the tools to the testbed.
KeywordsMicrocomputers Scheduling Apache Spark Raspberry Pi Fault tolerance High availability
The research was supported by Siemens LLC.
- 1.Apache spark official website. http://spark.apache.org/
- 2.B.A.T.M.A.N. official web page. https://www.open-mesh.org/projects/open-mesh/wiki
- 4.Fox, K., Mongan, W.M., Popyack, J.: Raspberry hadoopi: a low-cost, hands-on laboratory in big data and analytics. In: SIGCSE, p. 687 (2015)Google Scholar
- 5.Gankevich, I., Tipikin, Y., Gaiduchok, V.: Subordination: cluster management without distributed consensus. In: 2015 International Conference on High Performance Computing & Simulation (HPCS), pp. 639–642. IEEE (2015)Google Scholar
- 6.Gankevich, I., Tipikin, Y., Korkhov, V., Gaiduchok, V.: Factory: non-stop batch jobs without checkpointing. In: 2016 International Conference on High Performance Computing & Simulation (HPCS), pp. 979–984. IEEE (2016)Google Scholar
- 7.Gankevich, I., Tipikin, Y., Korkhov, V., Gaiduchok, V., Degtyarev, A., Bogdanov, A.: Factory: master node high-availability for big data applications and beyond. In: Gervasi, O., et al. (eds.) ICCSA 2016, Part II. LNCS, vol. 9787, pp. 379–389. Springer, Cham (2016). doi: 10.1007/978-3-319-42108-7_29 CrossRefGoogle Scholar
- 9.Kaewkasi, C., Srisuruk, W.: A study of big data processing constraints on a low-power hadoop cluster. In: 2014 International Conference on Computer Science and Engineering Conference (ICSEC), pp. 267–272. IEEE (2014)Google Scholar
- 10.Laskowski, J.: Mastering apache spark 2.0. https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details