Abstract
In today’s technology era, Big Data has become a buzzword. Various frameworks are available in order to process this Big Data. Both Hadoop and Spark are open source framework to process Big Data. Hadoop provides batch processing while Spark supports both batch as well as stream processing, i.e., it is a hybrid processing framework. Both frameworks have their own advantages and drawback. The contribution of this paper is to provide a comparative analysis of Hadoop MapReduce and Apache Spark. In this paper, we also propose a scalable graph processing architecture that could be used to overcome traditional limitations of Hadoop framework.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Statista [Online]. http://www.statista.com/. Accessed 5 Feb 2018
Talan, P., Sharma, K.: An overview and an approach for graph data processing using Hadoop MapReduce. In: 2nd International Conference on Computing Methodologies and Communication (2018) (submitted)
Barrachina, A.D., O’Driscoll, A.: A big data methodology for categorising technical support requests using Hadoop and Mahout. J. Big Data 1, 1 (2014)
Shahabinejad, M., Khabbazian, M., Ardakani, M.: An efficient binary locally repairable code for Hadoop distributed file system. IEEE Commun. Lett. 18(8), 1287–1290 (2014)
Jacha, T., Magieraa, E., Froelich, W.: Application of HADOOP to store and process big data gathered from an urban water distribution system. In: 13th Computer Control for Water Industry Conference, CCWI, pp. 1375–1380 (2015)
Jain, A., Bhatnagar, V.: Crime data analysis using pig with Hadoop. In: International Conference on Information Security & Privacy, pp. 571–578 (2016)
Jun, F., Zhixian, T., Mian, W., Liming, X.: HQ-Tree: a distributed spatial index based on Hadoop. China Commun., 128–141 (2014)
Yao, Y., Tai, J., Sheng, B., Mi, N.: LsPS: a job size-based scheduler for efficient task assignments in Hadoop. IEEE Transac. Cloud Comput. 3(4), 411–424 (2015)
Liu, X., Wang, X., Matwin, S., Japkowicz, N.: Meta-mapreduce for scalable data mining. J. Big Data, 1–23 (2015)
Liu, J., Liu, F., Ansari, Nirwan: Monitoring and analyzing big traffic data of a large-scale cellular network with hadoop. IEEE Netw., 32–39 (2014)
Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: IEEE 17th International Conference on High Performance Computing and Communications, IEEE 7th International Symposium on Cyberspace Safety and Security, and IEEE 12th International Conference on Embedded Software and Systems, pp. 166–173 (2015)
Huang, W., Meng, L., Zhang, D., Zhang, W.: In-memory parallel processing of massive remotely sensed data using an apache spark on hadoop yarn model. IEEE J. Sel. Top. Appl. Earth Obs. Remot. Sens., 1–17 (2016)
Yan, Y., Huang, L., Yi, L.: Is apache spark scalable to seismic data analytics and computation. In: IEEE International Conference on Big Data, pp. 2036–2045 (2015)
Harnie, D., Vapirev, A.E., Wegner, J.K., Gedich, A., Steijaert, M., Wuyts, R., De Meuter, W.: Scaling machine learning for target prediction in drug discovery using apache spark. In: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 871–879 (2015)
Ramirez-Gallego, S., Garcia, S., Mourino-Talin, H., Martinez-Rego, D., Bolon-Canedo, V., Alonso-Betanzos, A., Benitez, J., Herrera, F.: Distributed entropy minimization discretizer for big data analysis under apache spark. In: IEEE Trustcom/BigDataSE/ISPA, pp. 33–40 (2015)
Mushtaq, H., Al-Ars, Z.: Cluster-based apache spark implementation of the GATK DNA analysis pipeline. In: IEEE International Conference on Bioinformatics and Biomedicine, pp. 1471–1477 (2015)
Maarala, A.I., Rautiainen, M., Salmi, M., Pirttikangas, S., Riekki, J.: Low latency analytics for streaming traffic data with apache spark. In: IEEE International Conference on Big Data, pp. 2855–2858 (2015)
Lu, X., Md. Wasi-ur-Rahman, Islam, N., Shankar, D., Panda, D. K.: Accelerating spark with RDMA for big data processing early experiences. In: IEEE 22nd Annual Symposium on High-Performance Interconnects, pp. 9–16 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Talan, P.P., Sharma, K.U., Nawade, P.P., Talan, K.P. (2019). An Overview of Hadoop MapReduce, Spark, and Scalable Graph Processing Architecture. In: Kalita, J., Balas, V., Borah, S., Pradhan, R. (eds) Recent Developments in Machine Learning and Data Analytics. Advances in Intelligent Systems and Computing, vol 740. Springer, Singapore. https://doi.org/10.1007/978-981-13-1280-9_3
Download citation
DOI: https://doi.org/10.1007/978-981-13-1280-9_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1279-3
Online ISBN: 978-981-13-1280-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)