An Overview of Hadoop MapReduce, Spark, and Scalable Graph Processing Architecture

Talan, Pooja P.; Sharma, Kartik U.; Nawade, Pratiksha P.; Talan, Karishma P.

doi:10.1007/978-981-13-1280-9_3

Pooja P. Talan¹⁸,
Kartik U. Sharma¹⁸,
Pratiksha P. Nawade¹⁸ &
…
Karishma P. Talan¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 740))

1115 Accesses
2 Citations

Abstract

In today’s technology era, Big Data has become a buzzword. Various frameworks are available in order to process this Big Data. Both Hadoop and Spark are open source framework to process Big Data. Hadoop provides batch processing while Spark supports both batch as well as stream processing, i.e., it is a hybrid processing framework. Both frameworks have their own advantages and drawback. The contribution of this paper is to provide a comparative analysis of Hadoop MapReduce and Apache Spark. In this paper, we also propose a scalable graph processing architecture that could be used to overcome traditional limitations of Hadoop framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Statista [Online]. http://www.statista.com/. Accessed 5 Feb 2018
Talan, P., Sharma, K.: An overview and an approach for graph data processing using Hadoop MapReduce. In: 2nd International Conference on Computing Methodologies and Communication (2018) (submitted)
Google Scholar
Barrachina, A.D., O’Driscoll, A.: A big data methodology for categorising technical support requests using Hadoop and Mahout. J. Big Data 1, 1 (2014)
Article Google Scholar
Shahabinejad, M., Khabbazian, M., Ardakani, M.: An efficient binary locally repairable code for Hadoop distributed file system. IEEE Commun. Lett. 18(8), 1287–1290 (2014)
Article Google Scholar
Jacha, T., Magieraa, E., Froelich, W.: Application of HADOOP to store and process big data gathered from an urban water distribution system. In: 13th Computer Control for Water Industry Conference, CCWI, pp. 1375–1380 (2015)
Google Scholar
Jain, A., Bhatnagar, V.: Crime data analysis using pig with Hadoop. In: International Conference on Information Security & Privacy, pp. 571–578 (2016)
Google Scholar
Jun, F., Zhixian, T., Mian, W., Liming, X.: HQ-Tree: a distributed spatial index based on Hadoop. China Commun., 128–141 (2014)
Google Scholar
Yao, Y., Tai, J., Sheng, B., Mi, N.: LsPS: a job size-based scheduler for efficient task assignments in Hadoop. IEEE Transac. Cloud Comput. 3(4), 411–424 (2015)
Article Google Scholar
Liu, X., Wang, X., Matwin, S., Japkowicz, N.: Meta-mapreduce for scalable data mining. J. Big Data, 1–23 (2015)
Google Scholar
Liu, J., Liu, F., Ansari, Nirwan: Monitoring and analyzing big traffic data of a large-scale cellular network with hadoop. IEEE Netw., 32–39 (2014)
Google Scholar
Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: IEEE 17th International Conference on High Performance Computing and Communications, IEEE 7th International Symposium on Cyberspace Safety and Security, and IEEE 12th International Conference on Embedded Software and Systems, pp. 166–173 (2015)
Google Scholar
Huang, W., Meng, L., Zhang, D., Zhang, W.: In-memory parallel processing of massive remotely sensed data using an apache spark on hadoop yarn model. IEEE J. Sel. Top. Appl. Earth Obs. Remot. Sens., 1–17 (2016)
Google Scholar
Yan, Y., Huang, L., Yi, L.: Is apache spark scalable to seismic data analytics and computation. In: IEEE International Conference on Big Data, pp. 2036–2045 (2015)
Google Scholar
Harnie, D., Vapirev, A.E., Wegner, J.K., Gedich, A., Steijaert, M., Wuyts, R., De Meuter, W.: Scaling machine learning for target prediction in drug discovery using apache spark. In: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 871–879 (2015)
Google Scholar
Ramirez-Gallego, S., Garcia, S., Mourino-Talin, H., Martinez-Rego, D., Bolon-Canedo, V., Alonso-Betanzos, A., Benitez, J., Herrera, F.: Distributed entropy minimization discretizer for big data analysis under apache spark. In: IEEE Trustcom/BigDataSE/ISPA, pp. 33–40 (2015)
Google Scholar
Mushtaq, H., Al-Ars, Z.: Cluster-based apache spark implementation of the GATK DNA analysis pipeline. In: IEEE International Conference on Bioinformatics and Biomedicine, pp. 1471–1477 (2015)
Google Scholar
Maarala, A.I., Rautiainen, M., Salmi, M., Pirttikangas, S., Riekki, J.: Low latency analytics for streaming traffic data with apache spark. In: IEEE International Conference on Big Data, pp. 2855–2858 (2015)
Google Scholar
Lu, X., Md. Wasi-ur-Rahman, Islam, N., Shankar, D., Panda, D. K.: Accelerating spark with RDMA for big data processing early experiences. In: IEEE 22nd Annual Symposium on High-Performance Interconnects, pp. 9–16 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Prof Ram Meghe College of Engineering & Management, Badnera-Amravati, India
Pooja P. Talan, Kartik U. Sharma & Pratiksha P. Nawade
KPIT Technologies Ltd, Thane, Mumbai, India
Karishma P. Talan

Authors

Pooja P. Talan
View author publications
You can also search for this author in PubMed Google Scholar
Kartik U. Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Pratiksha P. Nawade
View author publications
You can also search for this author in PubMed Google Scholar
Karishma P. Talan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pooja P. Talan .

Editor information

Editors and Affiliations

College of Engineering and Applied Science, University of Colorado Colorado Springs, Colorado Springs, CO, USA
Jugal Kalita
Automation and Applied Informatics, Aurel Vlaicu University of Arad, Arad, Romania
Valentina Emilia Balas
Department of Computer Applications, Sikkim Manipal University, Sikkim, India
Samarjeet Borah
Department of Computer Applications, Sikkim Manipal University, Sikkim, India
Ratika Pradhan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Talan, P.P., Sharma, K.U., Nawade, P.P., Talan, K.P. (2019). An Overview of Hadoop MapReduce, Spark, and Scalable Graph Processing Architecture. In: Kalita, J., Balas, V., Borah, S., Pradhan, R. (eds) Recent Developments in Machine Learning and Data Analytics. Advances in Intelligent Systems and Computing, vol 740. Springer, Singapore. https://doi.org/10.1007/978-981-13-1280-9_3

Download citation

DOI: https://doi.org/10.1007/978-981-13-1280-9_3
Published: 12 September 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1279-3
Online ISBN: 978-981-13-1280-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics