Skip to main content

An Overview of Hadoop MapReduce, Spark, and Scalable Graph Processing Architecture

  • Conference paper
  • First Online:
Recent Developments in Machine Learning and Data Analytics

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 740))

Abstract

In today’s technology era, Big Data has become a buzzword. Various frameworks are available in order to process this Big Data. Both Hadoop and Spark are open source framework to process Big Data. Hadoop provides batch processing while Spark supports both batch as well as stream processing, i.e., it is a hybrid processing framework. Both frameworks have their own advantages and drawback. The contribution of this paper is to provide a comparative analysis of Hadoop MapReduce and Apache Spark. In this paper, we also propose a scalable graph processing architecture that could be used to overcome traditional limitations of Hadoop framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Statista [Online]. http://www.statista.com/. Accessed 5 Feb 2018

  2. Talan, P., Sharma, K.: An overview and an approach for graph data processing using Hadoop MapReduce. In: 2nd International Conference on Computing Methodologies and Communication (2018) (submitted)

    Google Scholar 

  3. Barrachina, A.D., O’Driscoll, A.: A big data methodology for categorising technical support requests using Hadoop and Mahout. J. Big Data 1, 1 (2014)

    Article  Google Scholar 

  4. Shahabinejad, M., Khabbazian, M., Ardakani, M.: An efficient binary locally repairable code for Hadoop distributed file system. IEEE Commun. Lett. 18(8), 1287–1290 (2014)

    Article  Google Scholar 

  5. Jacha, T., Magieraa, E., Froelich, W.: Application of HADOOP to store and process big data gathered from an urban water distribution system. In: 13th Computer Control for Water Industry Conference, CCWI, pp. 1375–1380 (2015)

    Google Scholar 

  6. Jain, A., Bhatnagar, V.: Crime data analysis using pig with Hadoop. In: International Conference on Information Security & Privacy, pp. 571–578 (2016)

    Google Scholar 

  7. Jun, F., Zhixian, T., Mian, W., Liming, X.: HQ-Tree: a distributed spatial index based on Hadoop. China Commun., 128–141 (2014)

    Google Scholar 

  8. Yao, Y., Tai, J., Sheng, B., Mi, N.: LsPS: a job size-based scheduler for efficient task assignments in Hadoop. IEEE Transac. Cloud Comput. 3(4), 411–424 (2015)

    Article  Google Scholar 

  9. Liu, X., Wang, X., Matwin, S., Japkowicz, N.: Meta-mapreduce for scalable data mining. J. Big Data, 1–23 (2015)

    Google Scholar 

  10. Liu, J., Liu, F., Ansari, Nirwan: Monitoring and analyzing big traffic data of a large-scale cellular network with hadoop. IEEE Netw., 32–39 (2014)

    Google Scholar 

  11. Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: IEEE 17th International Conference on High Performance Computing and Communications, IEEE 7th International Symposium on Cyberspace Safety and Security, and IEEE 12th International Conference on Embedded Software and Systems, pp. 166–173 (2015)

    Google Scholar 

  12. Huang, W., Meng, L., Zhang, D., Zhang, W.: In-memory parallel processing of massive remotely sensed data using an apache spark on hadoop yarn model. IEEE J. Sel. Top. Appl. Earth Obs. Remot. Sens., 1–17 (2016)

    Google Scholar 

  13. Yan, Y., Huang, L., Yi, L.: Is apache spark scalable to seismic data analytics and computation. In: IEEE International Conference on Big Data, pp. 2036–2045 (2015)

    Google Scholar 

  14. Harnie, D., Vapirev, A.E., Wegner, J.K., Gedich, A., Steijaert, M., Wuyts, R., De Meuter, W.: Scaling machine learning for target prediction in drug discovery using apache spark. In: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 871–879 (2015)

    Google Scholar 

  15. Ramirez-Gallego, S., Garcia, S., Mourino-Talin, H., Martinez-Rego, D., Bolon-Canedo, V., Alonso-Betanzos, A., Benitez, J., Herrera, F.: Distributed entropy minimization discretizer for big data analysis under apache spark. In: IEEE Trustcom/BigDataSE/ISPA, pp. 33–40 (2015)

    Google Scholar 

  16. Mushtaq, H., Al-Ars, Z.: Cluster-based apache spark implementation of the GATK DNA analysis pipeline. In: IEEE International Conference on Bioinformatics and Biomedicine, pp. 1471–1477 (2015)

    Google Scholar 

  17. Maarala, A.I., Rautiainen, M., Salmi, M., Pirttikangas, S., Riekki, J.: Low latency analytics for streaming traffic data with apache spark. In: IEEE International Conference on Big Data, pp. 2855–2858 (2015)

    Google Scholar 

  18. Lu, X., Md. Wasi-ur-Rahman, Islam, N., Shankar, D., Panda, D. K.: Accelerating spark with RDMA for big data processing early experiences. In: IEEE 22nd Annual Symposium on High-Performance Interconnects, pp. 9–16 (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pooja P. Talan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Talan, P.P., Sharma, K.U., Nawade, P.P., Talan, K.P. (2019). An Overview of Hadoop MapReduce, Spark, and Scalable Graph Processing Architecture. In: Kalita, J., Balas, V., Borah, S., Pradhan, R. (eds) Recent Developments in Machine Learning and Data Analytics. Advances in Intelligent Systems and Computing, vol 740. Springer, Singapore. https://doi.org/10.1007/978-981-13-1280-9_3

Download citation

Publish with us

Policies and ethics