Fast Recovery MapReduce (FAR-MR) to accelerate failure recovery in big data applications

Zhu, Yongqing; Samsudin, Juniarto; Kanagavelu, Renuga; Zhang, Weiwen; Wang, Long; Aye, Theint Theint; Goh, Rick Siow Mong

doi:10.1007/s11227-018-2716-8

Fast Recovery MapReduce (FAR-MR) to accelerate failure recovery in big data applications

Published: 15 December 2018

Volume 76, pages 3572–3588, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yongqing Zhu ORCID: orcid.org/0000-0001-6111-8297¹,
Juniarto Samsudin¹,
Renuga Kanagavelu¹,
Weiwen Zhang¹,
Long Wang¹,
Theint Theint Aye¹ &
…
Rick Siow Mong Goh¹

323 Accesses
8 Citations
Explore all metrics

Abstract

Existing Hadoop MapReduce fault tolerance strategy causes the computing jobs suffering from high performance penalty during failure recovery. In this paper, we propose Fast Recovery MapReduce (FAR-MR) to improve MapReduce performance in failure recovery. FAR-MR includes a novel fault tolerance strategy that combines distributed checkpointing and proactive push mechanism to support fast recovery from task failure and node failure. With distributed checkpointing, computing progress of each task is recorded as checkpoints periodically and kept in distributed data storage. The recovered task can obtain the last progress of the failed task from the distributed storage during failure recovery. In addition, the proactive push mechanism enables the computing results of map tasks to be proactively transmitted to the nodes hosting reduce tasks of the same computing job. When a failure happens, the partial output results being pushed to the reducer nodes can be used by the reduce tasks without the necessity of re-compute. FAR-MR allows a failed task to be recovered efficiently at any node in the cluster. The performance evaluation has shown that the proposed FAR-MR can improve computing job performance by up to 62% and 45% compared to Hadoop MapReduce in the case of task failure recovery and node failure recovery, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analyzing the impact of various parameters on job scheduling in the Google cluster dataset

Article 29 March 2024

Danyal Shahmirzadi, Navid Khaledian & Amir Masoud Rahmani

Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems

Article 12 April 2024

Chanki Kim & Kang-Wook Chon

Supporting efficient video file streaming in P2P cloud storage

Article 04 April 2024

Jinsung Kim & Eunsam Kim

References

Raghupathi W, Raghupathi V (2014) Big data analytics in healthcare: promise and potential. Health Inf Sci Syst 2:3
Article Google Scholar
Cattaneo G, Petrillo UF, Giancarlo R et al (2017) An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop. J Supercomput 73(4):1467–1483. https://doi.org/10.1007/s11227-016-1835-3
Article Google Scholar
Cardenas AA, Manadhata PK, Rajan SP (2013) Big data analytics for security. IEEE Secur Priv 11(6):74–76
Article Google Scholar
Zhu Y, Juniarto S, Shi H, Wang J (2015) VH-DSI: speeding up data visualization via a heterogeneous distributed storage infrastructure. In: Proceedings of the 21st IEEE International Conference on Parallel and Distributed Systems (ICPADS 2015), pp 658–665
Lin KC, Zhang KY, Huang YH et al (2016) Feature selection based on an improved cat swarm optimization algorithm for big data classification. J Supercomput 72(8):3210–3221. https://doi.org/10.1007/s11227-016-1631-0
Article Google Scholar
Dean J, Ghemawat S (2008) Map-Reduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Apache Hadoop YARN. http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html. Accessed 2012
Rahman MT, Gabriel E, Subhlok J (2017) Performance implications of failures on MapReduce applications. In: Proceedings of 2017 IEEE International Conference on Cluster Computing, pp 741–748
Yang C, Yen C, Tan C, Madden SR (2010) Osprey: implementing MapReduce-style fault tolerance in a shared-nothing distributed database. In: Proceedings of IEEE ICDE, pp 657–668
Wang G, Butt AR, Pandey P, Gupta K (2009) A simulation approach to evaluating design decisions in MapReduce setups. In: Proceedings of IEEE/ACM MASCOTS, pp 1–11
Khalil S, Salem SA, Nassar S, Saad EM (2013) MapReduce performance in heterogeneous environments: a review. Int J Sci Eng Res 4(4):410–416
Google Scholar
Carlson JL (2013) Redis in action. Manning Publications, Greenwich
Google Scholar
Fitzpatrick B (2004) Distributed caching with memcached. Linux J 2004(124):72–78
Google Scholar
Chervenak A, Foster I, Kesselman C, Salisbury C, Tuecke S (2000) The data grid: towards an architecture for the distributed management and analysis of large scientific data sets. J Netw Comput Appl 23:187
Article Google Scholar
Cui X, Zhu P, Yang X et al (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259. https://doi.org/10.1007/s11227-014-1225-7
Article Google Scholar
Choi H, Lee KH, Lee YJ (2014) Parallel labeling of massive XML data with MapReduce. J Supercomput 67(2):408–437. https://doi.org/10.1007/s11227-013-1008-6
Article Google Scholar
Slagter K, Hsu CH, Chung YC et al (2013) An improved partitioning mechanism for optimizing massive data analysis using MapReduce. J Supercomput 66(1):539–555. https://doi.org/10.1007/s11227-013-0924-9
Article Google Scholar
Treaster M (2005) A survey of Fault-tolerance and Fault-recovery techniques in parallel systems. Technical Report cs.DC/0501002, ACM Computing Research Repository (CoRR)
Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’08, USA, pp 29–42
Chen Q, Zhang D, Guo M, Deng Q, Guo S (2010) SAMR: a selfadaptive MapReduce scheduling algorithm in heterogeneous environment. In: Proceedings of the IEEE 10th International Conference on Computer and Information Technology, pp 2736–2743
Ananthanarayanan G, Kandula S, Greenberg A, Stoica I, Lu Y, Saha B, Harris E (2010) Reining in the outliers in map-reduce clusters using Mantri. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI’10, USA, pp 1–16
Wang Y, Fu H, Yu W (2015) Cracking down MapReduce failure amplification through analytics logging and migration. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS’15), pp 261–270
Gates A et al (2009) Building a highlevel dataflow system on top of MapReduce: the pig experience. PVLDB 2(2):1414
Google Scholar
Thusoo A et al (2009) Hive—a warehousing solution over a Map-Reduce framework. PVLDB 2(2):1626
Google Scholar
Balazinska M, Balakrishnan H, Madden SR, Stonebraker M (2008) Fault-tolerance in the borealis distributed stream processing system. ACM Trans Database Syst 33(1):3
Article Google Scholar
Hwang J-H, Xing Y, Cetintemel U, Zdonik S (2007) A cooperative, self-configuring high-availability solution for stream processing. In: Proceedings of the IEEE 23rd International Conference on Data Engineering, pp 176–185
Liedes A-P, Wolski A (2006) SIREN: a memory-conserving, snapshot-consistent checkpoint algorithm for in-memory databases. In: Proceedings of the 22nd International Conference on Data Engineering, pp 99–99
Quiané-Ruiz J-A, Pinkel C, Schad J (2011) RAFTing MapReduce: fast recovery on the RAFT. In: Proceedings of the IEEE 27th International Conference on Data Engineering (ICDE’11), pp 589–600
Lin C-Y, Chen T-H, Cheng Y-N (2013) On improving fault tolerance for heterogeneous Hadoop MapReduce clusters. In: Proceedings of 2013 IEEE International Conference on Cloud Computing and Big Data, pp 38–43
Wang H, Chen H, Zhenwei D, Fei H (2016) BeTL: MapReduce checkpoint tactics beneath the task level. IEEE Trans Serv Comput 9:84–95
Google Scholar
Wang H, Chen H, Hu F (2014) Rect: improving MapReduce performance under failures with resilient checkpointing tactics. In: Proceedings of the IEEE International Conference Big Data (Big Data), pp 27–32

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their careful reading of our paper despite the busy schedules. We really appreciate their valuable comments and suggestions which for sure will improve the clarity and quality of our manuscript.

Author information

Authors and Affiliations

Institute of High Performance Computing, Singapore, Singapore
Yongqing Zhu, Juniarto Samsudin, Renuga Kanagavelu, Weiwen Zhang, Long Wang, Theint Theint Aye & Rick Siow Mong Goh

Authors

Yongqing Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Juniarto Samsudin
View author publications
You can also search for this author in PubMed Google Scholar
Renuga Kanagavelu
View author publications
You can also search for this author in PubMed Google Scholar
Weiwen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Long Wang
View author publications
You can also search for this author in PubMed Google Scholar
Theint Theint Aye
View author publications
You can also search for this author in PubMed Google Scholar
Rick Siow Mong Goh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongqing Zhu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, Y., Samsudin, J., Kanagavelu, R. et al. Fast Recovery MapReduce (FAR-MR) to accelerate failure recovery in big data applications. J Supercomput 76, 3572–3588 (2020). https://doi.org/10.1007/s11227-018-2716-8

Download citation

Published: 15 December 2018
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11227-018-2716-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast Recovery MapReduce (FAR-MR) to accelerate failure recovery in big data applications

Abstract

Access this article

Similar content being viewed by others

Analyzing the impact of various parameters on job scheduling in the Google cluster dataset

Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems

Supporting efficient video file streaming in P2P cloud storage

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fast Recovery MapReduce (FAR-MR) to accelerate failure recovery in big data applications

Abstract

Access this article

Similar content being viewed by others

Analyzing the impact of various parameters on job scheduling in the Google cluster dataset

Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems

Supporting efficient video file streaming in P2P cloud storage

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation