ReHRS: A Hybrid Redundant System for Improving MapReduce Reliability and Availability

  • Jia-Chun LinEmail author
  • Fang-Yie Leu
  • Ying-ping Chen
Part of the Modeling and Optimization in Science and Technologies book series (MOST, volume 4)


MapReduce is a parallel programming framework proposed by Google. Recently, it has become a popular technology for solving data-intensive applications. However, current MapReduce implementations provide insufficient redundant mechanisms for their master servers, consequently causing the fact that the master servers’ services cannot continue and all jobs cannot proceed and complete when the master servers unexpectedly fail. To solve this problem, this chapter proposes a master server redundant mechanism called the Reliable Hybrid Redundant System (ReHRS for short), in which a hot-standby server is employed to maintain the latest metadata of the master sever so as to achieve a fast takeover, and a warm-standby server is employed to further enhance system reliability and extend the operation of MapReduce when both the master server and hot-standby server cannot work properly. We proposed a failure detection algorithm to detect the failure of the master server and hot-standby server, and provided appropriate takeover processes to continue their operations. Additionally, we introduced a dynamic warmup mechanism for the warm-standby server to warm itself up such that it can quickly act as the hot-standby server when necessary. The extensive simulation and experiment results show that the ReHRS significantly speeds up the takeover process as compared with three state-of-the-art schemes.


MapReduce single-point-of-failure reliability availability reliable hybrid redundant system 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified Data Processing on Large Clusters. Communication of the ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  2. 2.
    Apache Hadoop, (March 07, 2012)
  3. 3.
    The Disco project, (March 17, 2012)
  4. 4.
    Gridgain, (April 15, 2012)
  5. 5.
    MapSharp, (May 7, 2012)
  6. 6.
    Skynet, (May 13, 2012)
  7. 7.
    Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: Proceedings of the ACM Symposium on Operating Systems Principles, pp. 29–43. ACM, New York (2003)Google Scholar
  8. 8.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: Proceedings of the IEEE Symposium on Mass Storage Systems and Technologies, Incline Village, NV, USA, pp. 1–10 (2010)Google Scholar
  9. 9.
    Hadoop Wiki, NameNodeFailover, (September 9, 2011)
  10. 10.
    Dean, J.: Designs, lessons and advice from building large distributed Systems, Keynote slides at (September 20, 2011)
  11. 11.
    Loques, O.G., Kramer, J.: Flexible Fault Tolerance for Distributed Computer Systems. IEE Proceedings-E on Computers and Digital Techniques 133(6), 319–337 (1986)CrossRefGoogle Scholar
  12. 12.
    Shooman, M.L.: Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design. John Wiley & Sons Inc., New York (2002)CrossRefGoogle Scholar
  13. 13.
    Sinaki, G.: Ultra-Reliable Fault Tolerant Inertial Reference Unit for Spacecraft. In: Proceedings of the Annual Rocky & Mountain Guidance and Control Conference, San Diego, CA, pp. 239–248 (1994)Google Scholar
  14. 14.
    Pandey, D., Jacob, M., Yadav, J.: Reliability Analysis of a Powerloom Plant with Cold-Standby for its Strategic Unit. Microelectronics and Reliability 36(1), 115–119 (1996)CrossRefGoogle Scholar
  15. 15.
    Kumar, S., Kumar, D., Mehta, N.P.: Behavioural Analysis of Shell Gasification and Carbon Recovery Process in a Urea Fertilizer Plant. Microelectronics and Reliability 36(4), 671–673 (1996)CrossRefGoogle Scholar
  16. 16.
    Leu, F.Y., Yang, C.T., Jiang, F.C.: Improving Reliability of a Heterogeneous Grid-based Intrusion Detection Platform using Levels of Redundancies. Future Generation Computer Systems 26(4), 554–568 (2010)CrossRefGoogle Scholar
  17. 17.
    Zheng, Q.: Improving MapReduce Fault Tolerance in the Cloud. In: Proceedings of the IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, Atlanta, CA, pp. 1–6 (2010)Google Scholar
  18. 18.
    Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce Performance in Heterogeneous Environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, San Diego, CA, pp. 29–42 (2008)Google Scholar
  19. 19.
    Cha, J.H., Mi, J., Yun, W.Y.: Modelling a General Standby System and Evaluation of its Performance. Applied Stochastic Models in Business and Industry 24(2), 159–169 (2008)CrossRefzbMATHMathSciNetGoogle Scholar
  20. 20.
    Du, Y., Yu, H.: Paratus: Instantaneous Failover via Virtual Machine Replication. In: Proceedings of 8th International Conference on Grid and Cooperative Computing, Lanzhou, Gansu, China, pp. 307–312 (2009)Google Scholar
  21. 21.
    Wang, F., Qiu, J., Yang, J., Dong, B., Li, X., Li, Y.: Hadoop High Availability through Metadata Replication. In: Proceedings of the First International Workshop on Cloud Data Management, pp. 37–44. ACM (2009)Google Scholar
  22. 22.
    Alvaro, P., Condie, T., Conway, N., Elmeleegy, K., Hellerstein, J.M., Sears, R.C.: BOOM: Data-centric Programming in the Datacenter. Technical Report UCB/EECS-2009-113, EECS Department, University of California, Berkeley (July 2009)Google Scholar
  23. 23.
    He, X., Ou, L., Engelmann, C., Chen, X., Scott, S.L.: Symmetric Active/Active Metadata Service for High Availability Parallel File Systems. Journal of Parallel and Distributed Computing 69(12), 961–973 (2009)CrossRefGoogle Scholar
  24. 24.
    Chen, Z., Xiong, J., Meng, D.: Replication-based Highly Available Metadata Management for Cluster File Systems. In: Proceedings of the IEEE International Conference on Cluster Computing, Heraklion, Greece, pp. 292–301 (2010)Google Scholar
  25. 25.
    Marozzo, F., Talia, D., Trunfio, P.: A Peer-to-Peer Framework for Supporting MapReduce Applications in Dynamic Cloud Environments. In: Cloud Computing: Principles, 1st edn. Springer (2010)Google Scholar
  26. 26.
    White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Yahoo! Press (June 5, 2009)Google Scholar
  27. 27.
    Chockler, G.V., Keidar, I., Vitenberg, R.: Group Communication Specifications: A Comprehensive Study. ACM Computing Surveys 33(4), 427–469 (2001)CrossRefGoogle Scholar
  28. 28.
    Défago, X., Schiper, A., Urbán, P.: Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey. ACM Computing Surveys 36(4), 372–421 (2004)CrossRefGoogle Scholar
  29. 29.
    Issariyakul, T., Hossain, E.: Introduction to Network Simulator NS2. Springer Science Media (2009) ISBN: 978-0-387-71759-3Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Department of Computer ScienceNational Chiao Tung UniversityHsinchu CityTaiwan
  2. 2.Department of Computer ScienceTungHai UniversityTaichung CityTaiwan

Personalised recommendations