Abstract
Fault-tolerant is an essential technology for high-performance computing systems. Checkpoint/Restart (C/R) is the most popular fault-tolerant technique in which the programs save their states in stable storage, typically a global file system, and recover from the last checkpoint upon a failure. Due to the high-cost of global file system, node-local storage based checkpoint techniques are now getting more and more interests, where checkpoints are saved in local storage, such as DRAM. Typically, computing nodes are divided into groups and the checkpoint data is redundantly saved on a specified another node or is distributed among all other nodes in the same group, according to different cross-node redundancy schemes, to overcome the volatility of node-local storage. As a result, multiple simultaneous failures within one group often cannot be withstood and the strategy of node grouping is consequently very important since it directly impacts the probability of multi-node-failure within one group. In this paper, we propose a novel node allocation model, which takes the topological structure of high-performance computing systems into account and can greatly reduce the probability of multi-node-failure within a group, compared with traditional architecture-neutral grouping algorithms. Experimental results obtained from a simulation system based on TianHe-2 supercomputer show that our method is very effective on random simulative instances.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
http://source-forge.net/projects/scalablecr/scalable-checkpoint/restart-library
http://www.netlib.org/utk/people/jackdongarra/papers/tianhe-2-dongarra-report.pdf
Daly, J.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22(3), 303–312 (2006)
Duda, A.: The effects of checkpointing on program execution time. Inf. Process. Lett. 16(5), 221–229 (1983)
Vivek Sarkar, E.: Exascale software study: Software challenges in exascale systems (2009)
Glosli, J.N., Caspersen, K.J., Gunnels, J.A., Rudd, D.F.R.A.E., Streitz, F.H.: Extending stability beyond cpu millennium: a micron-scale atomistic simulation of kelvin-helmholtz instability. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC), pp. 1–11 (2007)
Iskra, K., Romein, J.W., Yoshii, K., Beckman, P.: Zoid: I/o-forwarding infrastructure for petascale architectures. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 153–162 (2008)
Michalak, S.E., Harris, K.W., Hengartner, N.W., Takala, B.E., Wender, S.A.: Predicting the number of fatal soft errors in los alamos national laboratory’s ASC Q supercomputer. IEEE Trans. Device Mater. Reliab. 5(3), 329–335 (2005)
Moody, A.: The scalable checkpoint/restart (scr) library, user manual version 1.1-6 (2010)
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(SC), pp. 13–29, November 2010
Naksinehaboon, N., Liu, Y., Leangsuksun, C.B., Nassar, R., Paun, M., Scott, S.L.: Reliability-aware approach: an incremental checkpoint/restart model in hpc environments. In: Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pp. 783–788 (2008)
Plank, J.S., Thomason, M.G.: Processor allocation and checkpoint interval selection in cluster computing systems. J. Parallel Distrib. Comput. 61(11), 1570–1590 (2001)
Ross, R., Moreira, J., Cupps, K., Pfeiffer, W.: Parallel i/o on the ibm blue gene/l system. Blue Gene/L Consortium Quarterly Newsletter. Technical report (2006)
Schroeder, B., Gibson, G.: Understanding failure in petascale computers. J. Phys. Conf. Series: SciDAC 78, 012–022 (2007)
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN), pp. 249–258 (2006)
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)
Acknowledgment
This work is supported by National High Technology Research and Development Program of China (863 Program) No.2012AA01A301 and 2012AA01A309.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Liao, X., Yang, C., Quan, Z., Tang, T., Chen, C. (2015). An Efficient Clique-Based Algorithm of Compute Nodes Allocation for In-memory Checkpoint System. In: Kunkel, J., Ludwig, T. (eds) High Performance Computing. ISC High Performance 2015. Lecture Notes in Computer Science(), vol 9137. Springer, Cham. https://doi.org/10.1007/978-3-319-20119-1_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-20119-1_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20118-4
Online ISBN: 978-3-319-20119-1
eBook Packages: Computer ScienceComputer Science (R0)