Skip to main content

An Efficient Clique-Based Algorithm of Compute Nodes Allocation for In-memory Checkpoint System

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9137))

Included in the following conference series:

  • 2746 Accesses

Abstract

Fault-tolerant is an essential technology for high-performance computing systems. Checkpoint/Restart (C/R) is the most popular fault-tolerant technique in which the programs save their states in stable storage, typically a global file system, and recover from the last checkpoint upon a failure. Due to the high-cost of global file system, node-local storage based checkpoint techniques are now getting more and more interests, where checkpoints are saved in local storage, such as DRAM. Typically, computing nodes are divided into groups and the checkpoint data is redundantly saved on a specified another node or is distributed among all other nodes in the same group, according to different cross-node redundancy schemes, to overcome the volatility of node-local storage. As a result, multiple simultaneous failures within one group often cannot be withstood and the strategy of node grouping is consequently very important since it directly impacts the probability of multi-node-failure within one group. In this paper, we propose a novel node allocation model, which takes the topological structure of high-performance computing systems into account and can greatly reduce the probability of multi-node-failure within a group, compared with traditional architecture-neutral grouping algorithms. Experimental results obtained from a simulation system based on TianHe-2 supercomputer show that our method is very effective on random simulative instances.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. http://source-forge.net/projects/scalablecr/scalable-checkpoint/restart-library

  2. http://www.netlib.org/utk/people/jackdongarra/papers/tianhe-2-dongarra-report.pdf

  3. Daly, J.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22(3), 303–312 (2006)

    Article  Google Scholar 

  4. Duda, A.: The effects of checkpointing on program execution time. Inf. Process. Lett. 16(5), 221–229 (1983)

    Article  MATH  MathSciNet  Google Scholar 

  5. Vivek Sarkar, E.: Exascale software study: Software challenges in exascale systems (2009)

    Google Scholar 

  6. Glosli, J.N., Caspersen, K.J., Gunnels, J.A., Rudd, D.F.R.A.E., Streitz, F.H.: Extending stability beyond cpu millennium: a micron-scale atomistic simulation of kelvin-helmholtz instability. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC), pp. 1–11 (2007)

    Google Scholar 

  7. Iskra, K., Romein, J.W., Yoshii, K., Beckman, P.: Zoid: I/o-forwarding infrastructure for petascale architectures. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 153–162 (2008)

    Google Scholar 

  8. Michalak, S.E., Harris, K.W., Hengartner, N.W., Takala, B.E., Wender, S.A.: Predicting the number of fatal soft errors in los alamos national laboratory’s ASC Q supercomputer. IEEE Trans. Device Mater. Reliab. 5(3), 329–335 (2005)

    Article  Google Scholar 

  9. Moody, A.: The scalable checkpoint/restart (scr) library, user manual version 1.1-6 (2010)

    Google Scholar 

  10. Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(SC), pp. 13–29, November 2010

    Google Scholar 

  11. Naksinehaboon, N., Liu, Y., Leangsuksun, C.B., Nassar, R., Paun, M., Scott, S.L.: Reliability-aware approach: an incremental checkpoint/restart model in hpc environments. In: Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pp. 783–788 (2008)

    Google Scholar 

  12. Plank, J.S., Thomason, M.G.: Processor allocation and checkpoint interval selection in cluster computing systems. J. Parallel Distrib. Comput. 61(11), 1570–1590 (2001)

    Article  MATH  Google Scholar 

  13. Ross, R., Moreira, J., Cupps, K., Pfeiffer, W.: Parallel i/o on the ibm blue gene/l system. Blue Gene/L Consortium Quarterly Newsletter. Technical report (2006)

    Google Scholar 

  14. Schroeder, B., Gibson, G.: Understanding failure in petascale computers. J. Phys. Conf. Series: SciDAC 78, 012–022 (2007)

    Article  Google Scholar 

  15. Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN), pp. 249–258 (2006)

    Google Scholar 

  16. Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)

    Article  MATH  Google Scholar 

Download references

Acknowledgment

This work is supported by National High Technology Research and Development Program of China (863 Program) No.2012AA01A301 and 2012AA01A309.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiangke Liao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Liao, X., Yang, C., Quan, Z., Tang, T., Chen, C. (2015). An Efficient Clique-Based Algorithm of Compute Nodes Allocation for In-memory Checkpoint System. In: Kunkel, J., Ludwig, T. (eds) High Performance Computing. ISC High Performance 2015. Lecture Notes in Computer Science(), vol 9137. Springer, Cham. https://doi.org/10.1007/978-3-319-20119-1_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-20119-1_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-20118-4

  • Online ISBN: 978-3-319-20119-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics