Skip to main content

In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes

  • Conference paper
Recent Advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI 2009)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 5759))

Abstract

Today, the scale of High performance computing (HPC) systems is much larger than ever. This brings a challenge to fault tolerance of HPC systems. MPI (Message Passing Interface) is one of the most important programming tools for HPC. There are quite a few fault-tolerant extensions for MPI, such as MPICH-V, StarFish, FT-MPI and so on. Most of them are based on on-disk checkpointing. In this paper, we apply two kinds of XOR-based double-erasure codes - RDP (Row-Diagonal Parity) and B-Code to in-memory checkpointing for MPI programs. We develop scalable checkpointing/recovery algorithms which embed erasure code encoding/decoding computation into MPI collective communications operations. The experiments show that the scalable algorithms decrease communication overhead and balance computation effectively. Our approach provides highly reliable, fast in-memory checkpointing for MPI programs.

This paper is supported partly by the National High Technology Research and Development Program of China (2008AA01Z401), RFDP of China (20070055054), and Science and Technology Development Plan of Tianjin (08JCYBJC13000).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. http://www.top500.org

  2. Wu-Chun, F.: The Importance of Being Low Power in High Performance Computing. Cyberinfrastructure Technology Watch Quarterly 1(3), 12–21 (2005)

    Google Scholar 

  3. Message Passing Interface Forum: MPI: A Message Passing Interface Standard. Technical report, University of Tennessee (1994)

    Google Scholar 

  4. Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: 10th International Parallel Processing Symposium, Honolulu, USA, pp. 526–531 (1996)

    Google Scholar 

  5. Agbaria, A., Friedman, R.: Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations. In: 8th IEEE International Symposium on High Performance Distributed Computing, Redondo Beach, California, USA, pp. 167–176 (1999)

    Google Scholar 

  6. Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland, USA, pp. 1–18 (2002)

    Google Scholar 

  7. Fagg, G.E., Dongarra, J.: FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) PVM/MPI 2000. LNCS, vol. 1908, pp. 346–353. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  8. Plank, J.S., Li, K., Puening, M.A.: Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst. 9(10), 972–986 (1998)

    Article  Google Scholar 

  9. Chen, Z., Fagg, G., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J.: Fault Tolerant High Performance Computing by a Coding Approach. In: 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Chicago, IL, USA, pp. 213–223 (2005)

    Google Scholar 

  10. Liu, X.G., Wang, G., Zhang, Y., Li, A., Xie, F.: The Performance Of Erasure Codes Used In FT-MPI. In: 2nd International Forum on Information Technology and Applications, Chengdu, China (2005)

    Google Scholar 

  11. Plank, J.S.: Erasure Codes for Storage Applications. Tutorial. In: 4th Usenix Conference on File and Storage Technologies, San Francisco, CA, USA (2005)

    Google Scholar 

  12. Chen, P.M., Lee, E.K., Gibson, G.A., Katz, R.H., Patterson, D.A.: RAID: High-Performance, Reliable Secondary Storage. ACM Computing Surveys 26(2), 143–185 (1994)

    Article  Google Scholar 

  13. Plank, J.S.: A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems. Software - Practice & Experience 27(9), 995–1012 (1997)

    Article  Google Scholar 

  14. Corbett, P., English, B., Goel, A., Grcanac, T., Kleiman, S., Leong, J., Sankar, S.: Row-Diagonal Parity for Double Disk Failure Correction. In: 3rd USENIX Conference on File and Storage Technologies, San Francisco, CA, USA, pp. 1–14 (2004)

    Google Scholar 

  15. Blaum, M.: A Family of MDS Array Codes with Minimal Number of Encoding Operations. In: 2006 IEEE International Symposium on Information Theory, Washington, USA, pp. 2784–2788 (2006)

    Google Scholar 

  16. Xu, L., Bohossian, V., Bruck, J., Wagner, D.G.: Low-Density MDS Codes and Factors of Complete Graphs. IEEE Trans. on Information Theory 45(6), 1817–1826 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  17. Colbourn, C.J., Dinitz, J.H., et al.: Handbook of Combinatorial Designs, 2nd edn. CRC Press, Boca Raton (2007)

    MATH  Google Scholar 

  18. Plank, J.S.: The RAID-6 Liberation Codes. In: 6th USENIX Conference on File and Storage Technologies, San Francisco, USA, pp. 97–110 (2008)

    Google Scholar 

  19. Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel Computing, 2nd edn. Addison Wesley, Edinburgh Gate (2003)

    MATH  Google Scholar 

  20. http://math.nist.gov/MatrixMarket/data/Harwell-Boeing/bcsstruc3/bcsstk23.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, G., Liu, X., Li, A., Zhang, F. (2009). In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes. In: Ropo, M., Westerholm, J., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2009. Lecture Notes in Computer Science, vol 5759. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03770-2_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03770-2_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03769-6

  • Online ISBN: 978-3-642-03770-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics