Skip to main content
Log in

PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based fault-tolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Luebke D, Harris M, Krüger J, Purcell T, Govindaraju N, Buck I,Woolley C, Lefohn A. GPGPU: General-purpose computation on graphics hardware. In Proc. SIGGRAPH 2004 Course Notes, New York, NY, USA, Aug. 2004, p.33.

  2. Owens J, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn A, Purcell T. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, Mar. 2007, 26(1): 80–113.

    Article  Google Scholar 

  3. Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P. Brook for GPUs: Stream computing on graphics hardware. In Proc. ACM SIGGRAPH 2004 Papers, New York, NY, USA, Aug. 2004, pp.777–786.

  4. AMD. Brook+. http://developer.amd. com/gpu assets/AMD-Brookplus.pdf.

  5. NVIDIA Corporation. Cuda programming guide, 2008. http://www.nvidia.com/object/cuda develop.html.

  6. Lee S, Min S J, Eigenmann R. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. ACM SIGPLAN Notices, April 2009, 44(4): 101–110.

    Article  Google Scholar 

  7. Top500 Supercomputer Site. http://www.top500.org/lists/2010/11.

  8. Yim K S, Pham C, Saleheen M, kalbarczyk Z, Iyer R. Hauberk: Lightweight silent data corruption error detectors for GPGPU. In Proc. the 25th Int. Parallel & Distributed Processing Symposium, Anchorage, USA, May 2011, pp.287–300.

  9. Borucki L, Schindlbeck G, Slayman C. Comparison of accelerated DRAM soft error rates measured at component and system level. In Proc. the Int. Reliability Physics Symposium, Phoenix, USA, April 27-May 1, 2008, pp.482–487.

  10. Schroeder B, Pinheiro E, Weber W D. DRAM errors in the wild: A large-scale field study. In Proc. the 11th International Joint Conf. Measurement and Modeling of Computer Systems, Seattle, USA, June 15-19, 2009, pp.193–204.

  11. Mukherjee S S, Emer J S, Reinhardt S K. The soft error problem: An architectural perspective. In Proc. the 11th International Symposium on High-Performance Computer Architecture, February 12-16, 2005, pp.243–247.

  12. Gregerson A E, Abhyankar A V. Performance-cost analysis of software implemented hardware fault tolerance methods in general-purpose gpu computing. http://home-pages.cae.wisc.edu/ece753/papers/Paper 4.pdf.

  13. Maruyama N, Nukada A, Matsuoka S. Software-based ECC for GPUs. In Proc. 2009 Symposium on Application Accelerators in High Performance Computing, Urbana, Illinois, USA, July 27-31, 2009.

  14. Sheaffer J W, Luebke D P, Skadron K. A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors. In Proc. the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, San Diego, California, USA, August 4-5, 2007, pp.55–64.

  15. Dimitrov M, Mantor M, Zhou H Y. Understanding software approaches for GPGPU reliability. In Proc. the 2nd Work-shop on General Purpose Processing on Graphics Processing Units (GPGPU 2009), Washington, USA, March 8, 2009, pp.94–104.

  16. Maruyama N, Nukada A, Matsuoka S. A high-performance faulttolerant software framework for memory on commodity GPUs. In Proc. 2010 IEEE Int. Symp. Parallel & Distributed Processing, Atlanta, GA, USA, April 19-23, 2010, pp.1–12.

  17. Roman E. A survey of checkpoint/restart implementations. Berkeley Lab Technical Report, July 2002, https://ftg.lbl.gov/assets/projects/CheckpointRestart/Pubs/checkpointSurvey-020724b.pdf.

  18. Chandy K M, Ramamoorthy C V. Rollback and recovery strategies for computer programs. IEEE Transactions on Computers, June 1972, 21(6): 546–556.

    Article  MathSciNet  MATH  Google Scholar 

  19. Jafar S, Krings A, Gautier T. Flexible rollback recovery in dynamic heterogeneous grid computing. IEEE Transactions on Dependable and Secure Computing, 2009, 6(1): 32–44.

    Article  Google Scholar 

  20. Chu S L, Hsiao C C. OpenCL: Make ubiquitous supercomputing possible. In Proc. the 12th IEEE International Conference on High Performance Computing and Communications, Melbourne, Australia, 1-3 Sept. 2010, pp.556–561.

  21. George N, Lach J, Gurumurthi S. Towards transient fault tolerance for heterogeneous computing platforms. In Proc. Workshop on Compiler and Architectural Techniques for Application Reliability and Security, Anchorage, Alaska, USA, June 2008, http://www.cs.virginia.edu/»gurumurthi/papers/catars08.pdf.

  22. Goloubeva O, Rebaudengo M, Reorda M S, Violante M. Software-Implemented Hardware Fault Tolerance. New York: Springer, 2006, p.228.

  23. Pradhan D K. Fault-Tolerant Computer System Design. Prentice Hall PTR, 1996.

  24. Reis G A, Chang J, Vachharajani N, Rangan R, August D I. SWIFT: Software implemented fault tolerance. In Proc. the International Symposium on Code Generation and Optimization, Washington, DC, USA, March 2005, pp.243–254.

  25. Dubrova E. Fault-Tolerant Design: An Introduction. KTH Royal Institute of Technology, Stockholm, Sweden, 2008, http://web.it.kth.se/»dubrova/draft.pdf.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin-Hai Xu.

Additional information

This work was supported by the National Natural Science Foundation of China under Grant Nos. 60921062, 61003087, 61120106005 and 61170049.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 116 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, XH., Yang, XJ., Xue, JL. et al. PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs. J. Comput. Sci. Technol. 27, 240–255 (2012). https://doi.org/10.1007/s11390-012-1220-5

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-012-1220-5

Keywords

Navigation