Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

Benoit, Anne; Cavelan, Aurélien; Robert, Yves; Sun, Hongyang

doi:10.1007/978-3-319-17248-4_11

Anne Benoit¹⁶,
Aurélien Cavelan¹⁶,
Yves Robert^16,17 &
…
Hongyang Sun¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8966))

Included in the following conference series:

International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems

1051 Accesses
5 Citations

Abstract

In this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to address both fail-stop and silent errors. The objective is to minimize either makespan or energy consumption. While DVFS is a popular approach for reducing the energy consumption, using lower speeds/voltages can increase the number of errors, thereby complicating the problem. We consider an application workflow whose dependence graph is a chain of tasks, and we study three execution scenarios: (i) a single speed is used during the whole execution; (ii) a second, possibly higher speed is used for any potential re-execution; (iii) different pairs of speeds can be used throughout the execution. For each scenario, we determine the optimal checkpointing and verification locations (and the optimal speeds for the third scenario) to minimize either objective. The different execution scenarios are then assessed and compared through an extensive set of experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Assayad, I., Girault, A., Kalla, H.: Tradeoff exploration between reliability, power consumption, and execution time for embedded systems. Int. J. Softw. Tools Technol. Transf. 15(3), 229–245 (2013)
Article Google Scholar
Aupy, G., Benoit, A., Robert, Y.: Energy-aware scheduling under reliability and makespan constraints. In: Proceedings of the International Conference on High Performance Computing (HiPC), pp. 1–10 (2012)
Google Scholar
Bansal, N., Kimbrel, T., Pruhs, K.: Speed scaling to manage energy and temperature. J. ACM 54(1), 3:1–3:39 (2007)
Article MathSciNet Google Scholar
Benoit, A., Cavelan, A., Robert, Y., Sun, H.: Assessing general-purpose algorithms to cope with fail-stop and silent errors. Research report RR-8599, INRIA, September 2014
Google Scholar
Benson, A.R., Schmit, S., Schreiber, R.: Silent error detection in numerical time-stepping schemes. CoRR, abs/1312.2674 (2013)
Google Scholar
Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69(4), 410–416 (2009)
Article Google Scholar
Bougeret, M., Casanova, H., Rabie, M., Robert, Y., Vivien, F.: Checkpointing strategies for parallel jobs. In: 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–11 (2011)
Google Scholar
Bronevetsky, G., de Supinski, B.: Soft error vulnerability of iterative linear algebra methods. In: Proceedings 22nd International Conference on Supercomputing, ICS 2008, pp. 155–164. ACM (2008)
Google Scholar
Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)
Article Google Scholar
Chen, Z., Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2013, pp. 167–176. ACM (2013)
Google Scholar
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. FGCS 22(3), 303–312 (2004)
Article Google Scholar
Das, A., Kumar, A., Veeravalli, B., Bolchini, C., Miele, A.: Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in MPSoCs. In: Proceedings of the Conference on Design, Automation and Test in Europe (DATE), pp. 1–6 (2014)
Google Scholar
Dixit, A., Wood, A.: The impact of new technology on soft error rates. In: IEEE International on Reliability Physics Symposium (IRPS), pp. 5B.4.1–5B.4.7 (2011)
Google Scholar
El-Sayed, N., Stefanovici, I.A., Amvrosiadis, G., Hwang, A.A., Schroeder, B.: Temperature management in data centers: why some (might) like it hot. SIGMETRICS Perform. Eval. Rev. 40(1), 163–174 (2012)
Article Google Scholar
Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., Engelmann, C.: Combining partial redundancy and checkpointing for HPC. In: Proceedings of the ICDCS 2012. IEEE Computer Society (2012)
Google Scholar
Elnozahy, E.N.M., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34, 375–408 (2002)
Article Google Scholar
Feng, W.-C.: Making a case for efficient supercomputing. Queue 1(7), 54–64 (2003)
Article Google Scholar
Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the ACM/IEEE SC International Conference SC 2012. IEEE Computer Society Press (2012)
Google Scholar
Heroux, M., Hoemmen, M.: Fault-tolerant iterative methods via selective reliability. Research report SAND2011-3915 C, Sandia National Laboratories (2011)
Google Scholar
Hsu, C.-H., Chun Feng, W.: A power-aware run-time system for high-performance computing. In: Proceedings of the ACM/IEEE Supercomputing Conference, pp. 1–9 (2005)
Google Scholar
Huang, K.-H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33(6), 518–528 (1984)
Article MATH Google Scholar
Hwang, A.A., Stefanovici, I.A., Schroeder, B.: Cosmic rays don’t strike twice: understanding the nature of dram errors and the implications for system design. SIGARCH Comput. Archit. News 40(1), 111–122 (2012)
Article Google Scholar
Lu, G., Zheng, Z., Chien, A.A.: When is multi-version checkpointing needed. In: 3rd Workshop for Fault-Tolerance at Extreme Scale (FTXS). ACM Press (2013). https://sites.google.com/site/uchicagolssg/lssg/research/gvr
Lyons, R.E., Vanderkulk, W.: The use of triple-modular redundancy to improve computer reliability. IBM J. Res. Dev. 6(2), 200–209 (1962)
Article MATH Google Scholar
Ozaki, T., Dohi, T., Okamura, H., Kaio, N.: Distribution-free checkpoint placement algorithms based on min-max principle. IEEE TDSC 3, 130–140 (2006)
Google Scholar
Patterson, M.: The effect of data center temperature on energy efficiency. In: Proceedings of 11th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems, pp. 1167–1174 (2008)
Google Scholar
Rizvandi, N.B., Zomaya, A.Y., Lee, Y.C., Boloori, A.J., Taheri, J.: Multiple frequency selection in DVFS-enabled processors to minimize energy consumption. In: Zomaya, A.Y., Lee, Y.C. (eds.) Energy-Efficient Distributed Computing Systems. Wiley, Hoboken (2012)
Google Scholar
Sao, P., Vuduc, R.:Self-stabilizing iterative solvers. In: Proceedings ScalA 2013. ACM (2013)
Google Scholar
Sarood, O., Meneses, E., Kale, L. V.: A ‘cool’ way of improving the reliability of HPC machines. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 58:1–58:12 (2013)
Google Scholar
Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the ICS 2012. ACM (2012)
Google Scholar
Toueg, S., Babaoglu, Ö.: On the optimum checkpoint selection problem. SIAM J. Comput. 13(3), 630–649 (1984)
Article MATH MathSciNet Google Scholar
Yao, F., Demers, A., Shenker, S.: A scheduling model for reduced CPU energy. In: Proceedings of the 36th Annual Symposium on Foundations of Computer Science (FOCS), p. 374 (1995)
Google Scholar
Young, J.W.: A first order approximation to the optimum checkpoint interval. Comm. ACM 17(9), 530–531 (1974)
Article MATH Google Scholar
Zhao, B., Aydin, H., Zhu, D.: Reliability-aware dynamic voltage scaling for energy-constrained real-time embedded systems. In: Proceedings of the IEEE International Conference on Computer Design (ICCD), pp. 633–639 (2008)
Google Scholar
Zhu, D., Melhem, R., Mosse, D.: The effects of energy management on reliability in real-time embedded systems. In: Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 35–40 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

CNRS and INRIA, École Normale Supérieure de Lyon, Lyon, France
Anne Benoit, Aurélien Cavelan, Yves Robert & Hongyang Sun
University of Tennessee Knoxville, Knoxville, USA
Yves Robert

Authors

Anne Benoit
View author publications
You can also search for this author in PubMed Google Scholar
Aurélien Cavelan
View author publications
You can also search for this author in PubMed Google Scholar
Yves Robert
View author publications
You can also search for this author in PubMed Google Scholar
Hongyang Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aurélien Cavelan .

Editor information

Editors and Affiliations

University of Warwick, Coventry, United Kingdom
Stephen A. Jarvis
University of Warwick, Coventry, United Kingdom
Steven A. Wright
Sandia National Laboratories CSRI, Albuquerque, New Mexico, USA
Simon D. Hammond

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Benoit, A., Cavelan, A., Robert, Y., Sun, H. (2015). Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2014. Lecture Notes in Computer Science(), vol 8966. Springer, Cham. https://doi.org/10.1007/978-3-319-17248-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-17248-4_11
Published: 18 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17247-7
Online ISBN: 978-3-319-17248-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics