Name: Fault-Tolerance Techniques for High-Performance Computing
ISBN: 978-3-319-20943-2

Overview

Editors:

Thomas Herault⁰,
Yves Robert¹

Thomas Herault
1. University of Tennessee, Knoxville, USA
View editor publications

You can also search for this editor in PubMed Google Scholar
Yves Robert
1. Ecole Normale Supérieure de Lyon, Lyon, France
View editor publications

You can also search for this editor in PubMed Google Scholar

The first complete overview of this increasingly important field
Presents a unique, rigorous approach based on the design of analytical models to predict performance
Provides a coherent collection of valuable insights from internationally-renowned experts with considerable expertise
Includes supplementary material: sn.pub/extras

Part of the book series: Computer Communications and Networks (CCN)

7049 Accesses
94 Citations
1 Altmetric

This is a preview of subscription content, log in via an institution to check access.

Access this book

eBook USD 84.99

Price excludes VAT (USA)

Softcover Book USD 109.99

Price excludes VAT (USA)

Hardcover Book USD 109.99

Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Other ways to access

Licence this eBook for your library

Institutional subscriptions

About this book

This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as ABFT. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Features: provides a survey of resilience methods and performance models; examines the various sources for errors and faults in large-scale systems; reviews the spectrum of techniques that can be applied to design a fault-tolerant MPI; investigates different approaches to replication; discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems.

Keywords

Table of contents (5 chapters)

Front Matter

Pages i-ix

Download chapter PDF
General Overview
1. Front Matter
  
  Pages 1-1
  
  Download chapter PDF
2. Fault Tolerance Techniques for High-Performance Computing
  
  Jack Dongarra, Thomas Herault, Yves Robert
  
  Pages 3-85
Technical Contributions
1. Front Matter
  
  Pages 87-87
  
  Download chapter PDF
2. Errors and Faults
  
  Ana Gainaru, Franck Cappello
  
  Pages 89-144
3. Fault-Tolerant MPI
  
  Aurélien Bouteiller
  
  Pages 145-228
4. Using Replication for Resilience on Exascale Systems
  
  Henri Casanova, Frédéric Vivien, Dounia Zaidouni
  
  Pages 229-278
5. Energy-Aware Checkpointing Strategies
  
  Guillaume Aupy, Anne Benoit, Mohammed El Mehdi Diouri, Olivier Glück, Laurent Lefèvre
  
  Pages 279-317
Back Matter

Pages 319-320

Download chapter PDF

Editors and Affiliations

University of Tennessee, Knoxville, USA

Thomas Herault
Ecole Normale Supérieure de Lyon, Lyon, France

Yves Robert

Bibliographic Information

Book Title: Fault-Tolerance Techniques for High-Performance Computing
Editors: Thomas Herault, Yves Robert
Series Title: Computer Communications and Networks
DOI: https://doi.org/10.1007/978-3-319-20943-2
Publisher: Springer Cham
eBook Packages: Computer Science, Computer Science (R0)
Copyright Information: Springer International Publishing Switzerland 2015
Hardcover ISBN: 978-3-319-20942-5Published: 15 July 2015
Softcover ISBN: 978-3-319-35560-3Published: 15 October 2016
eBook ISBN: 978-3-319-20943-2Published: 01 July 2015
Series ISSN: 1617-7975
Series E-ISSN: 2197-8433
Edition Number: 1
Number of Pages: IX, 320
Number of Illustrations: 113 b/w illustrations
Topics: System Performance and Evaluation, Performance and Reliability, Numeric Computing
Industry Sectors: Aerospace, Electronics, IT & Software, Telecommunications

Publish with us

Policies and ethics

Fault-Tolerance Techniques for High-Performance Computing

Overview

Access this book

Other ways to access

About this book

Similar content being viewed by others

Software approaches for resilience of high performance computing systems: a survey

Scheduling for Fault-Tolerance: An Introduction

Using Performance Tools to Support Experiments in HPC Resilience

Keywords

Table of contents (5 chapters)

Front Matter

General Overview

Front Matter

Fault Tolerance Techniques for High-Performance Computing

Technical Contributions

Front Matter

Errors and Faults

Fault-Tolerant MPI

Using Replication for Resilience on Exascale Systems

Energy-Aware Checkpointing Strategies

Back Matter

Editors and Affiliations

University of Tennessee, Knoxville, USA

Ecole Normale Supérieure de Lyon, Lyon, France

Bibliographic Information

Publish with us

Navigation

Fault-Tolerance Techniques for High-Performance Computing

Overview

Access this book

Other ways to access

About this book

Similar content being viewed by others

Keywords

Table of contents (5 chapters)

Front Matter

General Overview

Front Matter

Technical Contributions

Front Matter

Back Matter

Editors and Affiliations

University of Tennessee, Knoxville, USA

Ecole Normale Supérieure de Lyon, Lyon, France

Bibliographic Information

Publish with us

Search

Navigation