Fault-Tolerance Techniques for High-Performance Computing

  • Thomas Herault
  • Yves Robert

Part of the Computer Communications and Networks book series (CCN)

Table of contents

  1. Front Matter
    Pages i-ix
  2. General Overview

    1. Front Matter
      Pages 1-1
    2. Jack Dongarra, Thomas Herault, Yves Robert
      Pages 3-85
  3. Technical Contributions

    1. Front Matter
      Pages 87-87
    2. Ana Gainaru, Franck Cappello
      Pages 89-144
    3. Aurélien Bouteiller
      Pages 145-228
    4. Henri Casanova, Frédéric Vivien, Dounia Zaidouni
      Pages 229-278
    5. Guillaume Aupy, Anne Benoit, Mohammed El Mehdi Diouri, Olivier Glück, Laurent Lefèvre
      Pages 279-317
  4. Back Matter
    Pages 319-320

About this book

Introduction

This timely text/reference presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC).

The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as algorithm-based fault tolerance. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models.

Topics and features:

  • Includes self-contained contributions from an international selection of preeminent experts
  • Provides a survey of resilience methods and performance models
  • Examines the various sources for errors and faults in large-scale systems, detailing their characteristics, with a focus on modeling, detection and prediction
  • Reviews the spectrum of techniques that can be applied to design a fault-tolerant message passing interface
  • Investigates different approaches to replication, comparing these to the traditional checkpoint-recovery approach
  • Discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems, proposing a methodology to estimate such energy consumption

This authoritative volume is essential reading for all researchers and graduate students involved in high-performance computing.

Dr. Thomas Herault is a Research Scientist in the Innovative Computing Laboratory (ICL) at the University of Tennessee Knoxville, TN, USA. Dr. Yves Robert is a Professor in the Laboratory of Parallel Computing at the Ecole Normale Supérieure de Lyon, France, and a Visiting Research Scholar in the ICL.

Keywords

Algorithm-Based Fault Tolerance Fault Predictors Fault-Tolerance High-Performance Computing Resilience Silent Errors

Editors and affiliations

  • Thomas Herault
    • 1
  • Yves Robert
    • 2
  1. 1.University of TennesseeKnoxvilleUSA
  2. 2.Ecole Normale Supérieure de LyonLyonFrance

Bibliographic information

  • DOI https://doi.org/10.1007/978-3-319-20943-2
  • Copyright Information Springer International Publishing Switzerland 2015
  • Publisher Name Springer, Cham
  • eBook Packages Computer Science
  • Print ISBN 978-3-319-20942-5
  • Online ISBN 978-3-319-20943-2
  • Series Print ISSN 1617-7975
  • Series Online ISSN 2197-8433
  • About this book
Industry Sectors
Electronics
Telecommunications
Automotive
Pharma