Skip to main content

DRIFT: Decoupled CompileR-Based Instruction-Level Fault-Tolerance

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8664))

Abstract

Compiler-based error detection methodologies replicate the instructions of the program and insert checks wherever it is needed. The checks evaluate code correctness and decide whether or not an error has occurred. The replicated instructions and the checks cause a large slowdown. In this work, we focus on reducing the error detection overhead and improving the system’s performance without degrading fault-coverage. DRIFT achieves this by decoupling the execution of the code (original and replicated) from the checks.

The checks are compare and jump instructions. The latter ones sequentialize the code and prohibit the compiler from performing aggressive instruction scheduling optimizations. We call this phenomenon basic-block fragmentation. DRIFT reduces the impact of basic-block fragmentation by breaking the synchronized execute-check-confirm-execute cycle. In this way, DRIFT generates a scheduler-friendly code with more ILP. As a result, it reduces the performance overhead down to 1.29\(\times \) (on average) and outperforms the state-of-the-art by up to 29.7 % retaining the same fault-coverage. The evaluation was done on an Itanium2 by running MediabenchII and SPEC2000 benchmark suites.

This work was supported in part by the EC under grant ERA 249059 (FP7).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. GCC: GNU compiler collection. http://gcc.gnu.org

  2. SKI, an IA64 instruction set simulator. http://ski.sourceforge.net

  3. Austin, T.: DIVA: a reliable substrate for deep submicron microarchitecture design. In: MICRO (1999)

    Google Scholar 

  4. Bernick, D., et al.: Nonstop advanced architecture. In: DSN (2005)

    Google Scholar 

  5. Chang, J., et al.: Automatic instruction-level software-only recovery. In: DSN (2006)

    Google Scholar 

  6. Constantinescu, C.: Trends and challenges in VLSI circuit reliability. IEEE Micro 23, 14–19 (2003)

    Article  Google Scholar 

  7. Feng, S., et al.: Shoestring: probabilistic soft error reliability on the cheap. In: ASPLOS (2010)

    Google Scholar 

  8. Fritts, J., et al.: Mediabench II video: expediting the next generation of video systems research. In: SPIE (2005)

    Google Scholar 

  9. Ghosh, Y., et al.: Runtime asynchronous fault tolerance via speculation. In: CGO (2012)

    Google Scholar 

  10. Henning, J.: SPEC CPU2000: measuring CPU performance in the new millennium. IEEE Comput. 33, 28–35 (2000)

    Article  Google Scholar 

  11. Hwu, W.-M.W., et al.: The superblock: an effective technique for VLIW and superscalar compilation. J. Supercomput. 7, 229–248 (1993)

    Article  Google Scholar 

  12. LaFrieda, C., et al.: Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: DSN (2007)

    Google Scholar 

  13. Li, M., et al.: Understanding the propagation of hard errors to software and implications for resilient system design. In: ASPLOS (2008)

    Google Scholar 

  14. Lowney, P.G., et al.: The multiflow trace scheduling compiler. J. Supercomput. 7, 51–142 (1993)

    Article  Google Scholar 

  15. Mahlke, S., et al.: Sentinel scheduling for vliw and superscalar processors. In: ASPLOS (1992)

    Google Scholar 

  16. Mahmood, A., et al.: Concurrent error detection using watchdog processors-a survey. IEEE Trans. Comput. 37, 160–174 (1988)

    Article  Google Scholar 

  17. McNairy, C., et al.: Itanium 2 processor microarchitecture. IEEE Micro 23, 44–55 (2003)

    Article  Google Scholar 

  18. Michalak, S., et al.: Predicting the number of fatal soft errors in Los Alamos national laboratory’s ASC Q supercomputer. IEEE Trans. Device Mater. Reliab. 5, 329–335 (2005)

    Article  Google Scholar 

  19. Mukherjee, S., et al.: Detailed design and evaluation of redundant multithreading alternatives. In: ISCA (2002)

    Google Scholar 

  20. Oh, N., et al.: Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliab. 51, 63–75 (2002)

    Article  Google Scholar 

  21. Reinhardt, S., et al.: Transient fault detection via simultaneous multithreading. In: ISCA (2000)

    Google Scholar 

  22. Reis, G., et al.: SWIFT: software implemented fault tolerance. In: CGO (2005)

    Google Scholar 

  23. Rotenberg, E.: AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. In: FTCS (1999)

    Google Scholar 

  24. Shivakumar, P., et al.: Modeling the effect of technology trends on the soft error rate of combinational logic. In: DSN (2002)

    Google Scholar 

  25. Shye, A., et al.: Using process-level redundancy to exploit multiple cores for transient fault tolerance. In: DSN (2007)

    Google Scholar 

  26. Slegel, T., et al.: IBM’s S/390 G5 microprocessor design. IEEE Micro 19(2), 12–23 (1999)

    Article  Google Scholar 

  27. Smolens, J., et al.: Reunion: complexity-effective multicore redundancy. In: MICRO (2006)

    Google Scholar 

  28. Sorin, D.: Fault tolerant computer architecture. Synthesis Lectures on Computer Architecture (2009)

    Google Scholar 

  29. Srinivasan, J., et al.: The impact of technology scaling on lifetime reliability. In: DSN (2004)

    Google Scholar 

  30. Wang, C., et al.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: CGO (2007)

    Google Scholar 

  31. Wang, N., et al.: ReStore: symptom-based soft error detection in microprocessors. IEEE Trans. Dependable Secure Comput. 3, 188–201 (2006)

    Article  Google Scholar 

  32. Zhang, Y., et al.: DAFT: decoupled acyclic fault tolerance. In: PACT (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Konstantina Mitropoulou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Mitropoulou, K., Porpodas, V., Cintra, M. (2014). DRIFT: Decoupled CompileR-Based Instruction-Level Fault-Tolerance. In: Cașcaval, C., Montesinos, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2013. Lecture Notes in Computer Science(), vol 8664. Springer, Cham. https://doi.org/10.1007/978-3-319-09967-5_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09967-5_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09966-8

  • Online ISBN: 978-3-319-09967-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics