Skip to main content

Transactional Encoding for Tolerating Transient Hardware Errors

  • Conference paper
Book cover Stabilization, Safety, and Security of Distributed Systems (SSS 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8255))

Included in the following conference series:

Abstract

The decreasing feature size of integrated circuits leads to less reliable hardware with higher likelihood for errors. Without adding additional failure detection and masking mechanisms, the next generations of CPUs would at least be unfit for executing mission- and safety-critical applications. One common approach is the replicated execution of programs on redundant cores, which is increasingly difficult considering that most programs are non-deterministic. To be able to detect and mask execution errors, one typically need to execute three copies of each thread.

In this paper, we propose and evaluate transactional encoding, a novel approach to detect and mask transient hardware errors such that one can build safe applications on top of unreliable components. Transactional encoding relies on a combination of arithmetic codes for detecting transient hardware errors and transactional memory for recovery and tolerance of transient errors. We present a prototype software implementation that encodes applications using an LLVM-based compiler and executes them with a customized software transactional memory algorithm. Our evaluation shows that our system can successfully survive between 90-96% of transient hardware errors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Andrew Frame, C.T.: Introducing new armæ cortextm-r technology for safe and reliable systems. Technical report, ARM Ltd. (2011)

    Google Scholar 

  2. Berger, E.D., Zorn, B.G.: Diehard: probabilistic memory safety for unsafe languages. In: ACM SIGPLAN (2006)

    Google Scholar 

  3. Blundell, C., Lewis, E., Martin, M.: Deconstructing transactional semantics: The subtleties of atomicity. In: WDDD (2005)

    Google Scholar 

  4. Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro, 25 (2005)

    Google Scholar 

  5. Cristal, A., Felber, P., Fetzer, C., Harmanci, D., Sobe, A., Unsal, O., Wamhoff, J.-T., Yalcin, G.: Leveraging transactional memory for energy-efficient computing below safe operation margins. In: TRANSACT 2013 (2013)

    Google Scholar 

  6. Dalessandro, L., Scott, M.L.: Sandboxing transactional memory. In: PACT (2012)

    Google Scholar 

  7. Fetzer, C., Felber, P.: Transactional memory for dependable embedded systems. In: HotDep (2011)

    Google Scholar 

  8. Forin, P.: Vital Coded Microprocessor Principles and Application for Various Transit Systems. In: FAC-GCCT (1989)

    Google Scholar 

  9. Yalcin, G., Unsal, O., Cristal, A., Valero, M.: FaulTM-multi: Fault tolerance for multithreaded applications running on transactional memory hardware. In: WANDS (2011)

    Google Scholar 

  10. Harris, T., Larus, J., Rajwar, R.: Transactional Memory, 2nd edn. Morgan & Claypool (2010)

    Google Scholar 

  11. Horst, R.W., Harris, R.L., Jardine, R.L.: Multiple instruction issue in the nonstop cyclone processor. In: SIGARCH (1990)

    Google Scholar 

  12. IBM. Powerpc 750gx lockstep facility- application note. Technical report, International Business Machines Corporation (2008)

    Google Scholar 

  13. Lattner, C., Adve, V.: LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In: CGO 2004 (2004)

    Google Scholar 

  14. Lenharth, A., Adve, V.S., King, S.T.: Recovery domains: an organizing principle for recoverable operating systems. In: ASPLOS (2009)

    Google Scholar 

  15. Li, M.-L., Ramachandran, P., Sahoo, S.K., Adve, S.V., Adve, V.S., Zhou, Y.: Understanding the propagation of hard errors to software and implications for resilient system design. In: ASPLOS (2008)

    Google Scholar 

  16. Oh, N., Mitra, S., McCluskey, E.J.: Ed4i: Error detection by diverse data and duplicated instructions. IEEE Trans. Comput. (2002)

    Google Scholar 

  17. Oh, N., Shirvani, P.P., McCluskey, E.J.: Control-flow checking by software signatures. IEEE Transactions on Reliability (2002)

    Google Scholar 

  18. Pattabiraman, K., Grover, V., Zorn, B.G.: Samurai: protecting critical data in unsafe languages. In: ACM SIGOPS/EuroSys. (2008)

    Google Scholar 

  19. Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I.: Swift: Software implemented fault tolerance. In: CGO (2005)

    Google Scholar 

  20. Rinard, M., Cadar, C., Dumitran, D., Roy, D., Leu, T.: A dynamic technique for eliminating buffer overflow vulnerabilities (and other memory errors). In: ACSAC (2004)

    Google Scholar 

  21. Roberts, D., Austin, T., Blauww, D., Mudge, T., Flautner, K.: Error analysis for the support of robust voltage scaling. In: ISQED (2005)

    Google Scholar 

  22. Schiffel, U.: Hardware Error Detection Using AN-Codes. PhD thesis, Technische Universität Dresden (2011)

    Google Scholar 

  23. Schiffel, U., Schmitt, A., Süßkraut, M., Fetzer, C.: ANB- and ANBDmem-Encoding: Detecting Hardware Errors in Software. In: Schoitsch, E. (ed.) SAFECOMP 2010. LNCS, vol. 6351, pp. 169–182. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  24. Schiffel, U., Schmitt, A., Süßkraut, M., Fetzer, C.: Slice Your Bug: Debugging Error Detection Mechanisms using Error Injection Slicing. In: IEEE TC (2010)

    Google Scholar 

  25. Schiffel, U., Schmitt, A., Süßkraut, M., Fetzer, C.: Software-Implemented Hardware Error Detection: Costs and Gains. In: DEPEND (2010)

    Google Scholar 

  26. Slegel, T.J., Averill III, R.M., Check, M.A., Giamei, B.C., Krumm, B.W., Krygowski, C.A., Li, W.H., Liptay, J.S., MacDougall, J.D., McPherson, T.J., Navarro, J.A., Schwarz, E.M., Shum, K., Webb, C.F.: Ibm’s s/390 g5 microprocessor design. In: IEEE Micro (1999)

    Google Scholar 

  27. Sorin, D.J., Martin, M.M.K., Hill, M.D., Wood, D.A.: Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: SIGARCH (2002)

    Google Scholar 

  28. Süßkraut, M., Schmitt, A., Schiffel, U., Brünink, M., Fetzer, C.: Silistra compiler: Building reliable systems with unreliable hardware. In: DSN (2011)

    Google Scholar 

  29. Wang, N., Patel, S.: Restore: Symptom-based soft error detection in microprocessors. In: TDSC (2006)

    Google Scholar 

  30. Wappler, U., Fetzer, C.: Hardware Failure Virtualization Via Software Encoded Processing. In: INDIN (2007)

    Google Scholar 

  31. Webber, S., Beirne, J.: The stratus architecture. In: FTCS (1991)

    Google Scholar 

  32. Yalcin, G., Unsal, O., Cristal, A., Hur, I., Valero, M.: SymptomTM: Symptom-based error detection and recovery using hardware transactional memory. In: PACT (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer International Publishing Switzerland

About this paper

Cite this paper

Wamhoff, JT., Schwalbe, M., Faqeh, R., Fetzer, C., Felber, P. (2013). Transactional Encoding for Tolerating Transient Hardware Errors. In: Higashino, T., Katayama, Y., Masuzawa, T., Potop-Butucaru, M., Yamashita, M. (eds) Stabilization, Safety, and Security of Distributed Systems. SSS 2013. Lecture Notes in Computer Science, vol 8255. Springer, Cham. https://doi.org/10.1007/978-3-319-03089-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-03089-0_1

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-03088-3

  • Online ISBN: 978-3-319-03089-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics