Abstract
The decreasing feature size of integrated circuits leads to less reliable hardware with higher likelihood for errors. Without adding additional failure detection and masking mechanisms, the next generations of CPUs would at least be unfit for executing mission- and safety-critical applications. One common approach is the replicated execution of programs on redundant cores, which is increasingly difficult considering that most programs are non-deterministic. To be able to detect and mask execution errors, one typically need to execute three copies of each thread.
In this paper, we propose and evaluate transactional encoding, a novel approach to detect and mask transient hardware errors such that one can build safe applications on top of unreliable components. Transactional encoding relies on a combination of arithmetic codes for detecting transient hardware errors and transactional memory for recovery and tolerance of transient errors. We present a prototype software implementation that encodes applications using an LLVM-based compiler and executes them with a customized software transactional memory algorithm. Our evaluation shows that our system can successfully survive between 90-96% of transient hardware errors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Andrew Frame, C.T.: Introducing new armæ cortextm-r technology for safe and reliable systems. Technical report, ARM Ltd. (2011)
Berger, E.D., Zorn, B.G.: Diehard: probabilistic memory safety for unsafe languages. In: ACM SIGPLAN (2006)
Blundell, C., Lewis, E., Martin, M.: Deconstructing transactional semantics: The subtleties of atomicity. In: WDDD (2005)
Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro, 25 (2005)
Cristal, A., Felber, P., Fetzer, C., Harmanci, D., Sobe, A., Unsal, O., Wamhoff, J.-T., Yalcin, G.: Leveraging transactional memory for energy-efficient computing below safe operation margins. In: TRANSACT 2013 (2013)
Dalessandro, L., Scott, M.L.: Sandboxing transactional memory. In: PACT (2012)
Fetzer, C., Felber, P.: Transactional memory for dependable embedded systems. In: HotDep (2011)
Forin, P.: Vital Coded Microprocessor Principles and Application for Various Transit Systems. In: FAC-GCCT (1989)
Yalcin, G., Unsal, O., Cristal, A., Valero, M.: FaulTM-multi: Fault tolerance for multithreaded applications running on transactional memory hardware. In: WANDS (2011)
Harris, T., Larus, J., Rajwar, R.: Transactional Memory, 2nd edn. Morgan & Claypool (2010)
Horst, R.W., Harris, R.L., Jardine, R.L.: Multiple instruction issue in the nonstop cyclone processor. In: SIGARCH (1990)
IBM. Powerpc 750gx lockstep facility- application note. Technical report, International Business Machines Corporation (2008)
Lattner, C., Adve, V.: LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In: CGO 2004 (2004)
Lenharth, A., Adve, V.S., King, S.T.: Recovery domains: an organizing principle for recoverable operating systems. In: ASPLOS (2009)
Li, M.-L., Ramachandran, P., Sahoo, S.K., Adve, S.V., Adve, V.S., Zhou, Y.: Understanding the propagation of hard errors to software and implications for resilient system design. In: ASPLOS (2008)
Oh, N., Mitra, S., McCluskey, E.J.: Ed4i: Error detection by diverse data and duplicated instructions. IEEE Trans. Comput. (2002)
Oh, N., Shirvani, P.P., McCluskey, E.J.: Control-flow checking by software signatures. IEEE Transactions on Reliability (2002)
Pattabiraman, K., Grover, V., Zorn, B.G.: Samurai: protecting critical data in unsafe languages. In: ACM SIGOPS/EuroSys. (2008)
Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I.: Swift: Software implemented fault tolerance. In: CGO (2005)
Rinard, M., Cadar, C., Dumitran, D., Roy, D., Leu, T.: A dynamic technique for eliminating buffer overflow vulnerabilities (and other memory errors). In: ACSAC (2004)
Roberts, D., Austin, T., Blauww, D., Mudge, T., Flautner, K.: Error analysis for the support of robust voltage scaling. In: ISQED (2005)
Schiffel, U.: Hardware Error Detection Using AN-Codes. PhD thesis, Technische Universität Dresden (2011)
Schiffel, U., Schmitt, A., Süßkraut, M., Fetzer, C.: ANB- and ANBDmem-Encoding: Detecting Hardware Errors in Software. In: Schoitsch, E. (ed.) SAFECOMP 2010. LNCS, vol. 6351, pp. 169–182. Springer, Heidelberg (2010)
Schiffel, U., Schmitt, A., Süßkraut, M., Fetzer, C.: Slice Your Bug: Debugging Error Detection Mechanisms using Error Injection Slicing. In: IEEE TC (2010)
Schiffel, U., Schmitt, A., Süßkraut, M., Fetzer, C.: Software-Implemented Hardware Error Detection: Costs and Gains. In: DEPEND (2010)
Slegel, T.J., Averill III, R.M., Check, M.A., Giamei, B.C., Krumm, B.W., Krygowski, C.A., Li, W.H., Liptay, J.S., MacDougall, J.D., McPherson, T.J., Navarro, J.A., Schwarz, E.M., Shum, K., Webb, C.F.: Ibm’s s/390 g5 microprocessor design. In: IEEE Micro (1999)
Sorin, D.J., Martin, M.M.K., Hill, M.D., Wood, D.A.: Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: SIGARCH (2002)
Süßkraut, M., Schmitt, A., Schiffel, U., Brünink, M., Fetzer, C.: Silistra compiler: Building reliable systems with unreliable hardware. In: DSN (2011)
Wang, N., Patel, S.: Restore: Symptom-based soft error detection in microprocessors. In: TDSC (2006)
Wappler, U., Fetzer, C.: Hardware Failure Virtualization Via Software Encoded Processing. In: INDIN (2007)
Webber, S., Beirne, J.: The stratus architecture. In: FTCS (1991)
Yalcin, G., Unsal, O., Cristal, A., Hur, I., Valero, M.: SymptomTM: Symptom-based error detection and recovery using hardware transactional memory. In: PACT (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer International Publishing Switzerland
About this paper
Cite this paper
Wamhoff, JT., Schwalbe, M., Faqeh, R., Fetzer, C., Felber, P. (2013). Transactional Encoding for Tolerating Transient Hardware Errors. In: Higashino, T., Katayama, Y., Masuzawa, T., Potop-Butucaru, M., Yamashita, M. (eds) Stabilization, Safety, and Security of Distributed Systems. SSS 2013. Lecture Notes in Computer Science, vol 8255. Springer, Cham. https://doi.org/10.1007/978-3-319-03089-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-03089-0_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-03088-3
Online ISBN: 978-3-319-03089-0
eBook Packages: Computer ScienceComputer Science (R0)