Abstract
Compiler-based error detection methodologies replicate the instructions of the program and insert checks wherever it is needed. The checks evaluate code correctness and decide whether or not an error has occurred. The replicated instructions and the checks cause a large slowdown. In this work, we focus on reducing the error detection overhead and improving the system’s performance without degrading fault-coverage. DRIFT achieves this by decoupling the execution of the code (original and replicated) from the checks.
The checks are compare and jump instructions. The latter ones sequentialize the code and prohibit the compiler from performing aggressive instruction scheduling optimizations. We call this phenomenon basic-block fragmentation. DRIFT reduces the impact of basic-block fragmentation by breaking the synchronized execute-check-confirm-execute cycle. In this way, DRIFT generates a scheduler-friendly code with more ILP. As a result, it reduces the performance overhead down to 1.29\(\times \) (on average) and outperforms the state-of-the-art by up to 29.7 % retaining the same fault-coverage. The evaluation was done on an Itanium2 by running MediabenchII and SPEC2000 benchmark suites.
This work was supported in part by the EC under grant ERA 249059 (FP7).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
GCC: GNU compiler collection. http://gcc.gnu.org
SKI, an IA64 instruction set simulator. http://ski.sourceforge.net
Austin, T.: DIVA: a reliable substrate for deep submicron microarchitecture design. In: MICRO (1999)
Bernick, D., et al.: Nonstop advanced architecture. In: DSN (2005)
Chang, J., et al.: Automatic instruction-level software-only recovery. In: DSN (2006)
Constantinescu, C.: Trends and challenges in VLSI circuit reliability. IEEE Micro 23, 14–19 (2003)
Feng, S., et al.: Shoestring: probabilistic soft error reliability on the cheap. In: ASPLOS (2010)
Fritts, J., et al.: Mediabench II video: expediting the next generation of video systems research. In: SPIE (2005)
Ghosh, Y., et al.: Runtime asynchronous fault tolerance via speculation. In: CGO (2012)
Henning, J.: SPEC CPU2000: measuring CPU performance in the new millennium. IEEE Comput. 33, 28–35 (2000)
Hwu, W.-M.W., et al.: The superblock: an effective technique for VLIW and superscalar compilation. J. Supercomput. 7, 229–248 (1993)
LaFrieda, C., et al.: Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: DSN (2007)
Li, M., et al.: Understanding the propagation of hard errors to software and implications for resilient system design. In: ASPLOS (2008)
Lowney, P.G., et al.: The multiflow trace scheduling compiler. J. Supercomput. 7, 51–142 (1993)
Mahlke, S., et al.: Sentinel scheduling for vliw and superscalar processors. In: ASPLOS (1992)
Mahmood, A., et al.: Concurrent error detection using watchdog processors-a survey. IEEE Trans. Comput. 37, 160–174 (1988)
McNairy, C., et al.: Itanium 2 processor microarchitecture. IEEE Micro 23, 44–55 (2003)
Michalak, S., et al.: Predicting the number of fatal soft errors in Los Alamos national laboratory’s ASC Q supercomputer. IEEE Trans. Device Mater. Reliab. 5, 329–335 (2005)
Mukherjee, S., et al.: Detailed design and evaluation of redundant multithreading alternatives. In: ISCA (2002)
Oh, N., et al.: Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliab. 51, 63–75 (2002)
Reinhardt, S., et al.: Transient fault detection via simultaneous multithreading. In: ISCA (2000)
Reis, G., et al.: SWIFT: software implemented fault tolerance. In: CGO (2005)
Rotenberg, E.: AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. In: FTCS (1999)
Shivakumar, P., et al.: Modeling the effect of technology trends on the soft error rate of combinational logic. In: DSN (2002)
Shye, A., et al.: Using process-level redundancy to exploit multiple cores for transient fault tolerance. In: DSN (2007)
Slegel, T., et al.: IBM’s S/390 G5 microprocessor design. IEEE Micro 19(2), 12–23 (1999)
Smolens, J., et al.: Reunion: complexity-effective multicore redundancy. In: MICRO (2006)
Sorin, D.: Fault tolerant computer architecture. Synthesis Lectures on Computer Architecture (2009)
Srinivasan, J., et al.: The impact of technology scaling on lifetime reliability. In: DSN (2004)
Wang, C., et al.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: CGO (2007)
Wang, N., et al.: ReStore: symptom-based soft error detection in microprocessors. IEEE Trans. Dependable Secure Comput. 3, 188–201 (2006)
Zhang, Y., et al.: DAFT: decoupled acyclic fault tolerance. In: PACT (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Mitropoulou, K., Porpodas, V., Cintra, M. (2014). DRIFT: Decoupled CompileR-Based Instruction-Level Fault-Tolerance. In: Cașcaval, C., Montesinos, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2013. Lecture Notes in Computer Science(), vol 8664. Springer, Cham. https://doi.org/10.1007/978-3-319-09967-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-09967-5_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09966-8
Online ISBN: 978-3-319-09967-5
eBook Packages: Computer ScienceComputer Science (R0)