Recovery Preparation

  • Igor SchagaevEmail author
  • Eugene Zouev
  • Kaegi Thomas


In the last section, we showed how hardware integrity of a computing system can be efficiently ensured using hardware-checking schemes and system software testing procedures and their sequences. However, to recover from faults, it is necessary to eliminate the effects the error had on the computation, i.e., the software code and data space. In GAFT, this corresponds to preparation for recovery. We now want to show how software has to be organized to be able own recovery or in other words, we want to revise different strategies how software can, after the detection of an error, ensure that the error did not affect the software state, or if this cannot be ensured, what precautions software has to conduct to be able to re-establish a correct software state. First, we revise the state of the art and then introduce a new technology and show its power and limitations. In the next step, we will show how hardware can assist software in the process of recovery preparation. For all generic approaches to recovery preparation, so-called stable storage, a nonvolatile, reliable, and fast storage is needed. If no direct hardware support is available, stable storage must be implemented in software. We will present a possible software implementation of such a stable storage.


  1. 1.
    Liedtke J (1995) On micro-kernel construction. In: Proceedings of the fifteenth ACM symposium on operating systems principles, SOSP ’95. ACM, New York, NY, USA, pp 237–250Google Scholar
  2. 2.
    Monkman S, Schagaev I (2013) Redundancy + reconfigurability = recoverability. Electronics 2:212–233. ISSN 2079-9292, Scholar
  3. 3.
    Haeberlen A et al (2000) Stub-code performance is becoming important. In: Proceedings of 1st conference on industrial experiences with systems software, vol 1. USENIX Association, Berkeley, CA, USA, p 4 Google Scholar
  4. 4.
    Wirth N, Gutknecht J (1992) Project Oberon: the design of an operating system and compiler. Addison-Wesley, WokinghamGoogle Scholar
  5. 5.
    Шагаев И., Берштейн А. Исследования систем команд их влияние на архитектуру современных ЭВМ. Зарубежная радиоэлектроника, 1989 N7, 8Google Scholar
  6. 6.
    Johannes M (2002) The active object system—design and multiprocessor implementation. ETH Zurich, ZurichGoogle Scholar
  7. 7.
    Mossenbock H, Wirth N (1991) The programming language oberon-2. Technical report, Johannes Kepler Universitat LinzGoogle Scholar
  8. 8.
    Martin R, Wirth N (1992) Programming in Oberon: steps beyond Pascal and Modula. Addison-Wesley, WokinghamGoogle Scholar
  9. 9.
    Wirth N (1977) Modula: a language for modular multiprogramming. Softw: Pract Experience 7(1):1–35zbMATHGoogle Scholar
  10. 10.
    Wirth N (1985) Programming in Modula-2. Springer, New YorkzbMATHCrossRefGoogle Scholar
  11. 11.
    Wirth N (1971) The programming language Pascal. Acta Informatica 35–63zbMATHCrossRefGoogle Scholar
  12. 12.
    Wirth N (1977) The use of Modula. Softw—Pract Experience 7Google Scholar
  13. 13.
    Kaegi-Trachsel T, Gutknecht J (2008) Minos—the design and implementation of an embedded real-time operating system with a perspective of fault tolerance. International Multiconference on IMCSIT 2008, 20–22 October 2008, pp 649–656Google Scholar
  14. 14.
    Fabry RS (1974) Capability-based addressing. Commun ACM 17:403–412CrossRefGoogle Scholar
  15. 15.
    Schagaev I (1990) Using software recovery facilities for determining the type of hardware faults. Autom and Remote Control 51(3)Google Scholar
  16. 16.
    McCluskey E et al (2002) Control-flow checking by software signatures. IEEE Trans Reliab 51(1):111–122CrossRefGoogle Scholar
  17. 17.
    Schagaev I (1989) Computing process recovery algorithms. Avtomat Telemekh 4Google Scholar
  18. 18.
    Oh N, Mitra S, McCluskey (2002) Error detection by diverse data and duplicated instructions. IEEE Trans Comput 51(2):180–199CrossRefGoogle Scholar
  19. 19.
    McCluskey E et al (2002) Error detection by duplicated instructions in superscalarprocessors. IEEE Trans Reliab 51(1):63–75CrossRefGoogle Scholar
  20. 20.
    Sogomonyan E, Schagaev I (1988) Hardware and software for fault-tolerant computing systems. Autom Remote Control 49:129–151Google Scholar
  21. 21.
    McCluskey E et al (2000) Dependable computing and online testing in adaptive and configurable systems. IEEE Des Test Comput 17(1):29–41CrossRefGoogle Scholar
  22. 22.
    Mukherjee S et al (2002) Detailed design and evaluation of redundant multi-threading alternatives. In: 29th annual international symposium on computer architecture, pp 99–110Google Scholar
  23. 23.
    Dal Cin M et al (1993) Fault tolerance in distributed shared memory multiprocessors. In: Parallel computer architectures: theory, hardware, software, applications. Springer, London, pp 31–48CrossRefGoogle Scholar
  24. 24.
    Candea G, Kawamoto S, Fujiki Y, Greg Friedman G, Fox A (2004) Microreboot: a technique for cheap recovery. In: Proceedings of the 6th conference on symposium on operating systems design & implementation, vol 6. USENIX Association, Berkeley, CA, USA, p 6Google Scholar
  25. 25.
    Deconinck G et al (1993) Survey of backward error recovery techniques for multicomputers based on checkpointing and rollback. Int J Model Simul 18:262–265Google Scholar
  26. 26.
    Elnozahy E et al (2002) A survey of rollback-recovery protocols in message-passing systemsGoogle Scholar
  27. 27.
    Lampson BW (1981) Atomic transactions. In: Distributed systems—architecture and implementation, an advanced course. Springer, London, pp 246–265Google Scholar
  28. 28.
    Randell B (1975) System structure for software fault tolerance. IEEE Trans Softw Eng 1:220–232CrossRefGoogle Scholar
  29. 29.
    Lamport L et al (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3:63–75CrossRefGoogle Scholar
  30. 30.
    Attig N, Sander V (1993) Automatic checkpointing of NQS batch jobs on CRAY unicos systems. In: Proceedings of the cray user group meeting, pp 250–255Google Scholar
  31. 31.
    Strom R, Yemini S (1985) Optimistic recovery in distributed systems. ACM Trans Comput Syst 3:204–226CrossRefGoogle Scholar
  32. 32.
    Lorenzo A, Keith M (1996) Trade-offs in implementing causal message logging protocols. In: 15th ACM symposium on principles of distributed computing, PODC ’96. ACM, New York, NY, USA, pp 58–67Google Scholar
  33. 33.
    Borg A, Baumbach J, Glazer S (1983) A message system supporting fault tolerance. In: Proceedings of the ninth ACM symposium on operating systems principles, SOSP ’83. ACM, New York, NY, USA, pp 90–99Google Scholar
  34. 34.
    Strom R, Bacon D, Yemini S (1988) Volatile logging in n-fault-tolerant distributed systems. In: Digest of papers eighteenth international symposium on fault-tolerant computing, FTCS-18, pp 44–49Google Scholar
  35. 35.
    Elnozahy E, Zwaenepoel W (1992) Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans Comput 41(5):526–531CrossRefGoogle Scholar
  36. 36.
    Johnson D, Zwaenepoel W (1987) Sender-based message logging. In: Digest of papers: 17 annual international symposium on fault-tolerant computing. IEEE Computer Society, pp 14–19Google Scholar
  37. 37.
    Smith S, Johnson D (1996) Minimizing time stamp size for completely asynchronous optimistic recovery with minimal rollback. In: Proceedings of the 15th symposium on reliable distributed systems, SRDS ’96. IEEE Computer Society, Washington, DC, USA, p 66Google Scholar
  38. 38.
    Bhargava B, Lian S, Leu P (1990) Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms. In: Proceedings of sixth international conference on data engineering, pp 182–189Google Scholar
  39. 39.
    Tamir Y, Squin C (1984) Error recovery in multicomputers using global checkpoints. In: International conference on parallel processing, pp 32–41Google Scholar
  40. 40.
    Tong Z, Kain R, Tsai W (1992) Rollback recovery in distributed systems using loosely synchronized clocks. IEEE Trans Parallel Distrib Syst 3:246–251CrossRefGoogle Scholar
  41. 41.
    Koo R, Toueg S (1987) Checkpointing and rollback-recovery for distributed systems. IEEE Trans Softw Eng 23–31zbMATHCrossRefGoogle Scholar
  42. 42.
    Janakiraman G, Tamir Y (1994) Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers. In: Proceedings 13th symposium on reliable distributed systems, pp 42–51Google Scholar
  43. 43.
    Janssens B, Fuchs WK (1994) Reducing inter-processor dependence in recoverable distributed shared memory. In: Proceedings of reliable distributed systems, pp 34–41Google Scholar
  44. 44.
    Li K (1986) Shared virtual memory on loosely coupled multiprocessors. PhD thesis, New Haven, CT, USA. AAI8728365Google Scholar
  45. 45.
    Bershad B, Zekauskas M (1991) Midway: shared memory parallel programming with entry consistency for distributed memory multiprocessors. Technical reportGoogle Scholar
  46. 46.
    Huang Y, Wang Y (1995) Why optimistic message logging has not been used in telecommunications systems. In: FTCS-25, pp 459–463Google Scholar
  47. 47.
    Johnson BD (1990) Distributed system fault tolerance using message logging and checkpointing. PhD thesis, Houston, TX, USA. AAI9110983Google Scholar
  48. 48.
    Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21:558–565zbMATHCrossRefGoogle Scholar
  49. 49.
    Brown L, Wu J (1995) Snooping fault-tolerant distributed shared memories. J Syst Softw 29:149–165CrossRefGoogle Scholar
  50. 50.
    Plank J (1993) Efficient checkpointing on MIMD architectures. PhD thesis, Princeton, NJ, USA, 1993. UMI Order No. GAX93-16087Google Scholar
  51. 51.
    Schagaev I. Algorithms of computation recovery. Automation and Remote Control, 7, 1986. 26, 36, 65, 122Google Scholar
  52. 52.
    Schagaev I (1987) Algorithms for restoring a computing process. Autom Remote Control 48(4). 26, 65, 122, 141, 149Google Scholar
  53. 53.
    Schagaev I (1989) Instructions retry in microprocessor recovery algorithms. In: IMEKO—FTSD symposium. 2Google Scholar
  54. 54.
    Schagaev I (1986) Relationship between the formation of program recovery points and equipment reliability indices. Autom Remote Control 47:124Google Scholar
  55. 55.
    Blaeser L, Monkman S, Schagaev I (2014) Evolving systems Worldcomp 2014. In: Proceedings of the international conference on foundations of computer science FCS’14. CSREA Press, ISBN: 1-60132-270-4Google Scholar
  56. 56.
    O’Brian F (1976) Rollback point insertion strategies. In: Digest of papers 6th international symposium fault-tolerant computing, FTCS-6Google Scholar
  57. 57.
    Wirth N (2008) Oberon-07 language report. Technical report, ETH ZurichGoogle Scholar
  58. 58.
    Compact Flash Association (2007) Cf+ and compact flash specification revision 4.1. Technical reportGoogle Scholar
  59. 59.
    ONFi Workgroup (2011) Open NAND flash interface specification 3.0. Technical report, ONFI WorkgroupGoogle Scholar
  60. 60.
    ONFi Workgroup (2009) Open NAND flash interface specification: block abstracted NAND. Technical report, ONFi WorkgroupGoogle Scholar
  61. 61.
    SanDisk Corporation (2002) Host design considerations: NAND MMC and SD-based products. Technical report, SanDisk CorporationGoogle Scholar
  62. 62.
    Gal E, Toledo S (2005) Algorithms and data structures for flash memories. ACM Comput Surv 37(2):138–163CrossRefGoogle Scholar
  63. 63.
    Chang L, Kuo T (2004) An efficient management scheme for large-scale flash memory storage systems. Technical reportGoogle Scholar
  64. 64.
    Woodhouse D (2001) JOFFs: the journaling flash file system. Technical report, Red Hat, IncGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.IT-ACS LtdStevenageUK
  2. 2.Department of InformaticsTechnopolisInnopolis, KazanRussia

Personalised recommendations