Proposal on Network-Wide Rollback Scheme for Fast Recovery from Operator Errors
This paper proposes a new network-wide rollback scheme for fast recovery from operator errors, toward the high availability of networks and services. A technical issue arises from the fact that operators, who manipulate one or more diverse devices and services due to their network-wide dependency in a typical management task, are the major cause of failure. The lack of systems or tools fully addressing the issue motivated us to develop a new scheme. The underlying idea is that, for any operational device or service, the observable behavior is identical whenever the same setting is configured. High availability will thus be achieved by rolling the settings that may cause an abnormal state by an operator error, back to past ones with which devices and services were stable. Certain policies for the network-wide rollback are identified and a prototype implementation and preliminary results will be presented.
- 1.Patterson, D.A.: A Simple Way to Estimate the Cost of Downtime. In: Proc. of the 16th Systems Administration Conference, pp. 185–188 (November 2002)Google Scholar
- 2.Brown, A.B., Patterson, D.A.: Undo for Operators: Building an Undoable E-mail Store. In: Proc. of USENIX 2003, pp. 1–14 (June 2003)Google Scholar
- 3.O’Brien, J., Shapiro, M.: Undo for anyone, anywhere, anytime. In: Proc. of the 11th workshop on ACM SIGOPS European workshop, ACM Press, New York (2004)Google Scholar
- 4.Shrubbery Networks, Inc.: Really Awesome New Cisco confIg Differ (RANCID) (URL available for May 2007), http://www.shrubbery.net/rancid/
- 5.AdventNet, Inc.: DeviceExpert (URL available for May 2007), http://manageengine.adventnet.com/products/device-expert/index.html