Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Managing large computer systems configuration in the age of web scale distributed applications, quickly growing cloud systems adoption and high-performance computing systems approaching the exaflops scale is a huge challenge. With O(1000)-O(10000) nodes in a data center or even in one computer, scalable automation for controlling the system configuration is crucial.

In this paper we present an approach to procedural system management that goes beyond the currently existing tools used for automating node and groups of similar nodes deployments and configuration management. Our approach can express and implement complex dependencies between managed components and resolve ordering and synchronization constraints that cross node boundaries. The framework can deal with reverting changes and removing managed components cleanly from the system, operations that might require for example stopping or restarting entire chains of dependent components. In the context of the Autonomic Computing challenge [1] our development falls into the category of self-configuration, with some elements of self-healing.

In Sect. 2 we discuss concepts of and approaches to large systems configuration as well as previous work and related tools. The following two sections sketch the distributed state in large systems and introduce the change objects (COBs) that describe and manage the state. Implementation details are discussed in Sect. 5 and the application to managing parallel file system setups in Sect. 6. We conclude the paper with an outlook to future work.

2 Large Systems Configuration

The problem of setting up and managing large, distributed computer systems has probably emerged shortly after the appearance of the first Beowulf clusters in the mid 1990s. The data center landscape was then filled with mainframes, proprietary UNIX flavors and a few special purpose vector or massively parallel machines. The hassle of managing various UNIX systems with incompatible shell scripts led to the emergence of tools like CFEngine [2], that tried to abstract common small system management tasks into a declarative, domain-specific language (DSL). CFEngine maps configuration and management rules to groups of similar hosts and enables administrators to manage large numbers of nodes in a simple and centralized way. The configuration changes and automated administration steps are done on each node, which applies the procedures assigned to its group. CFEngine is modeled with the help of the Promise Theory [3], a “model of voluntary cooperation between individual, autonomous actors or agents who publish their intentions to one another in the form of promises”[4].

The homogeneous Beowulf clusters built with identical off-the-shelf components were well served by centralized cluster management systems like OSCAR [5], Warewulf [6], SystemImager [7], SystemInstaller, SystemConfigurator. They manage one or several compute node images that get deployed to groups of identical nodes within a cluster. The installation and configuration tasks are done centrally, on the node images. This saves time because the tasks have to be done only once and don’t run on every node. The image deployment is the method of keeping cluster node OSes synchronized. Interactive administration tasks are mostly handled by scalable parallel execution tools like C3, pdsh or clush. While these allow quickly executing commands or scripts on a large set of nodes, they don’t deal with failed nodes, thus changes done centrally are not guaranteed to be applied to all nodes of the system.

For today’s high-performance computing (HPC) and cloud systems image based provisioning is still the preferred way of node deployment but it is often complemented by procedural management methods done with tools which are similar to CFEngine: e.g. Puppet [8], Chef [10], Salt [11], Ansible [9]. These tools often bring a large set of administration “recipes”, ready to be grouped and assigned to sets of hosts. They work very well for setups where nodes act more or less independently of each other, like data centers with specialized nodes for specific (e.g. WEB oriented) services. They can handle dependencies between the “recipes” of one particular node, but they lack the complex synchronization needed for deploying or managing complex distributed applications.

3 Distributed Systems State

The state of large distributed systems is very complex and can be represented as a vector in a highly dimensional space, as depicted in Fig. 1. The simple example which is representable in three dimensions involves the states of two services and the version of one configuration file linked to one of the services. Managing the system, that means moving the system from one state into the other is equivalent to moving the state vector from one point to another by following a possibly complex path that is imposed by dependencies among various components.

Fig. 1
figure 1

Example of a state transition in a system of two dependent services svc A and svc B, and the configuration of service A: cfg A. Changing the version of cfg A requires svc A to be stopped, which itself requires svc B to be stopped. The dependencies enforce that updating the version of cfg A correctly happens by moving the state from one yellow diamond to the other by following the path of the red arrows

It is possible to represent the state of a complex system by parallel coordinates or related methods, but the state transition path would still be very difficult to visualize in an intuitive way. Therefore we will focus on describing the entities contributing to the system state and the graph that is built by linking them with the arrows representing the dependencies. While state contributing entities are mostly attributed to a particular node, the dependencies can link them across node boundaries. The global distributed system state is thus one big graph with nodes carrying the state and links representing dependencies. When moving from one state to another the graph nodes state can change as well as the graph topology, e.g. by added or removed state contributing entities.

4 Change Objects: COBs

In our approach the global system state is being represented and controlled by entities that correspond to elementary system components which can be configured and managed separately. We call them Change OBjects: COBs. A COB is carrying a change for one of the system components. These can be, for example:

  • Service COBs: ensure that a particular service is enabled and started.

  • File tree COBs: transport files that are copied over or merged into the system. These can be e.g. configuration files or empty directories or symbolic links. File tree COBs are revertible and can be merged with user changes by using underneath a GIT repository and 3-way merge.

  • High availability resource COBs: represent a HA resource managed by Pacemaker.

  • Script COBs: represent changes made by executing a script. Whether these changes are reversible or not, depends on the scripts.

COBs can have arbitrary states and offer transition operations between the states. For example a service COB can have the states up (service is up and running), down (service is not running) and unknown (service state is unknown or an error was encountered when doing a transition operation). The transition operations are: start, stop and restart.

Whenever possible, restore information is saved with the COBs, such that an operation can be reverted. For example configuration files are stored in a GIT repository such that an old state can be restored when switching to a new one fails. Obviously with script COBs this is not always possible in simple way, revertible COBs operations require more complex scripts.

Dependencies between system components/COBs are expressed through links. A link specifies that a COB requires a dependent COB to be in a particular state before executing a transition operation. Since COBs can have arbitrary states and transition operations, links can be of various flavors, depend on their end-points and can carry multiple conditions. For example the dependency between two services can be:

  • COB B must be in state up when moving COB A to state up.

  • COB A must be in the state down when moving COB B to state down.

Dependencies and links can cross node boundaries and thus ensure globally the correct order of operations when moving the system between states.

When all COBs controlling a system have reached their desired state, the target system configuration has been achieved and all changes have been applied. This management method belongs to the class of procedural system management methods. Change objects are to some extent similar to the “manifests” used by Puppet or “promises” of CFEngine but their global dependencies and strict operation ordering are, to our knowledge, currently unique in procedural automated management systems.

The graph of all COBs in a system and their dependencies could in principle be solved centrally, and every action controlled from a central instance. That would limit scalability as well as bring up issues related to recovery from failed nodes which did not finish their actions. In order to build a scalable automated management system we decided to divide the graph and distribute the COBs to the nodes which they are affecting. Each node’s cobd agent will then drive the system and it’s part of the graph towards the desired state, independently, and only interact with the nodes which have COBs linked to the own ones. The architecture drawn in Fig. 2 shows the relevant components:

Fig. 2
figure 2

Architecture sketch of a system configured with inter-dependent COBs, distributed over three nodes. The COBs are generated and/or stored in a central place. When a new COBs generation is available, COB Daemons (cobd) pull the COBs that belong to them (green arrows). Intra-node dependencies are symbolised by black arrows, inter-node dependencies by red arrows

5 Implementation of an Asynchronous Distributed Transition Engine

5.1 States and Transitions

Each change object is either in a specific state or in a transition towards another state. As mentioned in Sect. 4, COBs can have arbitrary states. Without loss of generality and for the sake of simplicity we discuss most of the following examples by using a theoretical “simple” COB that has three states:

  • down: COBs was not applied,

  • up: COBs was applied,

  • outdated: the applied change is not valid any more.

One special state is hard-coded into each COB: unknown. It signals that the COB state is not known and the specific check() method needs to be called in order to determine the true state. It makes sense to follow some conventions when implementing COBs, like adding error states as possible exits from failed transitions, but they are not mandatory for the correct function of the engine.

A COB executes transition actions in order to move from one state to another, e.g. start() will trigger the transition from the state down to up. Transitions must not link every state to another. For reaching a particular target state a COB can go through multiple intermediate states.

Change objects can have a preferred target state.

5.2 Local State Machines

Each COB contains a state machine which decides about the next actions and coordinates them. The local state machine of a COB does not need information about the global state of the system. Dependencies might require the COB to communicate with others but this interaction only happens with the next neighbors in the global dependency graph.

For defining the state machine of a particular change object it is sufficient to define the states that need to be managed and a sufficiently large set of transition actions, in addition to the check() method. The transitions must ensure that the desired states can be reached. The core of the state machine of a COB is its transition matrix built of the transition actions. The matrix is usually sparse.

For the “simple” example COB defined in the previous section the matrix could be represented as follows:

 

Outdated

Down

Up

Outdated

cleanup()

restart()

Down

start()

Up

stop()

In the example above the state outdated is not reachable explicitly via transitions. It is reached implicitly through the already mentioned check() method that is applied to unknown state COBs. The system triggers a check() whenever the COB content changes.

5.3 Dependencies and Locking

To order the state transitions in all cases where it is necessary, it is possible to define dependencies between COBs. There are two different flavors of dependencies, called transition dependencies and lock dependencies

  • a transition dependency states that when dependent object A wants to switch from state x to state y, it has to acquire a state lock of depending object B for state z.

    Notation: (A, x, y) ← (B, z)

  • a lock dependency states that when dependent object C wants to lock itself into state u (and subsequently grant state locks to other objects), it has to acquire a state-lock of depending object D for state v.

    Notation: (C, u) ← (D, v)

To execute an action or grant a state lock to another COB a COB must hold all locks which are specified for the intended transition or granting of a state lock.

Dependencies are handled by entities called state locks. For example COB A might request a lock from COB B for state up. This means, COB B gets the information that COB A wants it to be in state up and stay there for a certain amount of time. If COB B is already in state up and some other conditions are fulfilled, which cause COB B to stay in that state, it will grant the lock to COB A. This has the consequence for COB B that it will not switch state until COB A releases the lock.

5.4 Transition Algorithm

The transitions between states are also driven by the state locks which are requested due to the dependencies between objects. A dependency causes a dependent object to request the depending object to lock the expected state if already there or to go into the specified state otherwise and lock it until the lock is revoked.

Since every change object has a preferred state, it can decide about the next state transitions to be done and locks to be requested in a completely local way. The decision depends only on:

  • its own preferred state,

  • its own transition and lock dependencies (if it is the dependent object),

  • the lock requests it receives from other objects,

  • the state-locks it has been granted from other objects.

With the dependency graphs built properly this methods has superior scalability compared to centralized management systems.

The algorithm deciding about the next state transitions and lock granting operates as follows:

  1. 1.

    The target state of the object is computed. This is the state the object should go into next, if it can do so. It is determined as follows:

    • If there are lock requests from other objects present, the requested state is chosen as target state. If multiple requests for different states are present, the one with the higher priority is selected.

    • If no lock requests are present, the target state is the preferred state.

    For handling the first case, each COB needs a priority order of its possible states.

  2. 2.

    If the object is already in the target state, it checks whether lock-requests for this state are present. It issues lock-requests itself for all its own lock-dependencies for this state.

  3. 3.

    If the object is already in the target state and if it already holds all needed state-locks, it grants all state-locks to other objects which are requesting this state.

  4. 4.

    If the object is not in target state, it issues lock-requests for all its own transition dependencies from the current state to the target state.

  5. 5.

    If the object is not in target state, and already holds all state-locks for all its own transition dependencies from the current state to the target state, it triggers the transition.

  6. 6.

    All locks held which are not needed for the transition to target state are released.

  7. 7.

    If already in target state and no lock requests for this state are present, the state is unlocked and all locks held are released and all correspondent lock requests are revoked.

Steps 4 and 5 are skipped in case there is no transition possible from the current state to the target state.

5.5 Cobd and Cob Proxies

Each change object belongs to the host on which its actions are executed. COBs representing states of some distributed service which cannot be attributed to a particular host are assigned to a host. All COBs belonging to one host are instantiated as objects in a daemon process, the cobd. Each host which can (potentially) have COBs needs to run its own instance of the cobd.

As COBs need to be able to request and grant locks both from other COBs residing in the same cobd and also in a remote instance, an abstraction layer to communicate with a different COB is used. For each COB a cobd process needs to communicate with it has a proxy object. This proxy object manages the forwarding and processing of messages from/to different COBs. If it is a local COB, methods on the COB instance are called directly, if it is a remote COB, messages are sent through ZeroMQ. In the current implementation each cobd has a permanent ZeroMQ connection open to all hosts that any of its COBs needs to communicate with.

5.6 COB Generations

Configuration changes are implemented by changing the content of COBs, removing existing COBs or adding new ones. This is handled by generations of COBs. If the desired system configuration is changed, a new generation of COBs is created, reflecting the new desired system state.

If a COB’s content changes, it might be inconsistent with the currently implemented state. Thus, on every content change the COB is forced into the unknown state. If the check() operation finds the system state matches the old content, but not the new one, the COB is set to the outdated state. This means it needs to be restarted in order to reflect the new COB content.

If existing COBs are removed, they are requested to do a transition to a “pre-remove” state first, e.g. the down state in our example COB. The new generation is not activated, unless all the COBs which are missing in the new generation have reached this state. An uncoordinated switch of COBs generations on cobds would lead to difficulties if different hosts have different active generations, as transitioning a COB to the “pre-remove” state might depend on a COB on a different host, which has already disappeared. Therefore when switching to a new generation of COBs the dependency graphs are backtracked in order to find out the order in which hosts may change to a new generation of COBs. It is possible that some hosts can only change the generation together at the same time.

6 Application to Lustre System

The COB system is currently used for the setup of the NEC LXFS file system, which is a Lustre installation with high-availability configuration and monitoring various and management services. For the deployment of LXFS, only a small data model with the relevant parameters like host names, IP addresses, RAID configuration has to be created manually. From this data model the COBs of all the servers are created and distributed to the servers, where they are processed by the running cobd instances.

Dependencies between COBs on different hosts are heavily used to order installation steps. All together around 50 COBs per server are necessary for a complete setup, with many of them having dependencies on other servers’ COBs. The configuration includes the complete network setup of the servers, setup of BMC interfaces, storage array setup, Lustre configuration, Pacemaker HA configuration with STONITH, and a monitoring framework in HA configuration, as well as common services like Nagios, Ganglia with the needed web servers, and the Robinhood policy engine and the associated database server.

The high-availability setup using Pacemaker [12] requires certain steps to be done on all servers in the HA cluster before a specific HA service can be installed. Also the Lustre file system itself requires the ordering of operations over server boundaries, for example the Metadata Targets MDT need to be set up before the OSTs. All the setup steps involved for the complete LXFS system can be successfully executed in the right order by wrapping them in COBs.

7 Conclusion and Outlook

We designed and implemented an automated configuration management system that is able to manage the state of complex distributed systems by autonomously interacting local cooperative agents. The system is able to handle complex dependencies which cross node boundaries and has been implemented in a production level highly available parallel file system setup.

In the current implementation the change objects are generated centrally from a data model, therefore the serialization of the COBs was chosen ad-hoc, without much emphasis on user friendliness. In the expectation that users might directly create and manipulate change objects, we plan to simplify the direct definition and editing of COBs by using a domain specific language as well as a graphical user interface. In addition we’d like to allow mixing generated COBs with user provided ones, as well as investigate possibilities to re-use “recipes” from other related automated management systems.