Keywords

1 Introduction

The influence of artificial intelligence (AI) continues to proliferate our daily lives at an ever-increasing pace. From personalized recommendations to autonomous vehicle navigation to smart personal assistants to health screening and diagnosis, AI has already proven to be effective on a day-to-day basis. But it can also be effective in tackling some of the world’s most challenging control problems – such as minimizing the power usage effectivenessFootnote 1 (PUE) in a data center.

From server racks to large deployments, data centers are the backbone to delivering IT services and providing storage, communication, and networking to the growing number of users and businesses. With the emergence of technologies such as distributed cloud computing and social networking, data centers have an even bigger role to play in today’s world. In fact, the Cisco\(\textregistered \) Global Cloud Index, an ongoing effort to forecast the growth of data center and cloud-based traffic, estimates that the global IP traffic will grow 3-fold over the next 5 years [2].

Naturally, data centers are sinkholes of energy (required primarily for cooling). While technologies like virtualization and software-based architectures to optimize utilization and management of compute, storage and network resource are constantly advancing [4], there is still a lot of room to improve the utilization of energy on these systems using AI.

1.1 The Problem

In this paper, we tackle an instance of the aforementioned problem, specifically, optimizing the operation of a solid state drive (SSD) storage rack (see Fig. 1). The storage rack comprises 24 SSDs and a thermal management system with 5 cooling fans. It also has 2 100G ethernet ports for data Input/Output (I/O) between the client machines and the storage rack. From data I/O operations off the SSDs, the rack’s temperature increases and that requires the fans to be turned on for cooling. Currently, the fan speeds are controlled simply based on a tabular (rule-based) method - thermal sensors throughout the chassis including one for each drive bay record temperatures and the fan speeds are varied based on a table that maps the temperature thresholds to desired fan speeds. In contrast, our solution uses a deep reinforcement learningFootnote 2 (DRL)-based control algorithm called the Advantage Actor-Critic to control the fans. Experimental results show significant performance gains over the rule-based current practices.

Fig. 1.
figure 1

The SSD storage rack we used for our experiments. It comprises 24 SSDs that are kept cool via 5 fans.

Ideally, if we are able to measure the PUE of the storage rack, that will be the precise function that needs to be minimized. However, the rack did not contain any hooks or sensors for directly measuring energy. As a proxy for the PUE, we instead designed a utility (or reward) function that represents the operational efficiency of the rack. Our reward function is explained in detail in Sect. 2.2, and across the space of its arguments, comprises contours of good as well as contours of not-so-good values. The problem now becomes being able to learn a control algorithm that optimizes the operational efficiency of the rack by always driving it towards the reward contours with good values.

1.2 Related Work and Our Contributions

Prior work in the area of developing optimal control algorithms for data centers [4,5,6] build approximate models to study the effects of thermal, electrical and mechanical subsystem interactions in a data center. These model-based methods are sometimes inadequate and suffer from error propagation, which leads to sub-optimal control policies. Recently, Google DeepMind published a blog [7] on using AI to reduce Google’s data centre cooling bill by 40%. In [8], the authors use the deep deterministic policy gradient technique on a simulation platform and achieve a low PUE as well as a 10% reduction in cooling energy costs.

Our novel contributions are three-fold. First, unlike prior model-based approaches, we formulate a model-free method that does not require any knowledge of the SSD server behavior dynamics. Second, we train our DRL algorithm on the real systemFootnote 3, and do not require a simulator. Finally, since our SSD rack does not have sensors to quantify energy consumption, we devise a reward function that is used not only to quantify the system’s operational efficiency, but also as a control signal for training.

2 Our DRL-Based Solution

In this section, we provide the details of our solution to the operation optimization problem. We first introduce the reader to reinforcement learning (RL), and subsequently explain the algorithm and the deep network architecture used for our experiments.

2.1 Reinforcement Learning (RL) Preliminaries

RL is a field of machine learning that deals with how an agent (or algorithm) ought to take actions in an environment (or system) so as to maximize a certain cumulative reward function. It is gaining popularity due to its direct applicability to many practical problems in decision-making, control theory and multi-dimensional optimization. RL problems are often modeled as a Markov Decision Process with the following typical notation. During any time slot t, the environment is described by its state notated \(s_t\). The RL agent interacts with the environment in that it observes the state \(s_t\) and takes an action \(a_t\) from some set of actions, according to its policy \(\pi (a_t|s_t)\) - the policy of an agent (denoted by \(\pi (a_t|s_t)\)) is a probability density function that maps states to actions, and is indicative of the agent’s behavior. In return, the environment provides an immediate reward \(r_t(s_t,a_t)\) (which is evidently a function of \(s_t\) and \(a_t\)) and transitions to its next state \(s_{t + 1}\). This interaction loops in time until some terminating criterion is met (for example, say, until a time horizon H). The set of states, actions, and rewards the agent obtains while interacting (or rolling-out) with the environment, \(\tau =:\{(s_0, a_0, r_0), (s_1, a_1, r_1), \ldots , (s_{H-1}, a_{H-1}, r_{H-1}), s_H\}\) forms a trajectory. The cumulative reward observed in a trajectory \(\tau \) is called the return, \(\mathcal {R}(\tau ) = \sum _{t=0}^{H-1}\gamma ^{H-1-t}r_t(s_t,a_t)\), where \(\gamma \) is a factor used to discount rewards over time, \(0\le {}\gamma \le 1\). Figure 2 represents an archetypal setting of a RL problem.

Fig. 2.
figure 2

The classical RL setting. Upon observing the environment state \(s_t\), the agent takes an action \(a_t\). This results in an instantaneous reward \(r_t\), while the environment transitions to its next state \(s_{t+1}\). The objective of the agent is to maximize the cumulative reward over time.

In the above setting, the goal of the agent is to optimize the policy \(\pi \) so as to maximize the expected return \(\mathbb {E}_{\tau }[\mathcal {R}_\tau ]\) where the expectation operation is taken across several trajectories. Two functions related to the return are (a) the action value function \(Q^{\pi }(s_t,a_t)\), which is the expected return for selecting action \(a_t\) in state \(s_t\) and following the policy \(\pi \), and (b) the state value function \(V^{\pi }(s_t)\), which measures the expected return from state \(s_t\) upon following the policy \(\pi \). The advantage of action \(a_t\) in state \(s_t\) is then defined as \(A^{\pi }(s_t,a_t)=Q^{\pi }(s_t,a_t)-V^{\pi }(s_t)\).

2.2 State, Action and Reward Formulations

In order to employ RL for solving our problem, we need to formulate state, action and reward representations.

State: We use a vector of length 7 for the state representation comprising the following (averaged and further, normalized) scalars:

  1. 1.

    tps (transfers per second): the mean number of transfers per second that were issued to the SSDs. A transfer is an I/O request to the device and is of indeterminate size. Multiple logical requests can be combined into a single I/O request to the device.

  2. 2.

    kB_read_per_sec: the mean number of kilo bytes read from an SSD per second.

  3. 3.

    kB_written_per_sec (writes per second): the mean number of kilo bytes written to an SSD per second.

  4. 4.

    kb_read: the mean number of kilo bytes read from an SSD in the previous time slot.

  5. 5.

    kb_written: the mean number of kilo bytes written to an SSD in the previous time slot.

  6. 6.

    temperature: the mean temperature recorded across the 24 SSD temperature sensors on the rack.

  7. 7.

    fan_speed: the mean speed of the 5 cooling fans in revolutions per minute (rpm).

Recall there are 24 SSDs in our rack (and hence 24 different values of transfers per second, bytes read, bytes written, etc.), but we simply use the averaged (over the 24 SSDs) values for our state representationFootnote 4.

To obtain 1. through 5. above, we use the linux system command iostat [20] that monitors the input/output (I/O) device loading by observing the time the devices are active in relation to their average transfer rates. For 6. and 7., we use the ipmi-sensors [21] system command that displays current readings of sensors and sensor data repository information.

For normalizing, 1. through 7., we use the min-max strategy. Accordingly, for feature X, an averaged value of \(\bar{x}\) is transformed to \(\varGamma (\bar{x})\), where

$$\begin{aligned} \varGamma (\bar{x}) = \frac{\bar{x}-\text {minX}}{\text {maxX}-\text {minX}}, \end{aligned}$$
(1)

where minX and maxX are the minimum and maximum values set for the feature X. Table 1 lists the minimum and maximum values we used for normalizing the state features, which were chosen based on empirically observed range of values.

Table 1. Minimum and maximum values of the various state variables used for the min-max normalization.

Action: The action component of our problem is simply in setting the fan speeds. In order to keep the action space manageableFootnote 5, we use the same action (i.e., speed-setting) on all the 5 fans. For controlling the fan speeds, we use the ipmitool [22] command-line interface. We consider two separate scenarios:

  • raw action: the action space is discrete with values 0 through 6, where 0 maps to 6000 rpm, while 6 refers to 18000 rpm. Accordingly, only 7 different rpm settings are allowed: 6000 through 18000 in steps of 2000 rpm. Note that consecutive actions can be very different from each other.

  • incremental action: the action space is discrete taking on 3 values - 0, 1 or 2. An action of 1 indicates no change in the fan speed, while 0 and 2 refer to an decrement or increment of the current fan speed by 1000 rpm, respectively. This scenario allows for smoother action transitions. For this case, we allow 10 different rpm values: 9000 through 18000, in steps of 1000 rpm.

Reward: In this section, we design a reward function that functions as a proxy for the operational efficiency of the SSD rack. One of the most important components of a RL solution is reward shaping. Reward shaping refers to the process of incorporating domain knowledge towards engineering a reward function, so as to better guide the agent towards its optimal behavior. Being able to devise a good reward function is critical since it explicitly relates to the expected return that needs to be maximized. We now list some desired properties of a meaningful reward function that will help perform reward shaping in the context of our problem.

  • Keeping both the devices’ temperatures and fan speeds low should yield the highest reward, since this scenario means the device operation is most efficient. However, also note that this case is feasible only when the I/O loads are absent or are very small.

  • Irrespective of the I/O load, a low temperature in conjunction with a high fan speed should yield a bad reward. This condition is undesirable since otherwise, the agent can always set the fan speed to its maximum value, which in turn will not only consume a lot of energy, but also increase the wear on the mechanical components in the system.

  • A high temperature in conjunction with a low fan speed should also yield a poor reward. If not, the agent may always choose to set the fan speed to its minimum value. This may result in overheating the system and potential SSD damages, in particular when the I/O loads are high.

  • Finally, for different I/O loads, the optimal rewards should be similar. Otherwise, the RL agent may learn to overfit and perform well only on certain loads.

While there are several potential candidates for our desired reward function, we used the following mathematical function:

$$\begin{aligned} R = -\max \left( \frac{\varGamma (\bar{T})}{\varGamma (\bar{F})},\frac{\varGamma (\bar{F})}{\varGamma (\bar{T})}\right) , \end{aligned}$$
(2)

where \(\bar{T}\) and \(\bar{F}\) represent the averaged values of temperature (over the 24 SSDs) and fan speeds (over the 5 fans), respectively. \(\varGamma (\cdot )\) is the normalizing transformation explained in (1), and is performed using the temperature and fan speed minimum and maximum values listed in Table 1. Note that while this reward function weighs F and T equally, we can tweak the relationship between F and T to meet other preferential tradeoffs that the system operator might find desirable. Nevertheless, the DRL algorithm will be able to optimize the policy for any designed reward function.

Figure 3 plots the reward function we use as a function of mean temperature \(\bar{T}\) (in \(^\circ \)C) and fan speed \(\bar{F}\) (in rpm). Also shown on the temperature-fan speed plane are contours representing regions of similar rewards. The colorbar on the right shows that blue and green colors represent the regions with poor rewards, while (dark and light) brown shades the regions with high rewards. All the aforementioned desired properties are satisfied with this reward function - the reward is maximum when both fan speed temperature are low; when either of them becomes high, the reward drops and there are regions of similar maximal rewards for different I/O loads (across the space of temperature and fan speeds).

Fig. 3.
figure 3

Depiction of the reward (operational efficiency) versus temperature and fan speeds. The reward contours are also plotted on the temperature-fan speed surface. The brown colored contour marks the regions of optimal reward. (Color figure online)

2.3 Algorithm: The Advantage Actor-Critic (A2C) Agent

Once we formulate the state, action and reward components, there are several methods in the RL literature to solve our problem. A rather classic approach is the policy gradient (PG) algorithm [9], that essentially uses gradient-based techniques to optimize the agent’s policy. PG algorithms have lately become popular over other traditional RL approaches such as Q-learning [10] and SARSA [11] since they have better convergence properties and can be effective in high-dimensional and continuous action spaces.

While we experimented with several PG algorithms including Vanilla Policy Gradient [12] (and its variants [13, 14]) and Deep Q-Learning [15], the most encouraging results were obtained with the Advantage Actor-Critic (A2C) agent. A2C is essentially a synchronous, deterministic variant of Asynchronous Advantage Actor-Critic (A3C) [16], that yields state-of-the-art performance on several Atari games as well as on a wide variety of continuous motor control tasks.

As the name suggests, actor-critic algorithms comprise two components, an actor and a critic. The actor determines the best action to perform for any given state, and the critic estimates the actor’s performed action. Iteratively, the actor-critic network implements generalized policy iteration [11] - alternating between a policy evaluation step and a policy improvement step. Architecturally, both the actor and the critic are best modeled via functional approximators, such as deep neural networks.

2.4 Actor-Critic Network Architecture

For our experiments, we employed a dueling network architecture, similar to the one proposed in [17]. The exact architecture is depicted in Fig. 4: the state of the system is a 7-length vector that is fed as input to a fully connected (FC) layer with 10 neurons represented by trainable weights \(\theta \). The output of this FC layer explicitly branches out to two separate feed-forward networks - the policy (actor) network (depicted on the upper branch in Fig. 4) and the state-value function (critic) network (depicted on the lower branch). The parameters \(\theta \) are shared between the actor and the critic networks, while additional parameters \(\alpha \) and \(\beta \) are specific to the policy and state-value networks, respectively. The policy network that has a hidden layer with 5 neurons and a final softmax output layer for predicting the action probabilities (for the 7 raw or 3 incremental actions). The state-value function network comprises a hidden layer of size 10 that culminates into a scalar output for estimating the value function of the input state. The actor aims to approximate the optimal policy \(\pi ^*\): \(\pi (a|s;\theta ,\alpha )\approx \pi ^*(a|s)\), while the critic aims to approximate the optimal state-value function: \(V(s;\theta ,\beta )\approx V^*(s)\).

Fig. 4.
figure 4

The employed dueling network architecture with shared parameters \(\theta \). \(\alpha \) and \(\beta \) are the actor- and critic-specific parameters, respectively. All the layers are fully connected neural networks; the numbers shown above represent the hidden layer dimensions. The policy network on the upper branch estimates the action probabilities via a softmax output layer, while the critic network on the lower branch approximates the state-value function.

Prior DRL architectures for actor-critic methods [16, 18, 19] employ single-stream architectures wherein the actor and critic networks do not share parameters. The advantage of our dueling network lies partly in its ability to compute both the policy and state-value functions via fewer trainable parameters vis-à-vis single-stream architectures. The sharing of parameters also helps mitigate overfitting one function over the other (among the policy and state-value functions). In other words, our dueling architecture is able to learn both the state-value and the policy estimates efficiently. With every update of the policy network parameters in the dueling architecture, the parameters \(\theta \) get updated as well - this contrasts with the updates in a single-stream architecture wherein when the policy parameters are updated, the state-value function parameters remain untouched. The more frequent updating of the parameters \(\theta \) mean a higher resource allocation towards the learning process, thus resulting in faster convergence in addition to obtaining better function approximations.

The pseudocode for our A2C algorithm in the context of the dueling network architecture (see Fig. 4) is described in Algorithm 1. Note that R represents the Monte Carlo return, and well-approximates the action-value function. Accordingly, we use \(R-V(s;(\theta ,\beta ))\) as an approximation to the advantage function.

figure a

3 Experimental Setup and Results

3.1 Timelines

Time is slotted to the duration of 25 s. At the beginning of every time slot, the agent observes the state of the system and prescribes an action. The system is then allowed to stabilize and the reward is recorded at the end of the time slot (which is also the beginning of the subsequent time slot). The system would have, by then, proceeded to its next state, when the next action is prescribed. We use a time horizon of 10 slots (\(H=250\) s), and each iteration comprises \(N=2\) horizons, i.e., the network parameters \(\theta \), \(\alpha \) and \(\beta \) are updated every 500 s.

3.2 I/O Scenarios

We consider two different I/O loading scenarios for our experiments.

  • Simple periodic workload: We assume a periodic load where within each period, there is no I/O activity for a duration of time (roughly 1000 s) followed by heavy I/O loading for the same duration of time. I/O loading is performed using a combination of ‘read’, ‘write’ and ‘randread’ operations with varying block sizes of data ranging from 4 KBytes to 64 KBytes. A timeline of the periodic workload is depicted in Fig. 5 (left).

  • Complex stochastic workload: This is a realistic workload where in every time window of 1000 s, the I/O load is chosen uniformly randomly from three possibilities: no load, medium load or heavy load. A sample realization of the stochastic workload is shown in Fig. 5 (right).

Fig. 5.
figure 5

The simple periodic workload (left) with a period of 2000 s, and a realization of the more realistic stochastic workload (right). Histograms of the load types are also shown for clarity. While the simple load chooses between no load and heavy load in a periodic manner, the complex load chooses uniformly randomly between the no load, medium load and heavy load scenarios.

3.3 Hyperparameters

Table 2 lists some hyperparameters used during model training.

Table 2. Table of hyperparameters.

3.4 Results

In this section, we present our experimental results. Specifically, we consider three separate scenarios - (a) periodic load with raw actions, and the stochastic load with both (b) raw and (c) incremental actions. In each of the cases, we compare the performances of our A2C algorithm (after convergence) against the default policy (which we term the baseline). Recall that he baseline simply uses a tabular method to control fan speeds based on temperature thresholds.

Fig. 6.
figure 6

Performance comparison of baseline and A2C for scenario 1.

(a) Scenario 1: Periodic Load with Raw Actions. We first experimented with the periodic I/O load shown in Fig. 5 (left). Figure 6 summarizes the results; it shows the I/O activity, normalized cumulative rewardsFootnote 6, fan speeds and temperature values over time for both the baseline and our method. Compared to the baseline, the A2C algorithm provided a cumulative reward uplift of \({\sim }33\%\) for similar I/O activity! The higher reward was obtained primarily as a result of the A2C algorithm prescribing a higher fan speed when there was heavy I/O loading (roughly 16000 rpm versus 11000 rpm for the baseline), which resulted in a lower temperature (\(52\,^\circ \)C versus \(55\,^\circ \)C). To clarify this, Fig. 7 plots the contour regions of the temperature and fan speeds for the baseline (left) and the A2C algorithm, at convergence (right). The black-colored blobs essentially mark the operating points of the SSD rack for the two types of load. Evidently, the A2C method converges to the better reward contour as compared to the baseline.

Fig. 7.
figure 7

For the no load scenario, the baseline policy settles to \(36\,^\circ \)C and 9 K rpm while the A2C algorithm converges to \(35\,^\circ \)C and 10 K rpm. With heavy I/O loading, the corresponding numbers are \(55\,^\circ \)C and 11 K versus \(52\,^\circ \)C and 16 K. The A2C algorithm is seen to always settle at the innermost contour, as desired.

Fig. 8.
figure 8

Performance comparison of baseline and A2C for scenario 2. The mean values of temperatures and fan speeds are shown using the black line.

(b) Scenario 2: Stochastic Load with Raw Actions. With the stochastic I/O load (see Fig. 8), the overall reward uplift obtained is smaller (only 12% after averaging over 3000 time steps) than the periodic load case. Again, the A2C algorithm benefits by increasing the fan speeds to keep the temperatures lower. Upon looking more closely at the convergence contours (Fig. 9), it is noted that with the stochastic load, the A2C method sometimes settles to sub-optimal reward contours. We believe this happened due of insufficient exploration.

Fig. 9.
figure 9

Temperature and fan speed contours for the baseline (left) and the A2C method (right). With the stochastic loading with 7 actions, the A2C does not converge as well as in the periodic load case (see Fig. 9 (right)).

Fig. 10.
figure 10

Performance comparison of baseline and A2C for scenario 3.

Fig. 11.
figure 11

Temperature and fan speed contours for the A2C method under scenario 3. The left plot is taken during early steps of training, while the plot on the right is taken at convergence. This illustrates that the A2C algorithm is able to start exploring from a completely random policy (black blobs everywhere) and learn to converge to the contour region with the best reward.

(c) Scenario 3: Stochastic Load with Incremental Actions. With raw actions, the action space is large to explore given the random nature of the I/O load, and this slows learning. To help the algorithm explore better, we study scenario 3. wherein actions can take on only 3 possible values (as compared to 7 values in the prior scenario). With this modification, more promising results are observed - specifically, we observed a cumulative reward uplift of 32% (see Fig. 10). In fact, the A2C algorithm is able to start from a completely random policy (Fig. 11 (left)) and learn to converge to the contour region with the best reward (Fig. 11 (right)).

4 Concluding Remarks

In this paper, we tackle the problem of optimizing the operational efficiency of a SSD storage rack server using the A2C algorithm with a dueling network architecture. Experimental results demonstrate promising reward uplifts of over 30% across two different data I/O scenarios. We hope that this original work on applied deep reinforcement learning instigates interest in employing DRL to other industrial and manufacturing control problems. Interesting directions for future work include experimenting with other data I/O patterns and reward functions, and scaling this work up to train multiple server racks in parallel in a distributed fashion via a single or multiple agents.