Keywords

1 Introduction

Despite the adoption of various resource management systems that use typical scheduling algorithms based on instantaneous resource availability during the scheduling, the ability to reliably distribute application tasks among cloud servers remains deficient. According to the analysis of Alibaba cluster data [3], cloud servers have a significant spatial imbalance and time imbalance. Due to the limits of existing task scheduling methods, this paper proposes a balanced task scheduling strategy based on multi-resource prediction and allocation to achieve a better load balance among cloud servers.

The main contributions of this paper are: (i) According to the load feedback sampled periodically, we forecast the future load of servers through a time series prediction model - Prophet [7]. Then we use a multi-objective particle swarm optimization algorithm - OMOPSO [8] to determine the mapping relationship between the tasks and the servers from the predicted load, actual load, and load threshold. (ii) We use the Alibaba cluster trace with 1310 servers as the test dataset to evaluate the prediction accuracy and also perform the load balance analysis to verify the effectiveness of the task scheduling strategy. Experimental results show that the proposed strategy can achieve a more balanced CPU and memory utilization.

2 Problem Description

Definition 1

Server and its resource utilization vector. The data center has n servers \(S_{i},i\in [1,n]\). Vector \(\overrightarrow{S_{i}^{cur}}=(S_{i,CPU}^{cur},S_{i,Mem}^{cur})\) represents the current resource utilization of different servers in the data center, \(S_{i,CPU}^{cur}\) is the current CPU utilization of server \(S_{i}\), \(S_{i,Mem}^{cur}\) is the current memory utilization of server \(S_{i}\). Vector \(\overrightarrow{S_{i}^{nxt}}=(S_{i,CPU}^{nxt},S_{i,Mem}^{nxt})\) represents the predicted resource utilization of different servers in the data center at the next time.

Definition 2

Batch task and its resource occupancy rate. The number of batch tasks that need to be deployed to the server at a given time is m, \(B_{j},j\in [1,m]\) represents a batch task, \(B_{j,CPU}\) is the CPU requirement of \(B_{j}\), \(B_{j,Mem}\) is the memory requirement of \(B_{j}\).

Definition 3

Batch tasks to servers deployment matrix. The deployment relationship between the batch tasks and servers can be expressed as a matrix \(E=(e_{ij})_{n\times m}\). When batch task \(B_{j}\) is deployed to server \(S_{i}\), \(e_{ij}=1\), otherwise \(e_{ij}=0\).

Definition 4

Server and its current utilization estimate. For server \(S_{i}\), its current CPU utilization estimate is the sum of \(S_{i,CPU}^{cur}\) and the CPU resource requested for all batch tasks deployed on it: \(EST_{i,CPU}^{cur}=S_{i,CPU}^{cur}+\sum _{j=1}^{m}e_{ij}B_{j,CPU}\). In the same way, its current memory utilization estimate is \(EST_{i,Mem}^{cur} = S_{i,Mem}^{cur} + \sum _{j=1}^{m}e_{ij}B_{j,Mem}\).

Definition 5

Server and its next-period utilization estimate. Assume that the batch tasks currently deployed are not finished in the next period. For server \(S_{i}\), its next-period CPU utilization estimate \(EST_{i,CPU}^{nxt}\) is the sum of \(S_{i,CPU}^{nxt}\) and the CPU resource requested for all the batch tasks currently deployed on it: \(EST_{i,CPU}^{nxt}=S_{i,CPU}^{nxt}+\sum _{j=1}^{m}e_{ij}B_{j,CPU}\). Its next-period memory utilization estimate \(EST_{i,Mem}^{nxt}=S_{i,Mem}^{nxt}+\sum _{j=1}^{m}e_{ij}B_{j,Mem}\).

Problem Model. By introducing the above definitions, the server load balancing problem can be modeled as a multi-objective optimization problem, whose objective functions:

$$\begin{aligned} \begin{aligned} min(K_{Res}^{cur})&=min \left( \sqrt{\frac{1}{n}\sum _{i=1}^{n}\left( EST_{i,Res}^{cur} -\frac{1}{n}\sum _{i=1}^{n}EST_{i,Res}^{cur}\right) ^{2}}\right) , \\Res&\in \{CPU,Mem\} \end{aligned} \end{aligned}$$
(1)

\(K_{Res}^{cur}\) is the standard deviation of the current resource utilization estimate for servers of the data center.

The constraint functions are as follows:

$$\begin{aligned} \sum _{i=1}^{n}e_{ij}=1,j=1,2,...,m \end{aligned}$$
(2)

indicating that each batch task can only be deployed on one server.

$$\begin{aligned} EST_{i,Res}^{cur}=S_{i,Res}^{cur}+\sum _{j=1}^{m}e_{ij}B_{j,Res}<T_{i,Res} \end{aligned}$$
(3)
$$\begin{aligned} EST_{i,Res}^{nxt}=S_{i,Res}^{nxt}+\sum _{j=1}^{m}e_{ij}B_{j,Res}<T_{i,Res} \end{aligned}$$
(4)

represent that when the batch tasks are deployed on the servers, the current and next-period resource utilization cannot exceed the server resource threshold. The resource threshold of server \(S_{i}\) is \(T_{i,Res}\).

3 Experimental Evaluation

The cluster data released by Alibaba in 2017 is used as the experimental data. It contains 12-h trace information of 1,310 machines, including machine resource usage and batch task workload.

We use the logistic regression model of Prophet for prediction. The model parameters are as follows: capacity is 100%, changepoint_range is 100%, changepoint_prior_scale is 0.2, and n_changepoint is automatically set by the model. The sliding window mechanism was applied to predict the workload and the length of the window is set to 8.

We first verify the prediction accuracy of the proposed method. Figure 1 shows the actual load and predicted load of a server (id = 600) in the sampling period. The figure shows that the prediction can fit the fluctuation of the machine load very well.

Fig. 1.
figure 1

Actual and predicted load comparison of machine id 600

Then, we evaluate the effectiveness of balanced scheduling strategy. We select 4 load sampling time periods from Alibaba cluster data, using the first 5,000 batch tasks in all servers for rescheduling in each time period.

We find the solution to problem (1) by the OMOPSO algorithm under constraints (2)(3)(4). By tracking 4 load sampling timestamps, we get the actual resource utilization \(S_{i,CPU}^{cur}\) and \(S_{i,Mem}^{cur}\) of the machines, and we get the predicted value \(S_{i,CPU}^{nxt}\) and \(S_{i,Mem}^{nxt}\) of future resource utilization through the Prophet model. The resource utilization threshold \(T_{i,CPU}\) and \(T_{i,Mem}\) of server \(S_{i}\) are set to 70% and 90% respectively. The parameters for particle swarm optimization are set as follows: \(w=rand(0.1,0.5)\), \(c_{1},c_{2}=rand(1.5,2.0)\), \(r_{1},r_{2}=(0.0,1.0)\), \(polupationSize=50\) and \(maxEvalution=1000\).

The load balancing effect is tested by calculating the standard deviation of the load of cloud servers, and the results are shown in Table 1, where \(K_{CPU}^{orig}\) and \(K_{Mem}^{orig}\) represent the standard deviation of the CPU load and memory load of the machines when the original scheduling strategy is adopted. In the case of using the proposed scheduling strategy, the load balance of each experimental group is improved compared with the original scheduling strategy.

Table 1. Load balancing effect of two scheduling strategies

4 Related Work

The intelligent algorithms such as simulated annealing algorithm [9], genetic algorithm [6] and particle swarm optimization [4] are powerful in solving the task scheduling problem under multi-resource constraints. LD et al. [1] propose a dynamic load balancing algorithm HBB-LB based on bees’ foraging behavior, aiming to achieve load balancing across VMs to maximize throughput. The priority of the task in the waiting sequence in the node is considered to minimize the waiting time of the task in the queue. Li et al. [2] propose a cloud task scheduling policy based on Load Balancing Ant Colony Optimization (LBACO) algorithm. The algorithm selects the best resource to perform a task based on the resource state and the size of a given task in the cloud environment. It balances the overall system and minimizes the completion time for a given set of tasks. Ramezani et al. [5] propose a Task-based System Load Balancing method using Particle Swarm Optimization (TBSLB-PSO) that achieves system load balancing by only transferring extra tasks from an overloaded VM instead of migrating the entire overloaded VM. It significantly reduces the time taken for the load balancing process.

5 Conclusion

In order to solve the load balancing problem, this paper proposes a task scheduling strategy based on the combination of multi-objective particle swarm optimization and time series prediction model. The goal of this strategy is to improve load balancing among the cloud servers, and the impact of the current and future load of the servers on task scheduling is also considered. The experiments based on Alibaba cluster trace with 1310 servers show that this scheduling strategy can effectively achieve the goal of reasonable task allocation with a more balanced resource utilization.