# Efficient scheduling for multi-stage coflows

- 57 Downloads

## Abstract

In data center networks (DCN), large scale flows produced by parallel computing frameworks form many coflows semantically. Most inter-coflow schedulers only focus on the remaining data of coflows and attempt to mimic Shortest Job First (SJF). However, a coflow may consist of multiple stages, where a coflow has different amounts of data to transmit. In this paper, we consider the Multi-stage Inter-Coflow Scheduling problem and try to give an efficient online scheduling scheme. We first explore a short-sighted algorithm, IAO, with the greedy strategy. This gives us an insight into utilizing the network resources. Based on that, we propose a far-sighted heuristic, MLBF, which schedules sub-coflows to occupy network bandwidth in turn. Furthermore, we remove the bijection assumption and propose a new practical heuristic, MPLBF. Through simulations in various network environments, we show that, compared to a state-of-the-art scheduler—Varys, a multi-stage aware scheduler can reduce the coflow completion time by up to 4.81 \(\times\) even though it is short-sighted. Moreover, the far-sighted scheduler MLBF can improve the performance by nearly 7.95 \(\times\) reduction. Last but not least, MPLBF can improve the performance by up to 8.03 \(\times\) reduction.

## Keywords

Coflow Scheduling Multi-stage## 1 Introduction

The huge demand for big data processing has accelerated the birth of cluster computing frameworks such as MapReduce (Dean and Ghemawat 2008), Spark (Zaharia et al. 2010), Dryad (Isard et al. 2007), CIEL (Murray et al. 2011) and so on. Most of them serve web search, social networking etc, which have a demanding delay requirement, and thus job processing delay is directly related to user experience (Alizadeh et al. 2011; Munir et al. 2013). During processing, a large amount of data is generated and needs to be transmitted in the form of flow in the data center network. Flow transmission accounts for a large proportion of job completion time (Chowdhury et al. 2011), which has become the focus of data center research in recent years. Previous researches Alizadeh et al. (2013), Hong et al. (2012), Al-Fares et al. (2010), Munir et al. (2015) and Wilson et al. (2011) have investigated job scheduling in an attempt to reduce flow completion time (FCT). However, the job of an application can create many interdependent flows. The isolated scheduling will lose its communication semantics.

The recently proposed *coflow* abstraction implements semantic perceptual scheduling (Chowdhury and Stoica 2012). Coflow is a collection of flows that share the same performance goal. Flows in a coflow serve the same job and depend on each other. The completion of a coflow requires all its flows to be completed. With this abstraction, scheduler is able to leverage the detailed information of data transmission for applications to optimize coflow completion time (CCT). As a consequence, the performance of applications can be improved in a more efficient way. Unfortunately, minimizing CCT through inter-coflow scheduling is NP-hard (Chowdhury et al. 2014), leading to the emergence of heuristic algorithms.

Since Shortest Job First (SJF) is the best scheduling scheme for minimizing the average FCT on a single link (Bai et al. 2015), it is emulated by many heuristics (Hong et al. 2012; Alizadeh et al. 2013; Munir et al. 2015; Chowdhury et al. 2014). Most of them prioritize coflow by calculating the cumulative data it sends. Although this approach plays an important role in distinguishing coflows, a phenomena deserves attention that a job tends to contain multiple phases, which means that its corresponding coflow will have multiple stages (Grandl et al. 2016). For a job, the amount of data that needs to be transmitted varies among different stages. Obviously, it is hard to accurately determine the priority of a coflow based solely on its remaining data. However, to our best knowledge, no reports have appeared concerning multi-stage coflow scheduling. The existing literature is limited to the study of single-stage coflow scheduling and we hope to provide a viable solution.

We use *sub-coflow* to represent the stage of a coflow. A sub-coflow consists of two phases: transmission and computation. During the transmission phase, all flows belonging to the coflow are transmitted at the allocated bandwidth. After all of these flows have completed, the computation phase begins immediately. All flows of the next stage cannot continue to be transmitted until the computation phase is complete. When the computation phase ends, the sub-coflow is finished.

In this paper, we formulate the schedule problem as the *Multi-stage Inter-Coflow Scheduling* (MICS) Problem and present an multi-stage aware online scheduling framework to solve it. With this framework, we convert this online scheduling problem to offline while retaining multi-stage awareness. Firstly we introduce a series of assumptions to help us focus on important factors, and then propose two efficient multi-stage aware coflow scheduling algorithm: the *Iteratively Approaching Optimal* (IAO) algorithm and the *Multi-stage Least Bottleneck First* (MLBF) algorithm. The former is beneficial to guarantee fairness among coflows while the latter focuses on reducing CCT. Secondly, our further analysis indicates that by appropriate modification to MLBF, a strong assumption can be removed and we propose an enhanced algorithm: *Maximizing Parallelism Least Bottleneck First* (MPLBF), which is more general and still efficient.

When designing IAO, we partition the MICS problem into multiple *Single-stage Inter-Coflow Scheduling* (SICS) problems. The SICS problem refers to minimizing the average CCT of sub-coflows of the same stage. Then we use a greedy strategy in an attempt to obtain a fair solution to MICS by optimizing the SICS problems of each stage. We observe and prove that the SICS problem is a convex optimization problem, which is easy to get the optimal solution. In virtue of the interior point method (Bazaraa et al. 2013), we design IAO to minimize the average sub-coflow completion time (SCCT). Experiments show that, because of the multi-stage awareness, IAO outperforms significantly the non multi-stage aware scheduler (e.g. Varys Chowdhury et al. 2014). As the size of coflows increases exponentially, IAO improves both average CCT (from 2 \(\times\) to 7.95 \(\times\)) and the completion time of all coflows (from 2.6 \(\times\) to 7.42 \(\times\)) compared to Varys.

It is evident that the scheduling of IAO is not optimal and the granularity of such simple partitioning method is coarse. The optimization goal of IAO is not the CCT, but the sub-coflow completion time. It brings fairness among coflows, but induces lack of foresight in terms of average CCT. IAO only focuses on the current sub-coflows, resulting in an optimal solution for SICS, but the simple summation of these optimal solutions is not good enough for MICS. To optimize SICS, IAO is prone to transfer all the current sub-coflows simultaneously, increasing the overlap of the computation stages of these coflows. As a consequence, these coflows will finish their transmission and enter computation phases nearly at the same time, which is a great waste of link bandwidth in the course of computation.

Under this point, we argue that a good scheduler is supposed to improve the network utilization as much as possible and occupying the network exclusively has an advantage over sharing. Based on SJF, we propose MLBF to schedule sub-coflows in a prioritized manner. In our experiments, it outperforms IAO from \(1\times\) to \(1.65\times\) in average CCT with the increase in the number of coflows.

Considering that IAO and MLBF depend on a strong assumption that any coflow is a bijection from ingress ports to egress ports, we conduct a further analysis and present an equally efficient but more general heuristic MPLBF. The MPLBF is modified from MLBF, but can work under circumstances that coflows are not bijective.

To take advantage of exclusive transmitting, MPLBF tries to maximize the parallelism of flow transmission. It first meets the needs of coflow with smallest bottleneck time by allocating full bandwidth to this coflow. Subsequently coflows obtain the full bandwidth of their ports that are not yet occupied in the order of priority one by one. In our experiments, MPLBF outperforms Varys from \(2.32\times\) to \(8.03\times\) in average CCT with the exponential increase in the number of coflows.

The rest of this paper is organized as follows. We specify the network model, coflow abstraction, and sub-coflow definition in Sect. 2. Secondly we formulate the MICS problem and prove its hardness in Sect. 3. Then we propose three heuristics to solve this problem in Sect. 4 and evaluate them in Sect. 5. Last but not least, we discuss our work in Sect. 6, present related works in Sect. 7 and conclude this paper in Sect. 8.

## 2 Background

In this section, we detail the background and concepts related to our problem MICS.

### 2.1 Coflow

A coflow is defined as a collection of parallel flows with a common performance goal. Flows within the same coflow serve the same job and have dependency on each other, so inter-coflow scheduling is more beneficial to a job than separately flow scheduling. Tuple \(f=(p, q)\) is used to present a flow whose source port is *p* and destination port is *q*. A coflow contains multiple flows, so a coflow *C* is a set of flows \(C=\{f_1, f_2, \dots \}\). For a multi-stage coflow, except that the volumes of flows vary with stages, flows defined by source-destination pair will not change in different stages. The set of coflows in data center network is \({\mathbb {C}}=\{C_1, C_2, C_3, \dots , C_n\}\).

### 2.2 Sub-coflow

According to some previous observations, we find that in most cases, a coflow has multiple stages. In each stage, flows belong to this coflow will transmit first, and after all flows finish, coflow cannot continue to transmit in that it has to wait the completion of the computation phase. In other words, a stage of coflow comprises two phases: transmission and computation. Once transmission finishes, computation will begin immediately. Only all the flows finish their transmission can computation begin. For convenience, we refer to a stage of coflow as a *sub-coflow* and let sub-coflow \(C_{i,j}\) indicate the *j*th stage of coflow \(C_i\).

In addition to the synchronization of flows in a sub-coflow, there are also dependencies between sub-coflows. A sub-coflow can only begin after its prior sub-coflows finish.

Notation \(t_{i,j}\) denotes the start time of \(C_{i,j}\) and we set \(t_{i,1}=0(\forall 1\le i\le n)\), namely, all coflows in network arrive at the same time 0. When \(C_{i,j}\) begins, it firstly enters the transmission phase, transmitting data of its flows. After all the flows are transmitted, \(C_{i,j}\) waits for computation. When the computation completes, \(C_{i,j}\) completes accordingly. The total time of transmission and computation of a single sub-coflow is sub-coflow completion time (SCCT).

### 2.3 Network model

## 3 Problem statement

Main notations for quick reference

Symbol | Meaning |
---|---|

\({\mathbb {C}}\) | The set of coflows in network |

| The number of coflows |

\(C_i\) | The |

| The total bandwidth of a port in network |

\(C_{i,j}\) | The |

\(t_{i,j}\) | The begin time of sub-coflow \(C_{i,j}\) |

\(T_{i,j}\) | The computing time of sub-coflow \(C_{i,j}\) |

\(V_{i,j,p,q}\) | The volume of flow \(f=(p,q)\in C_{i,j}\) |

\(r_{i,j,p,q}(t)\) | The rate of flow \(f=(p,q)\in C_{i,j}\) at time |

\(\tau _{i,j,p,q}\) | The transmission time of flow \(f=(p,q)\in C_{i,j}\) |

*t*. With this function, the entire transmission of flow

*f*can be formulated as follows.

*f*, and \(V_{i,j,p,q}\) is the volume of

*f*.

*Bandwidth constraint.*The total bandwidth of a port is limited to

*R*, so at any time

*t*, the total bandwidth allocated to flows of any port should not exceed its capacity:

*Dependency constraint*For every coflow, its sub-coflows must be processed according to their dependencies. Sub-coflow \(C_{i,j}\) cannot begin until all of sub-coflows it depends on finish. The time cost of a sub-coflow is equal to the transmission time of all its flows plus computation time. On account of the parallel transmitting of its flows, the transmission time is the longest transmitting time among flows in \(C_i\). The formulation is as follows:

*x*depends on these sub-coflows whose number is less than

*x*, so the only sub-coflow process sequence is 1, 2, 3, 4. This assumption ensures the process sequence is ascending by sub-coflow number: \(C_{i,1}, C_{i,2}, C_{i, 3}, \dots , C_{i, \phi }\). The original Eq. (6) can be converted into:

We model this scheduling problem with *Multi-stage Inter-Coflow Scheduling Problem* (MICS) and find it to be NP-hard.

### **Theorem 1**

*The MICS problem is NP-hard.*

### *Proof*

Chowdhury et al. (2014) have proved that in the same network model, scheduling of single-stage coflows to minimize average CCT is NP-hard. MICS problem is the scheduling of a combination of many single-stage coflows. As a result, minimizing average CCT of multi-stage coflows is NP-hard. \(\square\)

## 4 Online scheduling

During online scheduling, we use such a strategy: At time 0, all the rates of coflows are initialized by our scheduling algorithm. Since then, it will be invoked only when a new sub-coflow arrives or an old flow completes. During the interval between two invocations, rates of flows will not change. With this framework, resource allocation can be dynamically updated according to the network condition.

We first explore a simplified scheduling problem with the following assumption that a coflow \(C=\{f_1=(p_1,q_1), f_2=(p_2,q_2), \ldots \}\) is a bijection from the ingress port set \({\mathbb {P}}\) to the egress port set \({\mathbb {Q}}\). It means that each ingress port only sends data to one egress port and each egress port only receives data from one ingress port. Plus a flow is defined as a tuple of source and destination, so every coflow has no more than one flow in any port. As a consequence, the rate allocation of flows that have the same source but different destinations is not within our consideration. Strong though, it gives us an inspiration to solve the original problem and it will be removed later in this section.

### 4.1 Greedy strategy

*r*of \(r_{i,f}\) is enough, where \(r_{i,f}\) is the rate of flow \(f\in C_i\) before the next scheduling. Based on the simple greedy strategy, we try to minimize the average SCCT of current sub-coflows, in order to ensure fairness among coflows while approximating the optimal solution. Each time the scheduling algorithm is invoked, it will minimize the average completion time of the current sub-coflows. Coflows can be synchronized with scheduling, thus avoiding starvation. The current stage of \(C_i\) is \(s_i\), the completion time of current sub-coflow \(C_{i, s_i}\) is \(t_i\), and the remaining volume of flow \(f\in C_{i, s_i}\) is \(V_{i,f}\). Then Eq. (1) turns into:

In the above equation, flow is directly denoted as *f* rather than (*p*, *q*). This is because flow is a bijection so that there is only one flow for a given *p* or *q*.

*T*(

*r*) of the rate matrix

*r*, equal to the sum of SCCTs of all the current sub-coflows:

After aforementioned transformation, MICS is partitioned into multiple problems of rate allocation at a single schedule. We model the sub-problem as *Single-stage Inter-Coflow Scheduling Problem* (SICS). Fortunately, it is a convex optimization problem.

### **Theorem 2**

*The SICS problem is a convex optimization*.

### *Proof*

To prove convex optimization, we need prove two propositions: (1) the objective function *T*(*r*) is convex; (2) the solution space of *T*(*r*) is a convex set.

(1) The function *T*(*r*) is convex. Firstly, we will prove the function \(\max (\mathbf {x})\) is convex, in which \(\mathbf {x}\) is an n-dimensional vector.

So \(\max (\mathbf {x})\) is convex. Plus the function \(g(r_{i,f})=\frac{V_{i,f}}{r_{i,f}}\) is convex, \(h(\mathbf {r}_i)=\max \nolimits _f\ g(r_{i,f})\) is convex. Then \(T(r)=\sum \nolimits _{i=1}^n h(r_i)\) is convex.

(2) the solution space of *T*(*r*) is a convex set.

*r*is a \(n\times \omega\) matrix, the solution space is the set of all \(n\times \omega\) matrices that satisfy the following constraints for matrix

*X*

*A*and

*B*, let matrix \(C=\lambda A+(1-\lambda )B\), for all \(\lambda \in [0,1]\), any element \(C_{i,f}\) satisfies

*C*still belongs to the solution space, so the solution space is a convex set. In summary, the SICS problem is a problem of minimizing a convex function over a convex set, and it is a convex optimization. \(\square\)

### 4.2 A fair solution to SICS: IAO

Given that the SICS problem is a convex optimization, the local optimum must be global optimal. We present the Iteratively Approaching Optimal (IAO) algorithm as shown in Algorithm 1 in virtue of the interior point method (Bazaraa et al. 2013). In the iterative process of the algorithm, the result will gradually approximate the optimal solution.

*r*is always within the feasible solution space. It can do this because of its property: the value of \({\overline{F}}(r,\zeta _k)\) is close to

*T*(

*r*) and far from the boundary if

*r*is within the feasible solution space. But if not, \({\overline{F}}(r,\zeta _k)\) will be extremely large. Therefore, we can get the optimal solution of the objective function

*T*(

*r*) by iteratively finding the minimum value of the function \({\overline{F}}(r,\zeta _k)\). In an iteration, we use

*c*as the decreasing coefficient, and its range is [0.1, 0.5]. As the decreasing of the barrier factor \(\zeta _k\), \({\overline{F}}(r,\zeta _k)\) will get closer and closer to

*T*(

*r*), until the precision \(\epsilon\) is achieved.

Every time the network conditions change, for instance, a new sub-coflow is about to start or an old flow has just been completed, IAO will be invoked to allocate the rates and the allocation will not change before the next change. Initialized by WFS and updated by limited times, IAO performs like WFS, which will schedule the current sub-coflows to complete at the same time. As a result, scheduled by IAO, their complete time will be close as well. Since IAO focuses on the SICS problem of all current sub-coflows, IAO will synchronize current sub-coflows in such a manner whenever IAO is invoked. In the long term, transmission time of all sub-coflows will be close, which is fair to coflows. In our experiments, for a high workload, the improvement of fairness is significant. Also, the overall average CCT is improved owing to the multi-stage awareness.

### 4.3 A far-sighted heuristic: MLBF

In the matter of overall average CCT, the greedy scheduling scheme is short-sighted in that it only focuses on short-term SCCT minimization but lacks the awareness of global optimization.

It should be noted that if all of the current sub-coflows begin computing simultaneously, the network resource will be completely idle during this period, which is a great waste for link bandwidth. Unfortunately, initialized by WFS and updated by limited times, IAO generally performs like WFS. As a consequence, the greedy IAO is prone to trying to minimize the completion time of current sub-coflows and schedule them to enter transmission and computation phases at the same time.

According to the analysis above, a far-sighted schedule is supposed to take not only current sub-coflows but also subsequent sub-coflows into consideration. In the long term, if network links can be occupied by sub-coflows (not only current sub-coflows) one by one, simultaneous transmission and computation will be avoided, so that competition for bandwidth between sub-coflows during transmission can be reduced. Although this scheduling only performs on current sub-coflows, subsequent sub-coflows will benefit from it.

Under this circumstance, sub-coflows will be transmitted in a certain sequence and when the sub-coflows that have finished their transmission are in the computation phase, the network resource will be occupied by those who start transmission late. The demand of coflows for bandwidth is apportioned throughout the entire process, which reduces congestion during the peak period. Based on SJF, we set the aforementioned sequence by the bottleneck time of sub-coflow. The sub-coflows with less bottleneck time are able to finish with less time cost under the same network condition, so they deserve a high priority.

*f*is \(\varLambda _f\) and let \(D_f\) be the amount of data of flow \(f(f\in C)\), bottleneck \(\varGamma _C\) can be calculated by

We propose the Multi-stage Least Bottleneck First (MLBF) heuristic. Like IAO, MLBF also runs under the previously mentioned online scheduling framework. When a sub-coflow arrives or a flow completes, MLBF will be invoked to reallocate the bandwidth of ports. In Algorithm 2, we calculate the bottleneck of coflows firstly, and sort them by their bottlenecks in ascending order. Coflow with less bottleneck is supposed to have a higher priority, so we allocate bandwidth in ascending order. Then we set the rate of flows in a coflow so that it can complete just at the end of the bottleneck time. Last but not least, to utilize the network, we allocate the remaining bandwidth to the unfinished coflows averagely.

### 4.4 More general heuristic: MPLBF

So far, our multi-stage aware scheduling algorithms, both of IAO and MLBF are based on a basic assumption that every flow is a bijection from an ingress port to an egress port. This assumption may not be practical in all scenarios, thus, we propose a more general and equally far-sighted heuristic: Maximizing Parallelism Least Bottleneck First (MPLBF).

In the new condition, each ingress port can send data to any egress port and each egress port can receive data from any ingress port. Besides, there might exist multiple flows in a coflow at a port. For a single port, we need to consider both the total bandwidth allocation of a coflow and the bandwidth of flows in it.

Based on the analysis in the Sect. 4.3, we should process sub-coflows in a prioritized order to achieve high link utilization. The main problem to be solved is how to allocate bandwidth in the same coflow, for the reason that in a coflow, it is quite likely that multiple flows occupy the same port. If a flow gets a higher bandwidth at its source port but a lower bandwidth at its destination port, then its actual bandwidth is the lower one. In fact, the extra part at the source port is wasted. Nevertheless, this part may be useful to another flow at the source port.

Here we attach importance to the allocation of flows in the same coflow. For a sub-coflow \(C_{i,j}\), if it has multiple flows in the same port *p*, the sum volume of flows is *D*, and its available bandwidth is \(R_C\), then regardless of the order in which these flows are transmitted, the completion time of \(C_{i,j}\) at port *p* is \(D/R_C\). Therefore the transmitting order of flows at port *p* is immaterial for \(C_{i,j}\).

A transmission scheme helps scheduler get out of the dilemma above that a port only transmits a flow at any time. With this scheme, flows not interfering with each other can be transmitted in parallel between ingress ports and egress ports, and all of them can be sent at full speed. To reduce the latency of low priority flows, we adopt the principle of Least Remaining Time First (LRTF). Given that all the bandwidth of ports is *R*, LRTF is equivalent to SJF. That is, in the same sub-coflow, we will give the priority to the smaller flows and transmit them at full speed. We still take Fig. 5 as an example. If flows are prioritized by LRTF like Fig. 5c, sub-coflows will be processed in a criss-cross manner and before coflows finish, there will not exist link idleness. As a result, average CCT in Fig. 5c is 2 smaller than that in Fig. 5b.

## 5 Evaluation

In this section, we evaluate our algorithms including IAO, MLBF and MPLBF compared to a state-of-the-art coflow scheduler Varys. In Sect. 5.1, we evaluate the performances of IAO and MLBF and coflows in the simulation are bijective. In Sect. 5.2, we evaluate the performance of MPLBF and flows can be any mapping from ingress ports to egress ports.

### 5.1 Results of IAO and MLBF

We develop our testbed with MATLAB and the default parameters are configured as follows: There are 70 coflows in DCN and each coflow has 10 stages and 50 flows. The bandwidth of ports is 20 MB/s. The volumes of flows follow a uniform distribution in [0, 200 MB] and during computation phases of sub-coflows, the computing speed is 10 MB/s. We set the data large enough so that experiments can last long enough to avoid errors caused by special circumstances. Actually, every experiment will last at least 1000 simulated seconds.

### 5.2 Results of MPLBF

In consideration of the complexity of the simulation of non-bijective flows, we re-implement our testbed with Python and adjust default parameters as follows: There are 100 ingress ports and 100 egress ports, 20 coflows between them in DCN and each coflow has 10 stages and 25 flows. The bandwidth of ports is 20 MB/s. The volumes of flows follow a uniform distribution in [0, 200 MB] and during computation phases of sub-coflows, the computing speed is 10 MB/s. Like simulations in Sect. 5.1, we avoid possible errors by setting appropriate data.

*C*to be uniformly distributed from 0 to \(x_C\) and \(x_C (\forall C\in {\mathbb {C}})\) follows the distribution we want to test. To avoid potential bias, we set the mathematical expectations of all the distributions used in the experiments to 100 MB. Figure 12 shows the average CCT of Varys and MPLBF under uniform distribution (Uni.) between 10 and 200, exponential distribution (Exp.) with the rate parameter \(\lambda\) of 0.01, and normal distribution (Norm.) with the standard deviation of 20, 30, 40, 50. Under different distributions, MPLBF outperforms Varys from 3.91 \(\times\) to 5.24 \(\times\).

## 6 Discussion

**Sub-coflow Sequence** In the previous Sect. 3 we mentioned that the sub-coflow is numbered according to its dependency order and a sub-coflow depends on all its antecedent ones. In real-world scenarios, dependencies might be complicated much more. However, as long as the dependency graph is a directed acyclic graph, it must have a topological order and we can number sub-coflows by this order.

**Scheduling without prior knowledge** The information of coflows plays an important role in scheduling, but it is difficult to obtain in practical environment. Here we propose two methods that turn out to be effective in previous works: First, we can provide applications with registration APIs for coflows so that each time the coflow registers, the scheduler can obtain its full information (e.g. Varys; Chowdhury et al. 2014). Second, the information can be predicted by machine learning (e.g. CODA; Zhang et al. 2016). In this way, the scheduler can work without any modification to applications.

**Different numbers of stages and flows** In our simulation, each coflow has the same number of stages and flows. In real-world scenarios, up to 50% jobs have only no more than 20 stages (Grandl et al. 2016). Their difference is little and the performance will be similar to our experiment results. In addition, there will be some flows with volume 0 in our simulations, which equals to circumstances that coflows have different number of flows.

**Parallel stages** In this paper, we mainly study stages that have dependencies on each other. In a real environment, there might exist some parallel stages and their topological order is not deterministic. Our algorithm will randomly select one of all possible topological orders for processing. Of course, as far as this problem is concerned, there is still room for optimization, for example, we can process such stages in parallel, not in serial.

## 7 Related work

**Flow scheduling** Flow scheduling has been studied for a long time and there are quite a few works worth mentioning. Hedera (Al-Fares et al. 2010) is a dynamic scheduling system and it is designed to utilize aggregate network resources. Chowdhury et al. (2011) noticed the semantics between flows, and optimized the performance at a higher level than individual flows by Orchestra. Based on the insight that datacenter transport should decouple flow scheduling from rate control, Alizadeh et al. present pFabric (Alizadeh et al. 2013). Furthermore, Baraat (Dogar et al. 2014) promoted scheduling level to task, enabling schedulers to be task aware. Unlike pFabric, PIAS (Bai et al. 2015) does not need accurate per-flow information a priori, and aims to minimize FCT by mimicking SJF. Using expansion ratio, Zhang et al. (2017) designed MERP to archive efficient flow scheduling without starvation. Aemon Wang et al. (2017) is designed to address the challenges of mix-flow scheduling without flow sizes known as a prior.

In our work, performance is evaluated by coflow completion time instead of flow completion time (FCT). Because of the rich semantics carried by coflow, the optimization to CCT is more beneficial to the improve the performance of applications than FCT.

**Coflow scheduling** Coflow is an abstraction of a set of flows and with this abstraction, schedulers are able to optimize the performance of applications more efficiently. It is proposed by Chowdhury and Stoica (2012). Based on this, a lot works have emerged. Varys (Chowdhury et al. 2014) is the first coflow scheduler which leverages the characteristics of coflow. Zhao et al. present an observation that both routing and scheduling must be considered simultaneously (Zhao et al. 2015). Qiu et al. (2015) proposed a method to minimize the weighted CCT while scheduling coflows with release dates. Aalo (Chowdhury and Stoica 2015) provided a scheme without prior knowledge and took the dependencies of coflows into consideration.

The hallmark of our work is the multi-stage characteristic of coflow. It is not noticed before and we have used it to significantly improve overall performance.

**Data center network** Guo et al. (2016) applies the game-theoretic framework to the bandwidth allocation problem, so as to guarantee bandwidth for VMs according to their base bandwidth requirements and share residual bandwidth in proportion to the weights of VMs. Liu et al. pay attention to coping with highly dynamic traffic in DCN and propose *eBA* to provide end-to-end bandwidth guarantee for VMs under a number of short flows and massive bursty traffic. Considering that the lack of pricing for bandwidth guarantee causes a severe price-performance anomaly, SoftBW Guo et al. (2017) enables pricing bandwidth with over commitment on bandwidth guarantee. Wang et al. (2017) proposed an efficient online algorithm for dynamic SDN controller assignment to avoid long response time and high maintenance cost.

Our work is complementary. We focus on optimizing performance of coflows, instead of achieving fairness for VMs, guaranteeing bandwidth for VMs, pricing bandwidth or assigning SDN controllers.

## 8 Conclusion

In this paper, we concentrate on the multi-stage characteristic of coflow and leverage it to improve the performance of coflows. We first formulate the *Multi-stage Inter-Coflow* problem and prove it to be NP-hard. Then we propose an online scheduling framework to convert the online problem to offline and based on the greedy strategy, we divide the MICS into multiple SICS problems. We construct a solution IAO to optimize SICS and experiments show IAO improves average CCT by up to 4.81 \(\times\). However, we find that there is still a lot of room for optimization in the greedy strategy and propose a far-sighted heuristic MLBF, which outperforms IAO by up to 1.65 \(\times\). Furthermore, we remove the bijective flow assumption and propose a more general heuristic MPLBF, which improves average CCT by up to 8.03 \(\times\) compared to Varys.

## Notes

### Acknowledgements

This work was supported in part by National Key R&D Program of China (2017YFB1001801), NSFC (61872175), NSF of Jiangsu Province (BK20181252), CCF-Tencent Open Fund, and Collaborative Innovation Center of Novel Software Technology and Industrialization. On behalf of all authors, the corresponding author states that there is no conflict of interest.

## References

- Al-Fares, M., Radhakrishnan, S., Raghavan, B., Huang, N., Vahdat, A.: Hedera: dynamic flow scheduling fordata center networks. In: Nsdi, Vol. 10, pp. 89–92 (2010)Google Scholar
- Alizadeh, M., Edsall, T., Dharmapurikar, S., Vaidyanathan, R., Chu, K., Fingerhut, A., Matus, F., Pan, R., Yadav, N., Varghese, G., et al.: Conga: distributed congestion-aware load balancing for datacenters. In: Proceedings of the 2014 ACM Conference on the Special Interest Group on Data Communication (SIGCOMM 2014)Google Scholar
- Alizadeh, M., Greenberg, A., Maltz, D.A., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., Sridharan, M.: Data center tcp (dctcp). ACM SIGCOMM Comput. Commun. Rev.
**41**(4), 63–74 (2011)CrossRefGoogle Scholar - Alizadeh, M., Yang, S., Sharif, M., Katti, S., McKeown, N., Prabhakar, B., Shenker, S.: pfabric: Minimal near-optimal datacenter transport. ACM SIGCOMM Comput. Commun. Rev.
**43**, 435–446 (2013)CrossRefGoogle Scholar - Bai, W., Chen, K., Wang, H., Chen, L., Han, D., Tian, C.: Information-agnostic flow scheduling for commodity data centers. In: Proceedings of the 2015 USENIX Symposium on Networked Systems Design and Implementation (NSDI 2015)Google Scholar
- Ballani, H., Costa, P., Karagiannis, T., Rowstron, A.: Towards predictable datacenter networks. In: ACM SIGCOMM computer communication review, Vol. 41, pp. 242–253. ACM (2011)Google Scholar
- Bazaraa, M.S., Sherali, H.D., Shetty, C.M.: Nonlinear Programming: Theory and Algorithms. Wiley, Hoboken (2013)zbMATHGoogle Scholar
- Chowdhury, M., Stoica, I.: Coflow: a networking abstraction for cluster applications. In: Proceedings of the 11th ACM Workshop on Hot Topics in Networks, pp. 31–36. ACM (2012)Google Scholar
- Chowdhury, M., Stoica, I.: Efficient coflow scheduling without prior knowledge. In: Proceedings of the 2015 ACM Conference on the Special Interest Group on Data Communication (SIGCOMM 2015)Google Scholar
- Chowdhury, M., Zaharia, M., Ma, J., Jordan, M.I., Stoica, I.: Managing data transfers in computer clusters with orchestra. In: Proceedings of the 2011 ACM Conference on the Special Interest Group on Data Communication (SIGCOMM 2011)Google Scholar
- Chowdhury, M., Zhong, Y., Stoica, I.: Efficient coflow scheduling with varys. In: ACM SIGCOMM Computer Communication Review, Vol. 44, pp. 443–454. ACM (2014)Google Scholar
- Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM
**51**(1), 107–113 (2008)CrossRefGoogle Scholar - Dogar, F.R., Karagiannis, T., Ballani, H., Rowstron, A.: Decentralized task-aware scheduling for data center networks. In: Proceedings of the 2014 ACM Conference on the Special Interest Group on Data Communication (SIGCOMM 2014)Google Scholar
- Grandl, R., Kandula, S., Rao, S., Akella, A., Kulkarni, J.: G: packing and dependency-aware scheduling for data-parallel clusters. In: Proceedings of OSDI’16: 12th USENIX Symposium on Operating Systems Design and Implementation (2016)Google Scholar
- Greenberg, Albert, Hamilton, James R., Jain, Navendu, Kandula, Srikanth, Kim, Changhoon, Lahiri, Parantap, Maltz, David A., Patel, Parveen, Sengupta, Sudipta: Vl2: a scalable and flexible data center network. In: Proceedings of the 2009 ACM Conference on the Special Interest Group on Data Communication (SIGCOMM 2009)Google Scholar
- Guo, J., Liu, F., Wang, T., Lui, J.C.S.: Pricing intra-datacenter networks with over-committed bandwidth guarantee. In: 2017 USENIX annual technical conference (USENIXATC 17), pp. 69–81 (2017)Google Scholar
- Guo, J., Liu, F., Lui, J.C.S., Jin, H.: Fair network bandwidth allocation in iaas datacenters via a cooperative game approach. IEEE/ACM Trans. Netw.
**24**(2), 873–886 (2016)CrossRefGoogle Scholar - Hong, C.-Y., Caesar, M., Godfrey, P.: Finishing flows quickly with preemptive scheduling. In: Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, pp. 127–138. ACM (2012)Google Scholar
- Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: ACM SIGOPS operating systems review, Vol. 41, pp. 59–72. ACM (2007)Google Scholar
- Kang, N., Liu, Z., Rexford, J., Walker, D.: Optimizing the one big switch abstraction in software-defined networks. In: Proceedings of the ninth ACM conference on Emerging networking experiments and technologies, pp. 13–24. ACM (2013)Google Scholar
- Munir, A., Qazi, I.A., Uzmi, Z.A., Mushtaq, A., Ismail, S.N., Iqbal, M.S., Khan, B.: Minimizing flow completion times in data centers. In: INFOCOM, 2013 Proceedings IEEE, pp. 2157–2165. IEEE (2013)Google Scholar
- Munir, A., Baig, G., Irteza, S.M., Qazi, I.A., Liu, A.X., Dogar, F.R.: Friends, not foes: synthesizing existing transport strategies for data center networks. CM SIGCOMM Comput. Commun. Rev.
**44**, 491–502 (2015). ACMCrossRefGoogle Scholar - Murray, D.G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A., Hand, S.: Ciel: a universal execution engine for distributed data-flow computing. In: Proceedings of the 8th ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI 2011)Google Scholar
- Niranjan M., Radhika, P., Andreas, F., Nathan, H., Nelson, M., Pardis, R., Sivasankar, S., Vikram, V.A.: Portland: a scalable fault-tolerant layer 2 data center network fabric. In: ACM SIGCOMM Computer Communication Review, Vol. 39, pp. 39–50. ACM (2009)Google Scholar
- Popa, L., Kumar, G., Chowdhury, M., Krishnamurthy, A., Ratnasamy, S., Stoica, I.: Faircloud: sharing the network in cloud computing. ACM SIGCOMM Comput. Commun. Rev.
**42**(4), 187–198 (2012)CrossRefGoogle Scholar - Qiu, Z., Stein, C., Zhong, Y.: Minimizing the total weighted completion time of coflows in datacenter networks. In: Proceedings of the 27th ACM symposium on parallelism in algorithms and architectures (SPAA 2015)Google Scholar
- Wang, T., Xu, H., Liu, F.: Aemon: information-agnostic mix-flow scheduling in data center networks. In: Proceedings of the First Asia-Pacific Workshop on Networking, pp. 106–112. ACM (2017)Google Scholar
- Wang, T., Liu, F., Hong, X.: An efficient online algorithm for dynamic sdn controller assignment in data center networks. IEEE/ACM Trans. Netw.
**25**(5), 2788–2801 (2017)CrossRefGoogle Scholar - Wilson, C., Ballani, H., Karagiannis, T., Rowtron, A.: Better never than late: meeting deadlines in datacenter networks, Vol. 41, pp. 50–61 (2011) ACMGoogle Scholar
- Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. 10: 95 (2010)Google Scholar
- Zhang, H., Chen, L., Yi, B., Chen, K., Chowdhury, M., Geng, Y.: Coda: toward automatically identifying and scheduling coflows in the dark. In: Proceedings of the 2016 ACM Conference on the Special Interest Group on Data Communication (SIGCOMM 2016)Google Scholar
- Zhang, S., Qian, Z., Hao, W., Sanglu, L.: Efficient data center flow scheduling without starvation using expansion ratio. IEEE Trans. Parallel Distrib. Syst.
**28**(11), 3157–3170 (2017)CrossRefGoogle Scholar - Zhao, Y., Chen, K., Bai, W., Yu, M., Tian, C., Geng, Y., Zhang, Y., Li, D., Wang, S.: Rapier: integrating routing and scheduling for coflow-aware data center networks. In: Proceedings of the 2015 IEEE Conference on Computer Communications (INFOCOM 2015)Google Scholar