Efficient scheduling for multi-stage coflows

  • Shuai Zhang
  • Sheng ZhangEmail author
  • Zhuzhong Qian
  • Xiaoda Zhang
  • Mingjun Xiao
  • Jie Wu
  • Jidong Ge
  • Xiaoliang Wang
Regular Paper


In data center networks (DCN), large scale flows produced by parallel computing frameworks form many coflows semantically. Most inter-coflow schedulers only focus on the remaining data of coflows and attempt to mimic Shortest Job First (SJF). However, a coflow may consist of multiple stages, where a coflow has different amounts of data to transmit. In this paper, we consider the Multi-stage Inter-Coflow Scheduling problem and try to give an efficient online scheduling scheme. We first explore a short-sighted algorithm, IAO, with the greedy strategy. This gives us an insight into utilizing the network resources. Based on that, we propose a far-sighted heuristic, MLBF, which schedules sub-coflows to occupy network bandwidth in turn. Furthermore, we remove the bijection assumption and propose a new practical heuristic, MPLBF. Through simulations in various network environments, we show that, compared to a state-of-the-art scheduler—Varys, a multi-stage aware scheduler can reduce the coflow completion time by up to 4.81 \(\times\) even though it is short-sighted. Moreover, the far-sighted scheduler MLBF can improve the performance by nearly 7.95 \(\times\) reduction. Last but not least, MPLBF can improve the performance by up to 8.03 \(\times\) reduction.


Coflow Scheduling Multi-stage 

1 Introduction

The huge demand for big data processing has accelerated the birth of cluster computing frameworks such as MapReduce (Dean and Ghemawat 2008), Spark (Zaharia et al. 2010), Dryad (Isard et al. 2007), CIEL (Murray et al. 2011) and so on. Most of them serve web search, social networking etc, which have a demanding delay requirement, and thus job processing delay is directly related to user experience (Alizadeh et al. 2011; Munir et al. 2013). During processing, a large amount of data is generated and needs to be transmitted in the form of flow in the data center network. Flow transmission accounts for a large proportion of job completion time (Chowdhury et al. 2011), which has become the focus of data center research in recent years. Previous researches Alizadeh et al. (2013), Hong et al. (2012), Al-Fares et al. (2010), Munir et al. (2015) and Wilson et al. (2011) have investigated job scheduling in an attempt to reduce flow completion time (FCT). However, the job of an application can create many interdependent flows. The isolated scheduling will lose its communication semantics.

The recently proposed coflow abstraction implements semantic perceptual scheduling (Chowdhury and Stoica 2012). Coflow is a collection of flows that share the same performance goal. Flows in a coflow serve the same job and depend on each other. The completion of a coflow requires all its flows to be completed. With this abstraction, scheduler is able to leverage the detailed information of data transmission for applications to optimize coflow completion time (CCT). As a consequence, the performance of applications can be improved in a more efficient way. Unfortunately, minimizing CCT through inter-coflow scheduling is NP-hard (Chowdhury et al. 2014), leading to the emergence of heuristic algorithms.

Since Shortest Job First (SJF) is the best scheduling scheme for minimizing the average FCT on a single link (Bai et al. 2015), it is emulated by many heuristics (Hong et al. 2012; Alizadeh et al. 2013; Munir et al. 2015; Chowdhury et al. 2014). Most of them prioritize coflow by calculating the cumulative data it sends. Although this approach plays an important role in distinguishing coflows, a phenomena deserves attention that a job tends to contain multiple phases, which means that its corresponding coflow will have multiple stages (Grandl et al. 2016). For a job, the amount of data that needs to be transmitted varies among different stages. Obviously, it is hard to accurately determine the priority of a coflow based solely on its remaining data. However, to our best knowledge, no reports have appeared concerning multi-stage coflow scheduling. The existing literature is limited to the study of single-stage coflow scheduling and we hope to provide a viable solution.

We use sub-coflow to represent the stage of a coflow. A sub-coflow consists of two phases: transmission and computation. During the transmission phase, all flows belonging to the coflow are transmitted at the allocated bandwidth. After all of these flows have completed, the computation phase begins immediately. All flows of the next stage cannot continue to be transmitted until the computation phase is complete. When the computation phase ends, the sub-coflow is finished.

Figure 1 illustrates the lifespan of a job. The coflow \(C_i\) associated with it has 3 stages and is partitioned into 3 sub-coflows \(C_{i,1}, C_{i,2}, C_{i,3}\). The three sub-coflows are processed in turn and when the transmission of the last one \(C_{i,3}\) completes, the coflow \(C_i\) is considered to be finished. \(C_i\) has three flows \(f_1, f_2, f_3\) and they will be transmitted in each transmission phase. The computation requires all the data in the transmission phase, so it has to wait for the completion of all flows. In \(C_{i,1}\), \(f_2\) takes the longest transmission time so computation begins after the completion of \(f_2\).
Fig. 1

The job generates coflow \(C_i\) and \(C_i\) consists of 3 sub-coflows \(C_{i,1}, C_{i,2}, C_{i,3}\)

Considering that the subsequent stages depend on the antecedent stages in the jobs of parallel computing frameworks, the beginning of each sub-coflow also requires the completion of all prior sub-coflows in a coflow. In Fig. 2, for example, sub-coflow 6 depends on 3 and 5, meanwhile 3 and 5 have dependencies on 2, 4. Plus 2 depends on 1, sub-coflow 6 cannot begin until all of 1, 2, 3, 4, 5 finish. Sub-coflows are processed in topological order of the directed acyclic graph of dependencies.
Fig. 2

This coflow consists of 7 sub-coflows. A\(\rightarrow\)B indicates that sub-coflow B depends on A

In this paper, we formulate the schedule problem as the Multi-stage Inter-Coflow Scheduling (MICS) Problem and present an multi-stage aware online scheduling framework to solve it. With this framework, we convert this online scheduling problem to offline while retaining multi-stage awareness. Firstly we introduce a series of assumptions to help us focus on important factors, and then propose two efficient multi-stage aware coflow scheduling algorithm: the Iteratively Approaching Optimal (IAO) algorithm and the Multi-stage Least Bottleneck First (MLBF) algorithm. The former is beneficial to guarantee fairness among coflows while the latter focuses on reducing CCT. Secondly, our further analysis indicates that by appropriate modification to MLBF, a strong assumption can be removed and we propose an enhanced algorithm: Maximizing Parallelism Least Bottleneck First (MPLBF), which is more general and still efficient.

When designing IAO, we partition the MICS problem into multiple Single-stage Inter-Coflow Scheduling (SICS) problems. The SICS problem refers to minimizing the average CCT of sub-coflows of the same stage. Then we use a greedy strategy in an attempt to obtain a fair solution to MICS by optimizing the SICS problems of each stage. We observe and prove that the SICS problem is a convex optimization problem, which is easy to get the optimal solution. In virtue of the interior point method (Bazaraa et al. 2013), we design IAO to minimize the average sub-coflow completion time (SCCT). Experiments show that, because of the multi-stage awareness, IAO outperforms significantly the non multi-stage aware scheduler (e.g. Varys Chowdhury et al. 2014). As the size of coflows increases exponentially, IAO improves both average CCT (from 2 \(\times\) to 7.95 \(\times\)) and the completion time of all coflows (from 2.6 \(\times\) to 7.42 \(\times\)) compared to Varys.

It is evident that the scheduling of IAO is not optimal and the granularity of such simple partitioning method is coarse. The optimization goal of IAO is not the CCT, but the sub-coflow completion time. It brings fairness among coflows, but induces lack of foresight in terms of average CCT. IAO only focuses on the current sub-coflows, resulting in an optimal solution for SICS, but the simple summation of these optimal solutions is not good enough for MICS. To optimize SICS, IAO is prone to transfer all the current sub-coflows simultaneously, increasing the overlap of the computation stages of these coflows. As a consequence, these coflows will finish their transmission and enter computation phases nearly at the same time, which is a great waste of link bandwidth in the course of computation.

Under this point, we argue that a good scheduler is supposed to improve the network utilization as much as possible and occupying the network exclusively has an advantage over sharing. Based on SJF, we propose MLBF to schedule sub-coflows in a prioritized manner. In our experiments, it outperforms IAO from \(1\times\) to \(1.65\times\) in average CCT with the increase in the number of coflows.

Considering that IAO and MLBF depend on a strong assumption that any coflow is a bijection from ingress ports to egress ports, we conduct a further analysis and present an equally efficient but more general heuristic MPLBF. The MPLBF is modified from MLBF, but can work under circumstances that coflows are not bijective.

To take advantage of exclusive transmitting, MPLBF tries to maximize the parallelism of flow transmission. It first meets the needs of coflow with smallest bottleneck time by allocating full bandwidth to this coflow. Subsequently coflows obtain the full bandwidth of their ports that are not yet occupied in the order of priority one by one. In our experiments, MPLBF outperforms Varys from \(2.32\times\) to \(8.03\times\) in average CCT with the exponential increase in the number of coflows.

The rest of this paper is organized as follows. We specify the network model, coflow abstraction, and sub-coflow definition in Sect. 2. Secondly we formulate the MICS problem and prove its hardness in Sect. 3. Then we propose three heuristics to solve this problem in Sect. 4 and evaluate them in Sect. 5. Last but not least, we discuss our work in Sect. 6, present related works in Sect. 7 and conclude this paper in Sect. 8.

2 Background

In this section, we detail the background and concepts related to our problem MICS.

2.1 Coflow

A coflow is defined as a collection of parallel flows with a common performance goal. Flows within the same coflow serve the same job and have dependency on each other, so inter-coflow scheduling is more beneficial to a job than separately flow scheduling. Tuple \(f=(p, q)\) is used to present a flow whose source port is p and destination port is q. A coflow contains multiple flows, so a coflow C is a set of flows \(C=\{f_1, f_2, \dots \}\). For a multi-stage coflow, except that the volumes of flows vary with stages, flows defined by source-destination pair will not change in different stages. The set of coflows in data center network is \({\mathbb {C}}=\{C_1, C_2, C_3, \dots , C_n\}\).

2.2 Sub-coflow

According to some previous observations, we find that in most cases, a coflow has multiple stages. In each stage, flows belong to this coflow will transmit first, and after all flows finish, coflow cannot continue to transmit in that it has to wait the completion of the computation phase. In other words, a stage of coflow comprises two phases: transmission and computation. Once transmission finishes, computation will begin immediately. Only all the flows finish their transmission can computation begin. For convenience, we refer to a stage of coflow as a sub-coflow and let sub-coflow \(C_{i,j}\) indicate the jth stage of coflow \(C_i\).

In addition to the synchronization of flows in a sub-coflow, there are also dependencies between sub-coflows. A sub-coflow can only begin after its prior sub-coflows finish.

Notation \(t_{i,j}\) denotes the start time of \(C_{i,j}\) and we set \(t_{i,1}=0(\forall 1\le i\le n)\), namely, all coflows in network arrive at the same time 0. When \(C_{i,j}\) begins, it firstly enters the transmission phase, transmitting data of its flows. After all the flows are transmitted, \(C_{i,j}\) waits for computation. When the computation completes, \(C_{i,j}\) completes accordingly. The total time of transmission and computation of a single sub-coflow is sub-coflow completion time (SCCT).

2.3 Network model

Similar to Varys (Chowdhury et al. 2014), we abstract out the network as a giant non-blocking switch (Ballani et al. 2011; Kang et al. 2013; Popa et al. 2012) and consider the bandwidth of ingress and egress ports the sole source of transmission limit. This model is practical because of recent advance in data center fabrics (Alizadeh et al. 2014; Greenberg et al. 2009; Niranjan Mysore et al. 2009). What we need pay attention to is the bandwidth allocation for the flows of each port. In Fig. 3, there are 4 flows in ingress port 2, and they belong to \(C_1, C_2, C_3, C_4\) respectively. \(C_1\) has flows in all three ingress ports while \(C_3\) only has flows in ports 2 and 3 (Table 1).
Fig. 3

We consider the network model as a giant non-blocking switch. There are 4 coflows in this network and multiple flows in the three ingress ports. Color indicates the coflow that a flow belongs to (color figure online)

3 Problem statement

Table 1

Main notations for quick reference



\({\mathbb {C}}\)

The set of coflows in network


The number of coflows


The ith coflow in \({\mathbb {C}}\)


The total bandwidth of a port in network


The jth sub-coflow of coflow \(C_i\)


The begin time of sub-coflow \(C_{i,j}\)


The computing time of sub-coflow \(C_{i,j}\)


The volume of flow \(f=(p,q)\in C_{i,j}\)


The rate of flow \(f=(p,q)\in C_{i,j}\) at time t

\(\tau _{i,j,p,q}\)

The transmission time of flow \(f=(p,q)\in C_{i,j}\)

A coflow scheduler should decide the rate of uncompleted flows. We use the function \(r_{i,j,p,q}(t)\) to denote the solution, which describes the transmission rate of flow \(f=(p,q)\in C_{i,j}\) at time t. With this function, the entire transmission of flow f can be formulated as follows.
$$\begin{aligned} \int _0^{\tau _{i,h,p,q}} r_{i,h,p,q}(t) dt = V_{i,h,p,q} \end{aligned}$$
where \(\tau _{i,j,p,q}\) is the transmission time of f, and \(V_{i,j,p,q}\) is the volume of f.
We aim to minimize the average CCT of the set of coflows \({\mathbb {C}}\), which is composed of all the coflows in network model. The average CCT can be calculated by:
$$\begin{aligned} \min \ \frac{1}{n} \sum _{i=1}^n {\mathrm{CCT}}(C_i) \end{aligned}$$
For simplicity, we assume that all the coflows have \(\phi\) stages. When the transmission phase of the last sub-coflow \(C_{i,\phi }\) completes, the entire transmission of the job related to \(C_i\) finishes. Coflow \(C_i\) is deemed to be finished at the same time. Note that \(t_i\) is the start time of \(C_i\) and the transmission time of \(C_{i,\phi }\) is the longest transmission time among flows in \(C_i\). Hence the detailed mathematical description of CCT is given by
$$\begin{aligned} {\mathrm{CCT}}(C_i)=t_{i,\phi }+\max _{(p,q)\in C_i} \tau _{i,\phi ,p,q} \end{aligned}$$
Coflow scheduling is subject to the following constraints.
Bandwidth constraint. The total bandwidth of a port is limited to R, so at any time t, the total bandwidth allocated to flows of any port should not exceed its capacity:
$$\begin{aligned}&\sum _{i,j,q} r_{i,j,p,q}(t) \le R,\quad \forall t \end{aligned}$$
$$\begin{aligned}&\sum _{i,j,p} r_{i,j,p,q}(t) \le R,\quad \forall t \end{aligned}$$
Dependency constraint For every coflow, its sub-coflows must be processed according to their dependencies. Sub-coflow \(C_{i,j}\) cannot begin until all of sub-coflows it depends on finish. The time cost of a sub-coflow is equal to the transmission time of all its flows plus computation time. On account of the parallel transmitting of its flows, the transmission time is the longest transmitting time among flows in \(C_i\). The formulation is as follows:
$$\begin{aligned}&t_{i,h}+\max _{(p,q)\in C_i} \tau _{i,h,p,q} + T_{i,h} \le t_{i,j}, \nonumber \\&\quad \forall h\in \{x|C_{i,j}\ \mathrm{depends\ on}\ C_{i,x}\} \end{aligned}$$
where \(T_{i,h}\) is the computing time of sub-coflow \(C_{i,h}\).
To address the challenge of dependency, we number sub-coflows in the topological order of the directed acyclic graph of dependencies and assume that, for a coflow \(C_i\), sub-coflow \(C_{i,j}\) depends and only depends on all the sub-coflow \(C_{i,h}(1\le h <j)\). In this way, an inference can be drawn that once the previous sub-coflow \(C_{i,j-1}\) completes, the next sub-coflow \(C_{i,j}\) can begin immediately. As shown in Fig. 4, each sub-coflow x depends on these sub-coflows whose number is less than x, so the only sub-coflow process sequence is 1, 2, 3, 4. This assumption ensures the process sequence is ascending by sub-coflow number: \(C_{i,1}, C_{i,2}, C_{i, 3}, \dots , C_{i, \phi }\). The original Eq. (6) can be converted into:
$$\begin{aligned}&t_{i,h}+\max _{(p,q)\in C_i} \tau _{i,h,p,q} + T_{i,h} \le t_{i,j}, \nonumber \\&\quad \forall h\in \{x|1\le x<j, x\in Z\} \end{aligned}$$
The assumption is reasonable because in most cases, stages of a job are processed in a pre-defined and linear order.
Fig. 4

Sub-coflows are numbered in topological order of dependency graph. Each sub-coflow depends on these sub-coflows those number is less than it. A\(\rightarrow\)B indicates that B depends on A

We model this scheduling problem with Multi-stage Inter-Coflow Scheduling Problem (MICS) and find it to be NP-hard.

Theorem 1

The MICS problem is NP-hard.


Chowdhury et al. (2014) have proved that in the same network model, scheduling of single-stage coflows to minimize average CCT is NP-hard. MICS problem is the scheduling of a combination of many single-stage coflows. As a result, minimizing average CCT of multi-stage coflows is NP-hard. \(\square\)

4 Online scheduling

During online scheduling, we use such a strategy: At time 0, all the rates of coflows are initialized by our scheduling algorithm. Since then, it will be invoked only when a new sub-coflow arrives or an old flow completes. During the interval between two invocations, rates of flows will not change. With this framework, resource allocation can be dynamically updated according to the network condition.

We first explore a simplified scheduling problem with the following assumption that a coflow \(C=\{f_1=(p_1,q_1), f_2=(p_2,q_2), \ldots \}\) is a bijection from the ingress port set \({\mathbb {P}}\) to the egress port set \({\mathbb {Q}}\). It means that each ingress port only sends data to one egress port and each egress port only receives data from one ingress port. Plus a flow is defined as a tuple of source and destination, so every coflow has no more than one flow in any port. As a consequence, the rate allocation of flows that have the same source but different destinations is not within our consideration. Strong though, it gives us an inspiration to solve the original problem and it will be removed later in this section.

4.1 Greedy strategy

With the above online scheduling framework, there is no need to directly calculate the function \(r_{i,j,p,q}(t)\). A \(2\times 2\) matrix r of \(r_{i,f}\) is enough, where \(r_{i,f}\) is the rate of flow \(f\in C_i\) before the next scheduling. Based on the simple greedy strategy, we try to minimize the average SCCT of current sub-coflows, in order to ensure fairness among coflows while approximating the optimal solution. Each time the scheduling algorithm is invoked, it will minimize the average completion time of the current sub-coflows. Coflows can be synchronized with scheduling, thus avoiding starvation. The current stage of \(C_i\) is \(s_i\), the completion time of current sub-coflow \(C_{i, s_i}\) is \(t_i\), and the remaining volume of flow \(f\in C_{i, s_i}\) is \(V_{i,f}\). Then Eq. (1) turns into:
$$\begin{aligned} t_i = \max _f \frac{V_{i,f}}{r_{i,f}} \end{aligned}$$

In the above equation, flow is directly denoted as f rather than (pq). This is because flow is a bijection so that there is only one flow for a given p or q.

Let our optimization goal be a function T(r) of the rate matrix r, equal to the sum of SCCTs of all the current sub-coflows:
$$\begin{aligned} T(r)= \ \sum _{i=1}^n t_i=\sum _{i=1}^n \max _f \frac{V_{i,f}}{r_{i,f}} \end{aligned}$$
The set of flows of all current sub-coflows is \(\bigcup _{i=1}^n C_{i,s_i}\). For convenience, we denote its size as \(\omega =|\bigcup _{i=1}^n C_{i,s_i}|\). Then the constraint of port bandwidth Eqs. (4) and (5) can be converted into:
$$\begin{aligned} \sum _{i=0}^n r_{i,f} \le R,\quad \forall 1\le f\le \omega \end{aligned}$$

After aforementioned transformation, MICS is partitioned into multiple problems of rate allocation at a single schedule. We model the sub-problem as Single-stage Inter-Coflow Scheduling Problem (SICS). Fortunately, it is a convex optimization problem.

Theorem 2

The SICS problem is a convex optimization.


To prove convex optimization, we need prove two propositions: (1) the objective function T(r) is convex; (2) the solution space of T(r) is a convex set.

(1) The function T(r) is convex. Firstly, we will prove the function \(\max (\mathbf {x})\) is convex, in which \(\mathbf {x}\) is an n-dimensional vector.

For any two n-dimensional vectors \(\mathbf {a}=(a_1,a_2,\dots ,a_n)\), \(\mathbf {b}=(b_1,b_2,\dots ,b_n)\) and any number \(t\in [0,1]\), let \(a_i=\max (\mathbf {a})\) and \(b_j=\max (\mathbf {b})\). Then we have
$$\begin{aligned} a_k \le a_i \Rightarrow ta_k\le ta_i,\quad \forall 1\le k \le n \end{aligned}$$
$$\begin{aligned} b_k \le b_j \Rightarrow (1-t)b_k\le (1-t)b_j,\quad \forall 1\le k \le n \end{aligned}$$
As a result,
$$\begin{aligned} ta_k+(1-t)b_k\le ta_i+(1-t)b_j,\quad \forall 1\le k\le n \end{aligned}$$
which is equivalent to:
$$\begin{aligned} \max (t\mathbf {a}+(1-t)\mathbf {b}) \le t\max (\mathbf {a})+(1-t)\max (\mathbf {b}) .\end{aligned}$$

So \(\max (\mathbf {x})\) is convex. Plus the function \(g(r_{i,f})=\frac{V_{i,f}}{r_{i,f}}\) is convex, \(h(\mathbf {r}_i)=\max \nolimits _f\ g(r_{i,f})\) is convex. Then \(T(r)=\sum \nolimits _{i=1}^n h(r_i)\) is convex.

(2) the solution space of T(r) is a convex set.

Since r is a \(n\times \omega\) matrix, the solution space is the set of all \(n\times \omega\) matrices that satisfy the following constraints for matrix X
$$\begin{aligned} \begin{aligned} \left\{ \begin{matrix} X_{i,f}\in [0,R],\\ \sum \nolimits _{i=0}^n X_{i,f}\le R \end{matrix}\right. \end{aligned} (\forall 1\le i \le n, \quad \forall 1\le f \le \omega ) \end{aligned}$$
For any two matrices A and B, let matrix \(C=\lambda A+(1-\lambda )B\), for all \(\lambda \in [0,1]\), any element \(C_{i,f}\) satisfies
$$\begin{aligned} C_{i,f}=\lambda A_{i,f}+(1-\lambda )B_{i,f} \in [0,R] \end{aligned}$$
$$\begin{aligned} \begin{aligned} \sum \limits _{i=0}^n C_{i,f}&=\sum \limits _{i=0}^n [\lambda A_{i,f}+(1-\lambda )B_{i,f}]\\&=\lambda \sum \limits _{i=0}^n A_{i,f}+(1-\lambda )\sum \limits _{i=0}^n B_{i,f} \\&\le R \end{aligned} \end{aligned}$$
Based on the above results, the new matrix C still belongs to the solution space, so the solution space is a convex set. In summary, the SICS problem is a problem of minimizing a convex function over a convex set, and it is a convex optimization. \(\square\)

4.2 A fair solution to SICS: IAO

Given that the SICS problem is a convex optimization, the local optimum must be global optimal. We present the Iteratively Approaching Optimal (IAO) algorithm as shown in Algorithm 1 in virtue of the interior point method (Bazaraa et al. 2013). In the iterative process of the algorithm, the result will gradually approximate the optimal solution.

In Algorithm 1, we construct the barrier function \({\overline{F}}(r,\zeta _k)\) to ensure that the solution r is always within the feasible solution space. It can do this because of its property: the value of \({\overline{F}}(r,\zeta _k)\) is close to T(r) and far from the boundary if r is within the feasible solution space. But if not, \({\overline{F}}(r,\zeta _k)\) will be extremely large. Therefore, we can get the optimal solution of the objective function T(r) by iteratively finding the minimum value of the function \({\overline{F}}(r,\zeta _k)\). In an iteration, we use c as the decreasing coefficient, and its range is [0.1, 0.5]. As the decreasing of the barrier factor \(\zeta _k\), \({\overline{F}}(r,\zeta _k)\) will get closer and closer to T(r), until the precision \(\epsilon\) is achieved.
The interior point method is effective, but its time complexity is high. We set the initial allocation by Weighted Fair Sharing (WFS) to avoid its initial unnecessary search to speed up it:
$$\begin{aligned} \frac{V_{1,f}}{r_{1,f}}=\frac{V_{2,f}}{r_{2,f}}=\dots =\frac{V_{n,f}}{r_{n,f}} \ (\forall 1\le f\le \omega ). \end{aligned}$$

Every time the network conditions change, for instance, a new sub-coflow is about to start or an old flow has just been completed, IAO will be invoked to allocate the rates and the allocation will not change before the next change. Initialized by WFS and updated by limited times, IAO performs like WFS, which will schedule the current sub-coflows to complete at the same time. As a result, scheduled by IAO, their complete time will be close as well. Since IAO focuses on the SICS problem of all current sub-coflows, IAO will synchronize current sub-coflows in such a manner whenever IAO is invoked. In the long term, transmission time of all sub-coflows will be close, which is fair to coflows. In our experiments, for a high workload, the improvement of fairness is significant. Also, the overall average CCT is improved owing to the multi-stage awareness.

4.3 A far-sighted heuristic: MLBF

In the matter of overall average CCT, the greedy scheduling scheme is short-sighted in that it only focuses on short-term SCCT minimization but lacks the awareness of global optimization.

It should be noted that if all of the current sub-coflows begin computing simultaneously, the network resource will be completely idle during this period, which is a great waste for link bandwidth. Unfortunately, initialized by WFS and updated by limited times, IAO generally performs like WFS. As a consequence, the greedy IAO is prone to trying to minimize the completion time of current sub-coflows and schedule them to enter transmission and computation phases at the same time.

According to the analysis above, a far-sighted schedule is supposed to take not only current sub-coflows but also subsequent sub-coflows into consideration. In the long term, if network links can be occupied by sub-coflows (not only current sub-coflows) one by one, simultaneous transmission and computation will be avoided, so that competition for bandwidth between sub-coflows during transmission can be reduced. Although this scheduling only performs on current sub-coflows, subsequent sub-coflows will benefit from it.

Under this circumstance, sub-coflows will be transmitted in a certain sequence and when the sub-coflows that have finished their transmission are in the computation phase, the network resource will be occupied by those who start transmission late. The demand of coflows for bandwidth is apportioned throughout the entire process, which reduces congestion during the peak period. Based on SJF, we set the aforementioned sequence by the bottleneck time of sub-coflow. The sub-coflows with less bottleneck time are able to finish with less time cost under the same network condition, so they deserve a high priority.

We define the bottleneck time of a coflow as the transmission time of its current sub-coflow if all the remaining bandwidth is allocated to it. Remaining bandwidth of flow f is \(\varLambda _f\) and let \(D_f\) be the amount of data of flow \(f(f\in C)\), bottleneck \(\varGamma _C\) can be calculated by
$$\begin{aligned} \varGamma _C = \max _f \frac{D_f}{\varLambda _f}. \end{aligned}$$

We propose the Multi-stage Least Bottleneck First (MLBF) heuristic. Like IAO, MLBF also runs under the previously mentioned online scheduling framework. When a sub-coflow arrives or a flow completes, MLBF will be invoked to reallocate the bandwidth of ports. In Algorithm 2, we calculate the bottleneck of coflows firstly, and sort them by their bottlenecks in ascending order. Coflow with less bottleneck is supposed to have a higher priority, so we allocate bandwidth in ascending order. Then we set the rate of flows in a coflow so that it can complete just at the end of the bottleneck time. Last but not least, to utilize the network, we allocate the remaining bandwidth to the unfinished coflows averagely.

Compared to IAO, MLBF solved an important problem that current sub-coflows tends to be crowded together, transmit and compute simultaneously, like traffic jams. By prioritizing current sub-coflows, some sub-coflows will be accelerated so that jams are alleviated. As a consequence, the total completion time of all coflows is reduced.

4.4 More general heuristic: MPLBF

So far, our multi-stage aware scheduling algorithms, both of IAO and MLBF are based on a basic assumption that every flow is a bijection from an ingress port to an egress port. This assumption may not be practical in all scenarios, thus, we propose a more general and equally far-sighted heuristic: Maximizing Parallelism Least Bottleneck First (MPLBF).

In the new condition, each ingress port can send data to any egress port and each egress port can receive data from any ingress port. Besides, there might exist multiple flows in a coflow at a port. For a single port, we need to consider both the total bandwidth allocation of a coflow and the bandwidth of flows in it.

Based on the analysis in the Sect. 4.3, we should process sub-coflows in a prioritized order to achieve high link utilization. The main problem to be solved is how to allocate bandwidth in the same coflow, for the reason that in a coflow, it is quite likely that multiple flows occupy the same port. If a flow gets a higher bandwidth at its source port but a lower bandwidth at its destination port, then its actual bandwidth is the lower one. In fact, the extra part at the source port is wasted. Nevertheless, this part may be useful to another flow at the source port.

Here we attach importance to the allocation of flows in the same coflow. For a sub-coflow \(C_{i,j}\), if it has multiple flows in the same port p, the sum volume of flows is D, and its available bandwidth is \(R_C\), then regardless of the order in which these flows are transmitted, the completion time of \(C_{i,j}\) at port p is \(D/R_C\). Therefore the transmitting order of flows at port p is immaterial for \(C_{i,j}\).

But this may be vital to other sub-coflows. Proper rate allocation among sub-coflows can avoid the link idleness. For example, in Fig. 5a, both \(C_1\) and \(C_2\) have 2 stages and 2 flows. Provided that rates are proportionally distributed to flows at each port, situation will be like Fig. 5b. Rates of flows of \(C_1\) and \(C_2\) are 1 and 2. \(C_1\) and \(C_2\) enter transmission phases and computation phases simultaneously. During the period [3,4], link is idle in that both coflows are waiting for computing, which is a waste for network resource.
Fig. 5

Comparison between simultaneous and criss-cross manners. a Shows a simplified case. In b and c, vertical axis indicates the allocation of bandwidth to flows at ports

A transmission scheme helps scheduler get out of the dilemma above that a port only transmits a flow at any time. With this scheme, flows not interfering with each other can be transmitted in parallel between ingress ports and egress ports, and all of them can be sent at full speed. To reduce the latency of low priority flows, we adopt the principle of Least Remaining Time First (LRTF). Given that all the bandwidth of ports is R, LRTF is equivalent to SJF. That is, in the same sub-coflow, we will give the priority to the smaller flows and transmit them at full speed. We still take Fig. 5 as an example. If flows are prioritized by LRTF like Fig. 5c, sub-coflows will be processed in a criss-cross manner and before coflows finish, there will not exist link idleness. As a result, average CCT in Fig. 5c is 2 smaller than that in Fig. 5b.

We design the MPLBF as the Algorithm 3. Like MLBF, we firstly prioritize coflows in line with their bottlenecks and coflows with less bottleneck will have priorities to occupy ports exclusively. Inside the coflow, flows with less time remaining will get all the bandwidth of both source ports and destination ports. Once a port is allocated to a flow, all bandwidth will be given to this flow. If either end of a flow has been allocated, it will not get any bandwidth.

5 Evaluation

In this section, we evaluate our algorithms including IAO, MLBF and MPLBF compared to a state-of-the-art coflow scheduler Varys. In Sect. 5.1, we evaluate the performances of IAO and MLBF and coflows in the simulation are bijective. In Sect. 5.2, we evaluate the performance of MPLBF and flows can be any mapping from ingress ports to egress ports.

5.1 Results of IAO and MLBF

We develop our testbed with MATLAB and the default parameters are configured as follows: There are 70 coflows in DCN and each coflow has 10 stages and 50 flows. The bandwidth of ports is 20 MB/s. The volumes of flows follow a uniform distribution in [0, 200 MB] and during computation phases of sub-coflows, the computing speed is 10 MB/s. We set the data large enough so that experiments can last long enough to avoid errors caused by special circumstances. Actually, every experiment will last at least 1000 simulated seconds.

In Fig. 6, as the number of coflows in DCN increases exponentially, the average network bandwidth for each coflow becomes narrow, leading to an increase in average CCT of Varys. However, multi-stage aware schedulers using IAO and MLBF handle this situation well. Their improvements increase significantly with the increasing of scale. Obviously, multi-stage aware schedulers are more capable of tackling the challenge of coflow on a massive scale. In addition, when coflow number exceeds 40, the advantage of MLBF over IAO gradually appears. The improvement of IAO and MLBF over Varys increases from 2 \(\times\) to 4.81 \(\times\) and 2.01 \(\times\) to 7.95 \(\times\) respectively.
Fig. 6

Influence of the scale of coflows

In real-world scenarios, the distribution of coflows might vary a lot. We evaluate the sensitivity of the three scheduling schemes by coflows under uniform (Uni.), exponential (Exp.), and normal (Nm.) distribution in Fig. 7. We set up them by the same average volume of data. Furthermore, normal distribution is tested by setting the standard deviation 25, 50, 75, and 100. Figure 7 shows that the multi-stage aware schedulers have a steady performance while Varys fluctuates. In different distributions, MLBF and IAO still have a better performance than Varys and MLBF keeps an advantage over IAO by 1.56 \(\times\) on average.
Fig. 7

Sensitivity to the distribution of coflows

Computation phases account for an important part of the processing of a job. Network utilization is influenced by the computing speed because parallelism of computation and transmission can reduce the idle time of network. We test the three scheduling schemes with different computing speeds. Figure 8 shows a notable decline of average CCT of Varys with the improvement of computing speed. Varys is not aware of multi-stage characteristic, so that its CCT depends on the computation time directly and is sensitive to the computing speed. Meanwhile, IAO and MLBF leverage this characteristic and perform well - from 1.75 \(\times\) to 4.34 \(\times\) and from 2.52 \(\times\) to 6.74 \(\times\) than Varys. Owing to a better network utilization, MLBF outperforms IAO by 1.49 \(\times\) on average.
Fig. 8

Sensitivity to the computing speed

Besides improvement in average CCT, multi-stage awareness also brings forward the time latest coflow completes (Fig. 9). We observed that the latest coflow of multi-stage aware schedulers has a much less CCT than Varys. It is because the awareness give schedulers insight to reduce unnecessary waiting time, potentially avoiding perpetual starvation.
Fig. 9

Multi-stage awareness potentially avoids perpetual starvation

However, IAO is more attractive under a high workload than MLBF because of its improvement of fairness among coflows. We define the improvement of average CCT (ACCT), standard deviation (STD) and the factor of workload as follows:
$$\begin{aligned} {\mathrm{Improvement\ of\ ACCT}}= & {} \frac{{\mathrm{ACCT\ of\ Varys}}}{{\mathrm{ACCT\ of\ Alg.}}}\\ {\mathrm{Improvement\ of\ STD}}= & {} \frac{{\mathrm{STD\ of\ Varys}}}{{\mathrm{STD\ of\ Alg.}}}\\ {\mathrm{Factor\ of\ Workload}}= & {} \frac{{\mathrm{amount\ of\ data}}}{{\mathrm{maximum\ of\ bandwidth}}} \end{aligned}$$
In Fig. 10, as the workload increases, IAO has a great improvement of STD on MLBF while its performance on ACCT is slightly worse than MLBF. A less STD means a more fair scheduling, which is important to a high workload.
Fig. 10

Performance of IAO and MLBF under different workloads. a- is ACCT and s- is STD

5.2 Results of MPLBF

In consideration of the complexity of the simulation of non-bijective flows, we re-implement our testbed with Python and adjust default parameters as follows: There are 100 ingress ports and 100 egress ports, 20 coflows between them in DCN and each coflow has 10 stages and 25 flows. The bandwidth of ports is 20 MB/s. The volumes of flows follow a uniform distribution in [0, 200 MB] and during computation phases of sub-coflows, the computing speed is 10 MB/s. Like simulations in Sect. 5.1, we avoid possible errors by setting appropriate data.

Figure 11 gives the comparison results on average CCT of MPLBF and Varys. By varying the number of coflows from 5 to 35, we investigate the sensitivity of the two algorithms to the scale of coflows. With the increase in coflow, the average CCT of Varys rises proportionally while the average CCT of MPLBF is almost unchanged. Due to the multi-stage awareness, MPLBF can reasonably arrange priorities of different coflows in a criss-cross pattern, hence improving link utilization and reducing unnecessary waits. In contrast, Varys cannot handle this well due to the lack of multi-stage awareness. This causes the larger coflows to wait for quite a few stages of the small coflows, which could be avoided. As a result, MPLBF outperforms over Varys from 2.32 \(\times\) to 8.03 \(\times\).
Fig. 11

Comparison of average CCT between MPLBF and Varys with different number of coflows

The distribution of data volume of coflows has a remarkable influence on the performance in practical situations. We simulate coflows in the uniform distribution, exponential distribution, and normal distribution. We conduct 6 experiments, using a distribution in each experiment. In experiments, we set the flows in the same coflow C to be uniformly distributed from 0 to \(x_C\) and \(x_C (\forall C\in {\mathbb {C}})\) follows the distribution we want to test. To avoid potential bias, we set the mathematical expectations of all the distributions used in the experiments to 100 MB. Figure 12 shows the average CCT of Varys and MPLBF under uniform distribution (Uni.) between 10 and 200, exponential distribution (Exp.) with the rate parameter \(\lambda\) of 0.01, and normal distribution (Norm.) with the standard deviation of 20, 30, 40, 50. Under different distributions, MPLBF outperforms Varys from 3.91 \(\times\) to 5.24 \(\times\).
Fig. 12

Comparison of average CCT between MPLBF and Varys under different distributions

The computing time accounts for a non-negligible part of the entire coflow completion time and an appropriate arrangement of computing phases of sub-coflows can effectively promote the final performance. We conduct a series of experiments to show the improvements MPLBF on the Varys. As the computing speed grows exponentially, the CCTs of both Varys and MPLBF decrease accordingly. However, as shown in the Fig. 13, the performance of MPLBF is much better than Varys, and the improvement is from 4.42 \(\times\) to 5.10 \(\times\).
Fig. 13

Comparison of average CCT between MPLBF and Varys with different computing speeds

The largest CCT is the completion time of the last completed coflow and it indicates whether there exists starvation. In Fig. 14, as the scale of coflows expands, the largest CCT of Varys increases observably while MPLBF is stable. The improvement increases from 3.14 \(\times\) to 14.06 \(\times\). MPLBF not only improves the average CCT, but also effectively avoids perpetual starvation.
Fig. 14

Comparison of largest CCT between MPLBF and Varys with different number of coflows

6 Discussion

Sub-coflow Sequence In the previous Sect. 3 we mentioned that the sub-coflow is numbered according to its dependency order and a sub-coflow depends on all its antecedent ones. In real-world scenarios, dependencies might be complicated much more. However, as long as the dependency graph is a directed acyclic graph, it must have a topological order and we can number sub-coflows by this order.

Scheduling without prior knowledge The information of coflows plays an important role in scheduling, but it is difficult to obtain in practical environment. Here we propose two methods that turn out to be effective in previous works: First, we can provide applications with registration APIs for coflows so that each time the coflow registers, the scheduler can obtain its full information (e.g. Varys; Chowdhury et al. 2014). Second, the information can be predicted by machine learning (e.g. CODA; Zhang et al. 2016). In this way, the scheduler can work without any modification to applications.

Different numbers of stages and flows In our simulation, each coflow has the same number of stages and flows. In real-world scenarios, up to 50% jobs have only no more than 20 stages (Grandl et al. 2016). Their difference is little and the performance will be similar to our experiment results. In addition, there will be some flows with volume 0 in our simulations, which equals to circumstances that coflows have different number of flows.

Parallel stages In this paper, we mainly study stages that have dependencies on each other. In a real environment, there might exist some parallel stages and their topological order is not deterministic. Our algorithm will randomly select one of all possible topological orders for processing. Of course, as far as this problem is concerned, there is still room for optimization, for example, we can process such stages in parallel, not in serial.

7 Related work

Flow scheduling Flow scheduling has been studied for a long time and there are quite a few works worth mentioning. Hedera (Al-Fares et al. 2010) is a dynamic scheduling system and it is designed to utilize aggregate network resources. Chowdhury et al. (2011) noticed the semantics between flows, and optimized the performance at a higher level than individual flows by Orchestra. Based on the insight that datacenter transport should decouple flow scheduling from rate control, Alizadeh et al. present pFabric (Alizadeh et al. 2013). Furthermore, Baraat (Dogar et al. 2014) promoted scheduling level to task, enabling schedulers to be task aware. Unlike pFabric, PIAS (Bai et al. 2015) does not need accurate per-flow information a priori, and aims to minimize FCT by mimicking SJF. Using expansion ratio, Zhang et al. (2017) designed MERP to archive efficient flow scheduling without starvation. Aemon Wang et al. (2017) is designed to address the challenges of mix-flow scheduling without flow sizes known as a prior.

In our work, performance is evaluated by coflow completion time instead of flow completion time (FCT). Because of the rich semantics carried by coflow, the optimization to CCT is more beneficial to the improve the performance of applications than FCT.

Coflow scheduling Coflow is an abstraction of a set of flows and with this abstraction, schedulers are able to optimize the performance of applications more efficiently. It is proposed by Chowdhury and Stoica (2012). Based on this, a lot works have emerged. Varys (Chowdhury et al. 2014) is the first coflow scheduler which leverages the characteristics of coflow. Zhao et al. present an observation that both routing and scheduling must be considered simultaneously (Zhao et al. 2015). Qiu et al. (2015) proposed a method to minimize the weighted CCT while scheduling coflows with release dates. Aalo (Chowdhury and Stoica 2015) provided a scheme without prior knowledge and took the dependencies of coflows into consideration.

The hallmark of our work is the multi-stage characteristic of coflow. It is not noticed before and we have used it to significantly improve overall performance.

Data center network Guo et al. (2016) applies the game-theoretic framework to the bandwidth allocation problem, so as to guarantee bandwidth for VMs according to their base bandwidth requirements and share residual bandwidth in proportion to the weights of VMs. Liu et al. pay attention to coping with highly dynamic traffic in DCN and propose eBA to provide end-to-end bandwidth guarantee for VMs under a number of short flows and massive bursty traffic. Considering that the lack of pricing for bandwidth guarantee causes a severe price-performance anomaly, SoftBW Guo et al. (2017) enables pricing bandwidth with over commitment on bandwidth guarantee. Wang et al. (2017) proposed an efficient online algorithm for dynamic SDN controller assignment to avoid long response time and high maintenance cost.

Our work is complementary. We focus on optimizing performance of coflows, instead of achieving fairness for VMs, guaranteeing bandwidth for VMs, pricing bandwidth or assigning SDN controllers.

8 Conclusion

In this paper, we concentrate on the multi-stage characteristic of coflow and leverage it to improve the performance of coflows. We first formulate the Multi-stage Inter-Coflow problem and prove it to be NP-hard. Then we propose an online scheduling framework to convert the online problem to offline and based on the greedy strategy, we divide the MICS into multiple SICS problems. We construct a solution IAO to optimize SICS and experiments show IAO improves average CCT by up to 4.81 \(\times\). However, we find that there is still a lot of room for optimization in the greedy strategy and propose a far-sighted heuristic MLBF, which outperforms IAO by up to 1.65 \(\times\). Furthermore, we remove the bijective flow assumption and propose a more general heuristic MPLBF, which improves average CCT by up to 8.03 \(\times\) compared to Varys.



This work was supported in part by National Key R&D Program of China (2017YFB1001801), NSFC (61872175), NSF of Jiangsu Province (BK20181252), CCF-Tencent Open Fund, and Collaborative Innovation Center of Novel Software Technology and Industrialization. On behalf of all authors, the corresponding author states that there is no conflict of interest.


  1. Al-Fares, M., Radhakrishnan, S., Raghavan, B., Huang, N., Vahdat, A.: Hedera: dynamic flow scheduling fordata center networks. In: Nsdi, Vol. 10, pp. 89–92 (2010)Google Scholar
  2. Alizadeh, M., Edsall, T., Dharmapurikar, S., Vaidyanathan, R., Chu, K., Fingerhut, A., Matus, F., Pan, R., Yadav, N., Varghese, G., et al.: Conga: distributed congestion-aware load balancing for datacenters. In: Proceedings of the 2014 ACM Conference on the Special Interest Group on Data Communication (SIGCOMM 2014)Google Scholar
  3. Alizadeh, M., Greenberg, A., Maltz, D.A., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., Sridharan, M.: Data center tcp (dctcp). ACM SIGCOMM Comput. Commun. Rev. 41(4), 63–74 (2011)CrossRefGoogle Scholar
  4. Alizadeh, M., Yang, S., Sharif, M., Katti, S., McKeown, N., Prabhakar, B., Shenker, S.: pfabric: Minimal near-optimal datacenter transport. ACM SIGCOMM Comput. Commun. Rev. 43, 435–446 (2013)CrossRefGoogle Scholar
  5. Bai, W., Chen, K., Wang, H., Chen, L., Han, D., Tian, C.: Information-agnostic flow scheduling for commodity data centers. In: Proceedings of the 2015 USENIX Symposium on Networked Systems Design and Implementation (NSDI 2015)Google Scholar
  6. Ballani, H., Costa, P., Karagiannis, T., Rowstron, A.: Towards predictable datacenter networks. In: ACM SIGCOMM computer communication review, Vol. 41, pp. 242–253. ACM (2011)Google Scholar
  7. Bazaraa, M.S., Sherali, H.D., Shetty, C.M.: Nonlinear Programming: Theory and Algorithms. Wiley, Hoboken (2013)zbMATHGoogle Scholar
  8. Chowdhury, M., Stoica, I.: Coflow: a networking abstraction for cluster applications. In: Proceedings of the 11th ACM Workshop on Hot Topics in Networks, pp. 31–36. ACM (2012)Google Scholar
  9. Chowdhury, M., Stoica, I.: Efficient coflow scheduling without prior knowledge. In: Proceedings of the 2015 ACM Conference on the Special Interest Group on Data Communication (SIGCOMM 2015)Google Scholar
  10. Chowdhury, M., Zaharia, M., Ma, J., Jordan, M.I., Stoica, I.: Managing data transfers in computer clusters with orchestra. In: Proceedings of the 2011 ACM Conference on the Special Interest Group on Data Communication (SIGCOMM 2011)Google Scholar
  11. Chowdhury, M., Zhong, Y., Stoica, I.: Efficient coflow scheduling with varys. In: ACM SIGCOMM Computer Communication Review, Vol. 44, pp. 443–454. ACM (2014)Google Scholar
  12. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  13. Dogar, F.R., Karagiannis, T., Ballani, H., Rowstron, A.: Decentralized task-aware scheduling for data center networks. In: Proceedings of the 2014 ACM Conference on the Special Interest Group on Data Communication (SIGCOMM 2014)Google Scholar
  14. Grandl, R., Kandula, S., Rao, S., Akella, A., Kulkarni, J.: G: packing and dependency-aware scheduling for data-parallel clusters. In: Proceedings of OSDI’16: 12th USENIX Symposium on Operating Systems Design and Implementation (2016)Google Scholar
  15. Greenberg, Albert, Hamilton, James R., Jain, Navendu, Kandula, Srikanth, Kim, Changhoon, Lahiri, Parantap, Maltz, David A., Patel, Parveen, Sengupta, Sudipta: Vl2: a scalable and flexible data center network. In: Proceedings of the 2009 ACM Conference on the Special Interest Group on Data Communication (SIGCOMM 2009)Google Scholar
  16. Guo, J., Liu, F., Wang, T., Lui, J.C.S.: Pricing intra-datacenter networks with over-committed bandwidth guarantee. In: 2017 USENIX annual technical conference (USENIXATC 17), pp. 69–81 (2017)Google Scholar
  17. Guo, J., Liu, F., Lui, J.C.S., Jin, H.: Fair network bandwidth allocation in iaas datacenters via a cooperative game approach. IEEE/ACM Trans. Netw. 24(2), 873–886 (2016)CrossRefGoogle Scholar
  18. Hong, C.-Y., Caesar, M., Godfrey, P.: Finishing flows quickly with preemptive scheduling. In: Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, pp. 127–138. ACM (2012)Google Scholar
  19. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: ACM SIGOPS operating systems review, Vol. 41, pp. 59–72. ACM (2007)Google Scholar
  20. Kang, N., Liu, Z., Rexford, J., Walker, D.: Optimizing the one big switch abstraction in software-defined networks. In: Proceedings of the ninth ACM conference on Emerging networking experiments and technologies, pp. 13–24. ACM (2013)Google Scholar
  21. Munir, A., Qazi, I.A., Uzmi, Z.A., Mushtaq, A., Ismail, S.N., Iqbal, M.S., Khan, B.: Minimizing flow completion times in data centers. In: INFOCOM, 2013 Proceedings IEEE, pp. 2157–2165. IEEE (2013)Google Scholar
  22. Munir, A., Baig, G., Irteza, S.M., Qazi, I.A., Liu, A.X., Dogar, F.R.: Friends, not foes: synthesizing existing transport strategies for data center networks. CM SIGCOMM Comput. Commun. Rev. 44, 491–502 (2015). ACMCrossRefGoogle Scholar
  23. Murray, D.G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A., Hand, S.: Ciel: a universal execution engine for distributed data-flow computing. In: Proceedings of the 8th ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI 2011)Google Scholar
  24. Niranjan M., Radhika, P., Andreas, F., Nathan, H., Nelson, M., Pardis, R., Sivasankar, S., Vikram, V.A.: Portland: a scalable fault-tolerant layer 2 data center network fabric. In: ACM SIGCOMM Computer Communication Review, Vol. 39, pp. 39–50. ACM (2009)Google Scholar
  25. Popa, L., Kumar, G., Chowdhury, M., Krishnamurthy, A., Ratnasamy, S., Stoica, I.: Faircloud: sharing the network in cloud computing. ACM SIGCOMM Comput. Commun. Rev. 42(4), 187–198 (2012)CrossRefGoogle Scholar
  26. Qiu, Z., Stein, C., Zhong, Y.: Minimizing the total weighted completion time of coflows in datacenter networks. In: Proceedings of the 27th ACM symposium on parallelism in algorithms and architectures (SPAA 2015)Google Scholar
  27. Wang, T., Xu, H., Liu, F.: Aemon: information-agnostic mix-flow scheduling in data center networks. In: Proceedings of the First Asia-Pacific Workshop on Networking, pp. 106–112. ACM (2017)Google Scholar
  28. Wang, T., Liu, F., Hong, X.: An efficient online algorithm for dynamic sdn controller assignment in data center networks. IEEE/ACM Trans. Netw. 25(5), 2788–2801 (2017)CrossRefGoogle Scholar
  29. Wilson, C., Ballani, H., Karagiannis, T., Rowtron, A.: Better never than late: meeting deadlines in datacenter networks, Vol. 41, pp. 50–61 (2011) ACMGoogle Scholar
  30. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. 10: 95 (2010)Google Scholar
  31. Zhang, H., Chen, L., Yi, B., Chen, K., Chowdhury, M., Geng, Y.: Coda: toward automatically identifying and scheduling coflows in the dark. In: Proceedings of the 2016 ACM Conference on the Special Interest Group on Data Communication (SIGCOMM 2016)Google Scholar
  32. Zhang, S., Qian, Z., Hao, W., Sanglu, L.: Efficient data center flow scheduling without starvation using expansion ratio. IEEE Trans. Parallel Distrib. Syst. 28(11), 3157–3170 (2017)CrossRefGoogle Scholar
  33. Zhao, Y., Chen, K., Bai, W., Yu, M., Tian, C., Geng, Y., Zhang, Y., Li, D., Wang, S.: Rapier: integrating routing and scheduling for coflow-aware data center networks. In: Proceedings of the 2015 IEEE Conference on Computer Communications (INFOCOM 2015)Google Scholar

Copyright information

© China Computer Federation (CCF) 2019

Authors and Affiliations

  1. 1.State Key Lab. for Novel Software TechnologyNanjing UniversityNanjingPeople’s Republic of China
  2. 2.School of Computer Science and Technology/Suzhou Institute for Advanced StudyUniversity of Science and Technology of ChinaHefeiPeople’s Republic of China
  3. 3.Center for Networked ComputingTemple UniversityPhiladelphiaUSA

Personalised recommendations