Near Optimal Work-Stealing Tree Scheduler for Highly Irregular Data-Parallel Workloads

Prokopec, Aleksandar; Odersky, Martin

doi:10.1007/978-3-319-09967-5_4

Near Optimal Work-Stealing Tree Scheduler for Highly Irregular Data-Parallel Workloads

Aleksandar Prokopec¹⁷ &
Martin Odersky¹⁷

Conference paper
First Online: 01 January 2014

673 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8664))

Abstract

We present a work-stealing algorithm for runtime scheduling of data-parallel operations in the context of shared-memory architectures on data sets with highly-irregular workloads that are not known a priori to the scheduler. This scheduler can parallelize loops and operations expressible with a parallel reduce or a parallel scan. The scheduler is based on the work-stealing tree data structure, which allows workers to decide on the work division in a lock-free, workload-driven manner and attempts to minimize the amount of communication between them. A significant effort is given to showing that the algorithm has the least possible amount of overhead.

We provide an extensive experimental evaluation, comparing the advantages and shortcomings of different data-parallel schedulers in order to combine their strengths. We show specific workload distribution patterns appearing in practice for which different schedulers yield suboptimal speedup, explaining their drawbacks and demonstrating how the work-stealing tree scheduler overcomes them. We thus justify our design decisions experimentally, but also provide a theoretical background for our claims.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Arora, N.S, Blumofe, R.D., Plaxton, C.G: Thread scheduling for multiprogrammed multiprocessors. In: Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’98, pp. 119–129. ACM, New York (1998)
Google Scholar
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37(1), 55–69 (1996)
Article Google Scholar
Buss, A., Harshvardhan, Papadopoulos, I., Pearce, O., Smith, T., Tanase, G., Thomas, N., Xu, X., Bianco, M., Amato, N.M., Rauchwerger, L.: STAPL: standard template adaptive parallel library. In: Proceedings of the 3rd Annual Haifa Experimental Systems Conference, SYSTOR ’10, pp. 14:1–14:10. ACM, New York (2010)
Google Scholar
Chakravarty, M.M.T., Leshchinskiy, R., Peyton Jones, S., Keller, G., Marlow, S.: Data parallel Haskell: a status report. In: Proceedings of the 2007 Workshop on Declarative Aspects of Multicore Programming, DAMP ’07, pp. 10–18. ACM, New York (2007)
Google Scholar
Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R.R., Bradshaw, R., Weizenbaum, N.: FlumeJava: easy, efficient data-parallel pipelines. In: Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10, pp. 363–375. ACM, New York (2010)
Google Scholar
Chase, D., Lev, Y.: Dynamic circular work-stealing deque. In: Proceedings of the Seventeenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’05, pp. 21–28. ACM, New York (2005)
Google Scholar
Cong, G., Kodali, S.B., Krishnamoorthy, S., Lea, D., Saraswat, V.A., Wen, T.: Solving large, irregular graph problems using adaptive work-stealing. In: ICPP, pp. 536–545 (2008)
Google Scholar
Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 multithreaded language. In: Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI ’98, pp. 212–223. ACM, New York (1998)
Google Scholar
Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming. Morgan Kaufmann Publishers, San Francisco (2008)
Google Scholar
Hummel, S.F., Schonberg, E., Flynn, L.E.: Factoring: a method for scheduling parallel loops. Commun. ACM 35(8), 90–101 (1992)
Article Google Scholar
Intel Software Network. Intel Cilk Plus. http://cilkplus.org/
JáJá, J.: An Introduction to Parallel Algorithms. Addison-Wesley, Reading (1992)
MATH Google Scholar
Koelbel, C., Mehrotra, P.: Compiling global name-space parallel loops for distributed execution. IEEE Trans. Parallel Distrib. Syst. 2(4), 440–451 (1991)
Article Google Scholar
Kruskal, C.P., Weiss, A.: Allocating independent subtasks on parallel processors. IEEE Trans. Softw. Eng. 11(10), 1001–1016 (1985)
Article MATH Google Scholar
Lea, D.: A java fork/join framework. In: Java Grande, pp. 36–43 (2000)
Google Scholar
Polychronopoulos, C.D., Kuck, D.J.: Guided self-scheduling: a practical scheduling scheme for parallel supercomputers. IEEE Trans. Comput. 36(12), 1425–1439 (1987)
Article Google Scholar
Prokopec, A., Bagwell, P., Rompf, T., Odersky, M.: A generic parallel collection framework. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 136–147. Springer, Heidelberg (2011)
Chapter Google Scholar
Reinders, J.: Intel Threading Building Blocks, 1st edn. O’Reilly & Associates, Sebastopol (2007)
Google Scholar
Sarkar, V.: Optimized unrolling of nested loops. In: Proceedings of the 14th International Conference on Supercomputing, ICS ’00, pp. 153–166. ACM, New York (2000)
Google Scholar
Tardieu, O., Wang, H., Lin, H.: A work-stealing scheduler for x10’s task parallelism with suspension. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’12, pp. 267–276. ACM, New York (2012)
Google Scholar
Tzen, T.H., Ni, L.M.: Trapezoid self-scheduling: a practical scheduling scheme for parallel compilers. IEEE Trans. Parallel Distrib. Syst. 4(1), 87–98 (1993)
Article Google Scholar

Download references

Author information

Authors and Affiliations

École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Aleksandar Prokopec & Martin Odersky

Authors

Aleksandar Prokopec
View author publications
You can also search for this author in PubMed Google Scholar
Martin Odersky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aleksandar Prokopec .

Editor information

Editors and Affiliations

Silicon Valley, Qualcomm Research, San Jose, California, USA
Călin Cașcaval
Silicon Valley, Qualcomm Research, San Jose, California, USA
Pablo Montesinos

A Appendix

We provide the appendix section to further explain some of the concepts mentioned in the main paper which did not fit there. The information here is provided for convenience and it should not be necessary to read this section, but doing so may give useful insight.

1.1 A.1 Work-Stealing Reduction Tree

As mentioned, the work-stealing tree is a particularly effective data-structure for a reduce operation. Parallel reduce is useful in the context of many other operations, such as finding the first element with a given property, finding the greatest element with respect to some ordering, filtering elements with a given property or computing an aggregate of all the elements (e.g. a sum).

There are two reasons why the work-stealing tree is amenable to implementing reductions. First, it preserves the order in which the work is split between processors, which allows using non-commutative operators for the reduce (e.g. computing the resulting transformation from a series of affine transformations can be parallelized by multiplying a sequence of matrices – the order is in this case important). Second, the reduce can largely be performed in parallel, due to the structure of the tree.

The work-stealing tree reduce works similar to a software combining tree [9], but it can proceed in a lock-free manner after all the node owners have completed their work, as we describe next. The general idea is to save the aggregated result in each node and then push the result further up the tree. Note that we did not save the return value of the kernel method in line 50 in Fig. 5, making the scheduler applicable only to parallelizing for loops. Thus, we add a local variable sum and update it each time after calling kernel. Once the node ends up in a COMPLETED or EXPANDED state, we assign it the value of sum. Note that updating an invocation-specific shared variable instead would not only break the commutativity, but also lead to the same bottleneck as we saw before with fixed-size batching. We therefore add two new fields with atomic access to Node, namely lresult and result. We also add a new field parent to Ptr. We expand the set of abstract node states with two additional ones, namely PREPARED and PUSHED. The expanded state diagram is shown in Fig. 11.

The parent field in Ptr is not shown in the diagram in Fig. 11. The first two boxes in Node denote the left and the right child, respectively, as before. We represent the iteration state (progress) with a single box in Node. The iterator may either be stolen ($p_s$) or completed ($u$), but this is not important for the new states – we denote all such entries with $\times $. The fourth box represents the owner, the fifth and the sixth fields lresult and result. Once the work on the node is effectively completed, either due to a steal or a normal completion, the node owner $\pi $ has to write the value of the sum variable to lresult. After doing so, the owner announces its completion by atomically writing a special value $P$ to result, and by doing so pushes the node into the PREPARED state – we say that the owner prepares the node. At this point the node contains all the information necessary to participate in the reduction. The sufficient condition for the reduction to start is that the node is a leaf or that the node is an inner node and both its children are in the PUSHED state. The value lresult can then be combined with the result values of both its children and written to the result field of the node. Upon writing to the result field, the node goes into the PUSHED state. This push step can be done by any worker $\psi $ and assuming all the owners have prepared their nodes, the reduction is lock-free. Importantly, the worker that succeeds in pushing the result must attempt to repeat the push step in the parent node. This way the reduction proceeds upwards in the tree until reaching the root. Once some worker pushes the result to the root of the tree, it notifies that the operation was completed, so that the thread that invoked the operation can proceed, in case that the parallel operation is synchronous. Otherwise, a future variable can be completed or a user callback invoked.

Before presenting the pseudocode, we formalize the notion of the states we described. In addition to the ones mentioned earlier, we identify the following new invariants.

INV6. Field n.lresult is set to $\bot $ when created. If a worker $\pi $ overwrites the value $\bot $ of the field n.lresult then $\mathtt{n.owner }= \pi $ and the node n is either in the EXPANDED state or the COMPLETED state. That is the last write to n.lresult.

INV7. Field n.result is set to $\bot $ when created. If a worker $\pi $ overwrites the value $\bot $ of the field n.result with $\mathbb {P}$ then $\mathtt{n.owner }= \pi $, the node n was either in the EXPANDED state or the COMPLETED state and the value of the field n.lresult is different than $\bot $. We say that the node goes into the PREPARED state.

INV8. If a worker $\psi $ overwrites the value $\mathbb {P}$ of the field n.result then the node n was in the PREPARED state and was either a leaf or its children were in the PUSHED state. We say that the node goes into the PUSHED state.

We modify workOn so that instead of lines 53 through 57, it calls the method complete passing it the sum argument and the reference to the subtree. The pseudocodes for complete and an additional method pushUp are shown in Fig. 12.

Upon completing the work, the owner checks whether the subtree was stolen. If so, it helps expand the subtree (line 78), reads the new node and writes the sum into lresult. After that, the owner pushes the node into the PREPARED state in line 84, retrying in the case of spurious failures, and calls pushUp.

The method pushUp may be invoked by the owner of the node attempting to write to the result field, or by another worker attempting to push the result up after having completed the work on one of the child nodes. The lresult field may not be yet assigned (line 92) if the owner has not completed the work – in this case the worker ceases to participate in the reduction and relies on the owner or another worker to continue pushing the result up. The same applies if the node is already in the PUSHED state (line 94). Otherwise, the lresult field can only be combined with the result values from the children if both children are in the PUSHED state. If the worker invoking pushUp notices that the children are not yet assigned the result, it will cease to participate in the reduction. Otherwise, it will compute the tentative result (line 104) and attempt to write it to result atomically with the CAS in line 106. A failed CAS triggers a retry, otherwise pushUp is called recursively on the parent node. If the current node is the root, the worker notifies any listeners that the final result is ready and the operations ends.

1.2 A.2 Work-Stealing Tree Traversal Strategies

We showed experimentally that changing the traversal order when searching for work can have a considerable effect on the performance of the work-stealing tree scheduler. We described these strategies briefly how, but did not present a precise, detailed pseudocode. In this section we show different implementations of the findWork and descend methods that lead to different tree traversal orders when stealing.

Assign. In this strategy a worker with index $i$ invoking findWork picks a left-to-right traversal order at some node at level $l$ if and only if its bit at position $l \;\mathrm{mod}\;\lceil \log _2P \rceil $ is $1$, that is:

$$\begin{aligned} (i \gg (l \;\mathrm{mod}\;\lceil \log _2P \rceil )) \;\mathrm{mod}\;2 = 1 \end{aligned}$$

(1)

The consequence of this is that when the workers descend in the tree the first time, they will pick different paths, leading to fewer steals assuming that the workload distribution is relatively uniform. If it is not uniform, then the workload itself should amortize the creation of extra nodes. We give the pseudocode in Fig. 13.

AssignTop. This strategy is similar to the previous one with the difference that the assignment only works as before if the level of the tree is less than or equal to $\lceil log_2P \rceil $. Otherwise, a random choice is applied in deciding whether traversal should be left-to-right or right-to-left. We show it in Fig. 14 where we only redefine the method left, and reuse the same choose, descend and findWork.

RandomAll. This strategy randomizes all the choices that the stealer and the victim make. Both the tree traversal and the node chosen after the steal are thus changed in findWork. We show it in Fig. 14.

RandomWalk. Here we only change the tree traversal order that the stealer does when searching for work and leave the rest of the choices fixed to victim picking the left node after expansion and the stealer picking the right node. The code is shown in Fig. 15.

FindMax. This strategy, unlike the previous ones, does not break tree traversal early as soon as a viable node is found. Instead, it traverses the entire workstealing tree in left-to-right order and returns a reference to a node with the most work. Only then it attempts to own or steal that node. As noted before, this kind of search is not atomic, since some nodes may be stolen and expanded in the meantime and processors advance through the nodes they own. However, we expect steals to be rare events so in most cases this search should give an exact or a nearly exact estimate. The decisions about which node the victim and the stealer take after expansion remain the same as in the basic algorithm from Fig. 5. We show the pseudocode for FindMax in Fig. 16.

1.3 A.3 Speedup and Optimality Analysis

In Fig. 9-36 we identified a workload distribution for which the work-stealing reduction tree had a particularly bad performance. This coarse workload consisted of a major prefix of elements which required a very small amount of computation followed by a minority of elements which required a large amount of computation. We call it coarse because the number of elements was on the order of magnitude of a certain value we called MAXSTEP.

To recap, the speedup was suboptimal due to the following. First, to achieve an optimal speedup for at least the baseline, not all batches can have fewer elements than a certain number. We have established this number for a particular architecture and environment, calling it STEP. Second, to achieve an optimal speedup for ranges the size of which is below $\mathtt{STEP }{\cdot }\mathtt{P }$, some of the batches have to be smaller than the others. The technique we apply starts with a batch consisting of a single element and increases the batch size exponentially up to MAXSTEP. Third, there is no hardware interrupt mechanism available to interrupt a worker which is processing a large batch, and software emulations which consist of checking a volatile variable within a loop are too slow when executing the baseline. Fourth, the worker does not know the workload distribution and cannot measure time. All this caused a single worker obtain the largest batch before the other workers had a chance to steal some work for a particular workload distribution. Justifying these claims requires a set of more formal definitions. We start by defining the context in which the scheduler executes.

Definition 1

(Oblivious conditions). If a data-parallel scheduler is unable to obtain information about the workload distribution, nor information about the amount of work it had previously executed, we say that the data-parallel scheduler works in oblivious conditions.

Assume that a worker decides on some batching schedule $c_1, c_2, \ldots , c_k$ where $c_j$ is the size of the $j$-th batch and $\sum \nolimits _{j=1}^k c_j = N$, where $N$ is the size of the range. No batch is empty, i.e. $c_j \ne 0$ for all $j$. In oblivious conditions the worker does not know if the workload resembles the baseline mentioned earlier, so it must assume that it does and minimize the scheduling overhead. The baseline is not only important from a theoretical perspective being one of the potentially worst-case workload distribution, but also from a practical one – in many problems parallel loops have a uniform workload. We now define what this baseline means more formally.

Definition 2

(The baseline constraint). Let the workload distribution be a function $w(i)$ which gives the amount of computation needed for range element $i$. We say that a data-parallel scheduler respects the baseline constraint if and only if the speedup $s_p$ with respect to a sequential loop is arbitrarily close to linear when executing the workload distribution $w(i) = w_0$, where $w_0$ is the minimum amount of work needed to execute a loop iteration.

Arbitrarily close here means that $\epsilon $ in $s_p = \frac{P}{1 + \epsilon }$ can be made arbitrarily small.

The baseline constraint tells us that it may be necessary to divide the elements of the loop into batches, depending on the scheduling (that is, communication) costs. As we have seen in the experiments, while we should be able to make the $\epsilon $ value arbitrarily small, in practice it is small enough when the scheduling overhead is no longer observable in the measurement. Also, we have shown experimentally that the average batch size should be bigger than some value in oblivious conditions, but we have used particular scheduler instances. Does this hold in general, for every data-parallel scheduler? The answer is yes, as we show in the following lemma.

Lemma 1

If a data-parallel scheduler that works in oblivious conditions respects the baseline constraint then the batching schedule $c_1, c_2, \ldots , c_k$ is such that:

$$\begin{aligned} \frac{\sum \nolimits _{j=1}^k c_j}{k} \ge S(\epsilon ) \end{aligned}$$

(2)

Proof

The lemma claims that in oblivious conditions the average batch size must be above some value which depends on the previously defined $\epsilon $, otherwise the scheduler will not respect the baseline constraint.

The baseline constraint states that $s_p = \frac{P}{1 + \epsilon }$, where the speedup $s_p$ is defined as $T_0 / T_p$, where $T_0$ is the running time of a sequential loop and $T_p$ is the running time of the scheduler using $P$ processors. Furthermore, $T_0 = T \cdot P$ where $T$ is the optimal parallel running time for $P$ processors, so it follows that $\epsilon \cdot T = T_p - T$. We can also write this as $\epsilon \cdot W = W_p - W$. This is due to the running time being proportionate to the total amount of executed work, whether scheduling or useful work. The difference $W_p - W$ is exactly the scheduling work $W_s$, so the baseline constraint translates into the following inequality:

$$\begin{aligned} W_s \le \epsilon \cdot W \end{aligned}$$

(3)

In other words, the scheduling work has to be some fraction of the useful work. Assuming that there is a constant amount of scheduling work $W_c$ per every batch, we have $W_s = k \cdot W_c$. Lets denote the average work per element with $\overline{w}$. We then have $W = N \cdot \overline{w}$. Combining these relations we get $N \ge k \cdot \frac{W_c}{\epsilon \cdot \overline{w}}$, or shorter $N \ge k \cdot S(\epsilon )$. Since $N$ is equal to the sum of all batch sizes, we derive the following constraint:

$$\begin{aligned} \frac{\sum \nolimits _{j=1}^k c_j}{k} \ge \frac{W_c}{\epsilon \cdot \overline{w}} \end{aligned}$$

(4)

In other words, the average batch size must be greater than some value $S(\epsilon )$ which depends on how close we want to get to the optimal speedup. Note that this value is inversely proportionate to the average amount of work per element $\overline{w}$ – the scheduler could decide more about the batch sizes if it knew something about the average workload, and grows with the scheduling cost per batch $W_c$ – this is why it is especially important to make the workOn method efficient. We already saw the inverse proportionality with $\epsilon $ in Fig. 6. In part, this is why we had to make MAXSTEP larger than the chosen STEP (we also had to increase it due to increasing the scheduling work in workOn, namely, $W_c$). This is an additional constraint when choosing the batching schedule.

With this additional constraint there always exists a workload distribution for a given batching schedule such that the speedup is suboptimal, as we show next.

Lemma 2

Assume that $S(\epsilon ) > 1$, for the desired $\epsilon $. For any fixed batching schedule $c_1, c_2, \ldots , c_k$ there exists a workload distribution such that the scheduler executing it in oblivious conditions yields a suboptimal schedule.

Proof

First, assume that the scheduler does not respect the baseline constraint. The baseline workload then yields a suboptimal speedup and the statement is trivially true because $S(\epsilon ) > 1$.

Otherwise, assume without the loss of generality that at some point in time a particular worker $\omega $ is processing some batch $c_m$ the size of which is greater or equal to the size of the other batches. This means the size of $c_m$ is greater than $1$, from the assumption. Then we can choose a workload distribution such that the work $W_m = \sum \nolimits _{i=N_m}^{N_m + c_m} w(i)$ needed to complete batch $c_m$ is arbitrarily large, where $N_m = \sum \nolimits _{j=1}^{m-1} c_j$ is the number of elements in the batching schedule coming before the batch $c_m$. For all the other elements we set $w(i)$ to be some minimum value $w_0$. We claim that the obtained speedup is suboptimal. There is at least one different batching schedule with a better speedup, and that is the schedule in which instead of batch $c_m$ there are two batches $c_{m_1}$ and $c_{m_2}$ such that $c_{m_1}$ consists of all the elements of $c_m$ except the last one and $c_{m_2}$ contains the last element. In this batching schedule some other worker can work on $c_{m_2}$ while $\omega $ works on $c_{m_1}$. Hence, there exists a different batching schedule which leads to a better speedup, so the initial batching schedule is not optimal.

We can ask ourselves what is the necessary condition for the speedup to be suboptimal. We mentioned that the range size has to be on the same order of magnitude as $S$ above, but can we make this more precise? We could simplify this question by asking what is the necessary condition for the worst-case speedup of $1$ or less. Alas, we cannot find necessary conditions for all schedulers because they do not exist – there are schedulers which do not need any preconditions in order to consistently produce such a speedup (think of a sequential loop or, worse, a “scheduler” that executes an infinite loop). Also, we already saw that a suboptimal speedup may be due to a particularly bad workload distribution, so maybe we should consider only particular distributions, or have some conditions on them. What we will be able to express are the necessary conditions on the range size for the existence of a scheduler which achieves a speedup greater than $1$ on any workload. Since the range size is the only information known to the scheduler in advance, it can be used to affect its decisions in a particular implementation.

The worst-case speedups we saw occurred in scenarios where one worker (usually the invoker) started to work before all the other workers. To be able to express the desired conditions, we model this delay with a value $T_d$.

Lemma 3

Assume a data-parallel scheduler that respects the baseline constraint in oblivious conditions. There exists some minimum range size $N_1$ for which the scheduler can yield a speedup greater than $1$ for any workload distribution.

Proof

We first note that there is always a scheduler that can achieve the speedup $1$, which is merely a sequential loop. We then consider the case when the scheduler is parallelizing the baseline workload. Assume now that there is no minimum range size $N_1$ for which the claim is true. Then for any range size $N$ we must be able to find a range size $N + K$ such that the scheduler still cannot yield speedup $1$ or less, for a chosen $K$. We choose $N = \frac{f \cdot T_d}{w_0}$, where $w_0$ is the amount of work associated with each element in the baseline distribution and $f$ is an architecture-specific constant describing the computation speed. The chosen $N$ is the number of elements that can be processed during the worker wakeup delay $T_d$. The workers that wake up after the first worker $\omega $ processes $N$ elements have no more work to do, so the speedup is $1$. However, for range size $N + K$ there are $K$ elements left that have not been processed. These $K$ elements could have been in the last batch of $\omega $. The last batch in the batching schedule chosen by the scheduler may include the $N$th element. Note that the only constraint on the batch size is the lower bound value $S(\epsilon )$ from Lemma 1. So, if we choose $K = 2 S(\epsilon )$ then either the last batch is smaller than $K$ or is greater than $K$. If it is smaller, then a worker different than $\omega $ will obtain and process the last batch, hence the speedup will be greater than $1$. If it is greater, then the worker $\omega $ will process the last batch – the other workers that wake up will not be able to obtain the elements from that batch. In that case there exists a better batching order which still respects the baseline constraint and that is to divide the last batch into two equal parts, allowing the other workers to obtain some work and yielding a speedup greater than $1$. This contradicts the assumption that there is no minimum range size $N_1$ – we know that $N_1$ is such that:

$$\begin{aligned} \frac{f \cdot T_d}{w_0} \le N_1 \le \frac{f \cdot T_d}{w_0} + 2 \cdot S(\epsilon ) \end{aligned}$$

(5)

Now, assume that the workload $w(i)$ is not the baseline workload $w_0$. For any workload we know that $w(i) \ge w_0$ for every $i$. The batching order for a single worker has to be exactly the same as before due to oblivious conditions. As a result the running time for the first worker $\omega $ until it reaches the $N$th element can only be larger than that of the baseline. This means that the other workers will wake up by the time $\omega $ reaches the $N$th element, and obtain work. Thus, the speedup can be greater than $1$, as before.

We have so far shown that we can decide on the average batch size if we know something about the workload, namely, the average computational cost of an element. We have also shown when we can expect the worst case speedup, potentially allowing us to take prevention measures. Finally, we have shown that any data-parallel scheduler deciding on a fixed schedule in oblivious conditions can yield a suboptimal speedup. Note the wording “fixed” here. It means that the scheduler must make a definite decision about the batching order without any knowledge about the workload, and must make the same decision every time – it must be deterministic. As hinted before, the way to overcome an adversary that is repetitively picking the worst case workload is to use randomization when producing the batching schedule. This is the topic of the next section.

1.4 A.4 Overcoming the Worst-Case Speedup Using Randomization

Recall that the workload distribution that led to a bad speedup in our evaluation consisted of a sequence of very cheap elements followed by a minority of elements which were computationally very expensive. On the other hand, when we inverted the order of elements, the speedup became linear. The exponential backoff approach is designed to start with smaller batches first in hopes of hitting the part of the workload which contains most work as early as possible. This allow other workers to steal larger pieces of the remaining work, hence allowing a more fine grained batch subdivision. In this way the scheduling algorithm is workload-driven – it gives itself its own feedback. In the absence of other information about the workload, the knowledge that some worker is processing some part of the workload long enough that it can be stolen from is the best sign that the workload is different than the baseline, and that the batch subdivision can circumvent the baseline constraint. This heuristic worked in the example from Fig. 9-36 when the expensive elements were reached first, but failed when they were reached in the last, largest batch, and we know that there has to be a largest batch by Lemma 1 – a single worker must divide the range into batches the mean size of which has a lower bound. In fact, no other deterministic scheduler can yield an optimal speedup for all schedules, as shown by Lemma 2. For this reason we look into randomized schedulers.

In particular, in the example from the evaluation we would like the scheduler to put the smallest batches at the end of the range, but we have no way of knowing if the most expensive elements are positioned somewhere else. With this in mind we randomize the batching order. The baseline constraint still applies in oblivious conditions, so we have to pick different batch sizes with respect to the constraints from Lemma 1. Lets pick exactly the same set of exponentially increasing batches, but place consequent elements into different batches randomly. In other words, we permute the elements of the range and then apply the previous scheme. We expect some of the more expensive elements to be assigned to the smaller batches, giving other workers a higher opportunity to steal a part of the work.

In evaluating the effectiveness of this randomized approach we will assume a particular distribution we found troublesome. We define it more formally.

Definition 3

(Step workload distribution). A step workload distribution is a function which assigns a computational cost $w(i)$ to each element $i$ of the range of size $N$ as follows:

$$\begin{aligned} w(i) = {\left\{ \begin{array}{ll} w_e, &{} i \in [i_1, i_2] \\ w_0, &{} i \not \in [i_1, i_2] \end{array}\right. } \end{aligned}$$

(6)

where $[i_1, i_2]$ is a subsequence of the range, $w_0$ is the minimum cost of computation per element and $w_e \gg w_0$. If $w_e \ge f \cdot T_d$, where $f$ is the computation speed and $T_d$ is the worker delay, then we additionally call the workload highly irregular. We call $D = 2^d = i_2 - i_1$ the span of the step distribution. If $(N - D) \cdot \frac{w_0}{f} \le T_d$ we also call the workload short.

We can now state the following lemma. We will refer to the randomized batching schedule we have described before as the randomized permutation with an exponential backoff. Note that we implicitly assume that the worker delay $T_d$ is significantly greater than the time $T_c$ spent scheduling a single batch (this was certainly true in our experimental evaluation).

Lemma 4

When parallelizing a workload with a highly irregular short step workload distribution the expected speedup inverse of a scheduler using randomized permutations with an exponential backoff is:

$$\begin{aligned} \langle s_p^{-1} \rangle = \frac{1}{P} + (1 - \frac{1}{P}) \cdot \frac{(2^k - 2^d - 1)!}{(2^k - 1)!} \cdot \sum _{i=0}^{k-1} 2^i \frac{(2^k - 2^i - 1)!}{(2^k - 2^i - 2^d)!} \end{aligned}$$

(7)

where $D = 2^d \gg P$ is the span of the step workload distribution.

Proof

The speedup $s_p$ is defined as $s_p = \frac{T_0}{T_p}$ where $T_0$ is the running time of the optimal sequential execution and $T_p$ is the running time of the parallelized execution. We implicitly assume that all processors have the same the same computation speed $f$. Since $w_e \gg w_0$, the total amount of work that a sequential loop executes is arbitrarily close to $D \cdot w_e$, so $T_0 = \frac{D}{f}$. When we analyze the parallel execution, we will also ignore the work $w_0$. We will call the elements with cost $w_e$ expensive.

We assumed that the workload distribution is highly irregular. This means that if the first worker $\omega $ starts the work on an element from $[i_1, i_2]$ at some time $t_0$ then at the time $t_1 = t_0 + \frac{w_e}{f}$ some other worker must have already started working as well, because $t_1 - t_0 \ge T_d$. Also, we have assumed that the workload distribution is short. This means that the first worker $\omega $ can complete work on all the elements outside the interval $[i_1, i_2]$ before another worker arrives. Combining these observations, as soon as the first worker arrives at an expensive element, it is possible for the other workers to parallelize the rest of the work.

We assume that after the other workers arrive there are enough elements left to efficiently parallelize work on them. In fact, at this point the scheduler will typically change the initially decided batching schedule – additionally arriving workers will steal and induce a more fine-grained subdivision. Note, however, that the other workers cannot subdivide the batch on which the current worker is currently working on – that one is no longer available to them. The only batches with elements of cost $w_e$ that they can still subdivide are the ones coming after the first batch in which the first worker $\omega $ found an expensive element. We denote this batch with $c_{\omega }$. The batch $c_{\omega }$ may, however, contain additional expensive elements and the bigger the batch the more probable this is. We will say that the total number of expensive elements in $c_{\omega }$ is $X$. Finally, note that we assumed that $D \gg P$, so our expression will only be an approximation if $D$ is very close to $P$.

We thus arrive at the following expression for speedup:

$$\begin{aligned} s_p = \frac{D}{X + \frac{D - X}{P}} \end{aligned}$$

(8)

Speedup depends on the value $X$. But since the initial batching schedule is random, the speedup depends on the random variable and is itself random. For this reason we will look for its expected value. We start by finding the expectation of the random variable $X$.

We will now solve a more general problem of placing balls to an ordered set of bins and apply the solution to finding the expectation of $X$. There are $k$ bins, numbered from $0$ to $k - 1$. Let $c_i$ denote the number of balls that fit into the $i$th bin. We randomly assign $D$ balls to bins, so that the number of balls in each bin $i$ is less than or equal to $c_i$. In other words, we randomly select $D$ slots from all the $N = \sum \nolimits _{i=0}^{k-1} c_i$ slots in all the bins together. We then define the random variable $X$ to be the number of balls in the non-empty bin with the smallest index $i$. The formulated problem corresponds to the previous one – the balls are the expensive elements and the bins are the batches.

An alternative way to define $X$ is as follows:

$$\begin{aligned} X = \sum _{i=0}^{k-1} {\left\{ \begin{array}{ll} \text {number of balls in bin } i &{} \text {if all the bins } j < i \text { are empty} \\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

(9)

Applying the linearity property, the expectation $\langle X \rangle $ is then:

$$\begin{aligned} \langle X \rangle = \sum _{i=0}^{k-1} \langle \text {number of balls in bin } i \text { given that all the bins } j < i \text { are empty, and } 0 \text { otherwise} \rangle \end{aligned}$$

(10)

The expectation in the sum is conditional on the event that all the bins coming before $i$ are empty. We call the probability of this event $p_i$. We define $b_i$ as the number of balls in any bin $i$. From the properties of conditional expectation we than have:

$$\begin{aligned} \langle X \rangle = \sum _{i=0}^{k-1} p_i \cdot \langle b_i \rangle \end{aligned}$$

(11)

The number of balls in any bin is the sum of the balls in all the slots of that bin which spans slots $n_{i-1}$ through $n_{i-1} + c_i$. The expected number of balls in a bin $i$ is thus:

$$\begin{aligned} \langle b_i \rangle = \sum _{i=n_{i-1}}^{n_{i-1} + c_i} \langle \text {expected number of balls in a single slot} \rangle \end{aligned}$$

(12)

We denote the total capacity of all the bins $j \ge i$ as $q_i$ (so that $q_0 = N$ and $q_{k-1} = 2^{k-1}$). We assign balls to slots randomly with a uniform distribution – each slot has a probability $\frac{D}{q_i}$ of being selected. Note that the denominator is not $N$ – we are calculating a conditional probability for which all the slots before the $i$th bin are empty. The expected number of balls in a single slot is thus $\frac{D}{q_i}$. It follows that:

$$\begin{aligned} \langle b_i \rangle = c_i \cdot \frac{D}{q_i} \end{aligned}$$

(13)

Next, we compute the probability $p_i$ that all the bins before the bin $i$ are empty. We do this by counting the events in which this is true, namely, the number of ways to assign balls in bins $j \ge i$. We will pick combinations of $D$ slots, one for each ball, from a set of $q_i$ slots. We do the same to enumerate all the assignments of balls to bins, but with $N = q_0$ slots, and obtain:

$$\begin{aligned} p_i = \frac{\left( {\begin{array}{c}q_i\\ D\end{array}}\right) }{\left( {\begin{array}{c}q_0\\ D\end{array}}\right) } \end{aligned}$$

(14)

We assumed here that $q_i \ge D$, otherwise we cannot fill all $D$ balls into bins. We could create a constraint that the last batch is always larger than the number of balls. Instead, we simply define $\left( {\begin{array}{c}q_i\\ D\end{array}}\right) = 0$ if $q_i < D$ – there is no chance we can fit more than $q_i$ balls to $q_i$ slots. Combining these relations, we get the following expression for $\langle X \rangle $:

$$\begin{aligned} \langle X \rangle = D \cdot \frac{(q_0 - D)!}{q_0!} \sum _{i=0}^{k-1} c_i \cdot \frac{(q_i - 1)!}{(q_i - D)!} \end{aligned}$$

(15)

We use this expression to compute the expected speedup inverse. By the linearity of expectation:

$$\begin{aligned} \langle s_p^{-1} \rangle = \frac{1}{P} + \left( 1 - \frac{1}{P} \right) \cdot \frac{(q_0 - D)!}{q_0!} \sum _{i=0}^{k-1} c_i \cdot \frac{(q_i - 1)!}{(q_i - D)!} \end{aligned}$$

(16)

This is a more general expression than the one in the claim. When we plug in the exponential backoff batching schedule, i.e. $c_i = 2^i$ and $q_i = 2^k - 2^i$, the lemma follows.

The expression derived for the inverse speedup does not have a neat analytical form, but we can evaluate it for different values of $d$ to obtain a diagram. As a sanity check, the worst expected speedup comes with $d = 0$. If there is only a single expensive element in the range, then there is no way to parallelize execution – the expression gives us the speedup $1$. We expect a better speedup as $d$ grows – when there are more expensive elements, it is easier for the scheduler to stumble upon some of them. In fact, for $d = k$, with the conventions established in the proof, we get that the speedup is $\frac{1}{P} + \left( 1 - \frac{1}{P} \right) \cdot \frac{c_0}{D}$. This means that when all the elements are expensive the proximity to the optimal speedup depends on the size $c_0$ of the first batch – the less elements in it, the better. Together with the fact that many applications have uniform workloads, this is also the reason why we advocate exponential backoff for which the size of the first batch is $1$.

We call the term $\frac{(q_0 - D)!}{q_0!} \sum \nolimits _{i=0}^{k-1} c_i \cdot \frac{(q_i - 1)!}{(q_i - D)!}$ the slowdown and plot it with respect to span $D$ on the diagram in Fig. 17. In this diagram we choose $k = 10$, and the number of elements $N = 2^{10} = 1024$. As the term nears $1$, the speedup nears $1$. As the term approaches $0$, the speedup approaches the optimal speedup $P$. The quicker the term approaches $0$ as we increase $d$, the better the scheduler. We can see that fixed-size batching should work better than the exponential backoff if the span $D$ is below $10$ elements, but is much worse than the exponential backoff otherwise. Linearly increasing the batch size from $0$ in some step $a = \frac{2 \cdot (2^k - 1)}{k \cdot (k - 1)}$ seems to work well even for span $D < 10$. However, the mean batch size $\overline{c_i} = \frac{S}{k}$ means that this approach may easily violate the baseline constraint, and for $P \approx D$ the formula is an approximation anyway.

The conclusion is that selecting a random permutation of the elements should work very well in theory. For example, the average speedup becomes very close to optimal if less than $D = 10$ elements out of $N = 1024$ are expensive. However, randomly permuting elements would in practice either require a preparatory pass in which the elements are randomly copied or would require the workers to randomly jump through the array, leading to cache miss issues. In both cases the baseline performance would be violated. Even permuting the order of the batches seems problematic, as it would require storing information about where each batch started and left off, as well as its intermediate result – for something like that we need a data structure like a work-stealing tree and we saw that we have to minimize the number of nodes there as much as possible.

There are many approaches we could study, many of which could have viable implementations, but we focus on a particular one which seems easy to implement for ranges and other data structures. Recall that in the example in Fig. 9-36 the interval with expensive elements was positioned at the end of the range. What if the worker alternated the batch in each step by tossing the coin to decide if the next batch should be from the left (start) or from the right (end)? Then the worker could arrive at the expensive interval on the end while the batch size is still small with a relatively high probability. The changes to the work-stealing tree algorithm are minimal – in addition to another field called rresult (the name of which should shed some light on the previous choice of name for lresult), we have to modify the workOn, complete and pushUp methods. While the latter two are straightforward, the lines 47 through 51 of workOn are modified. The new workOn method is shown in Fig. 18.

The main issue here is to encode and atomically update the iteration state, since it consists of two pieces of information – the left and the right position in the subrange. We can encode these two positions by using a long integer field and a long CAS operation to update it. The initial 32 bits can contain the position on the left side of the subrange and the subsequent 32 on the right side. With this in mind, the methods tryAdvanceLeft, tryAdvanceRight, notStolen, notCompleted and decodeStep should be straightforward.

We evaluate the new scheduler on the distribution from Fig. 9-36 and show the results in Fig. 19. The first two diagrams (STEP2 and STEP3) show that with the expensive interval at the beginning and the end of the range the work-stealing tree achieves a close to optimal speedup. However, there is still a worst case scenario that we have to consider, and that is to have a step workload with the expensive interval exactly in the middle of the range. Intuition tells us that the probability to hit this interval early on is smaller, since a worker has to progress through more batches to arrive at it. The workload STEP4 in the third diagram of Fig. 19 contains around $25\,\%$ expensive elements positioned in the middle of the range. The speedup is decent, but not linear for STEP4, since the bigger batches seem to on average hit the middle of the range more often.

Having shown that randomization does help scheduling both in theory and in practice, we conclude that the problem of overcoming particularly bad workload distributions is an algorithmic problem of finding a batching schedule which can be computed and maintained relatively quickly, leaving this task as future work.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prokopec, A., Odersky, M. (2014). Near Optimal Work-Stealing Tree Scheduler for Highly Irregular Data-Parallel Workloads. In: Cașcaval, C., Montesinos, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2013. Lecture Notes in Computer Science(), vol 8664. Springer, Cham. https://doi.org/10.1007/978-3-319-09967-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-09967-5_4
Published: 01 October 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09966-8
Online ISBN: 978-3-319-09967-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Buying options

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Work-Stealing Reduction Tree

1.2 A.2 Work-Stealing Tree Traversal Strategies

1.3 A.3 Speedup and Optimality Analysis

Definition 1

Definition 2

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

1.4 A.4 Overcoming the Worst-Case Speedup Using Randomization

Definition 3

Lemma 4

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation