Keywords

1 Introduction

Stream programming paradigm has its origins in the Kahn’s processing model [5] and data-flow computing [3]. Stream programs are a natural fit for applications that process large unbounded regular sequences of data. There are many examples for established stream programming applications including digital signal processing, audio, video, graphics, networking and for big data.

Stream programs are expressed by a set of actors and a set of data channels between actors. Conceptually, actors are independent processing units with their own memory and program counters. An actor exchanges information with another actor via a data channel using tokens. The channels fully expose the dependencies between actors, and are directed: the producer is the actor at the source of a data channel, and the consumer is the actor at the destination of the data channel. The data channels are commonly implemented as FIFO buffers, and the size of the FIFO buffers depend on the point in time when actors are executed (also known as fired).

If the firing of actors is not coordinated, actors may starve or the memory of FIFO buffers may deplete. To overcome this problem, Synchronous Data Flow (SDF) Model was introduced [2] to bound the size of FIFO buffers and make computations of infinite streams of data deterministic and controllable. In the SDF model, the actor are constrained such that for each actor firing, only a fixed number of tokens are consumed and produced, respectively. For a well-formed SDF program, a finite periodic schedule can be constructed [2] that consists of a finite sequence of actor firings. The schedule can be computed a priori and invokes each actor of the stream graph at least once, and produces no net change in the system state after executing the schedule. I.e., the number of tokens in each data channel is the same before and after executing the schedule. Hence, a periodic schedule can be executed again and again for unbounded regular streams without starving actors and without exhausting memory. The state before and after the execution of a periodic schedule is known as a steady-state. Hence, the SDF model is a popular model for stream programming because the memory consumption of the data channels is known a prior at compile time. There are many different steady-state schedules for an SDF program, and the sizes of the FIFO buffers for channels depend on the chosen steady-state schedule. Finding a steady-state schedule that minimizes the sizes of FIFO buffers, is still an open research problem. The current state-of-the-art algorithms for finding steady-state schedules are ad-hoc heuristics only [2] that do not optimize for minimal memory. Hence, stream programs may not fully utilize caches and/or modern massively parallel architectures (e.g. GPGPUs) may need to utilize slower memory rather than fast memory. Hence, finding memory-optimal steady-state schedules is of importance for the SDF model.

Contributions: This work is of theoretic nature and explores the problem of finding memory optimal steady-state schedules in an algorithmic fashion. We anticipate large instances of SDF programs in near future that necessitates new algorithmic contributions for memory-optimal steady-state schedules. We provide three notions of memory optimality based on how FIFO buffers utilize memory. We show for each notion of optimality, algorithmic and complexity theoretic results. We also provide a synthetic set of experiments to show the effectiveness of our new algorithmic approach in comparison with the state-of-the-art algorithm for large instances.

2 Motivating Example

The data-flow model [3] represents a program as a stream graph \(G=(V,E)\) whose vertices V are called actors and whose edges \(E\subseteq V\times V\) are called channels. A channel \((u,v)\in E\) buffers data elements called tokens, which are passed from the output of actor u to the input of actor v. In Fig. 1(a) a stream graph is depicted whose actors are a, b and c. The directed edges of the example graph represent channels that transport streams of tokens from the producing actor to the consuming actor. In the following we denote by n and m the number of actors and channels, respectively.

Fig. 1.
figure 1

Example: stream graph consists of actors a, b, and c; channels are augmented with numbers of produced and consumed tokens for its adjacent actor when its fires. Fill-state, repetition for steady-state, and balance equation are given.

Synchronous dataflow [7] restricts the semantics of the dataflow model by fixing the number of consumed and produced tokens for a single firing of an actor. The number of consumed tokens for a single firing of actor v from an in-coming channel \((u,v)\in E\) is given by function \(c: E \rightarrow \mathbb N\). Function \(p: E \rightarrow \mathbb N\) denotes the number of produced tokens for an outgoing channel of an actor. We also refer to the functions p and c as data rates. The data rates of our motivating example are shown in Fig. 1(b) and are also depicted as edge annotation in the graph in Fig. 1(a). A schedule \(s=\langle u_{1},\ldots \rangle \in V^*\) is a sequence of actors, where a given actor may occur several times. Each occurrence \(u_{i}\in V\) in the schedule is called an firing of actor \(u_{i}\). A firing of an actor modifies the state of the system by producing and consuming tokens from the channels adjacent to the actor. The fill-state of the system is the numbers of tokens on the channels between actor invocations that have been queued but have not been consumed yet. We will specify the fill-state of the system at a given point in time with a function \(f:E \rightarrow \mathbb N\). The fill-state is an abstraction of the actual tokens that are stored on the channels. Let us assume that we have an initial fill-state as given in Fig. 1(c), which implies that there is a single token in channel (ab) and there are no tokens on channel (bc), and channel (ac).

A periodic schedule has finite length, includes every actor of the stream graph at least once, and its execution produces no net change in the fill-state after executing the schedule. A periodic schedule may be computed a-priori [7], and executed ad-infinitum without exhausting memoryFootnote 1. We refer to the fill-state before and after the execution of a periodic schedule as steady-state. A periodic schedule s has a repetition vector \(r: V \rightarrow \mathbb N\) that counts the occurrences of each actor in s. The length of s is given by \(\sum _{u \in V} r(u)\). We denote with \(\mathcal {S}\) the set of periodic schedules for a given stream graph instance. Periodic schedules are constrained by two factors. First, recall that every actor needs to be fired at least once. For our example, actor a, b, and c must occur in schedule s. Second, in order to conserve the fill-state of the FIFO-buffers after the execution of the schedule, for each buffer the number of tokens put into the buffer must equal the number of tokens consumed from the buffer. These constraints give rise to the so-called balance equations:

$$\begin{aligned}&p(u,v) \cdot r(u)= c(u,v) \cdot r(v)&\forall \, (u,v)\in E \end{aligned}$$
(1)
$$\begin{aligned}&r(u) > 0&\forall \, u\in V \end{aligned}$$
(2)

The balance equations of the example in Fig. 1 are given in the Fig. 1(e) with the additional constraint that \(r(a)>0\), \(r(b)>0\), and \(r(c)>0\) where r(u) are the repetitions for actor u, i.e., there must be r(u) occurrences of actor u in the schedule s. Finding the smallest integral repetitions for actors can be expressed as a problem of finding the smallest integral vector in the null-space of the topological matrix [7]. It is known that for connected stream graphs, an integral repetition vector satisfying Eqs. (1) and (2) exists if and only if the topological matrix of the stream graph has rank \(n-1\). The repetitions of the motivating example in Fig. 1(a) is shown in Fig. 1(d).

figure a

For the example, actor a is invoked twice since it produces only one token on the edge (ab), but to fire b at least once it needs to consume two tokens on (ab). Actor c is invoked twice because a is invoked twice, and c only consumes a single token for every token produced by a along the channel (ac). No smaller repetition vector can be found. To find a schedule s, a greedy heuristic was devised by Battacharyya et al. [2] (cf. Sect. 3.3.2). The heuristic is outlined in Algorithm 1. The goal of greedy is to minimize the sum of the maximum number of tokens required for each channel over a periodic schedule. Given a graph G, and the initial delay t as part of the input, it returns a schedule s. Note that we say an actor v is fireable, if for every incoming edge (uv), \(f(u,v) \ge c(u,v)\). An actor v is deferrable if it is fireable, and for at least one of its outgoing edges (vu) (that is not a transitive edgeFootnote 2) it holds that \(f(v,u) \ge c(v,u)\).

For our motivating example in Fig. 1(a), the greedy algorithm will fail to find the optimal periodic schedule. For the example we further assume, that we have an initial delay of one token on edge (ab). To minimize the memory consumption the optimal periodic schedule to use is \(s = \langle a, b, c, a, c \rangle .\)

During the firing of these actors, we can keep track of the maximum number of tokens needed for each channel. For the edge (ab), the initial fill-state f(ab) on the edge is 1. We first fire actor a, and thus the fill on the edge (ab) is increased to 2. Similarly, the edge (ac) is now storing a single token. Next, actor b is invoked, consuming two tokens from the channel (ab) and producing two tokens on the channel (bc). Actor c fires, which consumes one token from edge (bc) and (ac). At this point, the edge (ab) has a fill of zero. In order to ensure this schedule is periodic, we must fire a again, producing a single token on both (ab) and (ac). We return back to our initial fill-state by firing c once more. Over this execution, we can see that the maximum number of tokens on the channels (ab) and (bc) is two, while the maximum number of tokens on the channel (ac) is one. Summing this together, this schedule requires memory to store at most five tokens during this execution.

However, it is entirely possible that the greedy algorithm could choose the schedule \(s' = \langle a, b, a, c, c \rangle \). By executing each actor in \(s'\) one-by-one, we can observe that the maximum number of tokens used by the channels (ab), (bc) and (ac) is two. Thus enough memory will be needed to store six tokens, which is clearly suboptimal compared to the schedule s.

To see why this is the case, note that the difference between s and \(s'\) is the order of the 3rd and 4th actors a and c. After firing a and then b, both a and c are fireable but not deferrable. Notice that a is not deferrable, even though there is an outgoing edge that meets the consumption requirements (\(f(a,c) \ge c(a,c)\)), the edge (ac) is a transitive edge. Since \(f(a,b) \not \ge c(a,b)\) at this point in time, a does not meet the criteria to be marked as a deferrable actor. Thus, the greedy algorithm can choose to fire either actor a or c. The greedy algorithm could make the suboptimal choice to fire a again (Line 6 of Algorithm 1), instead of firing c, which would consume the single token on the channel (ac). By choosing to fire a, the greedy algorithm produces a second token on the channel (ac). This leads to the schedule \(s'\).

3 Problem Statement

Stream programs admit an exponential number of periodic schedules. In fact, given a repetition vector r, any sequence of length \(L = \sum _{u \in V} r(u)\) where actor u occurs r(u) times is a periodic schedule. Therefore, the number of periodic schedules is given by \( |\mathcal {S}| = \frac{L !}{\prod _{u \in V} r(u)!}. \) Among these schedules some consume less memory on their FIFO-buffers than others. How much memory a given schedule consumes will depend on the implementation details of the FIFO-buffers and on the evolution of the fill-state over the execution of the schedule.

Recall that the fill-state function keeps track of the number of tokens stored on each channel waiting to be consumed. Given the current fill-state of the system, it is possible to determine the fill-state after the execution of the particular actor. Therefore, given the initial fill-state, we can easily compute the fill-state after the i-th step of the schedule execution. The fill-state function \(f_s^i : E \rightarrow {\mathbb N}\) defines the fill-state of channel (uv) after the i-th execution step of schedule s and may be defined as \(f_s^0(u,v) = t(u,v)\) for the first step, and

$$\begin{aligned} f_s^{i+1}(u,v) = {\left\{ \begin{array}{ll} f_s^i(u,v)+p(u,v), &{}\text{ if } u = s(i+1), \\ f_s^i(u,v)-c(u,v), &{}\text{ if } v = s(i+1), \\ f_s^i(u,v), &{}\text{ otherwise. } \\ \end{array}\right. } \end{aligned}$$

where t(uv) is initial fill-state of (uv) at the beginning of the execution of s.

A periodic schedule and an initial fill-state are said to be admissible if the schedule can be executed without ever running out of tokens on any channel.

Definition 1

Finite periodic schedule s with initial fill-state t is admissible, if

$$\begin{aligned} f_s^i(u,v)&\ge 0,&\forall (u,v) \in E, \, i \in \{0,\ldots ,L\} \\ f_s^0(u,v)&= f_s^{L}(u,v),&\forall (u,v) \in E \end{aligned}$$

It is worth noting that for each periodic schedule \(s \in \mathcal {S}\) there exists an initial fill-state that makes it admissible. Therefore, all we need is a method for deciding which periodic schedule to use.

We study three objective functions that capture the memory utilization of the system under different implementations of the FIFO buffers. In each case the goal is to compute an admissible schedule (st), the only difference is the objective being optimized:

  1. (P1)

    The Min-Max-Max Problem:

    $$\begin{aligned} \min _{(s, t)} \ \max _{0 \le i \le L} \max _{(u,v) \in E} f_s^i(u,v) \end{aligned}$$
  2. (P2)

    The Min-Sum-Max Problem:

    $$\begin{aligned} \min _{(s, t)} \sum _{(u,v) \in E} \max _{0 \le i \le L} f_s^i(u,v) \end{aligned}$$
  3. (P3)

    The Min-Max-Sum Problem:

    $$\begin{aligned} \min _{(s, t)} \max _{0 \le i \le L} \sum _{(u,v) \in E} f_s^i(u,v) \end{aligned}$$

The objective (P1) minimizes the maximum buffer requirement across all buffers. This objective captures a simplistic implementation of FIFO buffers where space is allocated ahead of time and buffers have uniform length. The objective (P2) minimizes the sum of the maximum requirements. This objective captures a simple implementation of FIFO buffers where space is allocated ahead of time, but different buffers can differ in size. The objective (P3) minimizes the maximum combined size of all buffers at any point in time. This objective capture a more sophisticated implementation where buffer space can be acquired and released dynamically.

4 Scheduling to Minimize Memory Usage

In this section we consider the objectives defined in Sect. 3 under the assumption that the initial fill-state of each buffer can be set arbitrarily. In other words, given an instance (VEcp), the goal is to compute a schedule s and an initial fill-state \(t:E \rightarrow \mathbb {N}\) so that the schedule is admissible and one of the three objectives (P1–P3) is minimized.

Our algorithm for Min-Max-Max and Min-Sum-Max assumes the Balance Eqs. (1) and (2) for the instance are feasible and that we are given the smallest integral repetition vector r that the instance admits. In addition to the instance (VEpc) the algorithm take as a parameter a permutation of the actors, we use \(\pi : V\rightarrow [1, n]\) to the denote the position of each actor within this permutation.

First, the algorithm computes for each channel the appropriate initial fill-state that will ultimately make the schedule admissible. Second, each actor u is added to a priority queue with priority 0. The algorithm then enters an infinite loop where in each iteration we remove from the priority queue the actor u with the smallest key x (if there are several actors with the same key, we break ties using the permutation order), we invoke u, and re-insert u with priority \(x + \frac{1}{r(u)}\). The pseudo-code of the procedure is given in Algorithm 2.

Notice that for each actor u, its priority becomes 1 after r(u) invocations. Therefore, after \(L = \sum _{u \in V} r(u)\) executions of the while loop every actor has priority 1. At this point in time, the schedule executed thus far is periodic. We call this periodic schedule, the canonical schedule induced by \(\pi \) and denote it by \((s_\pi , t_\pi )\). Notice, however, that Algorithm 2 itself never ends. Indeed, after the L-th iteration the while loop goes on to repeat this periodic schedule ad-infinitum.

The proof of correctness hinges on the following observation on the minimum buffer size of a channel based on the data rates of its endpoints.

Lemma 1

([2, Theorem 3.3]). Let (uv) be a channel. In any admissible schedule, the buffer for channel (uv) has size at least \(p(u,v) + c(u,v) - \gcd (p(u,v), \) c(uv)) at some point in time during the execution of the schedule.

Proof

For sake of brevity, let us denote p(uv) with a, c(uv) with b, and \(a + b - \gcd (a,b)\) with . Let (st) be an admissible schedule. Since we are interested in deriving a lower bound on the buffer size for channel (uv), we assume without loss of generality that this is the only channel in the graph.

If \(a \mid b\) then the buffer size has to be at least , so the lemma follows. Let us then assume from now on that \(1< a < b\) and that b is not a multiple of a; a symmetric argument can be used to handle the cases \(b \mid a\) and \(1< b < a\).

Consider two executions of v in s. If we have two consecutive executions of v, or the schedule begins and ends with v, then the buffer size is at least . So let us assume this does not happen. We apply the following transformation to the schedule: If the schedule starts by executing v, then we fuse every execution of v with the execution of u that immediately follows it into a new actor \(v'\); if the schedule starts by scheduling u, then we fuse every execution of v with the execution of u that immediately precedes it into a new actor \(v'\). The result is a schedule for a new channel \((u, v')\) with production rate a and consumption rate \(b - a\). It is easy to check that if is a lower bound on the buffer size for the new channel, then is a lower bound on the buffer size for the original channel.

A simple proof by induction finishes the argument if \(a \mid b\) then we are at the base of the inductive proof. Otherwise, the size of the buffer for the channel (uv) must be at least

where the last equality follows by Euclid’s algorithm.

Now that we have a lower bound on the size of each buffer, we will prove that these bounds are attained simultaneously by our algorithm.

Lemma 2

For any permutation \(\pi \), the schedule \((s_\pi , t_\pi )\) is admissible and for each channel (uv) the maximum size of the buffer during the execution of the schedule is \(p(u,v) + c(u,v) - \gcd (p(u,v), c(u,v))\).

figure b

Proof

We prove the bounds on the size of the buffer for a fixed, but arbitrary, channel (uv). For sake of brevity, let us denote p(uv) with a and c(uv) with b. Recall that Balance Eq. (1) for channel (uv) implies \(\frac{r(u)}{r(v)} = \frac{b}{a}\).

First, consider the case \(\pi (u) < \pi (v)\). Notice that the 1st execution of v is preceded by an execution of u. In general, the \(k+1\)st execution of v is preceded by

$$\begin{aligned} \left\lfloor \frac{ k \frac{1}{r(v)} }{\frac{1}{r(u)}} \right\rfloor + 1 = \left\lfloor k \frac{r(u)}{r(v)} \right\rfloor = \left\lfloor \frac{kb}{a} \right\rfloor + 1 \end{aligned}$$

executions of u. Therefore, the fill-state of the channel after the \(k+1\)st execution of v is precisely

$$t_\pi (u,v) + a \left( \left\lfloor \frac{k b}{a} \right\rfloor + 1 \right) - (k+1) b.$$

Using the fact that \(t_\pi (u,v) = b - \gcd (a,b)\) when \(\pi (u) < \pi (v)\), we can show that the fill-state of the channel is always non-negative:

$$\begin{aligned} t_\pi (u,v)\, +&\, a \left( \left\lfloor \frac{k b}{a} \right\rfloor + 1 \right) - (k+1) b = \\&= a \left( \left\lfloor \frac{kb}{a} \right\rfloor + 1\right) - k b - \gcd (a,b) \ge \gcd (a,b) - \gcd (a,b) = 0, \end{aligned}$$

where the inequality follows from the fact that \(a \left( \left\lfloor \frac{kb}{a} \right\rfloor + 1\right) - k b > 0\) and Bézout’s Lemma [9].

On the other hand, just before before the \(k+1\)st execution of v, the fill-state of the buffer is

$$\begin{aligned} t_{\pi }(u,v) + a \left( \left\lfloor \frac{k b}{a} \right\rfloor + 1\right) - k b. \end{aligned}$$

Again, using the fact that \(t_\pi (u,v) = b - \gcd (a,b)\) we get

$$\begin{aligned} t_{\pi }(u,v) \, + \, a \left( \left\lfloor \frac{k b}{a} \right\rfloor + 1\right) - k b = a + b - \gcd (a,b) + a \left\lfloor \frac{kb}{a} \right\rfloor - kb \le a + b - \gcd (a,b), \end{aligned}$$

so the buffer size never exceeds \(a + b - \gcd (a,b)\).

Now consider the case \(\pi (v) < \pi (u)\). In this case the \(i+1\)st execution of v is preceded by \(\left\lceil \frac{ib}{a} \right\rceil \) executions of u. A similar argument (but using the fact that \(t_\pi (u,v) = b\) when \(\pi (u) > \pi (v)\)) shows that the schedule is admissible and that the maximum buffer size is \(a + b - \gcd (a,b)\).

Combining the lower bound from Lemma 1 and the upper bound from Lemma 2 we get that every canonical schedule is an optimal solution for (P1) and (P2). The following theorem summarizes the results in this section.

Theorem 1

There is a polynomial time algorithm for computing an optimal periodic schedule for the objectives (P1) and (P2) with flexible initialization. Furthermore, the schedule can be computed online using \(\varTheta (n)\) space and \(O(\log n)\) time per actor invocation.

Proof

The optimality of the objectives (P1) and (P2) follows immediately from Lemmas 1 and 2. The complexity claims follow from using a priority queue implementation that uses \(\varTheta (n)\) space and performs insert and delete-min operations in \(O(\log n)\) time.

We contrast our positive results from the previous section by showing that it is NP-hard to optimize the Min-Max-Sum (cf. (P3)).

Theorem 2

It is NP-hard to optimize (P3) with flexible initialization.

The theorem can be shown by reducing the Minimum Feedback Arc Set (MFAS) problem to our problem.

5 Experiments

With the advent of stream programming we anticipate large instances of stream programs. In the absence of large stream programs, we have generated complete, directed, acyclic graphs as a synthetic benchmark suite. We generated the graphs as follows: We start with a directed graph \(G = (V,E)\) of n nodes, and number the vertices \(v_1, v_2, \ldots v_n\). For each vertex \(v_i \in V\), we select a random repetition value \(r(v_i)\) uniformly at random from the range \(\{1, \ldots , n\}\). We then iterate through every pair of vertices \(v_i, v_j \in V\). If \(i \ne j\) and \(i < j\), we add the directed edge \((v_i, v_j)\) to E, with \( p(v_i, v_j) = \frac{r(v_j)}{\gcd (r(v_i), r(v_j))}\) and \( c(v_i, v_j) = \frac{r(v_i)}{\gcd (r(v_i), r(v_j))}. \) This generation template guarantees that a repetition vector exists, and the topological matrix of this directed, acyclic graph has rank \(n - 1\).

Table 1. Performance comparison on randomly generated instances.
Fig. 2.
figure 2

Visualizing the performance on randomly generated instances.

Using this approach, we generated graphs of size \(n = 10, 15, \ldots , 50\) and ran both algorithms. As before, we timed the execution of each algorithm, taking the average over twenty runs. The numerical results of these experiments are shown in Table 1 and are visualized in Fig. 2.

Two observations stand out from Fig. 2. First, canonical seems to be asymptotically faster greedy. We suspect that this is due to the fact that as the graph becomes denser, there will be many “fireable” actors in each iteration, leading greedy to spend nearly quadratic time per iteration, whereas canonical, is guaranteed to spend at most logarithmic time. Second, greedy was never able to find an optimal schedule and the quality of the solutions it produced deteriorated as n grew.

These experiments strongly support the hypothesis that greedy is slower and produces worse schedules with respect to memory consumption when more actors become “fireable” at each iteration.

6 Related Work

Our work is closely connected to the greedy algorithm proposed in [2]. Their approach is based on a heuristic which keeps track of the set of “fireable” actors. Our algorithm is based on optimality theorems which produces both the optimal memory schedule and the required initial delay to achieve optimality. With a given initial delay, the NP-hard proof of the problem is given in [2]. The approach in [4] uses a model-checking method to find optimal schedules which requires a machinery which is outside of the complexity class P. There is a stream of literature on scheduling of SDF programs with model checking. Other approaches use time automata to solve the scheduling problem for SDF [1], and variations of the problem definition taking other metrics such as throughput into account [8]. There is also related work concerning to eliminate buffers via unrolling the finite periodic schedules [6].

7 Conclusion

In this work, we have studied three mathematical definitions of memory optimality based on how FIFO buffers utilize memory. We started by showing that two of these objectives can be solved in logarithmic worst-case time per actor invoked, and that the last objective is NP-hard. Experiments showed that our new algorithm drastically outperformed existing heuristics in both speed and the memory usage of schedules produced on dense instances.