Monotonic Optimization of Dataflow Buffer Sizes
 233 Downloads
Abstract
Many high datarate videoprocessing applications are subject to a tradeoff between throughput and the sizes of buffers in the system (the storage distribution). These applications have strict requirements with respect to throughput as this directly relates to the functional correctness. Furthermore, the size of the storage distribution relates to resource usage which should be minimized in many practical cases. The computation kernels of high datarate videoprocessing applications can often be specified by cyclostatic dataflow graphs. We therefore study the problem of minimization of the total (weighted) size of the storage distribution under a throughput constraint for cyclostatic dataflow graphs. By combining ideas from the area of monotonic optimization with the causal dependency analysis from a stateoftheart storage optimization approach, we create an algorithm that scales better than the stateoftheart approach. Our algorithm can provide a solution and a bound on the suboptimality of this solution at any time, and it iteratively improves this until the optimal solution is found. We evaluate our algorithm using several models from the literature, and on models of a high datarate videoprocessing application from the healthcare domain. Our experiments show performance increases up to several orders of magnitude.
Keywords
Monotonic optimization Cyclostatic dataflow Throughput Buffer size1 Introduction
Many high datarate video processing applications have strict requirements on throughput as it affects the (visual) quality. It may even affect safety, as is the case for the medical videoprocessing application in the imageguided therapy domain that has motivated our work. Often, these types of applications are subject to a tradeoff between throughput and the sizes of the buffers (the storage distribution from now on). Since buffer space uses expensive or scarce resources, one of the key design questions for these applications is how to minimize the storage distribution without violating the throughput constraint. We approach this problem using modelbased design, and model and analyze the application using the cyclostatic dataflow (CSDF) formalism [5]. This formalism is a member of the dataflow family [6] and is suitable for modeling a broad class of streaming, parallel applications with cyclically changing behavior and finite buffers such as our videoprocessing applications. This modelbased approach has the advantage that the analysis of the model usually is much faster than experimentation on a prototype. For instance, our driver case has an FPGA as implementation target. The hardware implementation step in the process takes several hours. Using a storagedistribution minimization algorithm with an analysis step that takes several hours clearly is infeasible for even small search spaces.
Efficient methods exist to compute the throughput of a CSDF graph with a given storage distribution. The problem that we consider in this article is to minimize the size of the storage distribution under a throughput constraint. In general, this problem is NPhard [8], and we therefore present an anytime algorithm. The algorithm first tries to quickly find an initial storage distribution that realizes the throughput constraint. Then it iteratively improves the storage distribution. During this process, the algorithm provides an upper bound on the difference between the size of the currently best storage distribution and the size of the (unknown) minimal storage distribution. This can be a useful feature because if the user has no patience to wait for a real minimum storage distribution (finding one can take long due to the NPhardness), he can terminate the algorithm and still have a feasible storage distribution and an estimation of the quality of this solution.
Contribution
In this work we combine principles from monotonic optimization [12, 13] and the concept of knee points of [7] with the causal dependency analysis from [10, 11]. This results in an algorithm that minimizes the storage distribution in CSDF graphs under a throughput constraint. This algorithm scales better than the stateoftheart approach of [10, 11]. Our experiments show that the performance may be improved by several orders of magnitude. Furthermore, it is an anytime algorithm which is able to present at any moment (after the initialization phase and if it exists) a storage distribution that satisfies the throughput constraint and a bound on the suboptimality of this best solution so far. A secondary contribution is an elaboration of the concept of knee points that has been introduced in [7]. Knee points play a crucial role in our algorithm.
Related work
Closely related work that adresses the problem of this article, optimization of the storage distribution under a throughput constraint for CSDF graphs, is the work of [3, 4, 10, 11, 15]. In [15], a fast approximation algorithm is proposed that overestimates the size of the required storage distribution with an unknown factor. The work of [10, 11], on the other hand, presents an exact solution to a slightly more general problem than the problem of this article: [10, 11] compute the whole tradeoff space, which can then be used to solve our problem. The work of [3, 4] is closely related to [15] and provides an approximate solution based on a relaxation of an integerlinear program. Our work is complementary to [10, 11] as it can be regarded as a fast heuristic to significantly prune the search space after which the exact method of [10, 11] is used to obtain the final solution. Our method can also be regarded as a domainspecific specialization of the domainindependent and generic monotonic optimization framework of [12, 13]. This framework to solve nonconvex, but monotonic optimization problems, has succesfully been applied in the area of wireless communications [9, 14, 16], and we now introduce it in the dataflow domain. The key difference with the generic outerpolyblock approximation algorithm of [12, 13] is that we bound the optimal solutions from both the inside and the outside. This is similar to the approach of [7], which uses a constraint solver to build an approximation of the Pareto front of a multicriteria optimization problem, using monotonicity implicitly. We use the concept of knee points of [7] to select a new point in the search space to explore, instead of using a binary search to compute the upperboundary projection in the outerpolyblock approximation algorithm of [12, 13]. Furthermore, the fact that we limit the scope of the approach to CSDF allows us to take advantage of domainspecific properties and analysis methods, i.e., the causal dependency analysis of [10, 11], to make the search more efficient.
2 Explanation of the Approach
Our approach is centered around monotonicity of throughput and buffer sizes in the CSDF formalism: increasing buffer sizes will not decrease throughput. This monotonicity allows us to efficiently represent and bound the search space. Suppose that we have analyzed four points in the storage distribution space u_{0} – u_{3} and that these resulted in a less than required throughput (they are infeasible). Figure 1 shows these points, and because of monotonicity we know that the area U^{−} does not contain any feasible points. In a similar way, we can build a view of feasible points. Suppose that s_{0} – s_{2} are points that we have analyzed and that have been shown to be feasible. The area S^{+} then only contains feasible points because of monotonicity. Because the total size of the storage distribution is also monotone, the best solution in S^{+} is one of the points s_{0} – s_{2}, namely s_{0} with size 400 + 700 = 1100. This representation of U^{−} and S^{+} and the exploitation of monotonicity is closely related to the area of monotonic optimization [12, 13].
The second concept that we use are the special knee pointsk_{0} – k_{4} [7]. These are induced by u_{0} – u_{3} and are local minima in the sense that any feasible solution will have a size greater than the size of one of the knees (because of monotonicity). In particular, the knee points with the smallest size provide a lower bound on the size of any feasible solution. In this example, k_{2} has the smallest size of 400 + 120 = 520. This means that the optimal solution (at best the point (401,121)) has a size of at least 522 and at most 1100 (size of s_{0}). We thus say that the maximal error Δ equals 1100 − 522 = 578. Given the state of knowledge determined by the set of feasible points s_{0} – s_{2} and the set of infeasible points u_{0} – u_{3}, we select a new point to check for feasibility. This selection process is based on the knees and on the hyperplane of points with size equal to the best solution so far (the dashed line through s_{0}): We select a point x = (545,265) halfway on the line segment between the hyperplane and a knee with the smallest size (k_{2} in this example). Intuitively, this is the area where most can be gained. By choosing the point x halfway between k_{2} and the hyperplane, we apply a multidimensional binary search. In the case that x is feasible, we extend S^{+} with the area to the right and above x. Furthermore, x improves on s_{0} and the maximal error Δ now equals 810 − 522 = 288, which is approximately half of the previous maximal error.
The third ingredient in our approach is the causal dependency analysis of [11], which we use to bound the search space even further. Throughput analysis of a CSDF graph can, in addition to the throughput, also provide the channels that have a socalled storage dependency. Intuitively, a channel creates a storage dependency if the progress of the data processing depends on freeing storage space in the buffer associated to that channel. From [11] it follows that throughput can only increase if the size of at least one channel with a storage dependency is increased. Now, suppose that the analysis of x shows that it is infeasible and that only buffer b_{2} has a storage dependency. This means that not increasing the size of buffer b_{2} from point x will never result in a feasible point. We therefore can extend the infeasible point x = (545,265) to (∞,265), resulting in a significant reduction in the search space: the area filled with the pattern is added to U^{−}. The knee points k_{2}, k_{3} and k_{4} are removed, and a new knee point k_{5} = (400,265) is added. This makes k_{1} = (400,200) the knee point with the smallest size in the new situation, and this reduces the maximal error Δ from 578 to 1100 − (401 + 201) = 498.
Iteration of these steps reduces the gap between U^{−} and S^{+} and also the maximal error Δ, and will eventually find the best feasible storage distribution.
3 CycloStatic Dataflow Graphs
We briefly repeat existing definitions and results concerning CSDF graphs based on [11]. We let \(\mathbb {N}_{0,\infty }\) denote the set \(\mathbb {N} \cup \{0, \infty \}\). Let P be a set of ports, and let rate be a function that assigns a finite sequence (r_{1},r_{2},…,r_{n}) of rates in \(\mathbb {N}\) to each port (lengths of these sequences may differ among the ports). An actor is a tuple (I,O,T) consisting of I ⊆ P input ports, O ⊆ P output ports with I ∩ O = ∅, and of T = (t_{1},t_{2},…,t_{n}) execution times.
Definition 1 (CSDF graph)
A CSDF graph is a tuple (A,C) of a set of actors A, and a set of channels C ⊆ P × P such that (i) (p,q) ∈ C implies that p is an output port and that q is an input port, and (ii) all ports are connected to exactly one channel.
The initial state of a CSDF graph is determined by the initial token distribution, which assigns a number (possibly 0) of initial tokens to each channel.
Channels have an unbounded storage space in the semantics. As it is commonly done in literature, we model finite buffer space of channels C_{buf} ⊆ C by adding for each channel (p,q) ∈ C_{buf} from actor a ∈ A to actor b ∈ A a new channel (p_{δ},q_{δ}) from b to a where p_{δ} and q_{δ} are new ports with rate(p_{δ}) = rate(q) and rate(q_{δ}) = rate(p). The number of initial tokens on (p_{δ},q_{δ}) equals the storage space of the channel (p,q) minus the number of initial tokens on (p,q).
Definition 2 (Storage distribution)
Let (A,C) be a CSDF graph and let C_{buf} ⊆ C be a set of buffered channels. A storage distribution for C_{buf} is a function \(\delta : C_{\mathit {buf}} \to \mathbb {N}_{0,\infty }\). We let (A_{δ},C_{δ}) denote the CSDF graph with the additional channels that realize the storage constraints,^{1} and assume that it is strongly connected.^{2}
Let δ and δ^{′} be two storage distributions. We say that δ ≼ δ^{′} if and only if δ(c) ≤ δ^{′}(c) for all c ∈ C_{buf}. Since the tokens in different channels may represent data of different size we introduce a cost function \(w : C_{\mathit {buf}} \to \mathbb {N}\) that assigns a (nonzero) cost to each buffer channel. The cost of a storage distribution δ, denoted by δ, then is \({\sum }_{c \in C_{\mathit {buf}}} w(c) \cdot \delta (c)\).
Throughput of a CSDF graph is a welldefined concept and algorithms exist to compute it (see [11]). For (A,C) we let \(\xi (A,C) \in \mathbb {R}\) denote its throughput. A throughput constraint \(\mathit {tc} \in \mathbb {R}\) on (A,C) gives a lower bound on the necessary throughput. We say that a storage distribution δ is feasible if and only if the throughput constraint is satisfied, i.e., ξ(A_{δ},C_{δ}) ≥tc. A useful property of the CSDF formalism is that throughput, and thereby feasibility of storage distributions, is monotone with respect to buffer sizes.
Lemma 1
[11] Let (A,C) be a CSDF graph, and let δ,δ^{′} be storage distributions such that δ^{′}≼ δ. Then \(\xi (A_{\delta ^{\prime }},C_{\delta ^{\prime }}) \le \xi (A_{\delta },C_{\delta })\).
A key contribution of [11] is the concept of storage dependencies and we refer the reader to [11] for the precise definition. Analysis of the selftimed execution of (A_{δ},C_{δ}) is used to compute a set Dep_{δ} ⊆ C_{buf} of buffered channels that have a storage dependency. The throughput of a CSDF graph cannot be increased without increasing the capacity of at least one such a channel. In Section 2, we have sketched how this can be used to reduce the search space (cutting off the area filled with the pattern in Fig. 1). The following lemma and corollary formalize this.
Lemma 2
[11] Let (A,C) be a CSDF graph and let δ and δ^{′} be a storage distributions such that δ ≼ δ^{′} and \(\xi (A_{\delta }, C_{\delta }) < \xi (A_{\delta }^{\prime }, C_{\delta }^{\prime })\). Then there is a channel c ∈Dep_{δ} such that δ(c) < δ^{′}(c).
The following corollaries follow from these lemmas. The first one states that increasing buffers that have no storage dependency does not increase the throughput.
Corollary 1
Let (A,C) be a CSDF graph and let δ be a storage distribution. For every storage distribution δ^{′} holds: if for all c ∈Dep_{δ} we have that δ^{′}(c) ≤ δ(c), then \(\xi (A_{\delta ^{\prime }}, C_{\delta ^{\prime }}) \le \xi (A_{\delta }, C_{\delta })\).
Proof
The second corollary informally states that if we have an infeasible storage distribution without buffered channels that have a storage dependency, then no feasible storage distribution exists.
Corollary 2
Let (A,C) be a CSDF graph and let δ be an infeasible storage distribution. If Dep_{δ} = ∅, then no feasible storage distribution exists.
Proof
Suppose that a feasible storage distribution δ^{′} exists, which necessarily has a greater throughput than δ. Define δ^{″} as δ^{″}(c) = max(δ(c),δ^{′}(c)) for all c ∈ C_{buf}. Then δ^{′}≼ δ^{″} and therefore \(\xi (A_{\delta ^{\prime }}, C_{\delta ^{\prime }}) \le \xi (A_{\delta ^{\prime \prime }}, C_{\delta ^{\prime \prime }})\) by Lemma 1. Thus, \(\xi (A_{\delta }, C_{\delta }) < \xi (A_{\delta ^{\prime \prime }}, C_{\delta ^{\prime \prime }})\). We also have that δ ≼ δ^{″} and therefore we can apply Lemma 2 to conclude that there must be a channel c ∈Dep_{δ} such that δ(c) < δ^{″}(c). This contradicts that Dep_{δ} = ∅. □
In the remainder of this article, we assume that we have access to a CSDF analysis function analyze that, given a CSDF graph (A,C), a set of buffered channels C_{buf} ⊆ C and a storage distribution δ for C_{buf} returns a tuple (ξ(A_{δ},C_{δ}),Dep_{δ}) where Dep_{δ} ⊆ C_{buf} is the set of channels with a storage dependency in (A_{δ},C_{δ}). The problem that we consider is the following:
Definition 3 (Optimization problem)
Given are a CSDF graph (A,C), buffered channels C_{buf} ⊆ C, a throughput constraint tc, and a cost function \(w : C_{\mathit {buf}} \to \mathbb {N}\). The buffer optimization problem is to find a feasible storage distribution δ such that for any other feasible storage distribution δ^{′} holds that δ≤δ^{′}.
Consider the CSDF graph (A,C) from Fig. 2 again. It has throughput 1.04 ⋅ 10^{− 3}, and we use this as the throughput constraint for (A_{δ},C_{δ}) shown in Fig. 3. Then δ = {c_{1}↦1,c_{2}↦4,c_{3}↦8,c_{4}↦14,c_{5}↦5} has a throughput of 9.19 ⋅ 10^{− 4} and thus is not feasible. The causal dependency analysis gives us that Dep_{δ} = {c_{1},c_{2},c_{3},c_{5}}. From this we can conclude that every buffer valuation {c_{1}↦1,c_{2}↦4,c_{3}↦8,c_{4}↦x,c_{5}↦5} with \(x \in \mathbb {N}_{0,\infty }\) is not feasible. That is, increasing only buffer c_{4} in size and none of the other buffers, will not lead to a better throughput. We can use this to reduce the search space as explained in Section 2.
In the next section, we formally explain the framework that we use to approach the optimization problem of Definition 3.
4 Monotonic Optimization
We assume the problem setting of Definition 3 and let d = C_{buf}. A storage distribution δ is represented by a point x = (x_{1},x_{2},…,x_{d}) in \(\mathbb {N}_{0,\infty }^{d}\) given a bijection index : C_{buf} →{1,2,…,d} as follows: x_{index(c)} = δ(c) for all c ∈ C_{buf}. The cost of x, denoted by x is defined as δ. We say that an x is feasible if and only if δ is feasible. We use the abbreviation x[i ← v] for the point (x_{1},x_{2},…,x_{i− 1},v,x_{i+ 1},…,x_{d}), i.e., the ith element of x is replaced by v. In the remainder of this article, we assume that the sets D, E, K, S, and U all are subsets of \(\mathbb {N}_{0,\infty }^{d}\), and that k, s, u, q, x, x^{′}, y, y^{′}, z, and z^{′} all are elements of \(\mathbb {N}_{0,\infty }^{d}\). The set complement operation is assumed to act with respect to the universe \(\mathbb {N}_{0,\infty }^{d}\), i.e., \(\overline {U} = \mathbb {N}_{0,\infty }^{d} \setminus U\).
Definition 4 (Knee points)
The set of points (knees) of a finite set \(U \subseteq \mathbb {N}_{0,\infty }^{d}\), denoted by knee(U) is the set \(\mathit {knee}(U) = \mathit {min}(\overline { U^=} )\).
The following corollary states an equivalent characterization of knee points, which we use further below.
Corollary 3
K = knee(U) if and only if K is minimal and \(K^{+} = \overline {U^=}\), i.e., \(K^{+} \cup U^= = \mathbb {N}_{0,\infty }^{d} \land K^{+} \cap U^= = \emptyset \).
Proof
 (⇒)
 (⇐)
We have that \(K^{+} = \overline {U^=}\). Thus, \(\mathit {min}(K^{+}) = \mathit {min}(\overline {U^=})\). Because clearly min(K^{+}) = min(K), we have by the assumption that K is minimal that min(K^{+}) = K and thus \(K = \mathit {min}(\overline {U^=}) =\mathit {knee}(U)\).
The next corollary states that under a specific condition, the union of the strict forward cones of the knees K is equal to the complement of the union of the backward cones of the unsat points U. The extra condition on U is needed because the hyperplane lower boundaries (i.e., the points in which at least one dimension equals 0) by definition are not part of K^{++} and hence should be included in U^{−}. This extra condition is true for at least every set U for which holds that {∞[k ← 0]1 ≤ k ≤ d}⊆ U^{−}. In the context of buffer sizing this makes perfect sense, as it models the situation in which storage distributions that have buffers of size 0 are infeasible.
Corollary 4
Let K = knee(U) and let U be such that for every point x∉U^{−},there is some point y ∈ U^{−} such that y ∈ x^{=}. Then \(K^{++} = \overline {U^{}}\).
Proof
(\(K^{++} \subseteq \overline {U^{}}\)) Let x ∈ K^{++}. We need to show that x∉U^{−}. Then by definition of K, \(x \in \overline {U^=}^{++}\). Hence, there is some \(y \in \overline {U^=}\) such that y ∈ x^{=}. We have that \(y \in \overline {U^=}\), so y∉U^{=} and therefore, for any u ∈ U, y∉u^{=} (1). Assume towards a contradiction that there is some z ∈ U such that x ∈ z^{−}. From y ∈ x^{=} and x ∈ z^{−} it follows that y ∈ z^{=}, which contradicts (1).
(\(\overline {U^{}} \subseteq K^{++}\)) Let \(x \in \overline {U^{}}\), i.e. x∉U^{−}. We need to show now that x ∈ K^{++}, i.e., that \(x \in \overline {U^=}^{++}\). Let z_{0} be such that z_{0} ∈ U^{−} and z_{0} ∈ x^{=}. Here we use the additional assumption to ensure it exists. If z_{0}∉U^{=}, we have some z∉U^{=} with z ∈ x^{=}, thus \(x \in \overline {U^=}^{++}\) and we are done. Otherwise, z_{0} ∈ U^{=}, x∉U^{−} and z_{0} ∈ x^{=}. Hence, there must be some z_{1} such that z_{1}≠z_{0}, \(z_{0} \in z_{1}^{}\), z_{1} ∈ U^{−}, and z_{1} ∈ x^{=}. Again, if z_{1}∉U^{=} we are done. Otherwise we continue similarly with z_{2}, z_{3}, etcetera. Eventually, we must find z_{k} such that z_{k}∉U^{=}, because x−z_{k} decreases in every step and cannot go negative. □
Finally, the following corollary formalizes that the knee points are part of the backward cone of the generating set (see, e.g., Fig. 1).
Corollary 5
Let K = knee(U). If U≠∅, then K ⊆ U^{−}.
Proof
Suppose that U≠∅ and K⫅̸U^{−}. Then we have that ¬∀_{k∈K}∃_{u∈U}k ∈ u^{−}. I.e., ∃_{k∈K}∀_{u∈U}k∉u^{−}. Consider such a k. We distinguish two cases. First, k = 0^{d}. Because we have assumed that U≠∅, there is at least one u ∈ U, and clearly k ∈ u^{−}, which is a contradiction. Hence, K ⊆ U^{−}. Second, k≠ 0^{d}. Define the point x such that x_{i} = max(k_{i} − 1,0) for 1 ≤ i ≤ d. In that case, x≠k and clearly x∉k^{+} and also x∉K^{+} because K consists of minimal points. Furthermore, x∉u^{−} since k∉u^{−} for all u ∈ U by assumption. Therefore also x∉U^{=}. Thus, we have that x∉K^{+} and x∉U^{=}, which constradicts Corollary 3. Hence, K ⊆ U^{−}. □
This is formalized by the following two lemmas and Theorem 1 on knee generation.
Lemma 3 (Extension completeness)
If k ∈ x^{−}, then k^{+} ∖ x^{=} = {k[i ← x_{i}]1 ≤ i ≤ d}^{+}.
Proof
First, let z ∈{k[i ← x_{i}]1 ≤ i ≤ d}^{+}. Then z ∈ k[i ← x_{i}]^{+} for some dimension i. We let q = k[i ← x_{i}] and thus have z ∈ q^{+}. By definition we have z_{j} ≥ q_{j} for all 1 ≤ j ≤ d. We also have that q_{j} ≥ k_{j} for 1 ≤ j≠i ≤ d and q_{i} ≥ x_{i} ≥ k_{i}. Thus, z_{j} ≥ k_{j} which is to say that z ∈ k^{+}, but also z_{i} ≥ x_{i}, which implies that z∉x^{=}.
Second, let z ∈ k^{+} ∖ x^{=}. This means that z_{j} ≥ k_{j} for all 1 ≤ j ≤ d and some i exists such that z_{i} ≥ x_{i}. Let q = k[i ← x_{i}]. Then we have q_{j} = k_{j} for all 1 ≤ j≠i ≤ d and q_{i} = x_{i} ≥ k_{i}. Thus, z_{j} ≥ q_{j} for 1 ≤ j ≤ d, and z ∈ q^{+}. Therefore, z ∈{k[i ← x_{i}]1 ≤ i ≤ d}^{+}. □
The second lemma is a generalization of Lemma 3 and is the basis for our knee computation method. It formalizes (and generalizes to an arbitrary number of dimensions) the situation sketched in Fig. 4.
Lemma 4
Consider a point x and a set K such that x ∈ K^{+}. Then K^{+} ∖ x^{=} = ((K ∖ D) ∪min(E))^{+} where D = K ∩ x^{−} and E = {y[i ← x_{i}]y ∈ D ∧ 1 ≤ i ≤ d}.
Proof
Note that this proof treats the term min(E)^{+} first as E^{+} and as a last step reasons why this is valid.
First suppose that z ∈ K^{+} ∖ x^{=}. Then z ∈ k^{+} for some k ∈ K. We distinguish two cases: (a) k∉x^{−}, and (b) k ∈ x^{−}. In case (a) k is not removed from K through D, thus k ∈ (K ∖ D) ∪ E. Therefore, z ∈ ((K ∖ D) ∪ E)^{+}. In case (b) k is removed as part of D. However, the extensions of D are added through set E, and by Lemma 3 we know that z ∈ ((K ∖ D) ∪ E)^{+}.
Second, suppose that z ∈ ((K ∖ D) ∪ E)^{+}. Then (a) some k ∈ K ∖ D exists such that z ∈ k^{+}, or (b) some k ∈ K and i exist such that z ∈ k[i ← x_{i}]^{+}. In case (a) we thus have that z ∈ k^{+}, k ∈ K and k∉x^{−} and thus z∉x^{−}. Therefore, z ∈ K^{+} ∖ x^{=}. In case (b) we have that z ∈ k[i ← x_{i}]^{+}, k ∈ K and k ∈ x^{−}. Clearly also z ∈ k^{+} because x_{i} ≥ k_{i} and thus z ∈ K^{+}. Furthermore, z∉x^{=} because z_{i} ≥ x_{i}.
Now we have proven that K^{+} ∖ x^{=} = ((K ∖ D) ∪ E)^{+}. It is clear that \((U_{1} \cup U_{2})^{+} = U_{1}^{+} \cup U_{2}^{+}\), and the combination with min(E)^{+} = E^{+} yields the desired result. □
The following theorem is fundamental to our method as it provides a means to compute knee points efficiently.
Theorem 1 (Knee generation)

if x ∈ U^{−},then K^{′} = K,and otherwise:

K^{′} = (K ∖ D) ∪min(E) where D = K ∩ x^{−} and E = {y[i ← x_{i}]y ∈ D ∧ 1 ≤ i ≤ d}
Proof
 1.
K^{′+} ∩ (U ∪{x})^{=} = ∅,
 2.
\(K^{\prime +} \cup (U \cup \{{x} \})^= = \mathbb {N}^{d}_{0, \infty }\), and
 3.
K^{′} = min(K^{′}).
For item 2, we have to show – using the same reduction as above – that \((K^{+} \setminus {x}^=) \cup (U \cup \{{x} \})^= = \mathbb {N}^{d}_{0, \infty }\). Using our assumption that K = knee(U), we can derive in a similar way as above that this indeed is the case.
For item 3, notice that K ∖ D is minimal because K is minimal. Furthermore, min(E) is minimal by definition. Note that for e ∈min(E) holds that e ∈ x^{−}. Therefore, such an extension e does not have a point in K ∖ D in its backward cone, which is to say that no point in K ∖ D has e in its forward cone. Furthermore, no extension e ∈min(E) has a point in K ∖ D in its forward cone, because then K would not have been minimal. Therefore, K^{′} is minimal: K^{′} = min(K^{′}). □
The following theorem states that knees of a set of infeasible storage distributions give a (nontight) lower bound on the cost of a feasible storage distribution.
Theorem 2 (Lower bound feasible cost)
Consider a set U of infeasible points such that {∞^{d}[k ← 0]1 ≤ k ≤ d}⊆ U^{−} and a feasible point x. Then x≥min{k + 1^{d}k ∈knee(U)}.
Proof
Let K = knee(U). First, we have that for every point x∉U^{−} holds that every buffer has at least size 1, because we assume that {∞^{d}[k ← 0]1 ≤ k ≤ d}⊆ U^{−}. We can apply Corollary 4 and have that \(K^{++} = \overline {U^{}}\).
Every point in U^{−} is infeasible by Lemma 1 and our representation of the storage distributions. Thus, a feasible point is part of \(\overline {U^{}}\), which thus equals K^{++}. Hence, there is some knee k ∈ K such that k ∈ x^{=} and x≥k + 1^{d}. Thus x≥min{k + 1^{d}k ∈knee(U)}. □
In the next section, we apply the mechanism explained in this section to the optimization problem.
5 Optimization Algorithm
The algorithm consists of three phases. First, lines 1–11 form the initialization phase in which a first feasible solution is created, starting from the initial storage distribution that gives each buffer a minimal necessary size for deadlockfree execution (see, e.g., [2]). The whileloop iteratively doubles the buffer sizes of the buffers with a storage dependency until it finds a feasible solution. The handleInfeasible function, which is defined in Algorithm 2, updates the set of infeasible points U and the knees K for every infeasible point that is encountered. Note that this function reports an error if an infeasible point is encountered with no storage dependencies, which implies that there is no feasible storage distribution. Also note that lines 5 – 7 of handleInfeasible apply the additional pruning of the search space using the causal dependency information. This is formalized in the following lemma.
Lemma 5
Proof
Let x represent the infeasible storage distribution δ and consider a storage distribution \(\delta ^{\prime }\) that is represented by a point in y^{−}. By definition, \(\delta ^{\prime }(c) \le \delta (c)\) for all c ∈Dep. Thus, by Corollary 1 we know that the throughput for \(\delta ^{\prime }\) is not greater than the throughput for δ. Therefore, \(\delta ^{\prime }\) is also infeasible. \(\ \Box \)
The second phase of Algorithm 1, lines 12–22, form the optimization phase which starts after a first feasible point is found by the initialization phase. This phase iteratively chooses a new point x to analyze and calls handleInfeasible if the point is infeasible, and otherwise adds it to S. The selection function is flexible (hence the … notation in the list of parameters). A requirement is that implementations either return a point that is neither in U^{−} nor in S^{+}, or ⊥ (in case it cannot find a good point to explore). Our current implementation, shown in Algorithm 3, selects points on the line through a knee with minimal cost, and the closest point on the cost hyperplane of the best solution so far (see Fig. 1). It starts halfway the line segment (lines 12 – 21), and doubles the distance to the knee as long as the point is infeasible and not part of S^{+} to prune as much of the space as possible (lines 5 – 11). When the function cannot select a point in the unexplored space between U^{−} and S^{+}, it returns ⊥. This will eventually happen because we work in \(\mathbb {N}^{d}_{0, \infty }\).
The third phase of Algorithm 1, the final enumeration phase, starts in line 23. It calls the algorithm from [11] with the knee points that have the potential of leading to a feasible point with a cost smaller than the cost of the best point so far. This still is necessary because, in general, a select function may have left some points between U^{−} and S^{+} that may give a better solution than the best one we have found so far.
Invariant 1
At lines 10 and 21 of Algorithm1 it holds that (i) U^{−} only contains infeasible points, (ii) S^{+} only contains feasible points, and (iii) K = knee(U).
Proof
Straightforward using Lemma 5 and Theorem 1.
\(\ \Box \)
Theorem 3
Algorithm 1 solves the optimization problem of Definition 3.
Proof
The initialization phase in fact is a greedy version of the algorithm in [11] that takes exponentially growing steps in the direction of the storage dependencies. Therefore, if a feasible point exists, then the initialization phase will find one. The conclusion that no feasible solution exists for an empty set of storage dependencies is valid according to Corollary 2. The optimization phase extends the sets U, K and S until the selection function returns ⊥. This happens eventually, because the extension part in lines 5 – 11 eventually will find that y ∈ S in which case a new point is selected in lines 12 – 21. If U and S are sufficiently close, then, due to the fact that we have a discrete search space, we cannot find a point in between. Furthermore, if we find a point in between and process it, then either S comes closer to U (in case of a feasible point), or U comes closer to S (in case of an infeasible point). Invariant 1 ensures that U, K and S are built in a proper way. Finally, the algorithm from [11] is invoked with the still promising knee points as a starting point. These are good starting points because any feasible point must be part of K^{++}, and by correctness of the algorithm in [11] we thus solve the problem of Definition 3. \(\ \Box \)
Note that the algorithm can also be interrupted; in that case the optimization and enumeration phases are stopped or skipped, and the best result so far x and the maximal cost error Δ = x−min{k + 1^{d}k ∈ K} are returned (see Theorem 2). This interruption logic is not shown in Algorithm 1 for readability.
6 Experimental Evaluation
We compare with the stateoftheart approach of [11] that computes the full buffersize – throughput tradeoff space. Since the optimization problem of this article (see Definition 3) is a slightly more restricted problem, [11] can be used to solve it. In this section, we compare the approaches, because no other reference algorithm exists. We therefore set the throughput constraint to the throughput of the selftimed execution of the graph, which is the highest throughput possible. The approach of [11] terminates as soon as it has analyzed the storage distributions up to and including this selftimed throughput. Earlier results in the algorithm on tradeoff points with lower throughput are needed for this, so [11] needs all earlier computations to reach the final tradeoff point of the selftimed throughput. This makes the approaches comparable for the case in which we optimize the size of the storage distribution under the constraint that the throughput is maximal, i.e., equal to the selftimed throughput.
We use the following models from the Sdf3 website [1]: an MP3 playback application, an H.263 decoder, a samplerate converter, and a satellite receiver. These are all SDF models (i.e., CSDF models with constant rates and execution times). The models MRF32, MRF64 and MRF128 are models from a reallife image processing application, a multiresolution filter with different input sizes, from the healthcare domain. The MRF models are all rather complex CSDF models with many different rates for a number of actors due to data dependencies. The cost function that we use gives each buffer a weight of one in each model.
Experimental results.
SGB08 [11]  MBS (Algorithm1)  

SDF  CSDF  A  C_{buf}  n  δ  # calls  Time (s)  Δ  δ  # calls  Time (s)  
MP3 playback  ✓  3  2  1  –  –  –  0  2898  34  5  
10  2907  1944724  8702  18  2906  27  4  
H.263 decoder  ✓  4  3  1  –  –  –  0  8006  46  13  
10  8023  1223707  10751  27  8029  40  12  
Sample rate  ✓  6  5  1  34  16  0  0  34  14  1  
Satellite  ✓  22  26  1  1544  38  2  0  1544  32  4  
MRF32  ✓  21  4  1  –  –  –  0  500  106  31  
5  503  211  18  16  505  46  10  
MRF64  ✓  21  4  1  –  –  –  0  985  115  272  
5  993  749  1002  16  993  57  106  
MRF128  ✓  21  4  1  –  –  –  0  1962  149  4974  
10  1968  506  8144  36  1965  51  1112 
To compare the performance of the approaches we primarily use the number of throughput analysis calls, and not the running time. The reason is that our prototype implementation has been written in Java and invokes an external Sdf3 executable for each throughput analysis, which has a significant overhead. The approach of SGB08 has, on the other hand, been fully integrated in a single Sdf3 executable. In both approaches the throughput analysis dominates the overall running time, and therefore we use the number of throughput analysis calls as a measure of performance to abstract from implementation details. The running times are, nevertheless, also shown in Table 1, and we expect that the values for a fully integrated MBS implementation will be smaller.
The results show that both approaches obtain the same size of the storage distribution when an optimal solution is expected (i.e, for a step size of one and a Δ of zero). When a suboptimal solution with a bounded error is accepted, then both approaches result in storage distributions of similar size, which is as expected. The MP3 playback application and the H.263 decoder models are difficult for SGB08, but easy for our approach. We believe that this is caused by explicit visitation of a large part of the search space by SGB08 to achieve the optimal throughput, whereas our approach takes large steps and skips analysis of many intermediate storage distributions. Both methods show similar performance for the models of the samplerate converter and satellite receiver. These models differ from the MP3 playback and the H.263 decoder models in the fact that the initial storage distribution that can be calculated by a fast analysis is close to the optimal storage distribution with the required throughput. The results also show that our approach scales better than SGB08 for the rather complex CSDF models of the imageprocessing application from the healthcare domain.
7 Conclusions
We have introduced an algorithm to optimize the storage distribution size given a throughput constraint for CSDF graphs. This algorithm is based on three ingredients: (i) the causal dependency analysis from [10, 11], (ii) principles from the area of monotonic optimization [12, 13], and (iii) the concept of knee points introduced in [7]. A useful property of our algorithm is that it can provide some feasible storage distribution and an upper bound on the size difference with an optimal feasible storage distribution any time after the initialization phase. The experimental results show that our approach is better suited for buffer minimization under a throughput constraint than (the more general) approach of [10, 11] in the sense that solutions can be obtained with fewer throughput analysis calls.
Our algorithm can in principle be applied to other models of computation and other optimization problems than buffer sizing in CSDF by removing or adapting the parts with respect to causal dependency analysis (i.e., line 23 in Algorithm 1 and lines 5 – 7 in Algorithm 2). The only requirement is that the function that defines the feasibility is monotone with respect to the optimization parameters (in our case the throughput is monotone with respect to the buffer sizes; see Lemma 1). The resulting approach then would be closely related to the generic monotonic optimization frameworks as presented in [12, 13].
Footnotes
Notes
Acknowledgements
This research was supported by the ARTEMIS joint undertaking under grant agreement no 621439 (ALMARVI).
References
 1.Sdf3 website. http://www.es.ele.tue.nl/sdf3/.
 2.Adé, M., Lauwereins, R., Peperstraete, J.A. (1997). Data memory minimisation for synchronous data flow graphs emulated on dspfpga targets. In Proceedings of the 34th annual design automation conference. New York: ACM.Google Scholar
 3.Benazouz, M., Marchetti, O., MunierKordon, A., Michel, T. (2010). A new method for minimizing buffer sizes for cyclostatic dataflow graphs. In 8th IEEE workshop on embedded systems for realtime multimedia.Google Scholar
 4.Benazouz, M., & MunierKordon, A. (2013). Cyclostatic dataflow phases scheduling optimization for buffer sizes minimization. In Proceedings of the 16th international workshop on software and compilers for embedded systems. New York: ACM.Google Scholar
 5.Bilsen, G., Engels, M., Lauwereins, R., Peperstraete, J. (1996). Cyclestatic dataflow. IEEE Transactions on Signal Processing, 44(2), 397–408.CrossRefGoogle Scholar
 6.Lee, E.A., & Parks, T.M. (1995). Dataflow process networks. Proceedings of the IEEE, 83(5), 773–801.CrossRefGoogle Scholar
 7.Legriel, J., Le Guernic, C., Cotton, S., Maler, O. (2010). Approximating the Pareto front of multicriteria optimization problems. In Proceedings of the 16th international conference on tools and algorithms for the construction and analysis of systems. Berlin: Springer.Google Scholar
 8.Moreira, O., Basten, T., Geilen, M., Stuijk, S. (2010). Buffer sizing for rateoptimal singlerate dataflow scheduling revisited. IEEE Transactions on Computers, 59(2), 188–201.MathSciNetCrossRefGoogle Scholar
 9.Qian, L.P., Zhang, Y.J., Huang, J. (2009). Mapel: achieving global optimality for a nonconvex wireless power control problem. IEEE Transactions on Wireless Communications, 8(3), 1553–1563.CrossRefGoogle Scholar
 10.Stuijk, S., Geilen, M., Basten, T. (2006). Exploring tradeoffs in buffer requirements and throughput constraints for synchronous dataflow graphs. In Proceedings of the 43rd annual design automation conference. New York: ACM.Google Scholar
 11.Stuijk, S., Geilen, M., Basten, T. (2008). Throughputbuffering tradeoff exploration for cyclostatic and synchronous dataflow graphs. IEEE Transactions on Computers, 57(10), 1331–1345.MathSciNetCrossRefGoogle Scholar
 12.Tuy, H. (2000). Monotonic optimization: problems and solution approaches. SIAM Journal on Optimization, 11(2), 464–491.MathSciNetCrossRefGoogle Scholar
 13.Tuy, H., AlKhayyal, F., Thach, P. (2005). Monotonic optimization: branch and cut methods, (pp. 39–78). US: Springer.zbMATHGoogle Scholar
 14.Utschick, W., & Brehmer, J. (2012). Monotonic optimization framework for coordinated beamforming in multicell networks. IEEE Transactions on Signal Processing, 60(4), 1899–1909.MathSciNetCrossRefGoogle Scholar
 15.Wiggers, M.H., Bekooij, M.J.G., Smit, G.J.M. (2007). Efficient computation of buffer capacities for cyclostatic dataflow graphs. In 44th ACM/IEEE design automation conference.Google Scholar
 16.Xing, C., Ma, S., Zhou, Y. (2015). Matrixmonotonic optimization for mimo systems. IEEE Transactions on Signal Processing, 63(2), 334–348.MathSciNetCrossRefGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.