In the following, we provide a “simple, yet sufficient” evaluation model for the strategies shown in Fig. 2. This model is used for evaluating the presented optimization strategies with values determined by measurements on the existing prototype of the RPU. “Please note that a purely selectivity-based predicate ordering is actually all we need for this model.”
From the table sizes \(s_{Q_{i},\textit{table}}\) of each query \(Q_{i}\) and a scan rate of \(r_{\textit{scan}}\), we can calculate the scanning time as
$$t_{Q_{i},scan}=\frac{s_{Q_{i},\textit{table}}}{r_{\textit{scan}}}.$$
(1)
The reconfiguration time \(t_{r}\) is based on the size of the PRs of our RPU. With a (constant) data rate \(r_{\textit{acc}}\) of the accelerators, we can calculate the processing time of the first accelerator in the standard plan S as
$$t_{S,Q_{0},\textit{acc}_{0}}=\frac{s_{Q_{0},\textit{table}}}{r_{\textit{acc}}}.$$
(2)
According to our assumptions, all our accelerators are filters, so they reduce the size of the table by a selectivity factor of \(f_{Q_{i},\textit{acc}_{j}}\). This leads to an intermediate table of size
$$s_{S,Q_{0},\textit{intermediate}}=s_{Q_{0},\textit{table}}\times f_{Q_{0},\textit{acc}_{0}}$$
(3)
and a final result size of
$$s_{S,Q_{0},\textit{result}}=s_{S,Q_{0},\textit{intermediate}}\times f_{Q_{0},\textit{acc}_{1}}.$$
(4)
For query \(Q_{1}\) with only one accelerator in use, we get
$$s_{Q_{1},\textit{result}}=s_{Q_{1},\textit{table}}\times f_{Q_{1},\textit{acc}_{0}}.$$
(5)
This will not change in the other plans. The result of both queries is transferred to the host at a constant network data rate of \(r_{\textit{network}}\), resulting in the transfer times:
$$t_{S,Q_{0},\textit{trans}}=\frac{s_{S,Q_{0},\textit{result}}}{r_{\textit{network}}}\quad t_{Q_{1},\textit{trans}}=\frac{s_{Q_{1},\textit{result}}}{r_{\textit{network}}}.$$
(6)
Overall, we get an execution time for the standard plan S of:
$$\begin{aligned}t_{S,Q_{0}} & =\max(t_{r},t_{Q_{0},\textit{scan}})+t_{S,Q_{0},\textit{acc}_{0}}+t_{r}+t_{S,Q_{0},\textit{acc}_{1}}+t_{S,Q_{0},\textit{trans}}\end{aligned}$$
(7)
$$\begin{aligned}t_{S,Q_{1}} & =\max(t_{r},t_{Q_{1},\textit{scan}})+t_{Q_{1},\textit{acc}_{0}}+t_{Q_{1},\textit{trans}}\end{aligned}$$
(8)
$$\begin{aligned}t_{S} & =t_{S,Q_{0}}+t_{\textit{gap},Q_{0}}+t_{S,Q_{1}}.\end{aligned}$$
(9)
In strategy I, we save the reconfiguration time needed for \(Q_{1}\) and execute the table scan for \(Q_{1}\) as soon as possible. So the execution time for \(Q_{1}\) is reduced to
$$t_{I,Q_{1}}=t_{Q_{1},\textit{acc}_{0}}+t_{Q_{1},\textit{trans}}.$$
(10)
However, a larger transfer time \(t_{I,Q_{0},\textit{trans}}\) is needed in \(Q_{0}\) and also the time \(t_{I,Q_{0},\textit{DBMS}}\) for DBMS post-processing. So the calculation changes to:
$$t_{I}=\max(\max(t_{r},t_{Q_{0},\textit{scan}})+t_{I,Q_{0},\textit{acc}_{0}}+t_{I,Q_{0},\textit{trans}}+t_{I,Q_{0},\textit{DBMS}}+t_{\textit{gap},Q_{0}},t_{Q_{0},\textit{scan}}+t_{Q_{1},\textit{scan}})+t_{I,Q_{1}}.$$
(11)
The time \(t_{I,Q_{0},\textit{DBMS}}\) for the DBMS operation execution can be derived from the processing rate \(r_{\textit{dbms}}\) of the DBMS, which has been measured on the host.
Strategies II and III try to hide the reconfiguration time while avoiding the transfer of unnecessary data. The execution time \(t_{\textit{II}}\) for strategy II is calculated as:
$$t_{\textit{II}}=\max(\max(t_{r},t_{Q_{0},\textit{scan}})+t_{\textit{II},Q_{0},\textit{acc}_{0}}+t_{r}+t_{\textit{II},Q_{0},\textit{acc}_{1}}+\max(t_{\textit{II},Q_{0},\textit{trans}}+t_{\textit{gap},Q_{0}},t_{r}),t_{Q_{0},\textit{scan}}+t_{Q_{1},\textit{scan}})+t_{\textit{II},Q_{1}}$$
(12)
with \(t_{\textit{II},Q_{1}}=t_{I,Q_{1}}\). In case of strategy III, we obtain:
$$t_{\textit{III}}=\max(\max(t_{r},t_{Q_{0},\textit{scan}})+t_{\textit{III},Q_{0},\textit{acc}_{1}}+t_{r}+t_{\textit{III},Q_{0},\textit{acc}_{0}}+t_{\textit{III},Q_{0},\textit{trans}}+t_{\textit{gap},Q_{0}},t_{Q_{0},\textit{scan}}+t_{Q_{1},\textit{scan}})+t_{\textit{III},Q_{1}}$$
(13)
with \(t_{\textit{III},Q_{1}}=t_{I,Q_{1}}\).
To validate our evaluation model, we have used a test setup. We have run 3 queries (\(Q_{0},Q_{1},Q_{2}\)) on our FPGA-based prototype. \(Q_{0}\) has a WHERE clause using 2 columns and a selectivity factor of \(0.1\). \(Q_{1}\) has a WHERE clause using only 1 column and also a selectivity factor of \(0.1\). \(Q_{2}\) has a WHERE clause using also only 1 column, but a selectivity factor of \(0.2\). Our prototype has a measured accelerator rate \(r_{\textit{acc}}\) of \(\approx\)1,544 MB/s. The results are shown in Table 1.
As can be seen, the number of columns used in the WHERE clause and the selectivity factor have no impact on the execution time of the particular accelerator. Only the size of the table the accelerator is applied to makes a difference. The subsequent operation works on the reduced dataset of the tuples fulfilling the WHEREclause. Its size and thus the workload of the subsequent operation are determined by the selectivity of the previous operation.
This evaluation model is actually very close to what is really happening in our system. Our hardware accelerators are built to process the incoming data at line-rate, and each accelerator must process all incoming data. This means that processing a dataset of size \(s_{Q_{i},\textit{table}}\) takes the time given in Eqn. 2. Moreover, the cost of predicate evaluation by the accelerator is constant in terms of time or throughput, as we have explained in Section 2. So, all in all, our proposed evaluation model is actually the basis for cost-based plan enumerations—which are however not in the scope of this paper, but of our immediate future work.
Table 1 Comparison of measured and modeled accelerator times