Keywords

1 Introduction

In recent years, the emergence and proliferation of cloud computing provides users on demand, redundant, inexpensive and scalable resources [1]. However, along with the convenience brought by using on-demand cloud services, users have to pay for the resources used according to the pay-as-you-go model, which can be substantial for complex applications and data intensive applications [2], such as BPM Systems [3], which aim to be a “holistic management” approach to satisfy the needs of users in organization’s business process and can generate variety of datasets of large amount. These generated data contain important intermediate or final results of computation, which may need to be stored for reuse and sharing [4]. The fast-growing cloud computing market along with more and more cloud service providers enable BPM system to have flexible ways to utilize multiple cloud services with different prices of computation, storage and bandwidth resources [5]. An efficient storage strategy which can cut the cost of multi-cloud-based data management in a pay-as-you-go fashion is in need for deploying applications in multi-cloud computing environment.

Furthermore, due to the dynamic property of usage of data, some data in the application could be more popular to the users at a certain time, some other data could be less popular, the usage frequency of data could vary from time to time, such as the data in BPM system [3], the efficiency of the data storage strategy that was efficient in a previous time could also degrades. To this end, an efficient algorithm that can generate the minimum cost storage strategy at runtime to keep low resource cost is very important for online data intensive applications in multi-cloud environment.

Finding the trade-off among computation, storage and bandwidth costs to achieve minimum total cost in multi-clouds is a complicated problem [6]. Different cloud service providers have different prices on their resources and datasets have different resource usage and generation dependencies. Even worse, the dynamic data usage frequencies demand that the storage strategy should be updated in time to avoid performance degradation. For this problem, our previous work [6] has proposed GT-CSB which can find the optimal storage strategy that has the minimum overall cost, however, this approach is impractical for runtime storage strategy due to high computation complexity. Therefore, it is necessary to design a highly efficient runtime algorithm that can find optimal storage strategy at runtime to adjust the data storage status in real time.

In this paper, by studying the intrinsic property of the minimum cost storage problem, we propose a dynamic programming algorithm which can reduce the searching space and find the optimal storage strategy in nearly linear time. We also propose optimizing strategies, which can help us calculate the (1) minimum regeneration cost in O(m2) and (2) the sum overall cost rate of dataset in O(m) (m is the number of Cloud Service Providers). By conducting extensive experimental studies, we find that our algorithm has a very good performance and is scalable with large number of datasets and Cloud Service Providers.

The remainder of this paper is organized as follows: Sect. 2 discusses the related work. Section 3 analyses the problem and presents some preliminaries. Section 4 introduces the detail of PCE algorithm. Section 5 evaluate PCE algorithm. Section 6 concludes this paper.

2 Related Work

The resource management in clouds becomes a very important research topic, much work has been done about resource negotiation [7], replica placement [8] and multi-tenancy in clouds. Foster et al. [9] propose the concept of virtual data in the Chimera system, which enables the automatic regeneration of data when needed. Recently, research on data provenance in cloud computing systems has also appeared [10].

Plenty of research has been done with regard to the tradeoff between computation and storage. The Nectar system [11] is designed for automatic management of data and computation in data centers, where obsolete data are deleted and regenerated whenever reused in order to improve resource utilization. In [12], authors firstly propose a cost-effective strategy based on the trade-off of computation and storage cost. In [13], the authors propose a dynamic on-the-fly minimum cost benchmarking approach by pre-storing calculated results with a specially designed data structure.

As the trade-off among different costs is an important issue in the cloud, some research has already embarked on this issue to a certain extent. In [14], Joe-Wong et al. investigate computation, storage and bandwidth resources allocation in order to achieve a trade-off between fairness and efficiency. In our prior work [15], we propose the T-CSB algorithm which can find a trade-off among Computation, Storage and Bandwidth costs (T-CSB). In our another prior work [6], we propose the GT-CSB algorithm, which can find a Generic best Trade-off among Computation, Storage and Bandwidth in clouds.

In this paper, to address above problem, we propose the PCE algorithm, which can efficiently find a Generic best Trade-off among Computation, Storage and Bandwidth in multiple clouds with a computation complexity of O(n*|cand|*(m2+log(|cand|))).

3 Preliminaries

In this Section, we first introduce some preliminaries and then the GT-CSB algorithm.

3.1 Preliminaries

In general, there are two types of data stored in clouds, original data and generated data, in this paper, we only consider generated data.

In this paper, we use DDG [16] (Data Dependency Graph) to represent datasets generation relationships. DDG [16] is a DAG which is based on data provenance in applications. Figure 1 depicts a simple DDG, where a node in the graph denotes a dataset. Edge denotes the generation relationship between datasets, i.e., d4 and d6 are needed for generation of d7. If there exists a path from di to dj in the DDG, we say di and dj have a generation relationship, and di(dj) is the predecessor (successor) of dj(di), we denote it as didj, e.g., d1d4, d5d7.

Fig. 1.
figure 1

A simple Data Dependency Graph (DDG)

In a commercial cloud computing environment, there are generally three basic types of resource cost in the cloud: computation cost, storage cost and bandwidth cost:

$$ \varvec{Total}\;\varvec{Resource}\;\varvec{Cost}\, = \,\varvec{Computation}\;\varvec{Cost}\, + \,\varvec{Storage}\;\varvec{Cost}\, + \,\varvec{Bandwidth}\;\varvec{Cost}\varvec{.} $$

Assumptions:

We assume that the application be deployed with m Cloud Service Providers, denoted as CSP = {c1, c2 … cm}. Furthermore, we assume there are n datasets in the DDG, denoted as DDG = {d1, d2, … dn}. For every dataset di ∈ DDG, it can be either stored with one of the cloud service providers or be deleted.

Denotations:

We use X, Y, Z to denote the computation cost, storage cost and bandwidth cost of datasets respectively. Specifically, for a dataset \( d_{i} \, \in \,DDG \):

\( X_{{d_{i} }}^{{c_{j} }} \) denotes the cost of computing di from its direct predecessors with cloud cj;

\( Y_{{d_{i} }}^{{c_{j} }} \) denotes the storage cost per time unit for storing dataset di with cloud cj;

\( Z_{{d_{i} }}^{{c_{k} ,c_{j} }} \) denotes the cost of transferring dataset di from cloud service provider ck to cj.

\( v_{{d_{i} }} \) denote the usage frequency of di, which means how often di is accessed.

Definition 1:

In a multi-cloud computing environment, in order to regenerate a deleted dataset, we need first to find its stored provenance dataset(s), then to choose a cloud service provider to regenerate it. We denote the minimum regeneration cost of dataset di as minGenCost(di).

Definition 2:

Cost Rate of a dataset is the average cost spent on this dataset per time unit in clouds. For di ∈ DDG, we denote its Cost Rate as CostR(di), which is:

$$ CostR\left( {d_{i} } \right) = \left\{ \begin{aligned} & minGenCost\left( {d_{i} } \right) \times v_{{d_{i} }} ,//\,d_{i} \,\text{is}\,\text{deleted} \\ & Y_{{d_{i} }}^{{c_{j} }} ,//\,d_{i} \,\text{is}\,\text{stored}\,\text{in}\,c_{j} \\ \end{aligned} \right.. $$

The Total Cost Rate of a DDG is the sum Cost Rate of all the datasets: \( TCR = \sum\nolimits_{{d_{i} \in DDG}} {CostR\left( {d_{i} } \right)} \).

Definition 3:

Storage strategy of a DDG is the storage status of all datasets in the DDG, i.e. whether dataset is stored, and which cloud the dataset is stored.

Definition 4:

Minimum cost of a DDG is the minimum Total Cost Rate for storing and regenerating datasets in the DDG, which is denoted as \( \text{TCR}_{\hbox{min} } = \hbox{min} \left( {\sum\nolimits_{{d_{i} \in DDG}} {CostR\left( {d_{i} } \right)} } \right) \).

3.2 GT-CSB Algorithm

The GT-CSB algorithm proposed in our prior work [6] can find the best trade-off among computation, storage and bandwidth costs in multi-clouds. The core idea of GT-CSB is to convert a minimum cost storage problem to a shortest path problem over a Cost Transitive Graph (CTG) graph. In the CTG graph, for each dataset in DDG, there are m nodes each representing that the dataset is stored in the corresponding cloud, and two virtual vertexes, start vertex and end vertex, are used to represent the start point and end point of the shortest path problem. For any two vertexes belonging to different datasets, there is an edge between them. An edge signifies that the datasets between the edge are deleted while the end datasets of the edge are stored in the corresponding cloud. Each path from the start vertex to the end vertex in the CTG corresponds to a storage strategy of the datasets in the clouds. By sophistically setting the edge weight, which represents the sum Cost Rate of those datasets between the end nodes of the edge, we can get the minimum cost storage strategy by solving shortest path problem over the graph, the length of the shortest path corresponding to the minimum Total Cost Rate of datasets in DDG.

4 PCE Algorithm

In this section, we first detailed introduce our PCE algorithm and some optimizing strategies in Section; then we analyze the complexity of PCE algorithm.

4.1 Provenance Candidates Elimination (PCE) Algorithm

In this section, we will first elaborate the minimum cost dataset regeneration in Multiple Clouds Environment and baseline approach for optimal data storage strategy, and then introduce the detail of PCE Algorithm and optimizations.

Dataset Regeneration with Multiple Clouds.

We use Prov(d) to denote the provenance of dataset d, the provenance of d is the nearest stored predecessor(s) of d and is used to generate d when d is reused. The Minimum cost to regenerate a dataset is the minimum cost of generating the dataset from its provenance with multiple clouds, which includes the bandwidth cost for transferring datasets among the clouds and the computation cost for regenerating datasets from its predecessors.

Definition 5:

We use \( ver_{{\left( {d_{j} ,c_{k} } \right)d_{i} }}^{{c_{s} }} \) to denote the minimum cost of generating di on cloud cs from its provenance dj which is stored in ck, or simplify it as \( ver_{{d_{i} }}^{{c_{s} }} \) in the context without ambiguity.

Based on the definition, if a provenance di is stored in cloud cs, the minimum generation cost of dataset on cloud can be iteratively computed as:

$$ \left\{ \begin{aligned} & ver_{{d_{i + 1} }}^{{c_{k} }} = Z_{{d_{i} }}^{{c_{s} ,c_{k} }} + X_{{d_{{i + \text{1}}} }}^{{c_{k} }} \\ & ver_{{d_{j} }}^{{c_{k} }} = \mathop {\hbox{min} }\nolimits_{h = 1}^{m} \left\{ {ver_{{d_{j - 1} }}^{{C_{h} }} + Z_{{d_{j - 1} }}^{{c_{h} ,c_{k} }} } \right\} + X_{{d_{j} }}^{{c_{k} }} \\ \end{aligned} \right. $$
(1)

where dj ∈ DDGdi+1dj ∧ Prov(dj) = di, ck ∈ {c1, c2,…cm}.

Based on Definition 5, the minimum regeneration cost of dj with provenance di is:

$$ minGenCost\left( {d_{j} } \right) = \mathop {\hbox{min} }\nolimits_{{h = \text{1}}}^{m} \left\{ {ver_{{d_{j} }}^{{c_{h} }} } \right\} $$
(2)

Baseline Algorithm

Lemma 1.

In a linear DDG, if dataset d i ∈ DDG is stored in cloud, then the sum Cost Rate of d i ’s successors (predecessors) is independent of the storage status of d i ’s predecessors (successors).

According to the definition and the iterative calculation of the minimum regeneration cost of a dataset in Eqs. (1) and (2), a deleted dataset is computed from its provenance, since di is stored in cloud, any of di’s predecessor cannot be a provenance of di’s successor, so the overall cost of di’s successors is independent of the storage status of di’s predecessors. The regeneration cost or storage cost of di’s predecessors is also independent of di’s successor. Hence, if a dataset, e.g. di, is stored in cloud, we can compute its predecessors’ storage strategy and its successors’ storage strategy independently.

Assume a dataset di is stored in cloud ck, we use di.preCost to represent the minimum total cost of di’s predecessors, and a tuple (di, ck) to represent that dataset di is stored in cloud ck, and the storage strategy S of a DDG in multi-clouds is represented by a set of tuples S = {(di, ck)|di ∈ DDGck ∈ CSPdi is stored in ck}. The provenance, e.g. dj, and the provenance stored place, e.g. ck, of di is represented by a tuple di.Prov = (dj, ck).

figure a

Baseline Algorithm starts by creating two virtual nodes d0 and dn+1 as starts dataset and end dataset respectively (line 1), the two datasets have 0 size and 0 computation cost, they are created only for ease of illustration. For each dataset in DDG and dn+1, e.g. di, Baseline-Algorithm computes its minimum preCost and Prov (line 5–11). After the iteration process on all datasets, dn+1.preCost is the minimum total cost of all dn+1’s predecessors and is also the minimum total cost of DDG, then the optimal storage strategy can be collected by a reverse traverse from dn+1 with Prov (line 13–17). When computing preCost and Prov of a dataset, e.g. di, preCost is first initialed as infinite, then Baseline-Algorithm iterates on all di’s predecessors and all CSPs to determine di’s provenance and the stored cloud service, e.g. di.Prov = (dj, ck), that can make di.preCost minimum (line 4–11).

In Baseline Algorithm, let n be the number of dataset and m be the number of CSPs, minGenCost can be compute in O(m2n). When deciding Prov of a dataset, Baseline-Algorithm have to iterate all its predecessors and all Cloud Service Providers, this procedure can be done in O(m3n3), and there are n datasets, so the final time complexity of Baseline-Algorithm is O(m3n4).

Provenance Candidates Elimination Strategy.

Based on the definition of minimum regeneration cost in multiple clouds, we find that the more distant the Prov(dj) is from dj, the higher the minimum regeneration cost of dj will be.

Theorem 1:

In multi-cloud scenarios, without loss of generality, if exists an optimal storage strategy S1* for datasets {d1, d2…dj}, i.e., \( \sum\limits_{{i = \text{1}}}^{j} {CostR\left( {d_{i} } \right)} \), is minimum with S1*, assuming the last stored dataset of S1* is dh and is stored in cloud cr, then the last stored dataset and its stored cloud of optimal storage strategy S2* for datasets {d1, d2…dj, dj+1} cannot be (dk, ci) with \( ver_{{\left( {c_{r} ,d_{h} } \right)d_{{j + \text{1}}} }}^{{c_{s} }} < ver_{{\left( {c_{i} ,d_{k} } \right)d_{{j + \text{1}}} }}^{{c_{s} }} \) for all cs ∈ CSP.

Proof:

Assuming the last stored dataset of S2* is (dk, ci) with \( ver_{{\left( {c_{r} ,d_{h} } \right)d_{{j + \text{1}}} }}^{{c_{s} }} < ver_{{\left( {c_{i} ,d_{k} } \right)d_{{j + \text{1}}} }}^{{c_{s} }} \) for all cs ∈ CSP. We can construct a strategy S3 for {d1, d2,…, dj+1} with same storage strategy of S1* for {d1, d2,…, dj} and dj+1 is deleted with lower sum Cost Rate than S2*. Since \( \sum\limits_{{i = \text{1}}}^{j} {CostR_{{S1^{*} }} \left( {d_{i} } \right)} < \sum\limits_{{i = \text{1}}}^{j} {CostR_{{S2^{*} }} \left( {d_{i} } \right)} \) and \( ver_{{\left( {c_{\text{2}} ,d_{h} } \right)d_{{j + \text{1}}} }}^{{c_{s} }} < ver_{{\left( {c_{i} ,d_{k} } \right)d_{{j + \text{1}}} }}^{{c_{s} }} \) for all cs ∈ CSP, \( CostR_{{S\text{3}}} \left( {d_{{j + \text{1}}} } \right) = v_{{d_{{j + \text{1}}} }} \times \mathop {\hbox{min} }\limits_{{c_{s} \in CSP}} ver_{{\left( {c_{\text{2}} ,d_{h} } \right)d_{{j + \text{1}}} }}^{{c_{s} }} < \text{Cos} tR_{{S\text{2}}} \left( {d_{j + 1} } \right) + v_{{d_{{j + \text{1}}} }} \times \mathop {\hbox{min} }\limits_{{c_{s} \in CSP}} ver_{{\left( {c_{\text{2}} ,d_{k} } \right)d_{{j + \text{1}}} }}^{{c_{s} }} \), hence \( \sum\limits_{{i = \text{1}}}^{{j + \text{1}}} {CostR_{S3} \left( {d_{i} } \right)} = \sum\limits_{{i = \text{1}}}^{j} {\text{Cos}tR_{S1} \left( {d_{i} } \right) + \text{Cos}tR_{S3} \left( {d_{j + 1} } \right)} < \sum\limits_{{i = \text{1}}}^{{j + \text{1}}} {CostR_{S2} \left( {d_{i} } \right) = \sum\limits_{{i = \text{1}}}^{j} {CostR_{S2} \left( {d_{i} } \right) + CostR_{S2} \left( {d_{{j + \text{1}}} } \right)} } \), which contradicts the premise. Theorem 1 holds.

According Theorem 1, we propose following Provenance Candidates Elimination Rules (PCERs).

Consider the Baseline-Algorithm, assume the provenance of a dataset di is di.Prov = (dj, ck), for di’s successors, i.e., dk, the initial provenance candidates set of dk is dk.cand = {(dh, cl)| dh→ dkdhDDGclCSP}, we can use the following rules to pruning the candidates set:

  1. 1.

    For (dh, cl) ∈ dk.cand, where dh→ di, if \( ver_{{\left( {d_{h} ,c_{l} } \right)d_{i} }}^{{c_{s} }} > ver_{{\left( {d_{k} ,c_{j} } \right)d_{i} }}^{{c_{s} }} \) for all cs ∈ CSP, then (dh, cl) can be eliminated from dk.cand.

  2. 2.

    For (dh, cl) dk.cand, where dh→ di, if exists (dh′, cl′) dk.cand, dh′→ di, that \( \left( {d_{h} .preCost + \sum\nolimits_{{p = h + \text{1}}}^{i} {\left( {\mathop {\hbox{min} }\limits_{{c_{s} \in CSP}} \left( {ver_{{\left( {d_{h} ,c_{l} } \right)d_{p} }}^{{c_{s} }} } \right) \times v_{{d_{p} }} } \right) + Y_{{d_{h} }}^{{c_{l} }} } } \right) > \left( {d_{{h^{\prime}}} .\text{preCost + }\sum\nolimits_{{p = h + \text{1}}}^{i} {\left( {\mathop {\hbox{min} }\limits_{{c_{s} \in CSP}} \left( {ver_{{\left( {d_{{h^{\prime}}} ,c_{{l^{\prime}}} } \right)d_{p} }}^{{c_{s} }} } \right) \times v_{{d_{p} }} } \right) + Y_{{d_{{h^{\prime}}} }}^{{c_{{l^{\prime}}} }} } } \right) \) and \( ver_{{\left( {d_{h} ,c_{l} } \right)d_{i} }}^{{c_{s} }} > ver_{{\left( {d_{{h^{\prime}}} ,c_{{l^{\prime}}} } \right)d_{i} }}^{{c_{s} }} \), then (dh, cl) can be eliminated from dk.cand.

To better illustrate the PCE Algorithm, we first introduce some new data structures:

  • cand is the candidates set to record the possible provenances of the datasets. In the algorithm, maintaining one cand is sufficient for all datasets, because, for example, the reduction on a dataset’s provenance candidates di.cand also applies on di’s successors.

  • (dj, ck).MGC is an array where (dj, ck).MGC[cs] is the value of \( ver_{{\left( {d_{j} ,c_{k} } \right)d_{{i - \text{1}}} }}^{{c_{s} }} \) when dj is stored in cloud ck and di–1 is generated on cloud cs.

  • (dj, ck).sucCost is similar to dj.preCost, it is the sum CostR of datasets from dj+1 to \( d_{i - 1} :\left( {d_{j} ,c_{k} } \right).sucCost = \sum\nolimits_{{h = j + \text{1}}}^{{i - \text{1}}} {minGenCost\left( {d_{h} } \right) \times v_{{d_{h} }} } . \)

figure b

In PCE algorithm, the cand is first initialized as {(d0, c0)} (line 4). For each di in DDG, di.Prov and di.preCost computed in line 6–10, after updating MGC and sucCost of all the candidates (line 12–16), the PCERs are performed on cand (line 17). At last in line 19–22, the new candidates, i.e., (di, ck) for all ck in CSP, are initialized and added to cand.

For example, in Fig. 2, the provenance of dj is (dh, c2), the cand now is {(dh–1, c1), (dh–1, c2), (dh, c1), (dh, c2), (dj–1, c1), (dj–1, c2), (dh–1, cm)…} marked with grey and green circles. After performing the elimination rules, (dh–1, c2) and (dh, c1) marked with grey circles are deleted from cand. Then before searching Prov of dj+1, (dj, c1), (dj, c2)… (dj, cm) marked with blue circles are added to cand.

Fig. 2.
figure 2

DDG with multiple clouds

Incremental Minimum Regeneration Cost and Sum Successors’ Cost.

For the computation of \( \sum\nolimits_{h = j + 1}^{i - 1} {minGenCost\left( {d_{h} } \right) \times v_{{d_{h} }} } \), we propose incremental computation for it, it contains two parts: the incremental computation of \( minGenCost\left( {d_{h} } \right) \) and \( \sum\nolimits_{h = j + 1}^{i - 1} {minGenCost\left( {d_{h} } \right) \times v_{{d_{h} }} } \), as was illustrated in line 12–16 of PCE algorithm.

First, for the computation of \( minGenCost\left( {d_{h} } \right) \), we use a data structure MGC, introduced before, to store the minimum regeneration cost of successors of datasets, e.g. (dj, ck).MGC stores the minimum regeneration cost of successors of dj. In the each round, MGC is updated accordingly (line 15).

Second, for the computation of \( \sum\nolimits_{h = j + 1}^{i - 1} {minGenCost\left( {d_{h} } \right) \times v_{{d_{h} }} } \), similar to the incremental computation of \( minGenCost\left( {d_{h} } \right) \), we use sucCost, introduced before, to store the sum cost rate of successors of a datasets, e.g., dj. In each round, sucCost is updated accordingly (line 16).

4.2 Analyses

In PCE Algorithm, let n be the number of datasets, m be the number of Cloud Service Providers and |cand| be the average size of cand, searching of Prov (line 6–10) can be done in O(|cand|), incremental update(line 12–16) can be done in O(|cand|*m2), elimination rules (line17) in O(|cand|*m+|cand|*log(|cand|)), adding new candidates (line 26–29) can be done in O(m2), so the overall time complexity of the Algorithm is O(n*|cand|*(m2+log(|cand|))). For the size of cand, it mainly depends on the computation cost rate and storage cost rate of datasets and is independent of the number of datasets n. Our experimental results in Sect. 5.2 (Fig. 4(b)) also demonstrate the independence of the size of cand and the number of dataset n.

5 Experiments

Our experiment is conducted on Desktop PC with Intel(R) Core(TM) i5-4200M CPU, RAM 8 GB. The algorithm is implemented in the Java and is run on Windows.

In real world applications, generated datasets may vary dramatically in terms of size, generation time, usage frequency and the structure of DDG. Hence, we randomly generate DDGs with different number of datasets, each with a random size from 1 GB to 100 GB. The computation time of dataset is also random, from 10 h to 100 h. The usage frequency is again random, from once per month to once per year. This setting is based on the scenarios of applications of scientific workflow [16] and BPM system [3].

In addition, we randomly generate 10 cloud service providers with different compute, storage and out-bandwidth price (see Table 1)Footnote 1.

Table 1. The pricing models of 10 cloud services providers

Our prior work [6] has thoroughly investigated the minimum cost strategy, the algorithm in this paper calculates the same minimum cost strategy as GT-CSB, the effectiveness of PCE algorithm will not be evaluated here.

5.1 Comparison with Existing Algorithms

We first compare the performance of our strategy with GT-CSB. In this experiment, we use 5 randomly generated DDGs with 100 to 500 datasets and 3 cloud service providers with the pricing models listed in Table 1.

The experiment result shown in Fig. 3(a) and (b) demonstrates our strategy can always finished within 1 s, while the running time of GT-CSB increases fast with the increase of number of datasets.

Fig. 3.
figure 3

Comparison of performance with varying settings

In the next experiment, based on the philosophy of our prior work [17], we devise a method which can derive localized minimum cost instead of a global one. The method is dividing the DDG into several blocks of the same size, and using the algorithm to find local optimal storage strategy for each block. We use a DDG with 500 datasets and divide it into blocks with different block size. Figure 3(c) demonstrate the speed up of GT-CSB algorithm with small block size, however, it is still not as efficient as PCE algorithm.

5.2 Evaluation of PCE with Varying Settings

Then we evaluate the efficiency of our strategies with varying number of cloud service providers.

We use the same datasets as above experiment, but gradually increase the number of cloud service providers. All cloud service providers are summarized in Table 1. As can be seen in Fig. 4(a), the run time of our algorithm increase slowly when the number of datasets or the number of cloud service providers increases. Compared with existing work, with the pruning effect of provenance candidates elimination and incremental computation, our algorithm can complete in near linear time in terms of number of datasets, hence, even if we use 10 cloud service providers and the 500 datasets, we can get the result in approximate 50 ms.

Fig. 4.
figure 4

Evaluation with varying settings

We demonstrate the effect of provenance elimination strategy by studying the average number of candidates with varying number of datasets (100–500). The number of candidates indicates how many times we should check before we could get the optimal provenance of a dataset, which is a key factor to the algorithm efficiency. In this experiment, we summarized the average number of candidates with 3 and 10 cloud service providers separately, as show in Fig. 4(b). With the varying number of datasets, the average number of candidate remains almost constant, which demonstrate that the number of candidates is independent of the number of datasets.

6 Conclusions and Future Work

In this paper, we proposed a provenance elimination strategy which can identify a small set of possible optimal provenance and reduce the search space. Besides, we propose incremental computations which speed up the algorithm a lot. The experimental results show that the running time of our algorithm is significantly reduced compared to that of the GT-CSB algorithm and our algorithm also scales well even the number of dataset is very large.

In our current work, we only consider the datasets with linear DDG. However, in the real world, dependencies between datasets can be very complex; they may contain blocks, sub-blocks and crossed-blocks, the data storage strategy can be very tough to obtain. Furthermore, extra cost might be caused by the “vender lock-in” issue among different cloud service providers, large number of requests from input/output (I/O) intensive applications, etc. In the future, we will consider complex DDG and incorporate more complex pricing models in our datasets storage and regeneration cost model.