Keywords

1 Introduction

Nowadays, Business Intelligence (BI) is a primordial factor for companies that are faced with large volumes of data. Indeed, the large volume of data accumulated over time is considered a rich source for decision makers. It allows them to have an overview on the various activities of the company and helps them make decisions. But building BI systems necessitates an important initial investment that can be a problem for adopting this technology.

On the other hand, cloud computing provides a new service that allows customers to create, maintain and query their data in the cloud using their internet connection. This new service delivery model is an important opportunity for companies because of its financial profitability, computing power and scalability. So, hosting a data warehouse in the cloud seems to be a good solution for BI.

But like each new technology, cloud computing brings its risks in terms of security. There are potential risks for sensitive data, for the latter are stored at an untrusted host. Consequently, before hosting data in the cloud, the owner must encrypt the data with symmetric encryption or asymmetric encryption to ensure confidentiality. Such a scenario is not suitable for data warehouse because of the high volume of data stocked in the warehouse and processed in the OLAP query. This scenario cannot also guarantee the confidentiality of data over the cloud provider because it decrypts the data when processing queries. For this reason, the necessity of computing data in a cipher-text is proved with the homomorphic encryption function. Actually, this paper introduces a new model for sharing data warehouse in a multi- cloud. Our proposition is based on the homomorphic privacy presented in [1]. One serious deficiency of this homomorphic privacy is the possibility of being broken by clear text attacks. Thus, our contribution is to make this privacy homomorphism more robust and secure using multi-cloud and perturbation value. In addition, a new technique is proposed to verify the integrity of our data when received from the cloud. It should be noted that it is not our aim to propose a solution as secure as the state-of-the-art encryption algorithms. Rather a technique that provides a considerable level of overall security strength with respect to some performance overheads.

This article proceeds as follows: the section below surveys related works. The third section proposes a novel method for securely hosting and querying data warehouse in the cloud. The fourth part is then devoted to shed light on some theoretical results. Finally, the paper ends with a conclusion.

2 Related Works

The standard scenario for outsourcing data in the cloud is to encrypt data with symmetric encryption [2,3,4] or with asymmetric encryption [5]. After encrypting data, the owner sends this data to the cloud with the keys and stocks it in the data base of the provider. This scenario is not secure because intruders can break the security system of the provider and steal the data with the keys. Moreover, the owner cannot trust the provider.

The necessity of running data in cipher-texts is presented. Homomorphic encryption [6] allows computations on cipher-texts without decrypting them first. It is applied to attributes that are used in the computation of aggregation function such as sum and average. This technique is computationally expensive for practical use. Order preserving encryption [7, 8] and multivalued Order preserving encryption MV-OPE [9] are used for performing computations over attributes that are used in the calculation of max and min aggregation functions, or attributes that are compared using relational operators. Those solutions are not practical in terms of time complexity and storage overhead. Besides, those encryption techniques and security protocoles are not sufficient to protect and process data in the cloud because such a solution is based on the trust between the owner and the provider. But, this constraint is not always true because the owner can disappear one day. That’s why, researches in the literature have turned their attention to an alternative solution based on multi-cloud.

So, to maintain confidentiality, multi-cloud schemes such as DSky [10], inercloud [11], and NCloud [12] use symmetric encryption tools and distribute encrypted data over multiclouds. CloudStash [13] preserves confidentiality by applying a secret sharing scheme [14] directly on the file. It splits the file into shares and distributes them over multiple clouds in parallel. The problem with this scheme is that it increases the storage overhead.

Several other works have used secret sharing to ensure the confidentiality and availability of data in the multi cloud. Authors in [15] propose dividing the secret in chunks of data to minimise the volume generated in the case of data warehouse. This is a practical solution to the data warehouse in terms of volume overhead. However, it creates high time complexity when decrypting the data.

Information dispersal algorithm (IDA) [16] is also an encryption strategy that makes data available and secure. The idea of this algorithm is to distribute the data into insignificant shares like secret sharing. The advantage of this procedure is that the size of final data does not exceed (n/m), with n is the total number of shares and m is the number of shares necessary for the reconstruction of the original data. Thus, with this algorithm the storage complexity is reduced, but the security is broke.

Authors in [17] try to highlight the advantage of the residue number system (RNS) and propose this schema HORNS based on the RNS system. In the same manner, as the works based on the secret sharing, HORNS proposes to divide the data in a small chunk with the modular arithmetic and stocks those residual numbers in a multi-cloud. This is an effective solution in terms of volume overhead and time complexity. Yet, this solution cannot be feasible in the case of collusion of cloud providers because the data will be decrypted immediately.

3 New Schema for Securing Data Warehouse in the Cloud

3.1 Motivation

Our schema is based on the simple privacy homomorphism described in [1]. The privacy homomorphism will be illustrated as it is given in [1]:

Let p and q be two large secret primes and m = pq the product of such large secret primes. For that m is difficult to factor.

Consider the set of cleartext data T = Zm, and the set of cleartext operation \( {\text{F }} = \, \{ +_{\text{m}} , -_{\text{m}} , \times_{\text{m}} \} \) consisting respectively of the addition, substraction and multiplication modulo m, with m = pq.

Let the ciphertext data set be \( {{\rm T}^{\prime} = Z}_{\text{P}} \times \text{Z}_{\text{q}} \) Ciphertext operation \( {\rm F}^{\prime} \) is the component wise of these in F.

Define the encryption function ɸ(x) = [x mod p, x mod q]. Given the two prime numbers p and q and the ciphertext xp = x mod p and the ciphertext xq = x mod q, the secret x is decrypted using the Chinese remainder theorem (CRT).

One is motivated to use this privacy homomorphism because the latter is based on the modular arithmetic as it is described in the encryption function ɸ(x). This is very interesting in terms of volume overhead because the data will be divided in a small residue number so the storage space will be reduced. In terms of confidentiality, the data will be encrypted with the two prime numbers p and q and will be computed in the range of m = pq. The decryption function of this schema is based on the use of Chinese remainder theorem. This technique is very practical and feasible because of its reasonable temporal complexity. Thanks to the homomorphic characteristic of encryption function, arithmetic operations can be done in a ciphertext in the cloud.

Hence, it is suggested that using this homomorphic privacy can be a promising solution for hosting data warehouse in the cloud.

3.2 Our Proposition for Hosting Data Warehouse in the Cloud

The following section presents two scenarios for hosting data warehouse in the cloud.

Scenario 1

The simple scenario is to encrypt data stocked in the data warehouse with the encryption function ɸ(x) = [x mod p, x mod q]. After that, the cipher text data xp = x mod p and the cipher text data xq = x mod q will be sent to the cloud provider with the modulo m. The two prime numbers p and q will be kept secret in the owner. Data stocked in the cloud will be processed modulo m. So, in this way, the cloud provider cannot decrypt the data with the modulo m because it is hard to factor. So, the data will be securely stocked in the cloud. Furthermore, with the homomorphic characteristic of modular arithmetic query, using arithmetic operation such that {+, −, x} will be done in the cloud in a cypher text without decryption. After processing the query in the cloud, the provider sends the result to the owner in a ciphertext. The owner decrypts their data with the two secret prime numbers p and q and the two chunks of encrypted data are received from the cloud using Chinese remainder theorem (CRT). Our proposition will be illustrated by introducing the two algorithms Alg-enc and Alg-dec:

figure a

With Pp = m/p; Pq = m/q; bp is the multiplicative inverse of Pp modulo p end bq is the multiplicative inverse of Pq modulo q. Those four constants can be precomputed and stocked in the client and kept secret. This can reduce the temporal complexity of Alg-dec.

To ensure the processing of the query sent in the cloud, the owner will use the query rewriting.

As the data is encrypted with Alg-enc, the query sent from the client will be encrypted with Alg-enc and the order of the chunks will be kept.

To maintain the security of our model, a trust tier will be created; it can be the administrator of data warehouse or any trust tier in the company. The role of such tier is a middle tier between the client and the cloud. The owner will stock all their secret parameters at that trust tier. The role of the trust tier is to encrypt data with alg-enc and send it to the cloud and keeping the secret parameters p and q secret. This trust tier is also responsible for the decryption of data with Alg-dec after being received from the cloud using their secret parameters p and q. The important role of this trust tier is also rewriting the query when the client sends a query to the cloud. The trust tier rewrites the query as the data is encrypted in the cloud and sends the query to the cloud. After receiving the response from the cloud, the trust tier decrypts the result and sends it to the client. So, all the secret parameters of our method will securely be stocked in the trust tier and the client using the data warehouse has no information about how the data is encrypted and stocked in the cloud. In this way, our data warehouse will be more secure.

The figure below illustrates our first scenario for hosting and querying data warehouse in the cloud (Fig. 1).

Fig. 1.
figure 1

First scenario for hosting and querying data warehouse in the cloud

Discussion

Our first proposed scenario for hosting data warehouse in the cloud is to encrypt all the data and stock them in one cloud provider. This solution seems satisfactory in terms of storage overhead and time complexity. Besides, the advantage of querying addition, subtraction and multiplication in a ciphertext is very important in the case of data warehouse because the nature of its OLAP query requires a massive volume of data.

Unfortunately, this schema can be broken by the cloud provider because it has the two chunks of data and the secret modulo m. It can infer the two chunks of data and get the two secret parameters p and q. Malicious intruders can also break the security parameters of the cloud provider, get the encrypted data and the modulo m from the cloud provider and decrypt it using the known cleartext attack as described in [18].

There are two factors that threaten the confidentiality of this schema: an internal factor being the cloud provider itself and an external factor being a malicious intruder. Consequently, a second scenario will be suggested which can reduce the risk of breaking the security parameters of our schema using a multi-cloud.

Scenario 2

Authors in [18] argue that this schema can be broken by a known plaintext attack. To illustrate, they present the way that cryptanalyst can infer the data and get the secret:

Suppose x is the integer that will be encrypted and presented by a pair (xp, xq), where xp = x mod p and xq = x mod q. Assume that the cryptanalyst has the plaintext, ciphertext pair for some data. They supposes that p′ be the gcd { xp − x for all data}. In the same way, they supposes that q′ be the gcd { xq − x for all data}. After that it tests that p = p′ and q = q′, if this is the case, the cryptanalyst can decrypt all ciphertext. They proves that when specifically given ciphertext (xp, xq),the cryptanalyst can find x′ such that x′ ≡ xp mod p′ and x′ ≡ xq mod q′.

So, it can be simply concluded that if the two chunks of data (xp, xq) are or one of them is kept secret from the cryptanalyst, the probability of inferring the data and breaking the system with the known plaintext attack will be reduced.

To this end, we propose to divide the two chunks of secret data and stock each chunk in a different cloud provider. As a result, this new model is based on two cloud providers; each of them stocks a chunk of secret data. In this way, we can reduce the probability of inferring data and breaking the encryption function because each provider has only one part of the chunk. So, the problem of intern risk will be decreased. Likewise, the risk of breaking the system from an extern intruder will be diminished because it is difficult for malicious users to break the security parameter of two cloud providers at the same time and gets the two chunks of secret data. So, with this new sharing method, our model will be more secure in terms of confidentiality towards the cloud providers as well as external attack.

Also, when p, q, m = pq are very large integers, a small value x is very likely to have the same representation over Zm, Zp, and Zq that is x mod m = x mod p = x mod q if x < min (p, q). This is an undesirable feature, because the homomorphic function ɸ(x) leaves the cleartext unencrypted (trivial ciphertext). To overcome this drawback we propose to multiply x with two secret value rp and rq such that rp < p and rq < q.

Our new sharing model is based on two initial steps. The first step is data sharing process, and the second step is data reconstruction process.

Data Sharing Process:

  • The trust tier as mentioned in the first scenario will give the two secret prime numbers p and q and the two secret values rp and rq such that rp < p and rp < q. Also, he calculate the modulo m = pq.

  • He affects each secret prime number for a specific cloud provider. This is very important for maintaining the coherence of secret data.

  • After that, he encrypts the secret data with the homomorphic function ɸ(x):

    $$ \Upphi \left( {\text{x}} \right) \, = \, [{\text{x}} \times {{\text{r}}_{\text{p}}}\,{\text{mod p}},{\text{ x}} \times {{\text{r}}_{\text{q}}}\,{\text{mod q}}] $$
    (1)

    and get the pair of data (xp, xq) with \( {\text{x}}_{\text{p}} = {\text{ x}} \times {\text{r}}_{\text{p}}\,{\text{mod p}} \) and \( {\text{x}}_{\text{q}} = {\text{ x}} \times {\text{r}}_{\text{q}}\,{\text{mod q}} \).

  • The trust tier computes the signature of each chunk of data with the homomorphic function Hs(A) = A mod B as:

    $$ {\text{sign}}\_{\text{x}}_{\text{p}} = {\text{ H}}_{\text{s}} \left( {{\text{x }} + {\text{ x}}_{\text{p}} } \right) $$
    (2)

    and

    $$ {\text{sign}}\_{\text{x}}_{\text{q}} = {\text{ H}}_{\text{s}} \left( {{\text{ x }} + {\text{ x}}_{\text{q}} } \right) $$
    (3)
  • Finally, the trust tier sends each chunk of data with its signature to the cloud provider which corresponds, (xp, sign_xp) to CSPp and (xq, sign_xq) to CSPq.

The scenario of data sharing process is presented in Fig. 2:

Fig. 2.
figure 2

Scenario of data sharing process

Data Reconstruction Process:

  • The trust tiers ask each cloud provider to get the pair of data (xp, sign_xp) and (xq, sign_xq).

  • He compute the scalar product of (xp, xq) pair by (\( \text{r}_{\text{p}}^{ - 1} \) mod p, \( \text{r}_{\text{q}}^{ - 1} \) mod q) to retrieve (x mod p, x mod q).

  • After that, the trust tier decrypts the data using the Chinese remainder theorem with the two secrets parameters p and q and with the two chunks of data (x mod p, x mod q).

  • Finally, he verifies the correctness of data with the two signatures. If sign_xp = Hs(x + xp) and sign_xq = Hs(x + xq) then data x is correct. In case of errors, the trust tier can ask CSP’s to get a new pair’s.

The scenario of data reconstruction process is presented in Fig. 3:

Fig. 3.
figure 3

Scenario of data reconstruction process

Example of sharing integer x = 17;

P = 5; q = 7; rp = 3; rq = 2; m = 35;

ɸ (17) = (17 × 3 mod 5, 17 × 2 mod 7);

ɸ (17) = (1, 6); so the integer x = 17 is encrypted in the chunks of pairs (1; 6);

3.3 Querying Data Warehouse in the Cloud

Our schema can directly support some basic OLAP operations at the CSP’s through SQL operations and aggregation function. For example, simple select-from queries can be directly applied in the cloud. However, when expressing a condition in a where or having clause the trust tier must rewrite the query and post processing some operation in the company because the MOD operator is non- injective. Given that for X MOD Y = Z, the same output Z, considering Y a constant, can have an undetermined number of possibilities in X as an input which will generate the same value Z when applying the operator (e.g. 17 MOD 5 = 2, 22 MOD 5 = 2, 27 MOD 5 = 2, etc.).

For example, the query “SELECT ProdName FROM Product WHERE UnitPrice = 17” would be transformed to the two queries “SELECT ProdName FROM Product WHERE UnitPrice = 1” at CSP1, where 1 is the share of 17 at CSP1 and SELECT ProdName FROM Product WHERE UnitPrice = 6” at CSP2, where 6 is the share of 17 at CSP2.

Or 1 can be also the result of \( 2 2\times 3 \) mod 5 = 1, \( 2 7\times 3 \) mod 5 = 1 etc.

So, the CSPp=5 well return all the rows that correspond to 1 and that refer UnitPrice = 17, UnitPrice = 22 and UnitPrice = 27.

With the same manner, the CSPq=7 well return all the rows that correspond to 6 and that refer UnitPrice = 17, UnitPrice = 24 and UnitPrice = 31.

After that, when the two queries are returned from the clouds, the trust tier must eliminate the erroning rows by a simple join between the two results of those queries before decrypting the final result.

This routine works for many comparison operators (=, ≠, EXISTS, IN, LIKE…) and their conjunction. Arithmetic operation and aggregation function such as sum, avg, count can be computed in ciphertext by the trust tier after eliminating the erroning rows.

But when ordering is necessary, as in ORDER BY clauses and many comparison operators (>, <, ≥, ≤, BETWEEN…), it can no longer apply since the original order is broken when sharing data. Thus, all fetched data must be decrypting and querying at the owner by the trust tier before to be sent to the client.

4 Security Analysis and Performance Evaluation

4.1 Security Analysis

Confidentiality of Data

The confidentiality of data is our major focus in this paper. So, as described in Sect. 3.2, the distribution of the two chunks of data in two cloud providers is a good solution for protecting data from plaintext attacks and from malicious cloud providers. The idea is to keep minimal information about data and parameters among the cloud provider.

So, the security parameter of our solution is based on the two parameters p and q and the two chunks of data xp and xq.

The role of trust tier as a middle tier between the user and the cloud is an effective solution that guarantees the confidentiality of the two secrets p and q.

The distributions of the two chunks of data are anonymous. Each cloud provider does not recognise if the chunk of data that is processes is about the parameter p or q.

Additionally, stocking each of the two chunks of data xp, xq in one cloud provider can reduce the risk of inferring the data and breaking the system.

Proof:

If the cloud provider predicts the p or gets the p (worst case), he can predict x as

$$ {\text{X}} = {\text{ y}} \times {\text{p }} + {\text{ x}}_{\text{p}} \,{\text{such that}}\;\text{x}\, < \,\text{m}; $$

The probability that the cloud can find x correctly is p/m. This probability is thus low and the cloud provider cannot also identify whether the chunks of data correspond to the parameters p or q. This ambiguity can disrupt the work of inferring the data.

As a matter of fact, we can conclude that even if the two secret parameters will be discovered by the cloud provider, the latter cannot decrypt all the data directly because he doesn’t have the second chunk of data. So, it is essential to do this operation on performing the system:

$$ {\text{X}} = {\text{ y}} \times {\text{p }} + {\text{ x}}_{{{\text{p}} }},\, {\text{X}} = {\text{ y}} \times {\text{q }} + {\text{ x}}_{{{\text{p}} }} \;{\text{such}}\;{\text{that}}\;\text{x}\, < \,\text{m}; $$

This cannot be done in the case of a huge volume of data as in the data warehouse.

Similarly, it can be argued that the confidentiality of our schema is better with this new sharing strategy. In fact, even if the malicious intruder gets the two secret parameters p and q, it is hard for it to break the security of the two cloud providers and get the two chunks of secret at the same time.

Integrity of Data

In our schema, the verifying phase is based on the correctness of two signatures. Hence, the risk of error does not exist. To reconstitute the integer x = 17; the trust tier gets the two pairs of data from the CSP’s (1, 2) and (6, 7) and we will suppose that there is a mistake when transferring the data from the cloud and the pair of data (1, 2) is transformed to (2, 2). The trust tier computes:

  • Pp = 7, Pq = 5, bp = 3, bq = 3, xp = 4, xq = 3, m = 35;

  • \( \text{r}_{\text{p}}^{ - 1} \) mod p = 3−1 mod 5 = 2;

  • \( \text{r}_{\text{q}}^{ - 1} \) mod q = 2−1 mod 7 = 4;

  • After that he compute: \( ( 2\times 2 {\text{ mod 5}},{ 6} \times 4 {\text{ mod 7}}) \, = \, \left( { 4,{ 3}} \right) \);

Using the CRT he can compute:

  • X = 129 mod 35; so X = 24.

After that, the correctness of data is verified as:

  • sign_xp = (24 + 2) mod 8 = 2 and sign_xq = (24 + 6) mod 8 = 6;

So, the data x = 24 is not correct because the signature sign_xq is not correct.

Accordingly, since our solution is based on two verifying phases, this homomorphic function Hs is very secure and it does not reveal any information about the secret data x.

4.2 Performance Evaluation

Volume Overhead

In our solution there is no volume over head when encrypting initial data because our encryption function is based in the MOD operator that divides the data in a smile residue number.

Or this operation is done two times. For that, the volume of data in ciphertext cannot exceed twice the volume of original data.

Temporal Complexity

Encryption, decryption, and homomorphic operations only need one/two modular operations in our schemas.

For the encryption phase, we need just O (n) operation; the decryption phase is based on the CRT. So we need just O (lg lg n) operation for decrypting phase.

Comparison of Our Schema to Existing Related Approaches

In this section, we compare our schema with approaches presented in our state of the art with respect to security and performance. Table 1 synthesizes the features of all approaches discussed above.

Table 1. Comparison of database sharing approaches

5 Conclusion

This paper presents an original approach to share DW in the cloud that simultaneously supports data privacy and OLAP query with reasonable time complexity and minimum volume overhead. Our proposed solution is based on a homomorphic encryption algorithm that reveals a serious weakness as it can be deciphered by ciphertext attacks. For that, we propose a new method of using this homomorphic privacy based on multi cloud providers and perturbation values. With this new sharing schema, we can reduce the risk of breaking the security of the system by both cloud providers and malicious intruders. Also, a new technique is proposed to verify the integrity of our data when received from the cloud. The weakness of our schema lies in the processing of range queries in the owner after decrypting all the data. This operation can take a lot of time in the case of data warehouse because of the huge volume of data that will be decrypted before processing range query. That is why; we attempt to propose a solution to this situation in order to reduce time consumption in the decryption phase in future work. We eventually endeavour to evaluate our schema in a real cloud provider.