1 Introduction

Predictive modeling is an essential tool in decision making processes in domains such as policy making, medicine, law enforcement, and finance. Considering a hospital would like to use a cloud service which provide predictive service to analyze the patient’s condition so as to improve the quality of care and reduce costs. Due to ethical and legal requirements, the hospital might be restricted to use such service [3, 4, 12]. Like the hospital, many organizations are collecting ever-increasing data for mining to improve decision-making and productivity. However, they may have no powerful resources to deal with such large-scale data. To solve this problem, an attractive business model is that a service provider, which has powerful platforms and advanced analytic skills, provides such services. Organizations who need the calculation resource can outsource their computational tasks to such powerful service providers. However, because the data may contain sensitive information, outsourcing data to public clouds directly raises privacy concerns.

In current implementations, the learning algorithm must see all user data in the clear in order to build the predictive model. In this paper, we consider whether the learning algorithm can operate in encrypted domains, thereby allowing users to retain control of their data. For medical data, this allows for a model to be built without affecting user privacy. For the book and movie preferences, letting users keep control of their data, can reduce the risk of future unexpected embarrassment in case of a data breach at the service provider.

Roughly speaking, there are three existing approaches to essure the privacy when the server mines the user data. The first lets users split their data among multiple servers by using secure multi-party computation [2, 5, 9]. These servers, then, run the learning algorithm using a distributed protocol. Privacy is assured as long as a majority of servers do not collude. The second is based on differential privacy protection, where the learning algorithm is executed over data containing noise [6, 7, 13]. And the third is based on homomorphic encryption, where the learning algorithm is executed over encrypted data [15].

Distributed linear regression is not suitable for outsourced model. In distributed linear regression, every party must take part in computation. Consequently, the secure multi-party computation may be inefficient. In addition, there may be a great loss of accuracy and can not fully guarantee the security of data by using the differential privacy protection. In this work, we choose homomorphic encryption for our privacy-preserving machine learning algorithm. As we know, homomorphic encryption (HE) allows operations on encrypted data, which provides a possible solution for linear regression over ciphertexts. In our work, we propose an efficient and secure linear regression protocol over encrypted data for outsourced environments, namely ESLR, where the cloud performs linear regression processing over encrypted data. The challenge is how to apply the linear regression algorithm over ciphertexts, while maintaining high accuracy and performance. To address these challenges, we exploit the vector HE (VHE) recently presented by Zhou and Wornell [17]. Unlike the fully HE (FHE), VHE only needs to support somewhat homomorphic encryption. As a result, it is much more efficient than many existing FHE schemes. For example, it is orders of magnitude faster than HELib [10], which is a very famous FHE implementation. Especially by designing ingeniously, VHE can be used in the privacy-preserving gradient descent. Different from existed works, our contributions are twofold as follows.

  1. (1)

    Firstly, ESLR reconstructs linear regression clustering process in the domain of ciphertext by taking advantage of the vector encryption, which allows low computation and communication cost. What’s more, we proposed a scheme that can apply privacy-preserving gradient descent method over ciphertext domain efficiently. To our best knowledges, it’s very efficient for the optimization algorithm over encrypted data. Experiments shows that ESLR achieves almost the same accuracy compared to the plaintext algorithm.

  2. (2)

    Secondly, security analysis demonstrates that ESLR achieves the confidentiality of data, ensuring the privacy of the data owner. In addition, we give the definition of loss function, which is needed for optimization over ciphertext domain.

This paper is organized as follows: The problem formulation is described in Sect. 2. The constructions of linear regression protocol are proposed in Sect. 3, followed by further discusses in Sect. 4. Then we give the security analysis and performance evaluation in Sects. 5 and 6, respectively. Finally, the conclusion is presented in Sect. 7.

2 Problem Statement

In this section, we give the problem statement, including system model and threat model, design goals, notations and preliminaries.

Fig. 1.
figure 1

System model

2.1 System Model and Threat Model

We give our system model concentrating on how to achieve secure liner regression over encrypted data in outsourced environments. As shown in Fig. 1, we proposed a classical outsourced system model, mainly consisting of two parties. The one is the data owner, and the other is the service provider. We primarily consider the service provider as an “honest-but-curious” server in our model. We assume the public matrix H and encrypted data \({{\varvec{{D}}}^{'}}\) (\({{\varvec{{D}}}}^{'}\) is the encryption of the data \({\varvec{{D}}}\)) have been outsourced to the cloud, and the confidentiality of the data will be protected by the underlying encryption primitive. After that, the server will implement the regression algorithm based on \({{\varvec{{D}}}}^{'}\). That is, the data owner outsources his encrypted data \({{\varvec{{D}}}}^{'}\), and the service provider runs the proposed protocol over \({{\varvec{{D}}}}^{'}\). Finally the service provider returns the predicted results to the data owner.

2.2 Design Goals

The overarching goal is to enable liner regression algorithm to be performed over encrypted data. What’s more, for an efficient and secure liner regression protocol, we consider the following requirements to be necessary.

  • Accuracy: Enable secure linear regression over encrypted data in outsourced environments and achieve high accuracy.

  • Security: Protect privacy of linear regression process.

  • Efficiency: Process large amount of data with practical performance.

2.3 Overview of Standard Linear Regression and Gradient Descent

In this section, we give a brief introduction about standard linear regression algorithm [16]. In statistics, linear regression equation is a regression analysis using least square function to find the relationship between one or more independent variables and dependent variables. This function is a linear combination of one or more model parameters called regression coefficients. Linear regression with only one independent variable is called simple regression, and it is called multiple regression with greater than one independent variable. Like all forms of regression analysis, linear regression also focuses on the probability distribution of x and y. Given a random sample \(({x}_{i1},{x}_{i2},...,{x}_{ip},{y}_{i})\), we have one hypothetical regression output \({y}_{i}\), and hypothetical regression inputs \({x}_{i1},{x}_{i2},...,{x}_{ip}\). So a multivariate linear regression model is expressed as \({y}_{i}={{{w}}_1}{{{x}}_1}+{{{w}}_2}{{{x}}_2}+\cdots +{{{w}}_d}{{{x}}_d}+{b}\).

For a data set \({\varvec{{D}}}={[({{{\varvec{{x}}}}_1},{y}_1),({{{\varvec{{x}}}}_2},{y}_2),\cdots ,({{{\varvec{{x}}}}_n},{y}_n)]}\), the goal of linear regression is to get the regression coefficients \(\varvec{\theta }=[{w}_1, {w}_2, \cdots , {w}_d, {b}]\) such that the loss function get the minimum value. We define the loss function as

$$\begin{aligned} J(\varvec{\theta }) = (\frac{1}{2n} )\sum _{i=1}^n(\varvec{\theta }^{T}{\varvec{{x}}}_i^{'}-{y}_i)^2. \end{aligned}$$

Further, we formulate the problem as Algorithm 1.

figure a

The gradient descent method [8] is one of the iterative methods, which can be used to solve the least squares problem. Gradient descent is one of the most commonly used methods in solving the model parameters of machine learning algorithm (unconstrained optimization problem). The other method commonly used is the least square method. When solving the minimum value of the loss function, we can get the minimum value of loss function and the model parameters through the gradient descent method.

2.4 Notations and Preliminaries

In this section, we review the preliminaries that are necessary for our work. First, we give notations used throughout the paper as illustrated in Table 1.

Table 1. Notations

We outline the VHE scheme as suggested by Zhou and Wornell [17] that encrypts integer vectors to allow computation of arbitrary polynomials in the encrypted domain. For our purpose of ESLR, we only consider the fundamental operations below and more details are referred to [17].

  • \({\varvec{{VHE.KG}}}(\lambda )\): Input a security parameter \({\lambda }\), choose l, m, n, p, q, w \(\in \mathbb {Z}\), and the distribution \(\chi \) where \(l=\lceil \log _{2}{(q-1)}\rceil \), \(w(p-1)<q\), \(q\gg p\), and \(m<n\), construct S = [I, T] \(\in \mathbb {Z}^{m\times n}\) with \({\varvec{{I}}}\in \mathbb {Z}^{m\times m}\) as the identity matrix, and output the secret key S and the public parameters \({\varvec{{Param}}}=(l, m, n, p, q, w, \,\chi \)).

  • \({\varvec{{VHE.E}}}({\varvec{{x,S}}})\): Input a secret key \({\varvec{{S}}}\in \mathbb {Z}^{m \times n}\) and a plaintext vector \({\varvec{{x}}}\in \mathbb {Z}^m \), output a ciphertext \({\varvec{{c}}} \in \mathbb {Z}^n \) that satisfies

$$\begin{aligned} {\varvec{{Sc}}}= w {\varvec{{x}}}+{\varvec{{e}}} \end{aligned}$$

where w is a large integer, \(\mid {\varvec{{S}}}\mid \ll w \), and e is an error term with \(|{\varvec{{e}}} |<w /2\).

  • \({\varvec{{VHE.D}}}({\varvec{{c,S}}})\): Input a ciphertext vector c \(\in \mathbb {Z}^n \) and a secret key S \(\in \mathbb {Z}^{m \times n} \), output a plaintext \({{\varvec{x}}} \in \mathbb {Z}^m \) that satisfies \({\varvec{{x}}}=\lceil {\varvec{{Sc}}}/w \rfloor \).

For the VHE scheme, the key switching is an important operation in the encrypted domain. Given two secret keys S \(\in \mathbb {Z}^{m \times n} \) and \({{\varvec{{S}}}}^{'} \in {\mathbb {Z}}^{m \times {n^{'}}} \), and the ciphertext c \(\in \mathbb {Z}^n \) which decrypts to the plaintext x \(\in \mathbb {Z}^m \) with S, we calculate a matrix M \(\in \mathbb {Z}^{{n^{'}} \times {nl}} \) producing a new ciphertext c\(^{'}\) \(\in {\mathbb {Z}}^{n^{'}}\) so as to decrypt \({{\varvec{{c}}}}^{'}\) to the same x with S\(^{'}\). In specific, this key switching task can be divided two steps: \({\varvec{{M}}} \leftarrow {\varvec{{VHE.KSM}}} ({\varvec{{S}}}, {{\varvec{{S}}}^{'}})\) and \({{\varvec{{c}}}}^{'}\) \(\leftarrow {\varvec{{VHE.KS}}} ({\varvec{{M,c}}})\).

Furthermore, as inferred by [17], for the plaintext x, the ciphertext c, and the key-switching matrix M, the following equation holds.

$$\begin{aligned} {\varvec{{c}}} = {\varvec{{M}}}(w {\varvec{{x}}})^{*} \end{aligned}$$

In addition, it is obvious that VHE supports the operation of the addition in ciphertexts domain as

$$\begin{aligned} {\varvec{{S}}}({{\varvec{{c}}}}_{1}+{{\varvec{{c}}}}_{2}+\cdots +{{\varvec{{c}}}}_{n})={w}({{\varvec{{x}}}}_{1}+{{\varvec{{x}}}}_{2}+\cdots +{{\varvec{{x}}}}_{n})+{\varvec{{e}}}. \end{aligned}$$

2.5 Privacy-Preserving Inner Product

In this section, we present a new technique of computing the inner product of two vectors. For simplication, we can assume that there are two vectors \({\varvec{{x}}}_{1}\) and \({\varvec{{x}}}_{2}\) which are encrypted to \({\varvec{{c}}}_{1}\) and \({\varvec{{c}}}_{2}\) using the vector homomorphic encryption of VHE. The challenge is how to calculate the inner product on ciphertext domain.

To tackle the problem, a matrix H is essential to be calculated. By solving equation \({\varvec{{A}}}{\varvec{{M}}}={\varvec{{I}}}^{*}\), we have a matrix A. Then we can get the matrix H from \({\varvec{{H}}}={\varvec{{A}}}^{T}{\varvec{{A}}}\). We can prove that

$$\begin{aligned} {\varvec{{c}}}^{T}{\varvec{{H}}}{\varvec{{c}}} ={w}^{2}{\varvec{{x}}}^{T}{\varvec{{x}}}. \end{aligned}$$

Hence, we can calculate the inner product in ciphertex domain, and will later discuss the security of this method.

3 Proposed Protocol

In this section, we will propose the protocol for linear regression over encrypted items in outsourced environments using VHE.

3.1 Reformulating the Problem

In this section, we give a brief introduction about our problem again. We supposed that the data owner owns a database D that can be thought to be a big table of n records \({{\varvec{{x}}}}_{1}, {{\varvec{{x}}}}_{2},\cdots , {{\varvec{{x}}}}_{n}\). The record \({{\varvec{{x}}}}_{i}=[x_{i_1} \cdots x_{i_m}]\) includes m attributes. Because the resources of the data owner is limited, so the data owner encrypts his database D record-wise, and then outsources the encrypted database \({{\varvec{{D}}}}^{'}\) to the cloud. After that, the service provider will apply the linear regression over encrypted data sets, and return back the results to the data owner. In this protocol, the service provider know nothing to the plaintext.

3.2 Linear Regression Over VHE

With the preparatory work ahead, we discuss the problem of regression over encrypted data firstly. In order to make our protocol faster and easier, We only consider the security of data properties. Supposed dataset \({\varvec{{D}}} = \{({{\varvec{{x}}}}_{1},{y}_{1}), ({{\varvec{{x}}}}_{2},{y}_{2}),\cdots , ({{\varvec{{x}}}}_{n},{y}_{n})\}\) which is only known by data owner are encrypted to be \({\varvec{{D}}}' = \{({{\varvec{{c}}}}_{1},{y}_{1}), ({{\varvec{{c}}}}_{2},{y}_{2}),\cdots , ({{\varvec{{c}}}}_{n},{y}_{n})\}\). The relation between plaintext and ciphertext satisfies \({\varvec{{S}}}{{\varvec{{c}}}}_{i}=w{{{\varvec{{x}}}}}_{i}+{{{\varvec{{e}}}}}_{i}\) where i = 12\(\cdots , n \). When the service provider get the encrypted data sets \({\varvec{{D}}}'\) from the data owner, he will apply linear regression protocol over \({\varvec{{D}}}'\). The whole process is divided three phases: Preparation, Regression, and BackResults.

  • \({\varvec{{Preparation}}} ({\varvec{{D}}},\lambda )\). The security parameter \(\lambda \) and the data sets D is taken as the input, and the data owner generates a secret key S and a key-switch Matrix M for every record which satisfies the following equation.

$$\begin{aligned} {{\varvec{{c}}}}={\varvec{{M}}}({w}{\varvec{{x}}})^{*}, \end{aligned}$$

where c is the ciphertext of x. The data owner need to calculate the key-switch matrix M only once and the data owner can use the key-switch M to encrypted data sets x. As we know, the scheme of VHE cost most is key-switch. If we use the same key-switch M to encrypt data, We can save a lot of overhead on encryption. Then, the data owner need to calculate the matrix H, which is used to define the loss function over encrypted data. As we know, the following equation holds.

$$\begin{aligned} {w}{{\varvec{x}}}={\varvec{{I}}}^{*}({w}{\varvec{{x}}})^{*}. \end{aligned}$$

The data owner solve a matrix equation which satisfies:

$$\begin{aligned} {\varvec{{A}}}{\varvec{{M}}}={\varvec{{I}}}^{*}. \end{aligned}$$

Then, the data owner obtains the matrix A from the equation. Finally, the data owner can get the matrix H as

$$\begin{aligned} {\varvec{{H}}}={\varvec{{A}}}^{T}{\varvec{{A}}} \end{aligned}$$

Finally, the data owner upload the encrypted data set \({{\varvec{{D}}}}^{'}\) and the matrix H to the service provider.

  • \({\varvec{{Regression}}}({{\varvec{{D}}}}^{'}\), \({{\varvec{{H}}}}\)). The service provider get the encrypted data set \({\varvec{{D}}}'\) = {\(({{\varvec{{c}}}}_{1},{y}_{1}), ({{\varvec{{c}}}}_{2},{y}_{2}),\cdots , ({{\varvec{{c}}}}_{n},{y}_{n})\)} and the matrix H from the data owner and apply the regression algorithm, which includes the steps as below:

    1. (1)

      Generate a vector \(\varvec{\theta }'\) randomly and choose a threshold t.

    2. (2)

      Define the loss function over encrypted data as

      $$\begin{aligned} J'(\varvec{\theta }') = (\frac{1}{2n} )\sum \nolimits _{i=1}^n(\frac{1}{w^2} \varvec{\theta }'^{T}{{\varvec{H}}}{\varvec{{c}}}_i-{y}_i)^2. \end{aligned}$$
    3. (3)

      Upload the \(\varvec{\theta }'\) based on gradient descent method as below:

      $$\begin{aligned} \varvec{\theta }'^{k}=\varvec{\theta }'^{k-1} - \alpha \frac{\partial J'(\varvec{\theta }')}{\partial \varvec{\theta }'}, \end{aligned}$$

      where \(\varvec{\theta }'^{k}\) is the value of the \(k^{th}\) iteration.

    4. (4)

      Repeat step (3) until the value of the loss function satisfies the condition as below:

      $$|J'(\varvec{\theta }'^{k})-J'(\varvec{\theta }'^{k-1})|<{t}.$$
  • \({\varvec{{BackResults}}}(\varvec{\theta }'\)). From Regression the cloud will get the encrypted parameters. Then, the cloud return it back to the data owner.

4 Discussion

We have shown how to achieve a basic protocol for linear regression over encrypted data in outsourced environment. In this section, we will give the correctness analysis of our protocol and give a brief introduction about how to use the encrypted results.

4.1 Loss Function over Encrypted Data

In this section, we introduce the correctness of loss function over encrypted data, and verify that the following equation holds.

$$\begin{aligned} J'(\varvec{\theta }')&=(\frac{1}{2n} )\sum _{i=1}^n(\frac{1}{{w}^2} \varvec{\theta }'^{T}{\varvec{H}}{\varvec{{c}}}_i-{y}_i)^2 \\&= (\frac{1}{2n} )\sum _{i=1}^n(\frac{1}{{w}^2} {w}^2\varvec{\theta }^{T}\varvec{{c}}_i-{y}_i)^2 \\&= (\frac{1}{2n} )\sum _{i=1}^n(\varvec{\theta }^{T}{\varvec{{c}}}_i-{y}_i)^2 \\&= J(\varvec{\theta }) \end{aligned}$$

As we can see, the loss function on the encrypted data is equal to the loss function on the plaintext.

4.2 Encrypted Parameters

In this section, we will discuss the relationship between encrypted parameters \(\varvec{\theta }'\) and encrypted data. First of all, We analysis loss function of plaintext. The loss function of plaintext is shown as follow:

$$\begin{aligned} J(\varvec{\theta }) = (\frac{1}{2n} )\sum \nolimits _{i=1}^n(\varvec{\theta }^{T}{\varvec{{x}}}_i-{y}_i)^2 \end{aligned}$$

Gradient descent is one of the most commonly used methods in solving the model parameters. When solving the minimum value of the loss function, we can get the minimum value of loss function and the model parameters. \(\varvec{\theta }=[\theta _{1},\theta _{2},\dots ,\theta _{d}]\) where the iterative equation is given as below:

$$\begin{aligned} \varvec{\theta }:=\varvec{\theta } - \alpha \frac{\partial J(\varvec{\theta })}{\partial \varvec{\theta }} \end{aligned}$$
$$\begin{aligned} \left[ \begin{matrix} {\varvec{{\theta }}}_1&{}\\ {\varvec{{\theta }}}_2&{}\\ \vdots &{}\\ {\varvec{{\theta }}}_{d}&{} \end{matrix} \right] :=\left[ \begin{matrix} {\varvec{{\theta }}}_1&{}\\ {\varvec{{\theta }}}_2&{}\\ \vdots &{}\\ {\varvec{{\theta }}}_{d}&{} \end{matrix} \right] -\frac{\alpha }{n}\left[ \begin{matrix} \sum \nolimits _{i=1}^n(\varvec{\theta }^{T}{\varvec{{x}}}_i-{y}_i)*{x}_{i1}&{}\\ \sum \nolimits _{i=1}^n(\varvec{\theta }^{T}{\varvec{{x}}}_i-{y}_i)*{x}_{i2}&{}\\ \vdots &{}\\ \sum \nolimits _{i=1}^n(\varvec{\theta }^{T}{\varvec{{x}}}_i-{y}_i)*{x}_{id}&{} \end{matrix} \right] \end{aligned}$$
$$\begin{aligned} \left[ \begin{matrix} {\varvec{{\theta }}}_1&{}\\ {\varvec{{\theta }}}_2&{}\\ \vdots &{}\\ {\varvec{{\theta }}}_{d}&{} \end{matrix} \right] :=\left[ \begin{matrix} {\varvec{{\theta }}}_1&{}\\ {\varvec{{\theta }}}_2&{}\\ \vdots &{}\\ {\varvec{{\theta }}}_{d}&{} \end{matrix} \right] -\frac{\alpha }{n}\sum _{i=1}^n(\varvec{\theta }^{T}{\varvec{{x}}}_i-{y}_i)\left[ \begin{matrix} {x}_{i1}&{}\\ {x}_{i2}&{}\\ \vdots &{}\\ {x}_{id}&{} \end{matrix} \right] \end{aligned}$$
$$\begin{aligned} \varvec{\theta }:=\varvec{\theta }-\frac{\alpha }{n}\sum _{i=1}^n(\varvec{\theta }^{T}{\varvec{{x}}}_i-{y}_i){\varvec{{x}}}_{i}, \end{aligned}$$

where \(\alpha \) is the iteration step. Note that \(\varvec{\theta }\) is a linear combination of \({\varvec{{x}}}_{i}^{'}\) when the initial value is set to the vector \(\varvec{0}\). Linear combination is supported by Vector Homomorphic Encryption, and thus we can get the results on the encrypted domain.

5 Security Analysis

In this section, we give the security analysis for ESLR, focusing on the encrypted database \({{\varvec{{D}}}}^{'}=\{{{\varvec{{c}}}}_{1}, {{\varvec{{c}}}}_{2},\cdots , {{\varvec{{c}}}}_{n}\}\) and the matrix H. The honest-but-curious cloud server could not threat the privacy of the data owner, i.e., the cloud could not recover plaintexts database \({\varvec{{D}}}=\{{{\varvec{{x}}}}_{1}, {{\varvec{{x}}}}_{2},\cdots , {{\varvec{{x}}}}_{n}\)}.

First of all, \({\varvec{{c}}}_{i}\) is the ciphertext of \({\varvec{{x}}}_{i}\) by the encryption of VHE, for i = 1, 2, \(\cdots \)n. For convenience, we omit the subscripts, denoting as \({\varvec{{c}}}={\varvec{{VHE}}}.{\varvec{{E}}}({\varvec{{x}}},{\varvec{{S}}})\), where S is the secret key. Therefore, we can ensure the confidentiality of x, only if the encryption scheme VHE is secure and the secret key S is not known by the cloud. Of course, we may suppose that the secret key S is stored privately by the data owner, and thus the cloud could not get it. Hence, we would focus on the security of VHE.

As shown in [17], the security of VHE could reduce to the problem of the learning with errors (LWE). It is well known the LWE problem is as hard to solve as several worst-case lattice problems [14]. As a resut, the intracibility of LWE assures the security of VHE.

However, in order to evaluate the distance of two ciphertexts vectors, we introduce a special matrix H. It is natural to consider if H may bring certain unknown privacy risk. For example, on one hand, to calculate H, we first solve the equation \({\varvec{{I}}}^{*}={\varvec{{A}}}{\varvec{{M}}}\) to obtain \({\varvec{{A}}}\), then compute \({\varvec{{H}}}={\varvec{{A}}}^{T}{\varvec{{A}}}\) to get H. On the other hand, according to VHE, for the ciphertext c and the plaintext \({\varvec{{x}}}, {\varvec{{c}}}={\varvec{{M}}}(w\mathbf x )^{*}\) holds. As known, the cloud has H and c. If the cloud combines the equations as follows, it seems that the cloud could recover the plaintext x.

$$\begin{aligned} \left\{ \begin{aligned} {\varvec{{H}}}&={\varvec{{A}}}^{T}{} \mathbf A \\ {\varvec{{I}}}^{*}&={\varvec{{AM}}} \\ {\varvec{{c}}}&={\varvec{{M}}}(w{\varvec{{x}}}) \end{aligned} \right. \end{aligned}$$

In the following, We would give positive answer about the challenge. The analysis demonstrates that the cloud could not yet recover the plaintext x from the ciphertext c by exploiting H.

As is known, for a random orthogonal matrix Q, satisfying the relation \({\varvec{{Q}}}^{T}{\varvec{{Q}}}={\varvec{{I}}}\), where I is an identity matrix, we have

$$\begin{aligned} {\varvec{{H}}}&= {\varvec{{A}}}^{T}{\varvec{{A}}}\\&= {\varvec{{A}}}^{T}{\varvec{{Q}}}^{T}{\varvec{{Q}}}{\varvec{{A}}} \\&= {\varvec{{A}}}^{T}{\varvec{{I}}}{\varvec{{A}}} \\&={\varvec{{H}}} \end{aligned}$$

It is clear that the equation \({\varvec{{H}}}={\varvec{{A}}}^{T}{\varvec{{A}}}\) has infinite solutions for A since Q is randomly chosen. Therefore, the cloud could not extract the matrix A from the Norm-matrix H. Futhermore, without knowing A, the cloud could not yet get M. And the cloud could not recover the plaintext x from the ciphertext c. As a result, we achieve the privacy of the database D.

6 Performance Evaluation

In this section, we evaluate the proposed linear regression protocol. Our data sets come from the UCI repository [1], and the experiment environment includes a data owner and a service provider. Python language is used on a Window 10 machine with i3-4130 CPU @1.40 GHz and 4 GB RAM for a user, and the server is a Linux machine with an Intel Xeon E5-2430 v2 CPU @2.5 GHz and 16 GB RAM running Ubuntu 14.04 LTS. The user acts as a data owner and a data user, and the server acts as a service provider. In the following, we will conduct the simulation experiments in terms of the time cost, accuracy, and communication overhead.

6.1 Time Cost and Accuracy

Firstly, we evaluate the time cost by the comparison of running time between plaintext and ciphertext. As illustrated in Fig. 2, we choose 4 data sets to verify our protocol from the UCI repository, and can see that the linear regression on ciphertext is a little slower than that on plaintext. However, the result is acceptable, and it has almost the same results for the data sets between the plaintext and the ciphertext.

Fig. 2.
figure 2

Comparison of running time between plaintext and ciphertext

Fig. 3.
figure 3

Comparison between real results and predicted results in encrypted domain

Then, we show the comparison of accuracy between the real results and the predicted results of the four different data sets in the ciphertext domain. As illustrated in Fig. 3, we can see that the predicted results almost coincide the actual results in the ciphertext domain. Furthermore, we choose the Mean Squared Error, Root Mean Squard Error, Mean Absolute Error (MAE) and R Squared (R-S), as the indexes of linear regression to evaluate our model. As seen in Table 2, compared to results in the plaintext domain, our protocol has almost achieved the same prediction performance. This shows that our model has a good performance on the ciphertext domain.

Table 2. Clustering time and iterations

6.2 Communication Cost

In this section, we will discuss the communication cost of our protocol. In our protocol, the communication cost mainly come from ciphertext and the matrix H which is used to define the loss function. Firstly, for n records and every record have m dimensions, it will produce \(\mathcal {O}(m(n+1))\) communication traffic overhead when the data items are encrypted. Secondly, it will generate \(\mathcal {O}((n+1)^{2})\) communication traffic overhead for matrix H. That means that it will produce \(\mathcal {O}(m+n+1)(n+1)\) communication traffic overhead totally on encrypted domain. On the other hand, the complexity of plaintext stage is \(\mathcal {O}(mn)\) for the same data sets. In fact, m is always far greater than n because of dimension disaster problem [11]. So communication traffic overhead between plaintext and ciphertext is almost same when m is far greater than n and m is big enough.

7 Conclusion

In this paper we have proposed an efficient and secure linear regression protocol over encrypted data using the vector homomorphic encryption. Especially, we have given a good solution to the challenging problem of privacy-preserving gradient descent method. Performance evaluation shows that it has high accuracy and low computation and communication cost. As we know, many machine learning algorithm base on gradient descent method. In the future, we will use this method on other machine learning algorithms.