3.1 Privacy-Preserving Data Integration

3.1.1 Introduction

Medical organizations often store the data accumulated through medical analyses. However, detailed data analysis sometimes requires separate datasets to be integrated without violating patient or commercial privacy. Consider the scenario in which the occurrence of similar accidents can be attributed to a particular defective product. Such defective products should be identified as quickly as possible. However, the databases related to accidents are maintained separately by different organizations. Thus, investigating the causes of accidents is often time-consuming. For example, assume child A has broken her/his leg at school, but it is not clear whether the accident was caused by defective equipment. In this case, information relating to A’s injury, such as the patient’s name and type of injury, is stored in hospital database \(S_{1}\). Information pertaining to A’s accident, such as their name and the location of the swing at the school, is stored in database \(S_{2}\), which is held by the fire department. Finally, information relating to the insurance claim following A’s accident, such as the name and medical costs, is maintained in the insurance company’s database, \(S_{3}\). Computing the intersection of these databases, \(S_{1} \cap S_{2} \cap S_{3}\), without compromising privacy would enable us to combine the separate sets of information, which may allow the cause of the accident to be identified. Let us consider another situation. Several clinics, denoted as \(\mathsf{P}_i\), maintain separate databases, represented as \(S_{i}\). The clinics wish to know the patients they have in common to enable them to share treatment details; however, \(\mathsf{P}_i\) should not be able to access any information about patients not stored in their own dataset. In this case, the intersection of the set must not reveal private information.

These examples illustrate the need for the Multiparty Private Set Intersection (MPSI) protocol [1,2,3,4]. MPSI is executed by multiple parties who jointly compute the intersection of their private datasets. Ultimately, only designated parties can access this intersection. Previous protocols are impractical because the bulk of the computation depends on the number of players. One previous study required the size of the datasets maintained by the different players to be equal [1, 2]. Another study [3] computed only the approximate number of intersections, whereas other researchers [4] required more than two trusted third-parties.

In this section, we propose a practical MPSI with the following features:

1. The size of the datasets maintained by each party is independent of those maintained by the other parties.

2. The computational complexity for each party is independent of the number of parties. This is accomplished by introducing an outsourcing provider, \(\mathcal{O}\). In fact, all computations related to the number of parties are carried out by \(\mathcal{O}\). Thus, the number of parties is irrelevant.

3.1.2 Preliminaries

In this section, we summarize the DDH assumption, Bloom filter, and ElGamal encryption. We consider security according to the honest-but-curious model [5]: all players act according to their prescribed actions in the protocol. A protocol that is secure in an honest-but-curious model does not allow any player to gain information about other players’ private input sets, besides that that can be deduced from the result of the protocol. Note that the term adversary here refers to insiders, i.e., protocol participants. Outsider adversaries are not considered. In fact, behavior by outsider adversaries can be mitigated via standard network security techniques.

Our protocol is based on the following security assumption.

Definition 3.1

(DDH Assumption) Let t be a security parameter. A decisional Diffie–Hellman (DDH) parameter generator \(\mathcal {IG}\) is a probabilistic polynomial time (ppt) algorithm, a finite field \({\mathbb F}_{p}\), and a basepoint \(g \in {\mathbb F}_{p}\) with prime order q. We say that \(\mathcal {IG}\) satisfies the DDH assumption if \(\left| p_1-p_2\right| \) is negligible (in \(\kappa \)) for all ppt algorithms A, where \(p_1={\small \Pr } [ ({\mathbb F}_{p}, g) \leftarrow \mathcal {IG}(1^{\kappa }); y_1=g^{x_1}, y_2= g^{x_2} \leftarrow {\mathbb F}_{p}: A({\mathbb F}_{p}, g, y_1, y_2, g^{x_1x_2}) = 0]\) and \(p_2={\small \Pr } [ ({\mathbb F}_{p}, g) \leftarrow \mathcal {IG}(1^{\kappa }); y_1=g^{x_1}, y_2= g^{x_2}, z \leftarrow {\mathbb F}_{p}: A({\mathbb F}_{p}, g, y_1, y_2, z) = 0]\).

A Bloom filter [6], denoted by \(\mathsf{BF}\), consists of m arrays and has a space-efficient probabilistic data structure. The \(\mathsf{BF}\) can check whether an element x is included in a set S by encoding S with at most w elements. The encoded Bloom filter of S is denoted by \(\mathsf{BF}(S)\).

The \(\mathsf{BF}\) uses a set of k independent uniform hash functions \(\mathcal {H}= \left\{ H_0, \ldots , H_{k-1} \right\} \), where \(H_i:\{0, 1 \}^* \longrightarrow \{ 0,1, \ldots , m-1 \}\) for \(0 \le \forall i \le k-1\). The \(\mathsf{BF}\) consists of two functions: \(\mathsf{Const}\) embeds a given set S into \(\mathsf{BF}(S)\) and \(\mathsf{ElementCheck}\) checks whether an element x is included in S. \(\mathsf{SetCheck}\), an extension of \(\mathsf{ElementCheck}\), checks whether an element x in \(S'\) is in \(S' \cap S\) (see Algorithm 3.3). In \(\mathsf{Const}\) (see Algorithm 3.1), \(\mathsf{BF}(S)\) is constructed for a given set S by first setting all bits in the array to 0. To embed an element \(x \in S\) into the filter, the element is hashed using k hash functions to obtain k index numbers, and the bits at these indexes are set to 1, i.e., set \(\mathsf{BF}\) \([H_i(x)] = 1\) for \(0 \le i \le k-1\). In \(\mathsf{ElementCheck}\) (see Algorithm 3.2), we check all locations where x is hashed; x is considered to be not in S if any bit at these locations is 0; otherwise, x is probably in S.

Some false positive matches may occur, i.e., it is possible that all \(\mathsf{BF}\) \([H_i(y)]\) are set to 1, but y is not in S. The false positive rate \(\mathtt{FPR}\) is given by \(\mathtt{FPR}= \left\{ 1- \left( 1-\frac{1}{m} \right) ^{kw}\right\} ^k \approx \left\{ 1-e^{-kw/m}\right\} ^k\) [7]. However, false negatives are not possible, and so Bloom filters have a 100\(\%\) recall rate.

figure a

Homomorphic encryption under addition is useful for processing encrypted data. A typical homomorphic encryption under addition was proposed by Paillier [8]. However, because Paillier encryption cannot reduce the order of a composite group, it is computationally expensive compared with the following ElGamal encryption. Our protocol requires matching without revealing the original messages, for which exponential ElGamal encryption (exElGamal) is sufficient [9]. In fact, the decrypted results of exElGamal encryption can distinguish whether two messages \(m_1\) and \(m_2\) are equal, although the exElGamal scheme cannot decrypt messages itself. Furthermore, exElGamal can be used in (nn)-threshold distributed decryption [10], where the decryption must be performed by all players acting together. An exElGamal encryption with (nn)-threshold distributed decryption consists of three functions:

Key generation:

Let \({\mathbb F}_{p}\) be a finite field, \(g \in {\mathbb F}_{p}\), with prime order q. Each player \(\mathsf{P}_i\) chooses \(x_i \in {\mathbb Z}_{q}\) at random and computes \(y_i=g^{x_i} \pmod {p}\). Then, \(y=\prod _{i=1}^{n}y_i \pmod {p}\) is a public key and each \(x_i\) is a share for each player to decrypt a ciphertext.

Encryption\(\mathsf{thrEnc}[m] \rightarrow (u,v)\)

Let \( m \in \mathbb {Z}_{q}^{*}\) be a message. Choose \(r \in {\mathbb Z}_{q}\) at random, and compute both \(u=g^r \pmod {p}\) and \(v=g^my^r \pmod {p}\) for the input message \(m \in {\mathbb Z}_{q}\) and a public key y. Output (uv) as a ciphertext of m.

Decryption\(\mathsf{thrDec}[(u,v)] \rightarrow g^m\)

Each player \(\mathsf{P}_i\) computes \(z_i = u^{x_i} \pmod {p}\). All players then compute \(z = \prod _{i=1}^{n} z_i \pmod {p}\) jointly.Footnote 1 Finally, each player can decrypt the ciphertext as \(g^m = v/z \pmod {p}\).

ExElGamal encryption with (nn)-threshold decryption has the following features:

(1) homomorphic under addition: \(\mathsf{Enc}(m_1) \mathsf{Enc}(m_2)=\mathsf{Enc}(m_1 + m_2)\) for messages \(m_1, m_2 \in {\mathbb Z}_{p}\).

(2) homomorphic under scalar operations: \(\mathsf{Enc}(m)^k = \mathsf{Enc}(km)\) for a message m and \(k \in {\mathbb Z}_{q}\).

3.1.3 Previous Work

This section summarizes prior works on PSI between a server and a client and MPSI among n players. In PSI, let \(S=\{s_1,\ldots ,s_v\}\) and \(C=\{c_1,\ldots ,c_w\}\) be server and client datasets, respectively, where \(|S|=v\) and \(|C|=w\). In MPSI [1], we assume that each player holds the same number of datasets.

PSI protocol based on polynomial representation: The main idea is to represent the elements in C as the roots of a polynomial. The encrypted polynomial is sent to the server, where it is evaluated on the elements in S, as originally proposed by Freedman [11]. This is secure against honest-but-curious adversaries under secure public key encryption. The computational complexity is O(vw) exponentiations, and the communication overhead is \(O(v+w)\). The computational complexity can be reduced to \(O(v \log \log w)\) exponentiations using the balanced allocation technique [12]. Kissner and Song extended this protocol to MPSI [1], which requires \(O(nw^2)\) exponentiations and O(nw) communication overhead. The MPSI version is secure against honest-but-curious and malicious adversaries (in the random oracle model) using generic zero-knowledge proofs.

PSI protocol based on DH-key agreement: The main objective here is to apply the DH-key agreement protocol [13]: after representing the server and client datasets as hash values \(\{h(s_i)\}\) and \(\{h(c_i)\}\), respectively, the client encrypts the dataset as \(\{h(c_i)^{r_i}\}\) using a random number \(r_i\) and sends the encrypted set to the server. The server encrypts the client set \(\{h(c_i)^{r_i}\}\) and the server set \(\{h(s_i)\}\) using a random number r, which gives \(\{h(c_i)^{rr_i}\}\) and \(\{h(s_i)^{r}\}\), respectively, and returns these sets to the client. Finally, the client evaluates \(S \cap C\) by decrypting to \(\{h(c_i)^{r}\}\). This is secure against honest-but-curious adversaries under the DDH assumption. The total computational complexity is \(O(v+w)\) exponentiations, and the total communication overhead is \(O(v+w)\). The security of this approach can be enhanced against malicious adversaries in the random oracle model [14] by using a blind signature. However, no extensions to MPSI based on the DH-key agreement protocol have been proposed.

PSI protocol based on BF : This protocol was originally proposed in [4]. As the Bloom filter itself reveals information about the other player’s dataset, the set of players is separated into two groups: input players who have datasets and privacy players who perform private computations under shared secret information. In [15], the privacy of each player’s dataset is protected by encrypting each array of the Bloom filter using Goldwasser–Micali encryption [16]. In an honest-but-curious version, the computational complexity is O(kw) hash operations and O(m) public key operations, and the communication overhead is O(m), where m and k are the number of arrays and hash functions, respectively, used in the Bloom filter. The Bloom filter is used in the Oblivious transfer extension [17, 18] and the newly constructed garbled Bloom filter [19]. The main novelty in the garbled Bloom filter is that each array requires \(\lambda \) bits rather than the single bit needed for the conventional Bloom filter. To embed an element \(x \in S\) to a garbled Bloom filter, x is split into k shares with \(\lambda \) bits using XOR-based secret sharing \((x=x_1 \bigoplus \cdots \bigoplus x_k)\). The \(x_i\) are then mapped to an index of \(H_i(x)\). An element y is queried by subjecting all bit strings at \(H_i(y)\) to an XOR operation. If the result is y, then y is in S; otherwise, y is not in S. The client uses a Bloom filter \(\mathsf{BF}(C)\), and the server uses a garbled Bloom filter \(\mathsf{GBF}(S)\). If x is in \(C \cap S\), then for every position i it hashes to, \(\mathsf{BF}(C)[i]\) must be 1 and \(\mathsf{GBF}(S)[i]\) must be \(x_i\). Thus, the client can compute \(C \cap S\). The computational complexity of this method is O(kw) hash operations and O(m) public key operations, and the communication overhead is O(m). The number of public key operations can be changed to \(O(\lambda )\) using the Oblivious transfer extension. This is secure against honest-but-curious adversaries if the Oblivious transfer protocol is secure. Finally, some researchers have computed the approximate number of multiparty set unions [3].

3.1.4 Practical MPSI

This section presents a practical MPSI that is secure under the honest-but-curious model.

3.1.4.1 Notation and Privacy Definition

In the remainder of this paper, the following notations are used.

  • \(\mathsf{P}_i\): ith player, \(i = 1, \ldots , n\)

  • \(\mathcal{O}\): outsourcing provider with no knowledge of the inputs or outputs

  • \(S_i = \{ s_{i,1}, s_{i, 2},\ldots , s_{i, w_i} \}\): dataset held by \(\mathsf{P}_i\), where \(|S_i| = \omega _i\)

  • \(\cap S_j\): intersection of all n players

  • \(\mathsf{thrEnc}\) and \(\mathsf{thrDec}\): (nn)-threshold exElGamal encryption and decryption, respectively

  • m and k: number of arrays and hashes used in \(\mathsf{BF}\)

  • \(\varvec{\ell }=[\ell , \ldots , \ell ]\) (\(1 \le \ell \le n\)): an n-dimensional array, where all strings in the array are set to \(\ell \)

  • \(\mathsf{BF}(S_i)= [\mathsf{BF}_{i}[0], \ldots , \mathsf{BF}_{i}[m-1]]\): Bloom filter applied to a set \(S_i\)

  • \(\mathsf{IBF}(\cap S_i)=[ \sum _{i=1}^{n} \mathsf{BF}_i[0], \ldots , \sum _{i=1}^{n} \mathsf{BF}_i[m-1]]\): integrated Bloom filter of n sets \(\{S_i\}\), where \(\sum _{i=1}^{n} \mathsf{BF}_i[j]\) is the sum of all players’ arrays

We introduce an outsourcing provider \(\mathcal{O}\) to reduce the computational burden on all players. The dealer has no information regarding the elements of any player’s set. The privacy issues faced by MPSI with an outsourcing provider can be informally written as follows.

Definition 3.2

(MPSI privacy) An MPSI scheme with an outsourcing provider \(\mathcal{O}\) is player-private if the following two conditions hold:

  • \(\mathsf{P}_i\) does not learn anything about the elements of other players’ datasets except for the elements that \(\mathsf{P}_i\) originally possesses.

  • the outsourcing provider \(\mathcal{O}\) does not learn anything about the elements of any player’s set.

3.1.4.2 Proposed MPSI

Our MPSI comprises four phases: (i) initialization, (ii) Bloom filter construction and the encryption of \(\mathsf{P}_i\) data, (iii) the \(\mathcal{O}\)’s randomization of \(\mathsf{thrEnc}(\mathsf{IBF}(\cup S_i) -\mathbf {n})\), and (iv) the computation of \(\cap \mathsf{P}_i\). The computation of \(\cap \mathsf{P}_i\) consists of three steps: (a) joint decryption of an (nn)-threshold exElGamal among n players, (b) Bloom filter check, and (c) output intersection.

Figure 3.1 shows an overview of our protocol after the initialization phase. The system parameters of a finite field \({\mathbb F}_{p}\) and a basepoint \(g \in {\mathbb F}_{p}\) with order q for an (nn)-threshold exElGamal encryption (\(\mathsf{thrEnc}\), \(\mathsf{thrDec}\)) are provided to both \(\mathsf{P}_i\) and \(\mathcal{O}\). For the Bloom filter, \(\mathsf{Const}(S)\) and \(\mathsf{SetCheck}(\mathsf{BF}, S')\) are only provided to \(\mathsf{P}_i\), where the array size is m and k independent hash functions are used.

Fig. 3.1
figure 1

Overview of our MPSI

To encrypt, randomize, or subtract a vector such as a Bloom filter \(\mathsf{BF}=[a_0, \ldots , a_{m-1}]\), each location is encrypted, randomized, or subtracted independently:

$$\begin{aligned} \mathsf{thrEnc}(\mathsf{BF})= & {} [\mathsf{thrEnc}(a_0), \ldots , \mathsf{thrEnc}(a_{m-1})], \\ \mathbf {r}\mathsf{BF}= & {} [r_0a_0, \ldots , r_{m-1}a_{m-1}], \text{ or } \\ \mathsf{BF}- \mathbf {r}= & {} [a_0-r_0, \ldots , a_{m-1}-r_{m-1}] \end{aligned}$$

for \(\mathbf {r}= [r_0, \ldots , r_{m-1}] \in {\mathbb Z}_{q}^m\).

Our protocol proceeds as follows.

Initialization:

  1. 1.

    \(\mathsf{P}_i\) generates \(x_i \in {\mathbb Z}_{q}\), computes \(y_i=g^{x_i} \in {\mathbb Z}_{q}\), and publishes \(y_i\) to the other players as a public key, where the corresponding secret key is \(x_i\).

  2. 2.

    \(\mathsf{P}_i\) computes \(y=\prod _i y_i\), where y is the n-player public key. Note that no player knows the corresponding secret key \(x = \sum x_i\) before executing the joint decryption.

Construction and encryption of \(\mathsf{BF}(S_i)\) \(\varvec{-}\) \(\varvec{1}\) :

  1. 1.

    \(\mathsf{P}_i\) executes \(\mathsf{Const}(S_i) \longrightarrow \mathsf{BF}(S_i)=[\mathsf{BF}_i[0], \ldots , \mathsf{BF}_i[m-1]]\) (Algorithm 3.1).

  2. 2.

    \(\mathsf{P}_i\) encrypts \(\mathsf{BF}(S_i) - \varvec{1}\) using \(\mathsf{thrEnc}_y\):

    $$ \mathsf{thrEnc}_y(\mathsf{BF}(S_i) - \varvec{1}) =[\mathsf{thrEnc}_y(\mathsf{BF}_i[0] -1), \ldots , \mathsf{thrEnc}_y(\mathsf{BF}_i[m-1]-1)], $$

    where y is an n-player public key.

  3. 3.

    \(\mathsf{P}_i\) sends \(\mathsf{thrEnc}_y(\mathsf{BF}(S_i) - \varvec{1})\) to \(\mathcal{O}\).

Randomization of \(\mathsf{thrEnc}(\mathsf{IBF}(\cap S_i) -\mathbf {n})\) :

  1. 1.

    \(\mathcal{O}\) encrypts \(\mathsf{IBF}(\cap S_i)- \mathbf {n}\) without knowing \(\mathsf{IBF}(\cap S_i)\) using an additive homomorphic feature and multiplying by \(\mathsf{thrEnc}_y(\mathsf{BF}(S_i)- \varvec{1})\) as follows:

    $$ \mathsf{thrEnc}_y(\mathsf{IBF}(\cap S_i)- \mathbf {n}) = \prod _{i=1}^{n} \mathsf{thrEnc}_y(\mathsf{BF}(S_i)- \varvec{1}). $$
  2. 2.

    \(\mathcal{O}\) randomizes \(\mathsf{thrEnc}_y(\mathsf{IBF}(\cap S_i)- \mathbf {n})\) by \(\mathbf {r}= [r_0, \ldots , r_{m-1}] \in {\mathbb Z}_{q}^m\):

    $$ \mathsf{thrEnc}_y(\mathbf {r}(\mathsf{IBF}(\cap S_i) - \mathbf {n})) =(\mathsf{thrEnc}_y(\mathsf{IBF}(\cup S_i) - \mathbf {n}))^{\mathbf {r}}. $$
  3. 3.

    \(\mathcal{O}\) broadcasts \(\mathsf{thrEnc}_y(\mathbf {r}(\mathsf{IBF}(\cap S_i) - \mathbf {n}))\) to \(\mathsf{P}_i\).

Computation of \(\cap S_i\) :

  1. 1.

    All players decrypt \(\mathsf{thrEnc}_y(\mathbf {r}(\mathsf{IBF}(\cap S_i) - \mathbf {n}))\) jointly.

  2. 2.

    \(\mathsf{P}_i\) computes \(\mathsf{SetCheck}(\mathbf {r}(\mathsf{IBF}(\cap S_i) - \mathbf {n}), S_i)\) and obtains \(\cap S_i\).

The above protocol satisfies the correctness requirement. This is because each array position of \(\mathsf{thrEnc}_y(\mathbf {r}(\mathsf{IBF}(\cap S_i) - \mathbf {n}))\) is decrypted to 1, where \(x \in \cap S_i\) is embedded by each hash function; however, each array position for which \(x \not \in \cap S_i\) is embedded by each hash function is decrypted to a random value.

3.1.4.3 Security Proof

The security of our MPSI protocol is as follows.

Theorem 3.1

For any coalition of fewer than n players, the MPSI is player-private against an honest-but-curious adversary under the DDH assumption.

Proof

The views of \(\mathsf{P}_i\) and \(\mathcal{O}\), that is,

$$ \mathsf{thrEnc}_y(\mathsf{BF}_{m,k}(S_i)) =[\mathsf{thrEnc}_y(\mathsf{BF}_i[0]), \ldots , \mathsf{thrEnc}_y(\mathsf{BF}_i[m-1])], $$

are shown to be indistinguishable from a random vector \(\mathbf {r}= [r_0, \ldots , r_{m-1}] \in {\mathbb Z}_{q}^m\). Assume that a polynomial-time distinguisher \(\mathcal {D}\) outputs 0 when the views are presented as a random vector and outputs 1 when they are constructed in MPSI, \(\mathsf{thrEnc}(\mathsf{BF}_i[0]), \ldots , \mathsf{thrEnc}(\mathsf{BF}_i[m-1])\). We show that a simulator \(\overline{\text{ SIM }}\) that solves the DDH assumption can be constructed as follows.

Upon receiving a DDH challenge \((\overline{g}, \overline{g}^\alpha , \overline{g}^\beta , \overline{g}^\gamma )\), \(\overline{\text{ SIM }}\) executes the following:

  1. 1.

    Set n-player public key \(y = \overline{g}^{\beta }\) and choose random numbers \(d_0,\ldots ,d_{m-1}\) and \(r_1,\ldots ,r_{m-1}\) from \({\mathbb Z}_{q}\).

  2. 2.

    Send \([(\overline{g}^\alpha , \overline{g}^{d_0} \cdot \overline{g}^\gamma ), ((\overline{g}^\alpha )^{r_1},\overline{g}^{d_1} \cdot (\overline{g}^\gamma )^{r_1}), \ldots , ((\overline{g}^\alpha )^{r_{m-1}},\overline{g}^{d_{m-1}} \cdot (\overline{g}^\gamma )^{r_{m-1}}) ]\) as \(\overline{\mathsf{thrEnc}_y(\mathsf{BF}_{m,k}(S_i))}\) to \(\mathcal {D}\).

If \((\overline{g}, \overline{g}^\alpha , \overline{g}^\beta , \overline{g}^\gamma )\) is a DH-key-agreement-protocol element, i.e., \(\gamma =\alpha \beta \), then \(\overline{\mathsf{thrEnc}_y(\mathsf{BF}_{m,k}(S_i))}\) is distributed in the same way as when constructed by the MPSI scheme. Thus, \(\mathcal {D}\) must output 1. If \((\overline{g}, \overline{g}^\alpha , \overline{g}^\beta , \overline{g}^\gamma )\) is not a DH tuple, then \(\overline{\mathsf{thrEnc}_y(\mathsf{BF}_{m,k}(S_i))}\) is randomly distributed, and \(\mathcal {D}\) has to output 0. Therefore, \(\overline{\text{ SIM }}\) can use the output of \(\mathcal {D}\) to respond to the DDH challenge correctly. Therefore, \(\mathcal {D}\) can answer correctly with negligible advantage over random guessing. Furthermore, as all inputs of each player are encrypted until the decryption is performed, and decryption cannot be performed by fewer than n players, nothing can be learned by any player prior to decryption.

As for the views of \(\mathsf{thrEnc}_y(\mathbf {r}(\mathsf{IBF}_{m,k}(\cap S_i) - \mathbf {n}))\), the same argument holds. Therefore, for any coalition of fewer than n players, MPSI is player-private under the honest-but-curious model.

Next, we present d-and-over MPSI. The procedures of d-and-over MPSI are the same as those of MPSI until \(\mathcal{O}\) computes \(\mathsf{thrEnc}_y(\mathsf{IBF}(\cap S_i))\). Thus, we describe the procedure after \(\mathcal{O}\) computes \(\mathsf{thrEnc}_y(\mathsf{IBF}(\cap S_i))\).

Encryption of \(\ell \)-subtraction of \(\mathsf{IBF}(\cap S_i)\): \(\mathcal{O}\) executes the following:

  1. 1.

    Encrypt \(\mathsf{IBF}(\cap S_i)- \varvec{\ell }\) randomized by \(\mathbf {r}= [r_0, \ldots , r_{m-1}] \in {\mathbb Z}_{q}^m (d \le \ell \le n)\): \( \mathsf{thrEnc}_y(\mathbf {r}(\mathsf{IBF}(\cap S_i) - \varvec{\ell })) =(\mathsf{thrEnc}_y(\mathsf{IBF}(\cap S_i)) \cdot \mathsf{thrEnc}_y (-\varvec{\ell }))^{\mathbf {r}}.\)

  2. 2.

    Broadcast \(\{ \mathsf{thrEnc}_y(\mathbf {r}(\mathsf{IBF}(\cap S_i) - \varvec{\ell })) \}_{\ell }\) (\(d \le \ell \le n\)) to \(\mathsf{P}_i\).

d-and-over MPSI computation: \(\mathsf{P}_i\) executes the following:

  1. 1.

    All \(\mathsf{P}_i\) jointly decrypt \(\{ \mathsf{thrEnc}_y(\mathbf {r}(\mathsf{IBF}(\cap S_i) - \varvec{\ell })) \}_{\ell }\).

  2. 2.

    Let \(\mathsf{CBF}_{\ell }\) be an m-array for \(d \le \ell \le n\), where an array is set to 1 if and only if the corresponding array of \(\mathbf {r}\mathsf{IBF}(\cap S_i)- \varvec{\ell }\) is 1, and others are set to 0.

  3. 3.

    Set \(\mathsf{CBF}= \mathsf{CBF}_{\ell } \vee \cdots \vee \mathsf{CBF}_n\).

  4. 4.

    Execute \(\mathsf{SetCheck}_{m,k}(\mathsf{CBF}, S_i) \longrightarrow \cap ^{\ge d} S[i]\) and output \(\cap ^{\ge d} S[i]\).

The correctness of d-and-over MPSI follows from the fact that if an element \(x \in \cap ^{\ell } S\) for \(d \le \exists \ell \le n\), the corresponding array locations in \(\mathsf{IBF}(\cap S_i) - \mathbf {j}\) for \(\ell \le \exists j \le n\), where x is mapped by k hashes, are an encryption of 0, which are decrypted to 1; otherwise, it is an encryption of randomized value.

3.1.5 Efficiency

Although many PSI protocols have been proposed, to the best of our knowledge, relatively few consider the multiparty scenario [1,2,3,4]. Our target is multiparty private set intersection, and the final result must be obtained by all players acting together, without a trusted third-party (TTP). Among previous MPSI protocols, the approach in [3] computes only the approximate number of intersections, and that in [4] requires more than two TTPs. In contrast, [2] follows almost the same method as [1] and thus has a similar complexity. The only difference exists in the security model. Hence, we only compare our scheme with that of [1].

The computational and communication efficiency of the proposed protocol and [1] are compared in Table 3.1. These approaches are secure against honest-but-curious adversaries without a TTP under exElGamal encryption (DDH security) and Paillier encryption (Decisional Composite Residue (DCR) security), respectively. The Bloom filter parameters (mk) used in our protocol are set as follows: \(k = 80\) and \(m=80 \omega /\ln 2\), where \(\omega \) is the maximum \(|S_i| = \omega _i\). Then, the probability of false positives is given by \(p=2^{-80}\).

Our MPSI uses the Bloom filter for the computations performed by \(\mathsf{P}_i\) and the integrations performed by the \(\mathcal{O}\). The use of a Bloom filter eliminates the restriction on set size. Thus, in our MPSI, the set size of each player is flexible. However, \(\mathsf{P}_i\)’s computations consist of Bloom filter construction, joint decryption, and Bloom filter check. Neither the computations related to the Bloom filter nor the joint decryption depends on the number of players, as shown in Sect. 3.1.2. In summary, the computational complexity of operations performed by \(\mathsf{P}_i\) is \(O(\omega _i)\). All player-dependent data are sent to \(\mathcal{O}\), who integrates \(\prod _{i=1}^{n} \mathsf{thrEnc}_y(\mathsf{IBF}(\cap S_i))\) without decryption. Therefore, the computational complexity of operations performed by \(\mathcal{O}\) is \(O(n \omega )\).

Table 3.1 Efficiency of [1] and the proposed protocol

3.1.6 System and Performance

PSI or MPSI implicitly assumes that every attendee can provide data, any attendee can retrieve data from the shared data, and all attendees can communicate with each other. If PSI or MPSI is implemented straightforwardly, such implementation should become a system like a peer-to-peer (P2P) network system. Although a fully distributed system like P2P network has attractive features, such as high availability and scalability, it incurs some unfavorable features.

The network address and port translation (NAPT) is a major obstacle for P2P network systems. Modern P2P network systems take advantage of NAPT traversal technologies to overcome NAPT, but it should be costly to make the architecture complex. The absence of trusted node is also an obstacle for attendee or group management. Making consensus on a P2P network system is difficult or highly costly. Additionally, unpredictable node joining and leaving are reasons that make the P2P network systems complex. To avoid the complexities of P2P networks, we designed a system based on the client server model.

Then, we discuss the design of PSI or MPSI’s client server model. There are 2 main functionalities of PSI or MPSI: (1) First, the data sharing is a functionality for sharing data among attendees. (2) Next, the data retrieving from the shared data is a functionality. Any attendee can retrieve data from the shared data, but the retrieving avoids correcting privacy sensitive data by using privacy preserving techniques described above.

However, we do not assume that every attendee provides and retrieves data. Imagine that an incident analysis situation in which data are provided by several organizations which employ labor and operate some machines, and a research institute collects data from the organizations and analyzes it. In such a situation, data providers do not need the data retrieving functionality, and data analysts do not need the data sharing functionality.

Therefore, we define 3 roles for our MPSI application design as follows.

  • Parties: entities for data providing

  • Clients: entities for data retrieving

  • Dealer: an entity for forwarding requests between parties and clients

From the perspective of privilege separation, defining and separating roles are significant. Figure 3.2 shows a P2P network model and our client server model. As show in this figure, every P2P network node is connected to each other and can provide and retrieve data, but parties only provide data and clients only retrieve data in the client server model. The dealer forwards requests from parties and clients and provides other functionalities that are not specified by PSI or MPSI. For example, attendee or group management, user authentication, and data logging should be performed by the dealer.

Fig. 3.2
figure 2

P2P and client server model

Fig. 3.3
figure 3

Sequence diagram of MSPI application

Figure 3.3 shows an example sequence diagram of our MPSI application. In this figure, there are 2 parties, 1 client, and 1 dealer. First of all, parties 1 and 2 join the dealer (join p1 and p2). A party must join before providing data, and it must be performed only once at initialization. After that, the client sends a request of data retrieval to the dealer (cl req), and parties send a request to confirm whether the dealer received data retrieval requests by clients (new-req p1 and p2). Then, the parties and the dealer generate keys, share the keys, encrypt data, and decrypt data (gpk p1 and p2, enc p1 and p2, and dec p1 and p2). Finally, the client gets the result from the dealer.

Fig. 3.4
figure 4

Performance

We measured performance of our MPSI application written in Python language on an Amazon’s EC2 server (2.4 GHz CPU, 1 GB Memory). Figure 3.4 shows the results when there are from 2 to 4 parties which provide data including 10,000 entries. The results show that it takes approximately 280 s to accomplish data retrieval and that the computational amount does not depend on the number of parties.

3.2 Classification

In this section, we present a secure classification protocol, a type of secure computation protocols. We assume two participants Alice and Bob of the protocol. Alice has private data x, and Bob has a classification model C. The task is that Alice learns C(x) at the end of the protocol while preserving the privacy of x and C. That is, Alice can learn only C(x) and Bob can learn nothing. Our construction is based on a code-based public-key encryption scheme called HQC [20], which is a candidate of NIST’s Post-Quantum Cryptography standardization [21].

3.2.1 Error-Correcting Code

We start with several fundamental notions for error-correcting codes.

Definition 3.3

(Linear code) A code \(\mathbb {C}\) such that \(c_1+c_2 \in \mathbb {C}\) always holds for any codeword \(c_1, c_2 \in \mathbb {C}\) is called a linear code. The code \(\mathbb {C}\) of code length n and information bit number k is described as “a” code.

Definition 3.4

(Generation matrix) For matrices \({\mathbb G}\in \mathbb {F}^{k \times n}\),\({\mathbb G}\) that satisfy

$$\begin{aligned} \mathbb {C}=\{ \varvec{m}\cdot {\mathbb G}|\varvec{m}\in \mathbb {F}^k\} \end{aligned}$$
(3.1)

is called a generator matrix. The generator matrix is the basis of linear codes and generates all codewords.

Definition 3.5

(Parity check matrix) For a matrix \(\mathbf {H}\in \mathbb {F}^{(n-k)\times n}\), \(\mathbf {H}\) that satisfies

$$\begin{aligned} \mathbb {C}=\{\varvec{x}\in \mathbb {F}^n|\mathbf {H}\cdot \varvec{x}^\top =\varvec{0}\} \end{aligned}$$
(3.2)

is called a parity check matrix.

Definition 3.6

(Cyclic matrix) When \(\varvec{x}=(x_1,\dots ,x_n)\in \mathbb {F}^n\), the circulant matrix for \(\varvec{x}\) is defined as

$$\begin{aligned} \mathrm{\mathbf {rot}}\varvec{(x)}={ \left( \begin{matrix} x_1 &{} x_n &{} \cdots &{} x_2 \\ x_2 &{} x_1 &{} \cdots &{} x_3 \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ x_n &{} x_{n-1} &{} \cdots &{} x_1 \end{matrix} \right) } \in \mathbb {F}^{n\times n} \end{aligned}$$
(3.3)

In addition, the multiplication of two polynomials xy has the following properties:

$$\begin{aligned} \begin{aligned} \varvec{x} \varvec{\cdot } \varvec{y}&= \varvec{x} \varvec{\times } {\mathbf {rot}}\varvec{(y)}^\top \\&= \varvec{(}{\mathbf {rot}}\varvec{(x)} \varvec{\times } \varvec{y}^\top \varvec{)}^\top \\&= \varvec{y} \varvec{\times } {\mathbf {rot}}\varvec{(x)}^\top \\&= \varvec{y} \varvec{\cdot } {\varvec{x}}. \end{aligned} \end{aligned}$$
(3.4)

Definition 3.7

(Cyclic shift) The operation of shifting \((c_0,\dots ,c_{n-1})\) to the right by one position with respect to n-dimensional vector \(c_i~(i=0,\dots ,n-2)\) and moving \(c_{n-1}\) to the beginning of the vector is called cyclic shift. That is, for any n dimensional vector \((c_0,\dots ,c_{n-1})\), it is a mapping \(\sigma :(c_0,c_1,\dots ,c_{n-1})\mapsto (c_{n-1},c_0,\dots ,c_{n-2})\).

Definition 3.8

(Quasi-cyclic code) Let \(\varvec{c}=(\varvec{c}_0,\dots ,\varvec{c}_{s-1})\in (\mathbb {F}_2^n)^s\) be an arbitrary codeword of code \(\mathbb {C}\) and let \(\sigma \) be a cyclic shift operation. If \((\sigma (\varvec{c}_0),\dots ,\sigma (\varvec{c}_{s-1})\in \mathbb {C}\), \(\mathbb {C}\) is called the s-quasi-cyclic code. In particular, when s = 1, \(\mathbb {C}\) is called a cyclic code.

Definition 3.9

(Systematic quasi-cyclic code) An s-quasi-cyclic [snn] code is called a systematic quasi-cyclic code if it has a parity check matrix of the form.

$$\begin{aligned} {\varvec{H}}= \left[ \begin{matrix} {\varvec{I}}_n &{} 0 &{} \cdots &{} 0 &{} {\varvec{A}}_1 \\ 0 &{} {\varvec{I}}_n &{}&{}&{} {\varvec{A}}_2 \\ &{}&{} \ddots &{}&{} \vdots \\ 0 &{}&{} \cdots &{} {\varvec{I}}_n &{} {\varvec{A}}_{s-1} \end{matrix} \right] \end{aligned}$$
(3.5)

Here, \(\varvec{A}_1,\ldots ,\varvec{A}_{s-1}\) is an \(n\times n\) circulant matrix.

3.2.2 Security Assumptions

As mentioned above, the security of the public-key cryptosystem HQC is based on the computational difficulty of the quasi cyclic syndrome decoding problem. More specifically, its security is proved under the following quasi cyclic syndrome decoding decision assumptions.

Definition 3.10

(quasi-cyclic syndrome decoding assumption) The quasi-cyclic syndrome decoding decision problem of a s-quasi-cyclic code in which n and w are integers and the number of blocks is \(s\ge 2\) is \((\mathbf {H},\varvec{y}^\top )\) when the parity check matrix \(\mathbf {H}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {F}^{(sn-n)\times sn}\) and the matrix \(\varvec{y}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {F}^{sn-n}\) of random systematic quasi-cyclic code are given, every efficient algorithm distinguish only with negligible probability whether it is quasi-cyclic syndrome decoding distribution or the uniform distribution over \(\mathbb {F}^{(sn-n)\times sn}\times \mathbb {F}^{(sn-n)}\).

As will be described later, since the security of the secure computation protocol proposed in this section is reduced to the security of HQC, the secure computation protocol of this section is proved to be secure under this assumption as well as under HQC.

3.2.3 Security Requirements for 2PC

Secure two-party computation is a subproblem of multi-party secure computation. The studies have been conducted by many researchers since it is closely related to many cryptographic protocols. The purpose of 2PC is to construct a general-purpose protocol so that arbitrary functions can be jointly computed without sharing the input values of the two parties with the other. One of the best-known examples of 2PCs is the millionaire problem [22] in Yao, where Alice and Bob do not reveal their money and decide who is richer. Specifically, suppose that Alice has a yen, and Bob has b yen. The problem is to decide whether \(a\ge b\) or not while keeping each other secret. Generally speaking, the security requirement of 2PC is that the computation of any function is performed using a protocol without leaking the two inputs to the other, and only the computation result is known.

A two-party linear function evaluation is a kind of 2PC that satisfies the 2PC security requirements. In other words, the participants perform the evaluation without notifying the other party of their input. In addition, the function of the protocol is the evaluation of linear functions. Specifically, linear function secure computation protocol computes \(f(m)=a\cdot m+b\). The participants in the protocol are called Alice and Bob. Alice’s input is m, and Bob’s input is linear function parameters ab. Alice gets only the result of \(f(m)=a\cdot m+b\) through the protocol, and Bob gets nothing.

Below we define the security requirements for two-party linear function secure computation.

Definition 3.11

(Security against semi-honest adversaries) Let \(f=(f_A,f_B)\) be the function that maps the input x of Alice(A) and the input y of Bob(B) to \(f_A(x,y)\),\(f_B(x,y)\). A aims to obtain \(f_A(x,y)\) and B aims to obtain \(f_B(x,y)\).

Let \(f=(f_A,f_B)\) be a function of probabilistic polynomial time, and \(\pi \) be a two-way protocol for computing function f. Let the view of A with (xy) execution \(\pi (x,y)\) and the security parameter n be \(\mathrm{view}^\pi _A(x,y,n)\) and the view of B be \(\mathrm{view}^\pi _B (x,y,n)\). The output of A is \(\mathrm{output}^\pi _A(x,y,n)\) and the output of B is \(\mathrm{output}^\pi _B(x,y,n)\). In addition, the joint output of the two is denoted as \(\mathrm{output}^\pi (x,y,n)=(\mathrm{output}^\pi _A(x,y,n),\mathrm{output}^\pi _B(x,y,n))\).

For semi-honest adversaries, we say that the protocol \(\pi (x,y)\) can securely compute the function f if there are probabilistic polynomial-time algorithms \(S_A\) and \(S_B\) that satisfy the following equations. For any xy that satisfy \(|x|=|y|=n\), \(n\in \mathbb {N}\), the following holds:

$$\begin{aligned}&\{(S_A(1^n,x,f_A(x,y)),f(x,y))\}_{x,y,n} \\ {\mathop {\equiv }\limits ^{c}}&\{(\mathrm{view}^\pi _A(x,y,n),\mathrm{output}^\pi (x,y,n))\}_{x,y,n}, \\&\{(S_B(1^n,x,f_B(x,y)),f(x,y))\}_{x,y,n} \\ {\mathop {\equiv }\limits ^{c}}&\{(\mathrm{view}^\pi _B(x,y,n),\mathrm{output}^\pi (x,y,n))\}_{x,y,n}. \end{aligned}$$

3.2.4 HQC Encryption Scheme

The protocols proposed in this section are based on the Hamming Quasi-Cyclic cryptosystem of Gaborit et al. First, we introduce the cryptosystem proposed by Gaborit et al. [20], which is a public key cryptosystem based on the quasi-cyclic syndrome decoding problem. In this cryptosystem, two kinds of codes quasi-cyclic code and error-correcting code \(\mathbb {C}\) are used. The error-correcting code \(\mathbb {C}\) is an arbitrary linear code (such as a BCH code) used for message encoding and decoding and with sufficient error correction capability. A quasi-cyclic code is used for a security requirement of this public key cryptosystem to generate noise that an adversary cannot decrypt.

The participants of the HQC cryptosystem are Alice (A) and Bob (B), and B aims to send the input message \(\varvec{m}\) securely to A. The cryptosystem is performed as follows:

  1. 1.

    Global parameter settings:

    Parameters param = \((n,k,\delta ,w_x,w_r,w_e)\) and the sign \(\mathbb {C}\) generation matrix \({\mathbb G}\in \mathbb {F}^{k \times n}\).

  2. 2.

    Key generation:

    A generates random \(\varvec{h} {\mathop {\longleftarrow }\limits ^{\$}} \mathbb {R}\).

    Furthermore, \((\varvec{x}, \varvec{y}) {\mathop {\longleftarrow }\limits ^{\$}} \mathbb {R}^2\) is generated, and the Hamming weight of \(\varvec{x}, \varvec{y}\) is \(w_x\).

    Secret information sk = \((\varvec{x}, \varvec{y})\) Public information pk = \((\varvec{h}, \varvec{s} = \varvec{x} + \varvec{h} \cdot \varvec{y})\). A sends public information pk to B.

  3. 3.

    Encryption:

    B generates a random \(\varvec{e}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}, (\varvec{r_1},\varvec{r_2}){\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}^2\).

    The Hamming weight of \(\varvec{e}\) is \(w_e\), and the Hamming weight of \(\varvec{r_1}\) and \(\varvec{r_2}\) is \(w_r\).

    Then, we compute \(\varvec{u}=\varvec{r_1 + h} \cdot \varvec{r_2}\) and \(\varvec{v}=\varvec{m}\cdot {\mathbb G}+ \varvec{s} \cdot \varvec{r_2 + e}\) on input \(\varvec{m}\). B sends the ciphertext \(\varvec{u,v}\) back to A.

  4. 4.

    Decryption:

    A uses the decoding function \(\mathbb {C}\).Decode\((\varvec{v-u \cdot y})\) of the error-correcting code \(\mathbb {C}\) to recover the message \(\varvec{m}\) of B.

In the HQC cryptosystem, public information \(\varvec{s}\) is added to the message \(\varvec{m}\) encoded by the error-correcting code when it is encrypted. Since \(\varvec{s}\) is noise with a large Hamming weight generated by the quasi-cyclic code, security is guaranteed by the quasi-cyclic syndrome decoding decision assumption introduced above. In addition, A can use the secret key for the encrypted error-protected ciphertext in the decryption stage, and can remove a large amount of noise from \(\varvec{s}\). However, some noise of \(\varvec{x\cdot r_2-r_1\cdot y+e}\) remains. If the weight of this noise is smaller than the maximum number of correctable errors \(\delta \) of the error-correcting code, correct decoding is possible. Hamming weights \(w,w_r,w_e = \mathcal {O}(\sqrt{n})\) are assumed and analyzed. Moreover, the conclusion that the probability of becoming \(\omega (\varvec{x\cdot r_2+e-y\cdot r_1})\le \delta \) increases as the code space n becomes larger is shown in the paper of Gaborit et al. In addition, the HQC cryptosystem is IND-CPA secure under the quasi-cyclic syndrome decoding decision assumption.

3.2.5 Proposed Protocol

3.2.5.1 Linear Function Evaluation

We introduce the secure evaluation protocol of the linear functions between two parties.

We use two codes, quasi-cyclic code and arbitrary error-correcting code \(\mathbb {C}\), based on Gaborit’s HQC cryptosystem. The participants in the protocol are Alice (A) and Bob (B). A’s input is \(m\in \mathbb {F}_2\), B’s input is \(a,b\in \mathbb {F}_2\), B’s output is nothing, and A’s output is \(a\cdot m+b\). The protocol is given in Protocol 3.2.5.1.

Protocol

Linear function evaluation protocol

input

A: \(m\in \mathbb {F}_2\)

B: \(a,b\in \mathbb {F}_2\)

output

A: \(a\cdot m+b\)

B: \(\perp \)

  1. 1.

    Global parameter param = \((n,k, \delta , w_x, w_r, w_e)\) and the sign \(\mathbb {C}\) generation matrix \({\mathbb G}\in \mathbb {F}^{k \times n}\) are chosen.

  2. 2.

    A generates the random \(\varvec{h} {\mathop {\longleftarrow }\limits ^{\$}} \mathbb {R}\). Furthermore, \((\varvec{x},\varvec{y}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}^2)\) is generated, and the Hamming weight of \(\varvec{x}\) and \(\varvec{y}\) is w. Secret information sk = \((\varvec{x}, \varvec{y})\), Public information pk = \((\varvec{h}, \varvec{s} = \varvec{x} + \varvec{h} \cdot \varvec{y})\).

  3. 3.

    By padding the input m with 0, A makes \(\varvec{m} = (m, 0, \dots , 0)\) of dimension k. A generates a random \(\varvec{r_A, r_u, r_v} {\mathop {\longleftarrow }\limits ^{\$}} \mathbb {R}\). Here, the Hamming weight of \(\varvec{r_A, r_u, r_v}\) is \(w_r\). Then, we compute \((\varvec{u = h \cdot r_A + r_u}, \varvec{v = m} \cdot {\mathbb G}+ \varvec{s \cdot r_A + r_v})\). A sends public information \(\varvec{h, s}\) and ciphertext pair \(\varvec{u, v}\) to B.

  4. 4.

    Let B be \(\varvec{b} = (b,0, \dots ,0)\). Generate \(\varvec{r_B}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}\) and \((\varvec{e_u},\varvec{e_v}){\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}^2\). Here, the Hamming weight of \(\varvec{r_B}\) is \(w_r\), and the Hamming weight of \(\varvec{e_u}\) and \(\varvec{e_v}\) is \(w_e\). B computes \(\varvec{u}'=a\cdot \varvec{u+h\cdot r_B + e_u}\) and \(\varvec{v}'=a\cdot \varvec{v+b\cdot {\mathbb G}+s\cdot r_B + e_v}\). B sends \(\varvec{u}', \varvec{v}'\) back to A.

  5. 5.

    A uses \(\mathbb {C}\). Decode(\(\varvec{v' - u' \cdot y}\)) to decode the error-correcting code \(\mathbb {C}\), and recovers \(a\cdot m+b\) by taking the first bit of the result.

First, we set global parameters. n is the code length of the code, k is the number of information bits, \(\delta \) is the maximum number of correctable errors in the error-correcting code, and \(w_x, w_r, w_e\) are Hamming weights set in advance. For example, it is half the weight of \(\mathcal {O}(\sqrt{n})\) assumed by Gaborit et al. The public parameter \({\mathbb G}\) is a generator matrix of error-correcting code \(\mathbb {C}\), which maps messages and codewords as \(\mathbb {F}^k_2\rightarrow \mathbb {F}^n_2\).

A generates random \(\varvec{h}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}\) and \((\varvec{x},\varvec{y}){\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}^2\) and computes \(\mathbf {s=x + h \cdot y}\). Here,

$$\begin{aligned} \begin{aligned} {\varvec{s}}&= {\varvec{x}} + {\varvec{h}} {\varvec{\cdot }} {\varvec{y}} \\&= {\varvec{x}}+{\varvec{y}}{\varvec{\cdot }}{\mathbf {rot}}{\varvec{(h)}}^\top \\&= {\varvec{(}} \begin{matrix} {\varvec{x}}&{\varvec{y}} \end{matrix} {\varvec{)(}} \begin{matrix} {{\varvec{I}}_{\varvec{n}}}&{\mathbf {rot}}{\varvec{(h)}} \end{matrix} {\varvec{)}}^\top . \end{aligned} \end{aligned}$$
(3.6)

It can be converted to and can be reduced to the quasi cyclic syndrome decoding problem. Then, A sets secret information sk as \((\varvec{x, y})\) and public information pk as \((\varvec{h, s})\).

A pads the input m with 0, making \(\varvec{m} = (m, 0, \dots , 0)\) with dimension k. A generates \(\varvec{r_A,r_u,r_v}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}\), encodes the value of \(\varvec{m}\) with an error-correcting code, and re-randomizes it. A generates a ciphertext pair of \((\varvec{u=h \cdot r_A + r_u}, \varvec{v=m} \cdot {\mathbb G}+ \varvec{s \cdot r_A + r_v})\) and send it to B. As for B, \(\varvec{v}\) has a noise \(\varvec{s}\) that cannot be decoded, and has no secret information that can be removed, so B cannot learn \(\varvec{m}\).

B sets \(\varvec{b} = (b, 0, \dots , 0)\) and generates \(\varvec{r_B}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}\) and \(\varvec{(e_u,e_v)}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}^2\). B produces \(\varvec{u}'=a\cdot \varvec{u+h\cdot r_B + e_u}\), \(\varvec{v}'=a\cdot \varvec{v+b\cdot {\mathbb G}+s\cdot r_B + e_v}\) and re-randomize \(\varvec{u}\) and \(\varvec{v}\) after updating. Since the error-correcting code is a linear code, \(\varvec{u'}\) and \(\varvec{v'}\) after update are

$$\begin{aligned} \varvec{u}' = \left\{ \begin{array}{ll} \varvec{h \cdot r_B + e_u} &{} ~(\text {In the case of a = 0}) \\ \varvec{u + h \cdot r_B + e_u} &{} ~(\text {In the case of a = 1}). \end{array} \right. \end{aligned}$$
(3.7)
$$\begin{aligned} \varvec{v}' = \left\{ \begin{array}{ll} \varvec{b \cdot {\mathbb G}+ s \cdot r_B + e_v} &{} ~(\text {In the case of a = 0}) \\ \varvec{v + b \cdot {\mathbb G}+ s \cdot r_B + e_v} &{} ~(\text {In the case of a = 1}). \end{array} \right. \end{aligned}$$
(3.8)

Finally, A uses his secret information to decrypt \(\varvec{v}' - \varvec{u}' \cdot \varvec{y}\). The result is

$$\begin{aligned} \begin{aligned}&\varvec{v}' - \varvec{u}' \cdot \varvec{y} \\ =&(a\varvec{m+b}){\mathbb G}+ \varvec{x}(a\varvec{r_A+r_B}) - \varvec{y}(a\varvec{r_u+e_u}) + (a\varvec{r_v+e_v}) \\ =&\left\{ \begin{array}{l} \varvec{b{\mathbb G}+xr_B-ye_u+e_v}~~~~~~~~~~~~~~~~~(\text {in the case of a = 0}) \\ \varvec{(m+b){\mathbb G}+x(r_A+r_B)-y(r_u+e_u)+(r_v+e_v)} \\ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~(\text {in the case of a = 1}). \end{array} \right. \end{aligned} \end{aligned}$$
(3.9)

As shown by the Eq. (3.9), the result of \(\varvec{v}' - \varvec{u}' \cdot \varvec{y}\) is the result of removing \(\varvec{h}\) and \(\varvec{s}\). Taking the first bit makes \(a\cdot m+b\) available to A.

3.2.5.2 Correctness and Security of the Proposed Protocol

The correctness of the two-way linear function evaluation protocol proposed in this study obviously depends on the decoding ability of the code \(\mathbb {C}\). Specifically, assuming that \(\mathbb {C}\). Decode decodes \(\varvec{v-u\cdot y}\) correctly, the following equation is satisfied:

$$\begin{aligned} \mathrm{Decrypt}(sk,\mathrm{Encrypt}(pk,a\cdot \varvec{m+b}))=a\cdot \varvec{m+b}. \end{aligned}$$
(3.10)

Also, let \(\epsilon \) be the error of \(\varvec{v-u\cdot y}\). The error is

$$\begin{aligned} \epsilon = \left\{ \begin{array}{l} \varvec{xr_B-ye_u+e_v}~~~~~~~~~~~~~(\text {In the case of a = 0}) \\ \varvec{x(r_A+r_B)-y(r_u+e_u)+(r_v+e_v)} \\ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~(\text {In the case of a = 1}) \end{array} \right. \end{aligned}$$
(3.11)

for the error correction capability of the code \(\mathbb {C}\). In the paper of Gaborit et al., \(\mathbb {C}\).Decode can work correctly when \(\omega (\varvec{x\cdot r_2+e-y\cdot r_1})\le \delta \) is satisfied, and \(w_r\) and \(w_e\) have the same value when actually evaluated. If the Hamming weight of \(\varvec{r_0,r_1,r_u,r_v,r_B}\) of the protocol proposed in this section is set to 1/2 of \(w_r\) of Gaborit et al., then, the Hamming weight of \(\varvec{e_u,e_v}\) is set to 1/2 of \(w_e\) of Gaborit et al. The Hamming weight of the error Eq. (3.11) is less than or equal to the Hamming weight of errors in Gaborit et al.’s setting. Therefore, the conclusion of the paper of Gaborit et al. also holds for the proposed protocol. As the code length n increases, the decoding failure rate of the error-correcting code decreases. If the appropriate code space size n and noise Hamming weights \(w_r\) and \(w_e\) are set, the decoding failure rate approaches 0.

The security requirements of the proposed protocol are described above. In this section, we prove the security against semi-honest adversaries.

Theorem 3.2

Under the quasi-cyclic syndrome decoding assumption, the 2PC protocol securely computes linear functions for semi-honest adversaries.

Proof

First, consider the semi-honest adversary A. With the global parameter omitted, the view of A is \(\mathrm{view}_A=(\varvec{m};\varvec{h,x,y,r_0,r_1,r_u,r_v};\varvec{u',v'})\). We construct a simulator \(S_A(\varvec{m,x,y})\) as follows:

  1. 1.

    Generate \(\varvec{\widetilde{h},\widetilde{r_0},\widetilde{r_A},\widetilde{r_u},\widetilde{r_v},\widetilde{u'},\widetilde{v'}}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}\) randomly.

    Here, the Hamming weight of \(\varvec{\widetilde{r_A},\widetilde{r_u},\widetilde{r_v}}\) is \(w_r\).

  2. 2.

    Output \((\varvec{m},\varvec{x},\varvec{y};\varvec{\widetilde{h},\widetilde{r_A},\widetilde{r_u},\widetilde{r_v};\widetilde{u'},\widetilde{v'}})\).

Since, \(\varvec{h,r_A,r_u,r_v}\) and \(\varvec{\widetilde{h},\widetilde{r_A},\widetilde{r_u},\widetilde{r_v}}\) follow the same distribution, the following equation holds:

$$\begin{aligned} \begin{aligned}&(\varvec{m,x,y};\varvec{\widetilde{h},\widetilde{r_A},\widetilde{r_u},\widetilde{r_v};\widetilde{u'},\widetilde{v'}}) \\ \equiv _s~&(\varvec{m,x,y};\varvec{h,r_A,r_u,r_v;\widetilde{u'},\widetilde{v'}}). \end{aligned} \end{aligned}$$
(3.12)

At \(\mathrm{view}_A\), \(\varvec{u}'=a\cdot \varvec{u+h\cdot r_B + e_u}\), \(\varvec{v}'=a\cdot \varvec{v+b\cdot {\mathbb G}+s\cdot r_B + e_v}\), and it holds

$$\begin{aligned} \left[ \begin{matrix} {\varvec{h\cdot r_B + e_u}} \\ {\varvec{s\cdot r_B + e_v}} \end{matrix} \right] = \left[ \begin{matrix} {{\varvec{I}}_n} &{} 0 &{} \mathrm{rot}(\varvec{h}) \\ 0 &{} {{\varvec{I}}_n} &{} \mathrm{rot}(\varvec{s}) \end{matrix} \right] \left[ \begin{matrix} {\varvec{e_u}} \\ {\varvec{e_v}} \\ {\varvec{r_B}} \end{matrix} \right] . \end{aligned}$$
(3.13)

Therefore, the adversary of probabilistic polynomial time cannot distinguish between \((\varvec{h\cdot r_B + e_u}, \varvec{s\cdot r_B + e_v})\) and uniform random numbers under the assumption of 3-quasi-cyclic syndrome decoding of quasi-cyclic code. Since \(\varvec{u}\) and \(\varvec{v}\) are also under the 3-quasicyclic syndrome decoding decision assumption, they cannot distinguish between \(\varvec{u}\) and \(\varvec{v}\) and uniform random numbers. Thus, the distribution of \(\varvec{u}'\) and \(\varvec{v}'\) also approaches uniform random numbers and satisfies the following equation:

$$\begin{aligned} \begin{aligned}&(\varvec{m,x,y};\varvec{h,r_A,r_u,r_v,\widetilde{u'},\widetilde{v'}})\\ \equiv _c~&(\varvec{m,x,y};\varvec{h,r_A,r_u,r_v,u',v'}). \end{aligned} \end{aligned}$$
(3.14)

Thus, the distributions of the view \(\mathrm{view}_A\) of A and the simulator \(S_A\) are indistinguishable against polynomial-time adversaries:

$$\begin{aligned} \begin{aligned}&S_A(\varvec{m,x,y}) \\ \equiv _c~&\mathrm{view}_A(\varvec{m,x,y};\varvec{h,r_A,r_u,r_v};\varvec{u',v'}). \end{aligned} \end{aligned}$$
(3.15)

Next, consider the semi-honest adversary B. With the global parameter omitted, the view of B is \(\mathrm{view}_B=(a,b;\varvec{h,s,u,v,r_B,e_u,e_v})\). Configure the simulator \(S_B(a,b)\) as follows:

  1. 1.

    Randomly generate \(\varvec{\widetilde{h},\widetilde{s},\widetilde{u},\widetilde{v},\widetilde{r_B},\widetilde{e_u},\widetilde{e_v}}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}\). Here, the Hamming weight of \(\varvec{\widetilde{r_B}}\) is \(w_r\), and the Hamming weight of \(\varvec{\widetilde{e_u} and \widetilde{e_v}}\) is \(w_e\)

  2. 2.

    Output \((a,b;\varvec{\widetilde{h},\widetilde{s},\widetilde{u},\widetilde{v},\widetilde{r_B},\widetilde{e_u},\widetilde{e_v}})\).

Since, \(\varvec{h,r_B,r_u,r_v}\) and \(\varvec{\widetilde{h},\widetilde{r_B},\widetilde{r_u},\widetilde{r_v}}\) follow the same distribution, the following equation holds:

$$\begin{aligned} \begin{aligned}&(a,b;\varvec{\widetilde{h},\widetilde{s},\widetilde{u},\widetilde{v},\widetilde{r_B},\widetilde{e_u},\widetilde{e_v}}) \\ \equiv _s~&(a,b;\varvec{h,\widetilde{s},\widetilde{u},\widetilde{v},r_B,e_u,e_v}). \end{aligned} \end{aligned}$$
(3.16)

Note that \(\varvec{s}\) can be reduced to 2-cyclic syndrome decoding decision, and the distribution cannot be distinguished from uniform random numbers for the adversary in polynomial time. Therefore, the following equation is satisfied.

$$\begin{aligned} \begin{aligned}&(a,b;\varvec{h,\widetilde{s},\widetilde{u},\widetilde{v},r_B,e_u,e_v}) \\ \equiv _c~&(a,b;\varvec{h,s,\widetilde{u},\widetilde{v},r_B,e_u,e_v}). \end{aligned} \end{aligned}$$
(3.17)

Moreover, since \(\varvec{u}\) and \(\varvec{v}\) are indistinguishable between \((\varvec{h\cdot r_B + e_u}, \varvec{s\cdot r_B + e_v})\) and uniform random numbers based on the assumption of quasi-cyclic syndrome decoding and the adversary of probabilistic polynomial time cannot be distinguished, the following holds:

$$\begin{aligned} \begin{aligned}&(a,b;\varvec{h,s,\widetilde{u},\widetilde{v},r_B,e_u,e_v}) \\ \equiv _c~&(a,b;\varvec{h,s,u,v,r_B,e_u,e_v}). \end{aligned} \end{aligned}$$
(3.18)

Therefore, the distributions of the view \(\mathrm{view}_B\) of B and the simulator \(S_B\) cannot be distinguished against the adversary of polynomial time:

$$\begin{aligned} \begin{aligned}&S_B(a,b) \\ \equiv _c~&\mathrm{view}_B(a,b;\varvec{h,s,u,v,r_B,e_u,e_v}). \end{aligned} \end{aligned}$$
(3.19)

    \(\square \)

The above protocol works over \(\mathbb {F}_2\), but one can see that this can be easily extended to a larger field \(\mathbb {F}_q\) by using appropriate error-correcting linear codes over \(\mathbb {F}_q\).

3.2.5.3 Secure Comparison

Two-party secure comparison protocol proposed in this section is based on the size comparison method used in the secure decision tree classification protocol of Wu et al. [23]. In this section, we used the following criteria given in Proposition 3.1 for comparison.

Proposition 3.1

For a t-bit xy, if there is an \(i\in [t]\) such that the following expression holds, then \(x<y\).

$$ x_i-y_i+1+3\sum _{j<i}(x_j\oplus y_j)=0. $$

In this section, we introduce the proposed protocol for two-party secret comparison protocol. The proposed protocol for two-party secret comparison protocol uses a quasi-cyclic code and an arbitrary error-correcting code (For example, Reed-Solomon code) on \({\mathbb F}_{q}\). The participants in the protocol are Alice (A) and Bob (B). The input of A is \(c\in \mathbb {N}\), and the input of B is \(d\in \mathbb {N}\). The output of A is the result of the comparison between c and d, and the output of B is none.

The flow of two-party secret comparison is shown as follows:

Protocol

Two-party secret comparison protocol

Input

A : \(c\in \mathbb {N}\)

B : \(d\in \mathbb {N}\)

Output

A : Comparison result of c and d

B : \(\perp \)

  1. 1.

    A and B perform binary expansion of c and d for each input so that \(\varvec{c}=c_1c_2\dots c_l, \varvec{d}=d_1d_2\dots d_l\). Then, each bit \(c_i,d_i\) is padded to make \(\varvec{c_i, d_i}, i\in [l]\) of k bits. In addition, they set the global parameter param = \((n,k,\delta ,w_x,w_r)\) and the generator matrix \({\mathbb G}\in {\mathbb F}_{q}^{k \times n}\) of code \(\mathbb {C}\).

  2. 2.

    A generates random \(\varvec{h}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}\). Furthermore, \((\varvec{x},\varvec{y}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}^2)\) with Hamming weight \(w_x\) is generated. Private key \(sk = (\varvec{x},\varvec{y})\), and public key \(pk = (\varvec{h,s=x + h \cdot y})\).

  3. 3.

    A generates a random \(\varvec{r_{Ai}},\varvec{r_{ui}},\varvec{r_{vi}}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}, i\in [l]\) with Hamming weight \(w_r\). Then, A computes \(\varvec{u_i=h \cdot r_{Ai}+r_{ui}}\) and \(\varvec{v_i=c_i \cdot G + s \cdot r_{Ai}+r_{vi}}\) for l pairs and sends l pairs of ciphertext \(\varvec{u_i,v_i}\) to B.

  4. 4.

    B generates \((\varvec{r_{Bi}},\varvec{e_{ui}},\varvec{e_{vi}}){\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}^3\) with Hamming weight \(w_r^*\) and computes the expression \(c_i-d_i+1+3\sum _{w<i}(c_w\oplus d_w)\) for \(c_i\). Specifically, B substitutes plaintext \(d_i\) for \(i\in [l]\) in the above formula and sets appropriate \(a_{1i},a_{2i},\dots ,a_{li},\varvec{b_i}\). B computes \(\varvec{u_i}'=a_{1i}\cdot \varvec{u_1}+\dots +\varvec{h\cdot r_{Bi}+e_{ui}}\) and \(\varvec{v_i}'=a_{1i}\cdot \varvec{v_1}+\dots +\varvec{b_i}\cdot {\mathbb G}+\varvec{s\cdot r_{Bi}+e_{vi}}\) for l pairs. Then, the order of \((\varvec{u_i}',\varvec{v_i}')\) of l pairs is randomly replaced and sent to A in a random order.

  5. 5.

    A computes \(\varvec{v_i}'-\varvec{u_i}'\cdot \varvec{y}\) for each \(i\in [l]\) and decrypts the result. If there is 0 in the first bit of the decoded results, \(c<d\) is output. Conversely, if there is no 0, \(c\ge d\) is output.

Protocol Description

  1. 1.

    In step 1, A and B expand c and d of each input to l-bit binary input, so that \(\varvec{c}=c_1c_2\dots c_l\) and \(\varvec{d}=d_1d_2\dots d_l\). Where \(c_i, d_i, i\in [l]\) is the ith digit of \(\varvec{c,d}\), and l is the bit length. To encode, pad each input to \(\varvec{c_i, d_i}, i\in [l]\) with bit length k.

    In addition, set global parameters. n is the code length, k is the number of information bits, \(\delta \) is the maximum number of errors that can be corrected by the error-correcting code, and \(w_x\) and \(w_r\) are the Hamming weights set in advance. The public parameter \({\mathbb G}\) is the generator matrix(For example, the Reed-Solomon code generator matrix) of the error-correcting code \(\mathbb {C}\), which maps the message and code length as \({\mathbb F}_{q}^k\rightarrow {\mathbb F}_{q}^n\).

  2. 2.

    In step 2, A generates a private key and public key for HQC encryption scheme.

  3. 3.

    In step 3, A uses the public key and encrypts each of the \(\varvec{c_i}\) pieces. Send \((\varvec{u_i},\varvec{v_i}) , i\in [l]\) of the encrypted result to B.

  4. 4.

    Step 4 uses Proposition 3.1 for the evaluation of \(c_i-d_i+1+3\sum _{w<i}(c_w\oplus d_w)\). In other words, \(c<d\) if \(i\in [l]\) exists such that

    $$\begin{aligned} c_i-d_i+1+3\sum _{w<i}(c_w\oplus d_w)=0. \end{aligned}$$
    (3.20)

    In particular, since B has plaintext \(d_i\) and encrypted \(c_i\), Eq. (3.20) can be regarded as an equation with \(c_i\) as an unknown and can be computed. In addition, for XOR operations, B can transform \(x_i \oplus y_i\) into

    $$\begin{aligned} x_i \oplus y_i = \left\{ \begin{array}{ll} x_i &{} (y_i = 0) \\ 1-x_i &{} (y_i = 1). \end{array} \right. \end{aligned}$$
    (3.21)

    Therefore, the XOR operation requires only the additive homomorphism of HQC encryption scheme.

    That is, B substitutes plaintext \(d_i, i\in [l]\) into the above equation, sets the appropriate \(a_{1i},a_{2i},\dots ,a_{li},\varvec{b_i}\), and computes as follows:

    $$\begin{aligned}&\varvec{u_i}'=a_{1i} \!\cdot \!\varvec{u_1}\!+\!\cdots \!+\!a_{li}\!\cdot \!\varvec{u_l}\!+\!\varvec{h\!\cdot \!r_{Bi}\!+\!e_{ui}}. \end{aligned}$$
    (3.22)
    $$\begin{aligned}&\varvec{v_i}'=a_{1i} \!\cdot \!\varvec{v_1}\!+\!\cdots \!+\!a_{li}\!\cdot \!\varvec{v_l}\!+\!\varvec{b_i \!\cdot \!G\!+\!s\!\cdot \!r_{Bi}\!+\!e_{vi}}. \end{aligned}$$
    (3.23)

    Here, the Hamming weight of \(\varvec{r_{Bi}},\varvec{e_{ui}},\varvec{e_{vi}}, i\in [l]\) is \(w_r^*\).

    Furthermore, to not leak the information about which bits are different to A, B needs to replace the order of each \((\varvec{u_i}',\varvec{v_i}')\) computed at random.

  5. 5.

    In step 5, A computes \(\varvec{v_i}'-\varvec{u_i}'\cdot \varvec{y}, i\in [l]\). The result is

    $$\begin{aligned} \begin{aligned}&\varvec{v_i}' - \varvec{u_i}' \cdot \varvec{y} \\ =&~(a_{1i}\cdot \varvec{m_1}+\cdots +a_{li}\cdot \varvec{m_l})\cdot {\mathbb G}\\&+ \varvec{x}\cdot (a_{1i}\cdot \varvec{r_{A1}}+\cdots +a_{li}\cdot \varvec{r_{Al}}+\varvec{r_{Bi}}) \\&- \varvec{y}\cdot (a_{1i}\cdot \varvec{r_{u1}}+\cdots +a_{li}\cdot \varvec{r_{ul}}+\varvec{e_{ui}}) \\&+(a_{1i}\cdot \varvec{r_{v1}}+\cdots +a_{li}\cdot \varvec{r_{vl}}+\varvec{e_{vi}}). \end{aligned} \end{aligned}$$
    (3.24)

    Then, the evaluation result is decoded by the error-correcting code. A takes out the first 1 bit of each of l decoding results, and outputs \(c<d\) if there is 0 in it. If there is no 0, \(c\ge d\) is output.

3.2.5.4 Correctness and Security of the Proposed Protocol

Correctness

First, we explain step 4 \(w_r^*\). The Hamming weight of the polynomial coefficient vector \(\varvec{x,y}\) is \(w_x\), and the Hamming weight of \(\varvec{r_{Ai},r_{ui},r_{vi}},i\in [l]\) is \(w_r\). Since each is selected uniformly and independently, the probability of each bit value of the vector is expressed as follows:

$$\begin{aligned} x_i = y_i = \left\{ \begin{array}{ll} 0 &{} ~\mathrm{w.p.}~~1-p \\ 1 &{} ~\mathrm{w.p.}~~p=\frac{w_x}{n}. \end{array} \right. \end{aligned}$$
(3.25)

Similarly,

$$\begin{aligned} r_{Ai,j} = r_{ui,j} = r_{vi,j} = \left\{ \begin{array}{ll} 0 &{} ~\mathrm{w.p.}~~1-p_r \\ 1 &{} ~\mathrm{w.p.}~~p_r=\frac{w_r}{n}. \end{array} \right. \end{aligned}$$
(3.26)

Let L be the set of \(a_{1i},a_{2i},\dots ,a_{li}\ne 0\) in each \(a_{1i}\cdot \varvec{r_{A1}}+a_{2i}\cdot \varvec{r_{A2}}+\cdots +a_{li}\cdot \varvec{r_{Al}}\) for the expression \(i\in [l]\).

$$L = \{ a_{ki} | a_{ki}\ne 0 \}$$

Let |L| be the number of elements in set L. Set the Hamming weights \(w_r^*\) for \(\varvec{r_{Bi},e_{ui},e_{vi}}\) be as follows:

$$\begin{aligned} w_r^*= (n-|L|+1)w_r. \end{aligned}$$

Thus, the value of each \(w_r^*\) can be determined based on the nonzero numbers in \(a_i\) and \(i\in [l]\).

Next, we analyze the validity of the proposed protocol.

The legitimacy of the proposed bilateral linear function secure computation protocol clearly depends on the decoding ability of \(\mathbb {C}\). Set the \(\varvec{v' - u' \cdot y}\) error to \(\epsilon \). For the error correction capability of code \(\mathbb {C}\), the error is

$$\begin{aligned} \begin{aligned} \epsilon =&~~~\varvec{x}\cdot (a_{1i}\cdot \varvec{r_{A1}}+\cdots +a_{li}\cdot \varvec{r_{Al}}+\varvec{r_{Bi}}) \\&- \varvec{y}\cdot (a_{1i}\cdot \varvec{r_{u1}}+\cdots +a_{li}\cdot \varvec{r_{ul}}+\varvec{e_{ui}}) \\&+(a_{1i}\cdot \varvec{r_{v1}}+\cdots +a_{li}\cdot \varvec{r_{vl}}+\varvec{e_{vi}}). \end{aligned} \end{aligned}$$
(3.27)

In other words, if \(\epsilon <\delta \), decoding is successful. Here, \(\delta \) is the maximum number of errors that can be corrected by error-correcting code \(\mathbb {C}\). In addition, in order to analyze the validity of the proposed protocol, we generalize the validity of the HQC encryption scheme proved by Gaborit et al. [20].

The following proposition holds for the Hamming weight of the error.

Proposition 3.2

There are polynomial coefficient vectors \(\varvec{x}=(X_1,\dots ,X_n)\) and \(\varvec{r}=(R_1,\dots ,R_n)\), and \(\varvec{y}=\varvec{x}\cdot \varvec{r}=(Y_1,\dots ,Y_n)\). The probability that the sum of the random variables \(Y_i, i\in [n]\) on \(\mathbb {F}_q\) is 0 is

$$\begin{aligned} \mathrm{Pr}[Y_1+\dots +Y_n=0]=\frac{1}{q}\{ 1+(1-\frac{q}{q-1}p)^n\cdot (q-1)\}. \end{aligned}$$
(3.28)

Where the probability distribution of the random variable \(Y_i\) is

$$\begin{aligned} Y_i = \left\{ \begin{array}{ll} 0 &{} ~\mathrm{w.p.}~~p_0 = 1-p \\ 1 &{} ~\mathrm{w.p.}~~p_1 = \frac{p}{q-1} \\ 2 &{} ~\mathrm{w.p.}~~p_1 = \frac{p}{q-1} \\ \vdots &{} \\ q-1 &{} ~\mathrm{w.p.}~~p_1 = \frac{p}{q-1}. \end{array} \right. \end{aligned}$$
(3.29)

Proof

For \(Y_i\), the following equation holds:

$$\begin{aligned} \begin{aligned}&\mathrm{Pr}[Y_1+\cdots +Y_n=0] \\ =&\sum _{\begin{array}{c} i_0+i_1+\cdots +i_{q-1}=n \\ i_0\cdot 0+i_1\cdot 1+\cdots +i_{q-1}\cdot (q-1)=0 \end{array}}\left( \frac{n!}{i_0!\cdots i_{q-1}!}\right) p_0^{i_0}\cdots p_{q-1}^{i_{q-1}}, \end{aligned} \end{aligned}$$
(3.30)

where \(i_0,\dots ,i_{q-1}\) is the number of times the corresponding \(0,\dots ,q-1\) appears. From the polynomial theorem, the following equation holds:

$$\begin{aligned} \begin{aligned}&\{ p_0\!+\!p_1\!+\!\dots \!+\!p_{q-1}\}^n+ \{ p_0\!+\!(\omega _q)p_1\!+\!\cdots \!+\!(\omega _q^{q-1})p_{q-1}\}^n \\&+\cdots +\{ p_0\!+\!(\omega _q)^{q-1}p_1\!+\!\cdots \!+\!(\omega _q^{q-1})^{q-1}p_{q-1}\}^n \\ =&\sum _{i_0+\dots +i_{q-1}=n}\left( \frac{n!}{i_0!\cdots i_{q-1}!}\right) p_0^{i_0}\cdots p_{q-1}^{i_{q-1}} \\&\qquad \{ 1+(\omega _q)^{i_1}(\omega _q^2)^{i_2}\cdots (\omega _q^{q-1})^{i_{q-1}} + \cdots \\&\qquad + (\omega _q)^{(q-1)i_1}(\omega _q^2)^{(q-1)i_2}\cdots (\omega _q^{q-1})^{(q-1)i_{q-1}}\} \\ =&\sum _{i_0+\dots +i_{q-1}=n}\left( \frac{n!}{i_0!\cdots i_{q-1}!}\right) p_0^{i_0}\cdots p_{q-1}^{i_{q-1}} \\&\qquad \{ 1+ \omega _q^{i_1+2i_2+\cdots +(q-1)i_{q-1}} +\cdots \\&\qquad +\omega _q^{(q-1)\{ i_1+2i_2+\cdots +(q-1)i_{q-1}\} } \}. \end{aligned} \end{aligned}$$
(3.31)

Where \(\omega _q\) is the q root of 1 and has the following properties:

$$\begin{aligned} 1+\omega _q+\omega _q^2+\cdots +\omega _q^{q-1}=0 \end{aligned}$$
(3.32)

Substituting \(i_0\cdot 0+i_1\cdot 1+\cdots +i_{q-1}\cdot (q-1)=0\) into Eq. 3.31 can be transformed as follows:

$$\begin{aligned} \begin{aligned}&\quad \{ p_0+p_1+\dots +p_{q-1}\}^n \\&+ \{ p_0+(\omega _q)p_1+\cdots +(\omega _q^{q-1})p_{q-1}\}^n+\cdots \\&+\{ p_0+(\omega _q)^{q-1}p_1+\cdots +(\omega _q^{q-1})^{q-1}p_{q-1}\}^n \\ =&\sum _{\begin{array}{c} i_0+\cdots +i_{q-1}=n \\ i_0\cdot 0+\cdots +i_{q-1}\cdot (q-1)=0 \end{array}}\!\left( \!\frac{n!}{i_0!\cdots i_{q-1}!}\!\right) \!p_0^{i_0}\cdots p_{q-1}^{i_{q-1}}\cdot q. \end{aligned} \end{aligned}$$
(3.33)

Substituting Eq. (3.33) into Eq. (3.30), the proposition holds:

$$\begin{aligned} \begin{aligned}&\mathrm{Pr}[Y_1+\cdots +Y_n=0] \\ =&\frac{1}{q}\{(p_0+p_1+\dots +p_{q-1})^n+\cdots \\&+(p_0+(\omega _q)^{q-1}p_1+\cdots +(\omega _q^{q-1})^{q-1}p_{q-1})^n\} \\ =&\frac{1}{q}\{ 1^n+(1-p+\frac{p}{q-1}(\omega _q+\omega _q^2+\cdots +\omega _q^{q-1}))^n\cdot (q-1)\} \\ =&\frac{1}{q}\left\{ 1+\left( 1-\frac{q}{q-1}p\right) ^n\cdot (q-1)\right\} . \end{aligned} \end{aligned}$$
(3.34)

    \(\square \)

In addition, the following analysis is the same as the validity analysis in Gaborit et al. [20]. According to the analysis result of [20], in the case of \(\mathbb {F}_2\), the decoding failure rate can be controlled by setting an appropriate code space size n and noise Hamming weights \(w_x\) and \(w_r\). Therefore, in the case of \({\mathbb F}_{q}\), it can be expected that the decoding failure rate can be controlled by setting the appropriate parameters.

Security

This section describes the security of the proposed secret comparison protocol.

First, consider semi-honest adversaries A and \(\mathrm {output}_A=(c<d)\). Omitting global parameters, A’s view is \(\mathrm{view}_A=(c,\varvec{x,y};\varvec{h},\{ \varvec{r_{Ai}}\}^l_{i=1},\{ \varvec{r_{ui}}\}^l_{i=1},\{ \varvec{r_{vi}}\}^l_{i=1},\{ \varvec{u_i}'\}^l_{i=1}, \{ \varvec{v_i}'\}^l_{i=1})\). However, the first bit is 0 only for \(\varvec{u_{i*}}'-\varvec{v_{i*}}'\cdot \varvec{y}\) with index \(i*\). The simulator \(S_A(c,\varvec{x},\varvec{y})\) is configured as follows:

  1. 1.

    Generates \(\varvec{\widetilde{h}},\{\widetilde{\varvec{r_{Ai}}}\}^l_{i=1},\{\widetilde{\varvec{r_{ui}}}\}^l_{i=1},\{\widetilde{\varvec{r_{vi}}}\}^l_{i=1},\{\widetilde{\varvec{u_i}'}\}^l_{i=1},\{\widetilde{\varvec{v_i}'}\}^l_{i=1}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}\) at random. Here, the Hamming weight of \(\{ \widetilde{\varvec{r_{Ai}}}\}^l_{i=1},\{ \widetilde{\varvec{r_{ui}}}\}^l_{i=1},\{ \widetilde{\varvec{r_{vi}}}\}^l_{i=1}\) is \(w_r\). It also selects random \(i*\in [l]\), the first bit of \(\widetilde{\varvec{u_{i*}}'}-\widetilde{\varvec{v_{i*}}'}\cdot \varvec{y}\) is 0, and the first bit of other \(\{ \widetilde{\varvec{u_i}'}-\widetilde{\varvec{v_i}'}\cdot \varvec{y}\}^l_{i=1,i\ne i*}\) is non-zero.

  2. 2.

    This replaces \(\{ \widetilde{\varvec{u_i}'}\}^l_{i=1},\{ \widetilde{\varvec{v_i}'}\}^l_{i=1}\) at random to make \(\{ \widetilde{\varvec{u_j}'}\}^l_{j=1},\{ \widetilde{\varvec{v_j}'}\}^l_{j=1}\) in random order.

  3. 3.

    This outputs \((c,\varvec{x,y};\varvec{\widetilde{h}},\{ \widetilde{\varvec{r_{Ai}}}\}^l_{i=1},\{ \widetilde{\varvec{r_{ui}}}\}^l_{i=1},\{ \widetilde{\varvec{r_{vi}}}\}^l_{i=1},\{ \widetilde{\varvec{u_j}'}\}^l_{j=1},\{ \widetilde{\varvec{v_j}'}\}^l_{j=1})\).

Since \(\varvec{h},\{ \varvec{r_{Ai}}\}^l_{i=1},\{ \varvec{r_{ui}}\}^l_{i=1},\{ \varvec{r_{vi}}\}^l_{i=1}\) and \(\varvec{\widetilde{h}},\{ \widetilde{\varvec{r_{Ai}}}\}^l_{i=1},\{ \widetilde{\varvec{r_{ui}}}\}^l_{i=1},\{ \widetilde{\varvec{r_{vi}}}\}^l_{i=1}\) follow the same distribution, the following equation holds:

$$\begin{aligned} \begin{aligned}&(\varvec{\widetilde{h}},\{ \widetilde{\varvec{r_{Ai}}}\}^l_{i=1},\{ \widetilde{\varvec{r_{ui}}}\}^l_{i=1},\{ \widetilde{\varvec{r_{vi}}}\}^l_{i=1}) \\ \equiv _s&(\varvec{h},\{ \varvec{r_{Ai}}\}^l_{i=1},\{ \varvec{r_{ui}}\}^l_{i=1},\{ \varvec{r_{vi}}\}^l_{i=1}). \end{aligned} \end{aligned}$$
(3.35)

From the assumption of quasi-cyclic syndrome decoding of quasi-cyclic codes, the probabilistic polynomial time adversary cannot distinguish between \(\varvec{u_j}',\varvec{v_j}', j\in [l]\) and uniformly random ones. Furthermore, since \(\{ \widetilde{\varvec{u_i}'}\}^l_{i=1}\) and \(\{ \widetilde{\varvec{v_i}'}\}^l_{i=1}\) are replaced randomly, the first bit is 0, and the index of \(\widetilde{\varvec{u_{i*}}}-\widetilde{\varvec{v_{i*}}}\cdot \varvec{y}\) where the index \(i*\) is a uniformly random one satisfying the following expression:

$$\begin{aligned} \left( \{ \widetilde{\varvec{u_j}'}\}^l_{j=1},\{ \widetilde{\varvec{v_j}'}\}^l_{j=1}\right) \equiv _c \left( \{ \varvec{u_i}'\}^l_{i=1}\,\{ \varvec{v_i}'\}^l_{i=1}\right) . \end{aligned}$$
(3.36)

Therefore, the distribution of the view \(\mathrm{view}_A\) and simulator \(S_A\) when A is \(\mathrm{output}_A=(c<d)\) is indistinguishable against polynomial time opponents.

Semi-honest adversary A and \(\mathrm{output}_A=(c\ge d)\) are the same as the security proof in the case of \(\mathrm{output}_A=(c<d)\), so details are omitted.

Next, we consider semi-honest adversary B. Omitting the global parameters, B’s view is \(\mathrm{view}_B=(d;\varvec{h,s},\{ \varvec{u_i}\}^l_{i=1},\{ \varvec{v_i}\}^l_{i=1},\{ \varvec{r_{Bi}}\}^l_{i=1},\{ \varvec{e_{ui}}\}^l_{i=1},\{ \varvec{e_{vi}}\}^l_{i=1})\). Configure simulator \(S_B(d)\) as follows:

  1. 1.

    Generates \(\varvec{\widetilde{h},\widetilde{s}},\{ \widetilde{\varvec{u_i}}\}^l_{i=1},\{ \widetilde{\varvec{v_i}}\}^l_{i=1},\{ \widetilde{\varvec{r_{Bi}}}\}^l_{i=1},\{ \widetilde{\varvec{e_{ui}}}\}^l_{i=1},\{ \widetilde{\varvec{e_{vi}}}\}^l_{i=1}{\mathop {\longleftarrow }\limits ^{\$}}\mathbb {R}\) at random. Here, the Hamming weight of \(\{ \widetilde{\varvec{r_{Bi}}}\}^l_{i=1},\{ \widetilde{\varvec{e_{ui}}}\}^l_{i=1},\{ \widetilde{\varvec{e_{vi}}}\}^l_{i=1}\) is \(w_r^*\).

  2. 2.

    This outputs \((\varvec{d};\varvec{\widetilde{h},\widetilde{s}},\{ \widetilde{\varvec{u_i}}\}^l_{i=1}\!,\!\{ \widetilde{\varvec{v_i}}\}^l_{i=1}\!,\!\{ \widetilde{\varvec{r_{Bi}}}\}^l_{i=1}\!,\!\{ \widetilde{\varvec{e_{ui}}}\}^l_{i=1}\!,\!\{ \widetilde{\varvec{e_{vi}}}\}^l_{i=1})\).

Since \(\varvec{h},\{ \varvec{r_{Bi}}\}^l_{i=1},\{ \varvec{e_{ui}}\}^l_{i=1},\{ \varvec{e_{vi}}\}^l_{i=1}\) and \(\varvec{\widetilde{h}},\{ \widetilde{\varvec{r_{Bi}}}\}^l_{i=1},\{ \widetilde{\varvec{e_{ui}}}\}^l_{i=1},\{ \widetilde{\varvec{e_{vi}}}\}^l_{i=1}\) follow the same distribution, the following equation holds:

$$\begin{aligned} \begin{aligned}&(\varvec{h},\{ \varvec{r_{Bi}}\}^l_{i=1},\{ \varvec{e_{ui}}\}^l_{i=1},\{ \varvec{e_{vi}}\}^l_{i=1}) \\ \equiv _s&(\varvec{\widetilde{h}},\{ \widetilde{\varvec{r_{Bi}}}\}^l_{i=1},\{ \widetilde{\varvec{e_{ui}}}\}^l_{i=1},\{ \widetilde{\varvec{e_{vi}}}\}^l_{i=1}). \end{aligned} \end{aligned}$$
(3.37)

\(\varvec{s}\) can be reduced to a 2-quasi-cyclic syndrome decoding decision assumption, and the distribution is indistinguishable from uniform random numbers for probabilistic polynomial-time adversaries. Thus, \( \varvec{\widetilde{s}} \equiv _c \varvec{s}\) holds.

In addition, since \(\varvec{u_i},\varvec{v_i}, i\in [l]\) are based on the assumption of quasi-cyclic syndrome decoding, an adversary in probabilistic polynomial time cannot distinguish between \(\varvec{u_i},\varvec{v_i}, i\in [l]\) and uniform random numbers.

$$\begin{aligned} (\{ \widetilde{\varvec{u_i}}\}^l_{i=1},\{ \widetilde{\varvec{v_i}}\}^l_{i=1}) \equiv _c (\{ \varvec{u_i}\}^l_{i=1},\{ \varvec{v_i}\}^l_{i=1}). \end{aligned}$$
(3.38)

Therefore, the distribution of B’s view \(\mathrm{view}_B\) and simulator \(S_B\) is indistinguishable against polynomial time adversaries.

3.2.6 Support Vector Machine from Secure Linear Function Evaluation and Secure Comparison

We can construct a code-based protocol for a support vector machine from the protocols for evaluation of linear functions and comparison described above. Note that the result of secure evaluation of linear function is in \(\mathbb {F}_q\) while that of secure composition is a bit string. Therefore, we need to provide secure bit-decomposition protocol. The bit-decomposition protocols have been already studied well in the research area of secure computation, and indeed, we can use the bit-decomposition protocol given in [24] with secure computation protocol from a threshold homomorphic encryption [25]. (It is straightforward to construct a threshold version of HQC scheme by setting \(sk_A=(\varvec{x}_1,\varvec{y}_1)\) and \(sk_B=(\varvec{x}_2,\varvec{y}_2)\) as distributed decryption keys for A and B. Then, the encryption key is \((\varvec{h}, (\varvec{x}_1+\varvec{x}_2)+\varvec{h}\cdot (\varvec{y}_1+\varvec{y}_2)\)).

We describe the overview of the protocol below. For simplification, we denote [m] as the ciphertext for m under HQC encryption scheme over \(\mathbb {F}_q\).

Protocol

Input

A : \(m\in \mathbb {F}_q\)

B : \(a,b,t\in \mathbb {F}_q\)

Output

A : \(a\cdot m+b>t\) or not

B : \(\perp \)

  1. 1.

    A and B perform the secure linear evaluation protocol over \(\mathbb {F}_q\). Then, B sends A \([a\cdot m+b]\) at step 4 in the original protocol.

  2. 2.

    A and B start the secure bit-decomposition protocol on \([a\cdot m + b]\).

  3. 3.

    From the result of the bit-decomposition protocol, B obtains the binary representation \([(a\cdot m + b)_1],\ldots ,[(a\cdot m + b)_\ell ]\).

  4. 4.

    A and B perform the secure comparison protocol from step 4.