Keywords

1 Introduction

Recommender systems [1] are widely used in e-commerce and online social networks, suggesting products, movies, music, etc. to consumers. Recommender systems are designed to provide personalized suggestions on items to consumers. Collaborative filtering (CF) plays an important role in recommender systems, which makes recommendations by collecting suggestions from similar users (user-based) or similar items (item-based).

However, the tendency towards personalization has raised privacy concerns [2]. To address this problem, privacy-preserving collaborative filtering (PPCF) methods have been proposed, most of which employ traditional approaches such as cryptography, obfuscation, and perturbation. These approaches either induce great computation cost or provide insufficient privacy protection, thus impairing the performance in practical applications. In 2006, Dwork [3] proposed the notion of differential privacy (DP), which provides a quantifiable measurement of privacy. McSherry and Mironov [4] first introduced DP into CF. They constructed a private covariance matrix to randomize each user’s ratings. In the context of the recommender systems, DP ensures that no user would be able to guess whether certain items appear in other users’ rating histories.

Unfortunately, former methods fail to resist the k nearest neighbors (kNN) attack, which was first proposed by Calandrino et al. [5]. In the kNN attack, the adversary can infer the rating history of the target user by creating fake users. The success of the attack is due to that CF reveals users’ preferences explicitly or implicitly, so the attacker can take advantage of the preferences to infer more information. Existent PPCF schemes only obscure the values of ratings, but not protect users’ preferences. To address the problem, Guerraoui et al. [6] proposed D2P, which strengthens the notion of DP in recommender systems by creating a perturbed rating history named AlterEgo for each user. The perturbation process of D2P is highly dependent on subjectively selected parameters, which makes the privacy level non-intuitive and hard to control.

To improve both the privacy protection and the recommendation quality of former solutions, we introduce social relationships to PPCF schemes. Although social relationships have been widely used in recommender systems to boost the accuracy of recommendation, little work adopts it to provide effective privacy protection. In practice, users tend to share similar tastes with their close friends. So, substituting users’s ratings with their friends’ rating histories will obscure the exact ratings while keeping high utility. The import of social relationship information calls for us to customize the perturbed rating histories for different users. Therefore, the modified rating histories in our scheme are generated dynamically, which vary with the user who is receiving recommendations. We call our dynamically modified rating history DyEgo, and correspondingly name our scheme DynaEgo.

The rest of the paper is organized as follows. Section 2 introduces the preliminaries. In Sect. 3, we present our construction along with the theoretical analysis on privacy preserving. Section 4 reports the experimental results on a real data set called Epinions. Section 5 concludes this paper.

2 Preliminaries

2.1 Notation

Let \(U=\{u_1, u_2, ..., u_n\}\) be a set of users and \(I=\{t_1, t_2, ..., t_m\}\) be a set of items. The rating histories data set D can be represented as an \(n\times m\) matrix, which can be decomposed into row vectors \(P_i\) and column vectors \(R_i\). Specifically, the row vector \(P_i=[r_{i1}, r_{i2}, ..., r_{im}]\) corresponds to the user \(u_i\)’s rating history, and \(r_{ij}\) denotes the rating that user \(u_i\) gave to item \(t_j\). The column vector \(R_i=[r_{1i}, r_{2i}, ..., r_{ni}]\) is composed of the ratings that \(t_i\) has received. For an item \(t_j\) that has not been rated by \(u_i\), \(r_{ij}=0\). \(sim(u_i,u_j)\) represents the similarity between users, and \(itemsim(t_i,t_j)\) is the similarity between items. Empirical analysis indicates that Pearson correlation coefficient (Eq. 1) and Cosine similarity (Eq. 2) outperform other measurements, respectively:

$$\begin{aligned} sim(u_i,u_j)=&\frac{\sum _{k=1}^{m}{(P_i^{(k)}-\bar{P_i})(P_j^{(k)}-\bar{P_j})}}{\sqrt{\sum _{k=1}^{m}{(P_i^{(k)}-\bar{P_i})^2}\sum _{k=1}^{m}{(P_j^{(k)}-\bar{P_j})^2}}},\end{aligned}$$
(1)
$$\begin{aligned} itemsim(t_i,t_j)=&\frac{\sum _{k=1}^n{R_i^{(k)}\cdot R_j^{(k)}}}{\sqrt{\sum _{k=1}^n{R_i^{(k)}}\sum _{k=1}^n{R_j^{(k)}}}}. \end{aligned}$$
(2)

2.2 Differential Privacy

Definition 1

( \(\epsilon \) -Differential Privacy). A random function M satisfies \(\epsilon \)- differential privacy if for all neighboring data sets D and \(D'\), and for all outputs t of this randomized function, the following statement holds:

$$\begin{aligned} \frac{\Pr (M(D)=t)}{\Pr (M(D')=t)}\le \exp (\epsilon ),\ where\ \exp (\epsilon )\triangleq e^{\epsilon }, \end{aligned}$$

Two data sets D and \(D'\) are said to be neighbors if they are different in at most one item. \(\epsilon \) is the privacy protection parameter which controls the amount of distinction induced by two neighboring data sets. A smaller value of \(\epsilon \) ensures a stronger privacy guarantee.

Two main mechanisms are usually used to achieve DP: the Laplace mechanism [7] and the Exponential mechanism [8]. The Laplace mechanism is only suitable for numeric outputs, while the Exponential mechanism is designed to address the privacy issues in scenarios of entity outputs.

Definition 2

(Exponential Mechanism). Let q(r) be a score function of data set D that measures the score of output \(r\in D\). The mechanism M satisfies \(\epsilon \)-differential privacy if M(D) outputs r with the probability:

$$\begin{aligned} \Pr (M(D)=r)=\frac{\exp (\epsilon q(r)/(2\varDelta q))}{\sum _{r\in D}\exp ({\epsilon q(r)}/(2\varDelta q))}, \end{aligned}$$

where \(\varDelta q\) is the global sensitivity of q.

Definition 3

(Global Sensitivity). The global sensitivity \(\varDelta q\) of a function q is the maximum absolute difference obtained on the output over all neighboring data sets:

$$\begin{aligned} \varDelta q=\mathop {\max }_{D\sim D'}|q(D)-q(D')|. \end{aligned}$$

Definition 4

(Composition Property [8]). The sequential application of mechanisms \({M_i}\), each giving \(\epsilon _i\)-differential privacy, gives \(\sum {\epsilon _i}\)-differential privacy.

2.3 kNN Attack

The kNN attack assumes that the attacker knows m elements in \(P_t\) (i.e., partial rating history of \(u_t\)), which may be obtained by methods like social engineering. Then the attacker wants to infer the remaining items that \(u_t\) has rated.

The process of the kNN attack can be summarized as follows. The attacker creates k fake users (i.e., sybils), and assigns the partial rating history of \(u_t\) to each sybil’s rating history. Then, with high probability, the k most similar users of each sybil will consist of \(u_t\) and the other \(k-1\) sybils. The attacker inspects the list of items recommended to any of the sybils. Any item which appears in the list must have been rated by \(u_t\).

2.4 Related Work

Former DP based PPCF solutions utilized DP mechanisms in various ways. McSherry and Mironov [4] proposed the first DP-based PPCF mechanism, which adds Laplace noise to the covariance matrix. PNCF [9] adopted the Exponential mechanism to select the neighbors, and add Laplace noise to the computation results. Zhu and Sun [10] proposed a mechanism based on the Exponential mechanism and designed a low sensitivity metric to measure the similarity.

Another method to adopt DP into PPCF is creating perturbed rating histories, e.g., AlterEgo in D2P [6]. In D2P scheme, the AlterEgo remains unchanged once created. In addition, the perturbation mechanism of D2P greatly depends on subjectively chosen parameters. Different from D2P, our proposal DynaEgo depends on the Exponential mechanism of DP, and employs social relationships to dynamically modify the rating histories.

3 DynaEgo Recommender

3.1 System Model

In this paper, we consider a general user-based CF recommender system, consisting of two parties: the service provider and the users. The service provider is trusted, storing users’ rating histories and the social relationships data. There may be some malicious users in the system, who aim to infer other users’ rating histories. The malicious users conduct the kNN attack. \(u_a\) is the active user, who is receiving recommendations. The service provider leverages k most similar users’ rating histories to suggest items to \(u_a\).

As Fig. 1 shows, DynaEgo relies on the DyEgos to make recommendations. The DyEgos are adaptively imitational rating histories. DynaEgo operates in three phases: Grouping phase, Modification phase, and Recommendation phase. The Recommendation phase is the same as the traditional CF algorithm, except using the DyEgos rather than the exact rating histories. Thus we only introduce the first two phases in detail below.

Fig. 1.
figure 1

The DynaEgo recommender scheme.

3.2 Grouping Phase

The Grouping phase consists of two steps. We select s users as the SelectedUsers at first, then create a RatingPool for each user in SelectedUsers.

SelectedUsers are s users who have the most similar rating histories to \(u_a\), where \(s>k\) (k is the size of neighbors). To reduce the computation cost, only the rating histories of the SelectedUsers will be substituted. Then, we create a RatingPool for each \(u_i\) in SelectedUsers. We use the Exponential mechanism of DP to randomly select h users for \(u_i\), and add their ratings into the RatingPool of \(u_i\). More specifically, \(sim(\cdot )\) acts as the score function, and \(\varDelta sim=1\).

Algorithm 1 illustrates the process of creating the RatingPool for a user \(u_t\) in SelectedUsers.

figure a

3.3 Modification Phase

Our scheme relies on the RatingPools obtained from the Grouping phase to create DyEgos by the Exponential mechanism. For users not in SelectedUsers, their DyEgos are their real rating histories. Algorithm 2 shows the process of creating DyEgo for \(u_t\). The social information is used to design the score function. It is believed that users with more common friends are more likely to have similar tastes. So we use the percentage of common friends (Eq. 3) to measure the trust relationship.

$$\begin{aligned} \small trust(u_i,u_j)=\frac{|F(u_i)\cap F(u_j)|}{|F(u_i)\cup F(u_j)|}. \end{aligned}$$
(3)

Here, function \(F(\cdot )\) returns the friend list of the user. \(|F(u_i)\cap F(u_j)|\) is the number of common friends of \(u_i\) and \(u_j\), and \(|F(u_i)\cup F(u_j)|\) is the number of total friends of \(u_i\) and \(u_j\).

figure b

According to Algorithm 2, we set the score function as \(\alpha \cdot trust(u_i,u_y) + (1-\alpha )\cdot itemsim(t_j,t_x)\), where \(\alpha \in (0,1)\) is the parameter indicating the weight of social information in perturbations.

3.4 Theoretical Analysis

Theorem 1

The recommendation mechanism M based on DynaEgo satisfies \((\epsilon _1+\epsilon _2)\)-differential privacy, where \(\epsilon _1\) and \(\epsilon _2\) are privacy parameters in Grouping phase and Modification phase, respectively.

First, we present lemmas and corollaries needed to prove Theorem 1.

Lemma 1

We denote \(\textit{SUB}(i,j)\) the event of substituting element i with j in a user’s rating history. For three arbitrary elements i, j, and k, where \(i \ne j\), the following inequality holds:

$$\begin{aligned} \frac{\Pr (\textit{SUB}(i,k))}{\Pr (\textit{SUB}(j,k))}\le \exp (\epsilon _2), \end{aligned}$$

where \(\epsilon _2\) is the privacy parameter in Modification phase.

The correctness of Lemma 1 can be easily derived from Definition 2 and Algorithm 2.

Lemma 2

Let \(P_i\) denote the original rating history of user \(u_i\), whereas \(P'_i\) and \(P_i\) are neighbors. PS is a privacy preserving mechanism which creates the DyEgo \(P_{iE}\) of \(u_i\). Then, we have:

$$\begin{aligned} \frac{\Pr (PS(P_i,P_{iE}))}{\Pr (PS(P_i',P_{iE}))}\le \exp (\epsilon _2). \end{aligned}$$

Proof

Let \(P_i=[r_{i1}, r_{i2}, ..., r_{ik}, ..., r_{im}]\). Without loss of generality, we assume that \(P_i\) and \(P_i'\) only differ in the first element, i.e., \(P_i'=[r'_{i1}, r_{i2}, ..., r_{ik}, ..., r_{im}]\). \(r_{ik}^\pi \) denotes the permutation of the rating \(r_{ik}\).

Based on the fact that every element is replaced independently, we get:

$$\begin{aligned} \Pr (PS(P_i,P_{iE}))=\prod \nolimits _{k=1}^{m}{\Pr (\textit{SUB}(r_{ik},r_{ik}^\pi ))}, \end{aligned}$$
(4)
$$\begin{aligned} \Pr (PS(P_i',P_{iE}))=\Pr (\textit{SUB}(r_{i1}',r_{i1}^\pi ))\cdot \prod \nolimits _{k=2}^{m}{\Pr (\textit{SUB}(r_{ik},r^\pi _{ik}))}. \end{aligned}$$
(5)

Now, from Eqs. 4 and 5, we have:

$$\begin{aligned} \frac{\Pr (PS(P_i,P_{iE}))}{\Pr (PS(P_i',P_{iE}))}=\frac{\Pr (\textit{SUB}(r_{i1},r_{i1}^\pi ))\cdot \prod _{k=2}^{m}{\Pr (\textit{SUB}(r_{ik},r^\pi _{ik}))}}{\Pr (\textit{SUB}(r'_{i1},r_{i1}^\pi ))\cdot \prod _{k=2}^{m}{\Pr (\textit{SUB}(r_{ik},r^\pi _{ik}))}}. \end{aligned}$$

So, according to Lemma 1, we can get:

   \(\Box \)

Corollary 1

For any two neighboring original rating histories data sets D, \(D'\), and an arbitrary DyEgo data set \(D^r\), we denote \(DS(D,D^r)\) as the function of substituting \(D^r\) for D. Then, we have:

$$\begin{aligned} \frac{\Pr (DS(D,D^r))}{\Pr (DS(D',D^r))}\le \exp (\epsilon _1+\epsilon _2). \end{aligned}$$

Proof

The process of building DyEgo set is composed of two phases: Creating the RatingPool and Modification phase. According to Lemma 2, the Modification phase satisfies \(\epsilon _2\)-differential privacy. If the process of creating the RatingPool satisfies \(\epsilon _1\)-differential privacy, the construction of DyEgo data set satisfies \((\epsilon _1+\epsilon _2)\)-differential privacy due to the composition property of DP.   \(\Box \)

Proof of Theorem 1. Let M the mechanism based on DynaEgo, and \(M'\) is an arbitrary recommendation algorithm. We denote Rec as the recommendations given to \(u_a\). So, Theorem 1 can be rewritten as:

$$\begin{aligned} \frac{\Pr (M(D,u_a)=Rec)}{\Pr (M(D',u_a)=Rec)}\le \exp (\epsilon _1+\epsilon _2). \end{aligned}$$

In addition, we have:

$$\begin{aligned} \Pr (M(D,u_a)=Rec)=\sum \nolimits _{D^r}{\Pr (DS(D,D^r))\Pr (M'(D^r,u_a)=Rec)}, \end{aligned}$$
(6)
$$\begin{aligned} \Pr (M(D',u_a)=Rec)=\sum \nolimits _{D^r}{\Pr (DS(D',D^r))\Pr (M'(D^r,u_a)=Rec)}. \end{aligned}$$
(7)

From Eqs. 6, 7, and Corollary 1, we can get:

Therefore, we can conclude that M satisfies \((\epsilon _1+\epsilon _2)\)-differential privacy.   \(\Box \)

4 Experimental Evaluation

4.1 Setup

Data Set. We evaluate DynaEgo with the EpinionsFootnote 1 data set. The Epinions data set consists of ratings given by 49,290 users over 139,738 items, and each rating ranges from 1 to 5. In addition, Epinions provides the social relationship data. In this paper, we use a subset of the Epinions data set with ratings given by 100 users over 500 items and the corresponding social relationship data, which shares the same size with the data set in [6].

Parameter Selection. The performance of PPCF varies with the privacy parameter \(\epsilon \). In our experiments, we set \(\epsilon =2\). According to [6], in D2P, \(\epsilon \) is decided by three parameters: \(\lambda \), p, and \(p^*\). We set \(\lambda =0.5\), \(p=0.8\), and \(p^*=0.01\), which ensure that \(\epsilon =2\). Whereas in PNCF, the parameter \(\epsilon \) can be selected directly. In our scheme, the privacy parameters \(\epsilon _1\) and \(\epsilon _2\) in Grouping phase and Modification phase are 1, ensuring the overall mechanism satisfies 2-differential privacy. In addition, without loss of generality, we set the size of SelectedUsers twice as large as the neighbors’ size, i.e., \(s=2k\). \(h=10\), and \(\alpha =0.2\).

4.2 Utility Evaluation

This section examines the utility performance of DynaEgo. We conduct experiments by defining different size of neighbors. We apply the traditional user-based CF as the baseline, and then compare DynaEgo with D2P and PNCF.

We measure the recommendation quality in terms of classification accuracy metrics (CAM) [11]. CAM measures the frequency with which a recommender makes correct or incorrect decisions about whether an item should be recommended. In CAM evaluation, parts of \(u_a\)’s ratings are deleted, and we observe how well the recommendations match the real rating record. The recommender system selects the top-5 items as the recommendations. We compare the \(F_1\)-Score of the PPCF schemes and the non-private CF. The experiment for each mechanism is repeated for 50 times to eliminate the impact of randomization.

According to Fig. 2, DynaEgo maintains the \(F_1\)-Score larger than 0.6, which indicates that DynaEgo maintain a high recommendation quality. Specifically, when \(\#neighbor=15\), DynaEgo achieves an \(F_1\)-Score of 0.628, which outperforms the results of D2P (0.462) and PNCF (0.508) by 36% and 24%, respectively.

4.3 Privacy Evaluation

We use the SuccessRate of the kNN attack as the metric, which indicates the percentage of inferences that are correct (i.e., the items that \(u_t\) has rated).

We consider a recommender system with the configuration of 10 neighbors. An attacker creates 10 sybils, each of whom owns a rating history with parts of \(u_t\)’s rating history (i.e., the auxiliary knowledge). Ideally, each sybil has a neighborhood consisting of \(u_t\) plus 9 other sybils. However, in PPCF schemes, the sybil cannot obtain neighbors as expected, due to the introduced perturbation.

Fig. 2.
figure 2

Utility comparison between D2P, PNCF, and DynaEgo

Fig. 3.
figure 3

Resistance of D2P, PNCF, and DynaEgo to the kNN attack.

In Fig. 3, each point represents the average value of SuccessRate from 50 repeated experiments. It may seem strange at the first sight that the SuccessRate decreases with the increase of auxiliary information. In fact, in PPCF schemes, more auxiliary knowledge increases the probability of other sybils rather than the target user appearing in the neighbors. Since sybil user’s rating history does not have any information about \(u_t\)’s remaining rating history, the increase of sybil users in neighbors has a negative effect on the inferences.

For the non-private scheme, the SuccessRate of kNN attack is 1.0, which indicates that the kNN attack always succeed. In PPCF schemes, the SuccessRate falls significantly. Compared with other PPCF schemes, DynaEgo has lower SuccessRate regardless the percentage of auxiliary knowledge of the attacker.

4.4 Effect of Social Relationship Weight (\(\alpha \))

We vary the value of parameter \(\alpha \) from 0.1 to 0.9 to observe the change of \(F_1\)-Score and SuccessRate, with the configuration of 80% auxiliary items and 15 neighbors. The larger \(\alpha \) is, the bigger impact the social relationship will have.

Table 1. Effect of \(\alpha \) on privacy and utility

Table 1 indicates that when \(\alpha =0.5\), our scheme will have the best resistance to the kNN attack and a correspondingly lower \(F_1\)-Score value. The privacy protection degrades when \(\alpha \) approaches 0.1 or 0.9.

5 Conclusions

In this paper, we have proposed a social-aware PPCF scheme called DynaEgo. The main idea of DynaEgo is substituting users’ rating histories adaptively, which adopts the Exponential mechanism of DP. Additionally, social relationships are used to design the score function. To evaluate the performance of our scheme, we have conducted experiments on the Epinions data set. The results show that DynaEgo outperforms existent solutions in both utility and privacy protection. Since DynaEgo is independent of any specific recommendation algorithm, it can be applied to all recommender systems that are based on users’ rating histories.