Fast Deanonymization of Social Networks with Structural Information
 168 Downloads
Abstract
Ever since the social networks became the focus of a great number of researches, the privacy risks of published network data have also raised considerable concerns. To evaluate users’ privacy risks, researchers have developed methods to deanonymize the networks and identify the same person in the different networks. However, the existing solutions either require highquality seed mappings for cold start, or exhibit low accuracy without fully exploiting the structural information, and entail high computation expense. In this paper, we propose a fast and effective seedless network deanonymization approach simply relying on structural information, named RoleMatch. RoleMatch equips with a new pairwise node similarity measure and an efficient node matching algorithm. Through testing RoleMatch with both real and synthesized social networks, which are anonymized by several popular anonymization algorithms, we demonstrate that the RoleMatch receives superior performance compared with existing deanonymization algorithms.
Keywords
Social network Deanonymization Privacy risk Node similarity1 Introduction
Online social networks have been a popular and important topic for many years, and a lot of successful companies (e.g., Facebook, Twitter and Tencent) emerge for providing the social network service. Users in such social networks are represented as nodes with multiple attributes, including name, gender, interests, location, etc., and the interactivities between users are abstracted as unidirectional or bidirectional edges between the user nodes. In other words, online social networks can be modeled as a network (or graph) with the necessary information of user relations.
Considering the fact that social network truly represents the relationships in human society, it has drawn a lot of attentions from researchers and advertisers. In order to satisfy the need of analysis, social network companies provide services for sharing the information of network. Nevertheless, user privacy can possibly be breached in the process of sharing more and more information for analysis. A popular problem is identity disclosure, where the real identities of nodes in the social networks are revealed [4, 14]. Therefore, the network to be published has to go through the anonymization processes. And many anonymization approaches have been developed, such as edge sparsification and edge perturbation [3]. The sparsification approach deletes edges in the network randomly, and the perturbation approach randomly deletes and adds back the same number of edges.
The existing solutions to the deanonymization problem can be classified into two major types. The first one is to deanonymize the network based on seed mappings which are propagated across the network to match other nodes [10, 17]. This approach heavily depends on not only the quality of seed mappings, but the number of seed mappings. However, because the original networks are always highly private, it is difficult to collect a set of seed mappings with high quality. As a result, some researches aim to find satisfactory seed mappings [16]. The second one directly processes the networks without seed mappings and can still obtain satisfied matching results [4]. This approach makes use of node signatures (e.g., degrees, subgraphs) or structural features (e.g., node similarities, descriptive information) to deanonymize networks. Without involving the seed mappings, this type of solutions is more general and is easy to setup for deanonymization. However, because of the expensive computation cost of node similarity (or other descriptive features), they suffer from poor efficiency.
In this paper, we develop a fast seedless deanonymization approach called RoleMatch. The RoleMatch consists of two phases, node similarity computation and node matching. During the node similarity computation phase, we propose a new similarity measure, named RoleSim++, which is extended from RoleSim [9]. To improve the precision of similarity estimation, RoleSim++ fully exploits the structural information by aggregating both incoming and outgoing neighbors’ similarity. Furthermore, based on the observation that correct node mappings tend to have high node similarity, we develop an efficient iterative algorithm, \( \alpha \)RoleSim++, by pruning node pairs with low similarity. In the node matching phase, we introduce a new matching algorithm, named NeighborMatch, which takes advantage of both node similarities and the structural information of neighborhood matches, to efficiently obtain highquality deanonymization results.
In addition, previous works only study global deanonymization, in which the anonymized network and the crawled network are of the similar size with some overlap subnetwork. In this paper, we further study the local deanonymization. In this new situation, the crawled network has much smaller size than the one of the anonymized networks does, and it basically corresponds to a subnetwork of the anonymized network. The local deanonymization is much closer to the realworld applications because usually the crawled network with certain initial node sets is much smaller than the anonymized network. The detailed definitions of the problems are presented in Sect. 2.

We propose an efficient and seedless approach, RoleMatch, for deanonymization.

We propose a new node similarity measure, RoleSim++, which fully exploits the structural information and improve the deanonymization performance.

We develop an efficient iterative algorithm to compute RoleSim++, and introduce a fast node matching algorithm by utilizing both the node similarity and neighborhood structural information. The two algorithms effectively reduce the computation cost of RoleMatch.

We study both global and local deanonymizations and conduct comprehensive experiments to demonstrate the effectiveness and efficiency of the RoleMatch algorithm on real datasets.
The remains of this paper is organized as follows. In Sect. 2 we present the definition of general deanonymization problem. Then we give an overview of RoleMatch algorithm in Sect. 3 followed by the elaboration of the node similarity and node matching algorithms used by RoleMatch in Sects. 4–6. The experimental results are presented in Sect. 7. Finally, in Sects. 8 and 9, we introduce the related works and conclude the paper.
2 Preliminaries
We introduce the formal definition of deanonymization problem and the related metrics of evaluating the complexity of the problem. Then, we describe two variants of the deanonymization problem, and they are global deanonymization and local deanonymization. Finally, we briefly review the two types of deanonymization algorithms.
2.1 The Definition of Deanonymization Problem
Frequently used notations
Notation  Description 

\( G=(V, E)\)  A network G with node set V and edge set E We use \( G_1\) to represent the crawled network with real topology and \( G_2\) to represent the anonymized network 
\( N_i^{{\mathrm{out}}}(u)\)  Outgoing neighbor set of node u in network \( G_i\) 
\( N_i^{{\mathrm{in}}}(u)\)  Incoming neighbor set of node u in network \( G_i\) 
\( {\mathrm{Sim}}(u, v)\)  Similarity score of the two nodes u and v 
\( {\mathrm{Sim}}^k(u, v)\)  Similarity score of the two nodes u and v after the kth iteration 
\( {\varDelta }^{{\mathrm{out}}}(u, v)\) (\( {\varDelta }^{{\mathrm{in}}}(u, v)\))  The larger one between the number of u’s outgoing (incoming) neighbors and the number of v’s outgoing (incoming) neighbors 
\( M^{{\mathrm{out}}}(u,v)\) (\( M^{{\mathrm{in}}}(u,v)\))  Node matching between u’s outgoing (incoming) neighbors and v’s outgoing (incoming) neighbors 
\( {\varGamma }^{{\mathrm{out}}}(u, v)\) (\( {\varGamma }^{{\mathrm{in}}}(u,v)\))  Then the maximum outgoing (incoming) similarity scores of all possible matchings between \( N_1(u)\) and \( N_2(v)\) 
\( \alpha \)  Parameter to prune unnecessary computations in node similarity computation 
\( \beta \)  Decay factor for node similarity RoleSim++ 
\( \delta \)  Parameter for the degree of anonymization 
Definition 1
(Deanonymization problem) Given two directed networks \( G_1=(V_1, E_1)\) and \( G_2=(V_2, E_2)\), where \( G_1\) is a crawled network from the original network and \( G_2\) is an anonymized network, and assuming that there exist subnetworks \( G_c \subset G_1\) and \( G^{\prime }_c \subset G_2\) such that \( G_c.V=G^{\prime }_c.V\), then the deanonymization \( {\mathcal {D}}(G_1, G_2)\) is the process to match the nodes between \( G_c\) and \( G^{\prime }_c\) as many as possible.
For simplicity, we use \( G_1\) to represent the crawled network and \( G_2\) to represent the anonymized network in the rest of this paper. Furthermore, \( G_c\) implies the overlap between the crawled network and anonymized network, and we called it as overlap network.
To measure the difficulty of deanonymization, we define the noise of a anonymized networks.
Definition 2
(Noise) In a problem \( {\mathcal {D}}(G_1, G_2)\), noise is the set of nodes in the networks that do not belong to the overlap network \( G_c\), i.e., \( V_1 \cup V_2 {\setminus } V_c.\) To quantify the noise, we introduce an overlap rate \( \lambda = \frac{\left V_c\right }{\left V_1 \cup V_2\right };\) then, the noise ratio is \( 1\lambda. \)
2.2 Global and Local Deanonymization
According to the different situations of realworld activities of deanonymization, there are two variants of deanonymization, which are global deanonymization and local deanonymization.
To improve the precision of deanonymization, we would like to get a crawled network from the original network as large as the anonymized one. This is because, in such case, the noise is relatively low, and it has almost no negative impact on deanonymization. Then the deanonymization becomes easy. We define this kind of deanonymization as global deanonymization below.
Definition 3
(Global Deanonymization) Global deanonymization is the deanonymization situation where the crawled network and anonymized network are similar in size, i.e., \( G_1 \approx G_2\).
Next considering that there are a lot of applications (e.g., community leader detection, influential person identification.), where we are only interested in the information of a part of nodes in the network, this leads to attackers only deanonymizing the subnetwork containing our interested nodes. In such case, we only crawl nodes that are near our targets and build a subnetwork as the crawled network for deanonymization. We define this problem as local deanonymization as below.
Definition 4
(Local Deanonymization) Local deanonymization is the deanonymization situation where the crawled network is far smaller than the anonymized one, i.e., \( G_1 \ll G_2\).
The local deanonymization brings the benefit of saving considerable time and space for attackers, but also brings some challenges for deanonymization algorithms because of the high noise ratio. To be more specific, since in this case the overlapped nodes take up only a small percentage of the anonymized network, all the remaining nodes in the anonymized network become noise which makes it difficult to figure out nodes that are actually matched. Previous researches rarely focused on subnetwork attack, and therefore, the stateoftheart deanonymization algorithms cannot perform well for the local deanonymization problem. Our solution will address this drawback.
2.3 SeedBased and Seedless Deanonymization Algorithm
As briefly mentioned in Introduction, existing deanonymization algorithms can be divided into two categories: seedbased deanonymization algorithm and seedless one. The seedbased algorithm requires seed mappings with high quality and extends them to find more mappings. The performance of seedbased algorithms is sensitive to the amount of accurate seed mappings [10]. And the seedless one just uses the properties of network to deanonymize. Considering the difficulty of obtaining a set of seed mappings, the seedless algorithm has its advantage of simplicity. However, as far as we know, there is no single algorithm that can handle both the situations with/without seed mappings.
In this paper, we mainly discuss the seedless situation and propose a deanonymization algorithm only replying on the structural information of networks. But we will show that our new solution is able to handle the cases with seed mappings as well, and the performance is improved compared to other classical seedbased algorithms.
3 Overview of RoleMatch
RoleMatch is a fast deanonymization algorithm, and it supports deanonymization both with and without initial seed mappings. RoleMatch deanonymizes the node mappings only based on the structural information of the crawled network and anonymized network.
RoleMatch mainly takes two networks \( G_1\) and \( G_2\) as inputs, and it can accept initial seed mappings if provided. After initializing a similarity matrix score, it iteratively computes the all pairs of node similarity according to the structural information. Higher similarity score indicates the higher probability of being a correct node mapping. To improve the effectiveness of deanonymization algorithm, we propose a new similarity measure, called RoleSim++, which will be introduced in Sect. 4. The new measure captures the information of both outgoing and incoming neighbors, reflecting the structural similarity between a pair of nodes. RoleSim++ is computed iteratively. During each iteration, the score of a pair of nodes is aggregated from the similarities of maximum matching between their neighboring pairs. To reduce the computation cost of RoleSim++, we also develop a thresholdbased variant, called \( \alpha \)RoleSim++.
Then, based on the similarity scores calculated in the previous stage, RoleMatch calls function findNodeMatching to generate final node mappings. In this function, we apply a matching algorithm called NeighborMatch, which synthetically combines the node similarity and neighborhood feedbacks. Refer to Sect. 6 for the details of NeighborMatch approach.
Furthermore, it is easy for RoleMatch to accept initial seed mappings. This is because RoleMatch computes similarity purely based on the structural information, and seed mappings just provides explicit structural information. The only difference between seedless and seedbased for RoleMatch is that, during the computation of node similarity, if seed mappings are provided, the similarity scores of all the seed pairs remain as one throughout the iterations, and during the node matching phase, the seed pairs are matched ahead of other nodes.
In summary, RoleMatch is a deanonymization approach based on network structure and works correctly no matter whether highquality seed mappings are provided or not. Due to the minor influence of the seed, in the following discussions, we mainly focus on the seedless version of RoleMatch.
4 RoleSim++: A New Node Similarity Measure
In this section, we introduce the details of the new node similarity metric, RoleSim++. First, we give the definition of RoleSim++ and its properties. Then we propose an efficient algorithm to compute the new measure.
4.1 Definition of RoleSim++
Many similarity measures of node pairs have been proposed; however, they cannot be applied to the deanonymization problem directly. For instance, SimRank [7] is a popular similarity measure based on network structure, but it is designed for single network only. Another popular measure [4] is computed based on neighborhood similarity. It uses unnormalized values for iteration so that there is a skew between similarity scores of smalldegree nodes and largedegree nodes. After several iterations, the scores of node pairs with small degree drop to almost zero and have no contributions for deanonymization. In addition, most of the previous similarity measures between two networks mainly discuss about undirected networks, and the directed ones are neglected.
To improve the effectiveness of deanonymization algorithm, we pay much attention on the structural information including the direction. As a result, we propose a new similarity measure called RoleSim++. RoleSim++ is extended from RoleSim [9] in two aspects: (1) RoleSim++ can model the similarity between two networks and (2) RoleSim++ utilizes the direction information of both incoming edges and outgoing edges.
Now the formal definition of RoleSim++ is described as below.
Definition 5
According to the definition of \( {\varGamma }^{{\mathrm{out}}}(u,v)\) and \( {\varGamma }^{{\mathrm{in}}}(u,v)\), we can easily infer that the RoleSim++ can be calculated iteratively. In this paper, we initialize the score matrix as an allone matrix, i.e., \( {{\mathrm{Sim}}}(u,v) = 1\) for all node pairs.
4.2 Properties of RoleSim++
To show that RoleSim++ is a valid similarity measure, we prove that RoleSim++ is converged and its tolerance to the noise of anonymization is bounded.
First, the following lemma shows that
Lemma 1
(NonIncreasing) Let \( {\mathrm{Sim}}^k(u, v)\) be the similarity score of (u, v) after k iterations. Then \( {\mathrm{Sim}}^{k1}(u, v) \ge {\mathrm{Sim}}^{k}(u, v)\) for all k and all node pairs (u, v).
Proof
See “The Proof of Lemma 1” of Appendix. \( \square \)
With the nonincreasing property and \( {\mathrm{Sim}}^k(u, v)\) \( \ge \beta \), the following convergence property can be derived immediately. We have:
Proposition 1
(Convergence) The similarity measure in Definition 5 converges for every pair of nodes (u, v), i.e., \( \lim _{k\rightarrow \infty } {\mathrm{Sim}}^k(u, v) = {{\mathrm{Sim}}}(u, v).\)
Next we show the impact of \( \beta \) on the convergence rate of the RoleSim++ score, and the result is that the difference between \( {\mathrm{Sim}}^k(u, v)\) and \( {{\mathrm{Sim}}}(u, v)\) decreases exponentially with \( (1  \beta )\).
Proposition 2
Proof
See “The Proof of Proposition 2” of Appendix. \( \square \)
When computing similarity scores on realworld networks, because of the small diameter of social networks, it shows that 5 rounds of iteration will be enough for deanonymization accuracy. We will discuss this further in Sect. 7.
The Tolerance of RoleSim++ to the Noise of Anonymization When applying RoleSim++ to deanonymize networks, there is a lower bound for expected similarity scores of correct matches, which shows its effectiveness for network deanonymization. The lower bound is influenced by the complexity of how the network is anonymized. In other words, the lower bound is related to the noise in an anonymized network.
 1.
for each node u in \( G_1\), there exists exactly one node v in \( G_2\) such that u and v are originally the same,
 2.
u and v have at least a proportion \( (1  \delta )\) of common neighbors,^{1} i.e., \( \frac{\left N^{{\mathrm{out}}}(u) \cap N^{{\mathrm{out}}}(v)\right }{{\varDelta }^{{\mathrm{out}}}(u, v)} \ge 1  \delta \) and \( \frac{\left N^{{\mathrm{in}}}(u) \cap N^{{\mathrm{in}}}(v)\right }{{\varDelta }^{{\mathrm{in}}}(u, v)\}} \ge 1  \delta \),
 3.
the ratio of incoming neighbors of u and v, and the ratio of outgoing ones are both between \( (1  \delta )\) and \( 1 / (1  \delta ).\)
Through studying the previous anonymization algorithms, we found that most of the commonly used network anonymization algorithms have the above three properties. For example, Sparsify anonymizes a network by deleting \( p\%\) of the edges, and therefore, \( \delta \) equals \( p\%\). Perturb and Switch also have these properties.
Then the following proposition gives the estimation of the lower bound of \( {\mathrm{Sim}}^{k}(u,v)\) with parameter \( \delta \).
Proposition 3
Let u and v be a correct match where \( G_1\) and \( G_2\) are \( \delta \)anonymized, then \( {\mathrm{Sim}}^{k}(u,v) \ge c_k\), where \( c_k = a^k + \beta \frac{1a^k}{1a}, \) and \( a = (1\beta )(1\delta )\).
Proof
See “The Proof of Proposition 3” of Appendix. \( \square \)
For example, when \( \beta \) is set to 0.15, and the parameter \( \delta \) in anonymization algorithms is set to 10%, after five iterations, the similarity score of each correct matched node pair satisfies that \( {\mathrm{Sim}}^{k}(u,v) > 0.73\). In particular, when \( G_1\) and \( G_2\) are isomorphic, which means that \( \delta = 0\) and \( a = 1  \beta \), then \( 1  (1  \beta )^k = \beta \sum ^{k1}_{i=0}(1\beta )^i.\)
Consequently, for each k we have \( c_k = 1\) and \( {\mathrm{Sim}}^{k}(u,v) = 1\). This is consistent with the fact that two networks are isomorphic.
5 Solutions of Computing RoleSim++
5.1 Basic Solution for RoleSim++
To compute the similarity score of RoleSim++, the basic solution calculates all pairwise score iteratively in bruteforce way. Algorithm 1 describes the procedure of the solution. First, the similarity matrix score is initialized as an allone matrix (Line 1). In each iteration, the similarity scores of node pairs are updated according to Eq. (1). The function \( \gamma (N(u), N(v))\) (Line 4) computes the maximum matching between neighboring pairs of u and v (i.e., \( {\varGamma }^{{\mathrm{out}}}\) and \( {\varGamma }^{{\mathrm{in}}}\)). Considering that the exact maximum matching for bipartite network is computationally expensive, we adopt a greedy approximation algorithm in our implementation, just as Fu et al. [4] and Jing et al. [9] did. However, the basic solution is expensive, and the computation complexity for each round is at least \( {\varOmega }(V_1V_2d^2),\) where d is the average degree of the nodes. The algorithm can only handle networks with thousands of nodes.
Change in similarity scores related to node \( u_1\) in Fig. 1 over five iterations
(\( u_1\), \( v_1\))  (\( u_1\), \( v_2\))  (\( u_1\), \( v_3\))  (\( u_1\), \( v_4\))  (\( u_1\), \( v_5\))  

First  0.72  0.32  0.32  0.15  0.43 
Second  0.56  0.32  0.27  0.15  0.35 
Third  0.47  0.23  0.24  0.15  0.30 
Fourth  0.41  0.22  0.23  0.15  0.28 
Fifth  0.38  0.21  0.22  0.15  0.27 
Similarity scores of each node pair in Fig. 1 after five iterations
\( v_1\)  \( v_2\)  \( v_3\)  \( v_4\)  \( v_5\)  

\( u_1\)  0.38  0.20  0.22  0.15  0.26 
\( u_2\)  0.15  0.38  0.36  0.31  0.15 
\( u_3\)  0.27  0.36  0.38  0.24  0.31 
\( u_4\)  0.15  0.26  0.26  0.34  0.15 
5.2 \( \alpha \)RoleSim++: A Fast Solution
To scale our deanonymization approach to large networks, we design a fast solution of computing RoleSim++, called \( \alpha \)RoleSim++. For deanonymization, each node in \( G_1\) has at most one correspondence (correct match) in \( G_2\). This means that our main concerns are those node pairs with higher similarity scores, which are more likely to be correct matches.
Based on this observation, we propose a heuristic rule to speed up the computation.
Heuristic 1
In each iteration, only the similar node pairs with high similarity are reserved and others can be discarded.
Following the heuristic rule, we propose a new efficient computation method, \( \alpha \)RoleSim++. The \( \alpha \)RoleSim++ can substantially reduce the computational cost but still retain the accuracy. In \( \alpha \)RoleSim++, the similarity formula is revised as follows. Let \( {\mathrm{Sim}}^{\theta }_{k}(u, v)\) denote the thresholdsieved similarity score of (u, v) on the kth iteration, where the threshold \( \theta =\theta (u, \alpha )\) relies on parameter \( \alpha \) and node u, and \( 0< \alpha , \theta < 1\).
Since the goal of deanonymization is to identify each node in \( G_1\), we need to keep a portion of candidates (nodes from \( G_2\)) for each node u in \( G_1\). Consequently, the threshold \( \theta \) should be related to the node u in each iteration for dynamically maintaining a proper list of candidates. We define \( \theta \) as \( \theta (u, \alpha ) = \alpha \cdot top(u)\), where top(u) is the highest similarity score related to u in the last round, and it is easy to figure out that \( \theta \) is dynamically determined by the similarities with respect to node u.
Algorithm 2 describes the details of \( \alpha \)RoleSim++ computation. The main framework still remains the same as Algorithm 1. The differences are below. At Line 6, the threshold \( \theta \) is decided by both parameter \( \alpha \) and the similarity scores related to node u. Later when visiting candidate pairs (Line 7), those with similarity score below the threshold are filtered out and others are updated for the iteration (Lines 8–11).
Property 1
The thresholdsieved similarity score of each pair of nodes (u, v) is nonincreasing, i.e., \( {\mathrm{Sim}}^{\theta }_{k}(u, v) \ge {\mathrm{Sim}}^{\theta }_{k+1}(u, v)\) for each k.
Property 2
The value of thresholdsieved similarity scores is lower than the standard RoleSim++ scores, i.e., \( {\mathrm{Sim}}^{\theta }_{k}(u, v) \le {\mathrm{Sim}}_{k+1}(u, v)\) holds for all pairs of nodes.
From Property 1 we know that the iterative computation of \( \alpha \)RoleSim++ converges. The convergent similarity score of (u, v) is denoted as \( {\mathrm{Sim}}^{\theta }(u, v)\).
In addition, to choose the value of parameter \( \alpha \), there is a tradeoff between the accuracy of similarity scores and the computational cost. If \( \alpha \) is set to a relatively low value, fewer node pairs will be filtered out in each iteration, resulting in higher computational cost, while the deanonymization accuracy will be closer to that of standard RoleSim++. In particular, if \( \alpha \) is set to zero then it will be exactly the same as the standard RoleSim++. We will study the influence of \( \alpha \) on accuracy in Sect. 7.
6 NeighborMatch: An Effective Node Matching Algorithm
In this part, we introduce our method to find a good mapping between the anonymized network and the crawled one, based on the precomputed similarity scores.
Intuitively, in order to find the mapping based on node similarity, the maximum weighted matching for bipartite network is a good option. By using Karnik–Mendel (KM) algorithm [12], the maximum matching can be computed in \( O(n^3)\), where n is the number of nodes. Since the maximum matching is computationally expensive, it can hardly be applied to large networks. Another solution proposed by Fu et al. [4] is a greedy algorithm, which offers an approximation of the globally optimal matching in \( O(n^2{\mathrm{log}}n)\), with less accuracy than KM algorithm.
However, both above approaches simply maximize the sum of similarity scores, and the structural information of the network is neglected during the matching phase. Actually the links between a pair of nodes and their neighbors contain valuable information that can help us deanonymizing a network with higher accuracy. We propose a new matching algorithm, NeighborMatch, based on two observations: First, correct mappings tend to have higher similarity scores and second, a pair of nodes is more likely to be a correct mapping if their neighbors are correct mappings. More specifically, NeighborMatch assigns a priority for each pair of nodes, and it follows the priority to generate matchings.
Algorithm 3 illustrates the procedure of NeighborMatch finding the node mappings. First, from Lines 4 to 9, it matches node pairs with highest similarity score and increases the scores of their neighbors by one, until there are some candidate pairs with at least r neighbors being matched. Then it matches all the candidates in sequence of their scores from the highest to the lowest and spreads score to their neighbors (Lines 10–13). These two steps are repeated until each node in \( G_1\) is matched with some node in \( G_2\). Taking the similarity scores in Table 3 as an example, in the first iteration, NeighborMatch first selects a matching pair \( (u_1, v_1)\), then increases the scores of \( (u_2,v_2)\), \( (u_2,v_3)\), \( (u_3,v_2)\), \( (u_3,v_3)\), \( (u_4,v_2)\), \( (u_4,v_3)\) by 0.38, respectively, and adds them into set A. After another four iterations, all node matchings will be correctly generated.
Since NeighborMatch is a variant of percolation network matching by using different seeds, the theoretical results [10] still hold and guarantee the performance of NeighborMatch. For instance, assuming that at the beginning, m pairs of nodes with highest similarity scores are matched, and m reaches the critical value according to Theorem 1 from the work [10], then with high probability, at least \( no(n)\) nodes can be deanonymized successfully, where \( n = \left V_1 \cap V_2\right \).
Moreover, NeighborMatch has several advantages over the original percolation network matching. Network percolation requires all the candidate pairs to have at least r neighbors matched previously, so the matching process gets stuck when there are no valid candidates. Our algorithm avoids getting stuck because the similarity scores provide a natural and reasonable choice of candidate pairs, i.e., picking out the one with highest score among all the unmatched pairs. Thus, our matching algorithm is capable to match more node pairs, even those whose degrees are less than the threshold r.
7 Experimental Studies
In this section, we evaluate the performance of RoleMatch through extensive experiments. First, we conduct experiments of tuning parameters for the RoleMatch. Then we compare the performance of RoleSim++ measure and NeighborMatch algorithm to the existing solutions [4, 10], respectively. Afterward we describe the performance of RoleMatch as a whole for the global deanonymization and local deanonymization. Finally, we also compare RoleMatch with existing seedbased deanonymization algorithms.
7.1 Experiment Settings
All the algorithms are implemented in C++ and compiled with O3 options. The experiments were run on a Linux server, which is equipped with an Intel Xeon E5620 CPU (16 cores, 2.4 GHz) and 64 GB memory. Furthermore, we used 16 threads to parallelize the computation of each iteration.
Dataset statistics
Dataset  #V  #E  Avg. degree  Diameter  Avg. clustering coefficient  Type 

LiveJournal  4,847,571  68,993,773  14.23  16  0.2742  Directed 
 81,306  1,768,149  21.75  7  0.5653  Directed 
Enron  36,692  367,662  10.02  11  0.4970  Undirected 
In addition, we follow the approach proposed in [4] to generate small networks, called synthesized datasets. The basic idea is that, given a large network G, we first randomly extract a subnetwork from G as a seed network, denoted as \( G_s = (V_s, E_s)\), and use the nodes in \( G_s\) to generate a crawled network \( G_{1} = (V_1, E_1)\) and an anonymized network \( G_{2} = (V_2, E_2)\) with satisfying \( V_s = V_1 \cup V_2\). Recall the definition of \( \lambda \); the overlap rate is \( \lambda = \frac{\left V_1 \cap V_2\right }{\left V_s\right }.\)
Then, we use breadth first search (BFS) algorithm to generate synthesized networks with arbitrary \( \lambda \). More specifically, we use BFS to create an overlap network \( G_c = (V_c, E_c)\) from \( G_s\) where \( V_c = \lambda \times V_s\). The overlap network \( G_c\) is treated as a subnetwork to both \( G_1\) and \( G_2\). And then the remaining node set \( V_s{\setminus } V_c\) is split into two parts of same size, \( V^{,}_{1}\) and \( V^{,}_{2}\). Finally, \( V_1\) is \( V^{,}_{1} \cup V_c\), and \( V_2\) is \( V^{,}_{2} \cup V_c\). Furthermore, we apply a selected anonymization algorithm on network \( G_2\) to build the anonymized network. In the following experiments, we use Syn(\( V_s\),\( \lambda \)) to represent a synthesized dataset generated from LiveJournal dataset. For example, Syn(10,000, 50%) means a synthesized dataset created by setting \( \left V_s\right \) to 10,000 and overlap \( \lambda \) to 50%.
 1.
Naive Anonymization The naive approach simply shuffles the identifiers of nodes and leaves the structure as it is.
 2.
Sparsify(\( \delta \)) The Sparsify approach removes \( \delta \left E\right \) edges randomly, where the parameter \( \delta \) controls the number of deleted edges.
 3.
Perturb(\( \delta \)) The Perturb approach [3] first removes edges in exactly the same way as the Sparsify does and then adds false edges randomly until the number of edges of the anonymized network is the same as the original network. This approach can be viewed as a kind of simulation of social network evolution or “unintended” anonymization.
 4.
Switch(\( \delta \)) The switch approach randomly selects two edges (\( i_1\), \( j_1\)) and (\( i_2\), \( j_2\)), where (\( i_1\), \( j_2\)) and (\( i_2\), \( j_1\)) are not in the network. The selected edges are then “switched,” i.e., (\( i_1\), \( j_1\)) and (\( i_2\), \( j_2\)) are deleted, and (\( i_1\), \( j_2\)) and (\( i_2\), \( j_1\)) are added to the network. The procedure is repeated \( \delta \left E\right / 2\) times, and \( \delta \left E\right \) edges are added and \( \delta \left E\right \) edges are deleted.
 1.
Baseline (BaseSim and BaseMatch) We use the deanonymization algorithm [4] as our baseline algorithm. The baseline algorithm consists of two parts: similarity computation phase and the node matching phase. The similarity measure in the baseline algorithm is referred as BaseSim, and the node matching algorithm is referred as BaseMatch.
 2.
Seed Baseline We use the seedbased mapping algorithm proposed by Kazemi et al. [10] as our seedbased mapping baseline.
 3.
RoleMatch The RoleMatch refers to the proposed algorithm, where two new similarity measures, RoleSim++ and \( \alpha \)RoleSim++, are used. Moreover, NeighborMatch is used as the node matching algorithm, where the threshold r is set to 2 in the experiments.
Precision Score This is a metric for evaluating effectiveness of deanonymization algorithms. Assume M(u, v) is the set of correct matching pairs; then, the precision score is \( \frac{M(u,v)}{\left V_1 \cap V_2\right }.\) The higher the precision score is, the more of correct mappings an algorithm generates.
Execution Time This is a metric for evaluating efficiency. It is the time cost of running algorithms in the experiments.
7.2 Parameter Tuning
Before evaluating the RoleMatch algorithm, two parameters need to be tuned, and they are the number of iterations and the threshold αin αRoleSim++. To showcase the parameter tuning processing, we conducted the tuning experiments on Syn(10,000, 100%), and the network is anonymized by Naive Anonymization. Other anonymization algorithms can be tuned in the same way.
Threshold Parameter α For computing αRoleSim++, we use threshold parameter α to limit the number of nodes involved in the computation. The lower the parameter α is, the more the node pairs each iteration computes, resulting in higher time consumption. We set α from 0.95 to 0.50, decreasing by 0.05 in the tuning process. Figure 2b, c shows the precision ratio and time cost ratio between the two algorithms, respectively. The precision ratio is defined as the ratio between the precision of αRoleSim++ and the precision of RoleSim++. Similarly, the time ratio is the ratio between the execution time of αRoleSim++ and the execution time of RoleSim++ in the experiment result. From the figures, we clearly see that when α increases, the time consumption reduces almost linearly, and the precision is well retained when \( alpha \le 0.85\). Therefore, we set α to 0.85 in the following experiments.
7.3 The Performance of RoleSim++
Effectiveness To demonstrate the effectiveness of RoleSim++ and αRoleSim++, we compare the new similarity measures with BaseSim and RoleSim. In the experiment, we use the Syn(10,000, 50%) and Syn(10,000, 100%). The original networks are anonymized by four previously mentioned anonymization algorithms. And we use top1 precision and topm% precision to evaluate the effectiveness.
In summary, because RoleSim++ fully exploits the structural information of a network, it improves the precision of estimating the node similarity.
Efficiency We compare the execution time of BaseSim, RoleSim++ and αRoleSim++ to verify the efficiency of \( \alpha \)RoleSim++. Two aspects are taken into consideration: the average edge density \( \overline{d}\) in the network and the number of nodes \( \left V\right \) in the network. We generate synthesized datasets from LiveJournal. When varying \( \overline{d}\), V is 10,000, and edges are randomly selected from the V. When varying V, the induced subgraph of V is used.
7.4 The Performance of NeighborMatch
It is clear to see that NeighborMatch can reach a better average precision score against BaseMatch on RoleSim++ similarity when anonymization methods (Sparsify and Perturb) are used. The BaseMatch outperforms the NeighborMatch in Switch anonymization, but in that case both algorithms can deanonymize more than 90% of nodes and the difference is subtle. When running on BaseSim, NeighborMatch has a much better performance than BaseMatch. NeighborMatch can achieve at least 75% deanonymization precision when different anonymization algorithms are applied, while the BaseMatch performs poorly when nontrivial anonymization algorithms are applied, resulting in less than 50% precision in Sparsify, Switch and Perturb, respectively. The advantage of NeighborMatch is gained from making use of neighborhood structural information.
7.5 Precision Comparison of Global Deanonymization
To demonstrate the effectiveness of RoleMatch for processing global deanonymization, we compare all three deanonymization algorithms (the baseline algorithm, RoleMatch with RoleSim++ and RoleMatch with \( \alpha \)RoleSim ++) on both real datasets, Enron, Twitter and synthesized datasets, and use Naive, Sparsify, Switch and Perturb anonymization algorithms to generate anonymized networks. Each experiment is ran five times and the average precision score is reported.
7.6 Precision Comparison of Local Deanonymization
In this subsection, we consider the local deanonymization case. To generated \( G_1\) and \( G_2\) satisfying \( G_1 \ll G_2\), we first extract a subnetwork \( G_1\) with 10,000 nodes from LiveJournal and then randomly crawl a subnetwork \( G_{0}\) from \( G_1\) with a given size. We anonymize the subnetwork \( G_{0}\) to generate an anonymized network \( G_{2}\). Here Sparsify is used as the anonymization algorithm with the probability of deleting an edge set to 0.1. Furthermore, we set the overlap rate of \( G_{1}\) and \( G_{0}\) from 10 to 50%, and the overlap part is exactly \( G_0\). The results are presented in Fig. 8a.
From the figure, we can see that generally the precisions of these three algorithms increase as the overlap rate increases. In spite of the general tendency, there is sharp difference between the precisions of the baseline algorithm and the RoleMatch algorithms. The precision of the baseline algorithm basically is always below 0.1 in the experiment; however, when the overlap rate is above 20%, both the RoleMatch with RoleSim++ and RoleMatch with \( \alpha \)RoleSim++ can deanonymize 80% of the overlapped nodes. The difference in precision reveals that the RoleMatch algorithms are far more effective in local deanonymization situations, especially when the overlap rate is not too low. And they can easily deanonymize most of the overlapped nodes with no need to crawl the whole network.
Furthermore, compared with results of global deanonymization, we can see that the improvement in RoleMatch is much larger in local deanonymization case. This is because there is much more noise in local deanonymization case, and RoleMatch is robustness to the noise as analyzed in Sect. 4.2.
7.7 SeedBased Deanonymization
Finally, we conduct experiments to demonstrate that the RoleMatch can be adapted to situations where seed mappings are provided for deanonymization.
The seed version RoleMatch was ran on the Enron dataset, with the correct seed ratio from 1 to 10%. The anonymization algorithm applied here is Sparsify, and we compared this seed version RoleMatch with \( \alpha \)RoleSim++ and the seed mapping deanonymization from the work [10].
8 Related Works
The work of deanonymizing networks is highly related to three topics. They are (1) anonymization algorithms, (2) deanonymization algorithms and (3) node similarity measures. In the following subsections, we describe the related work of each topic separately.
8.1 Anonymization Algorithms
According to a previous survey [18] on social network anonymization, anonymization algorithms are classified into three categories: Kanonymity, edge randomization and clusteringbased generalization.
Kanonymity [15, 21] modifies network structure by edge deletions and additions, so that each node in the modified network is indistinguishable with at least \( K1\) other nodes, in terms of some structural patterns like degree. This approach have good performance in anonymity but (1) is relatively complex to implement and (2) may have modifications to network structure to a too large extent. Edge randomization modifies the network via random deletions, additions or switches of edges. It protects user privacy in a probabilistic manner with simple yet effective approaches. Clusteringbased generalization [20] firstly clusters nodes into groups and next anonymizes a subnetwork into a supernode without individual node’s specific information. This approach can be effective against deanonymization. However, it has the loss of individual information as well as scale information, which may dramatically change the results in social network analysis.
8.2 Deanonymization Algorithms
Deanonymization is the reverse process of anonymization, and researchers have been studying it in different methods [1, 5, 10, 11, 13, 16, 17, 19]. It often appears in reality as part of an attack to leak user privacy [8].
Backstrom et al. [1] proposed a family of deanonymizations so that it is possible for an adversary to learn whether edges exist or not between specific targeted pairs of nodes. The weakness is that the algorithm is vulnerable when the networks are modified before publishing, although it works fine for naive anonymization where node numbers are switched.
Narayanan and Shmatikov [17] presented a framework for analyzing privacy and proposed a deanonymizing algorithm. The algorithm is based on the network topology only and is relatively robust to noise and most defenses. It requires a few seed mappings and propagates to the whole networks. However, the quality of seed mappings has significant influence on whether the attack will succeed. Later Narayanan et al. [16] introduced a simulated annealingbased weight network matching algorithm for finding good initial seed mappings for deanonymization. Yartseva and Grossglauser [19] used network percolation to propose a seedbased network matching algorithm. Kazemi et al. [10] proposed a scalable network matching algorithm using smaller seed mappings to match a pair of networks, with a small increase in matching errors. Korula and Lattanzi [11] applied network percolation to powerdistributed networks and had an improved performance for realworld social network matching.
Fu et al. [4] proposed a seedless algorithm for social network deanonymization. The algorithm first computes each node pair’s similarity iteratively based on maximum matching and then matches nodes according to the node pair similarity scores from high to low. Our RoleMatch follows the same deanonymization framework, but RoleMatch applies a new similarity measure and a new node matching algorithm.
8.3 Node Similarity Measure
Node similarity is a basic metric for network analysis. So far, many different similarity measures have been proposed.
Henderson et al. [6] proposed an algorithm that recursively combines local features with neighborhood features to produce regional features, and then used these regional features to compute node similarity for deanonymization. These regional features can effectively narrow the range of possible corresponding nodes, but with no evidence of the ability to find the real identity with the most similar pairs.
The famous “SimRank” measure by Jeh and Widom [7] provides node similarity measure within one network, which is inapplicable in deanonymization problems. Blondel et al. [2] proposed a crossnetwork node similarity measure by summing up similarity scores of neighbors of two nodes, which is much like the twonetwork version of “SimRank.” Fu et al. [4] proposed a twonetwork node similarity measure by iteratively matching neighbors with top similarity scores for two nodes, which is an improved measure compared to simple transplanting “SimRank” to crossnetwork computation. This approach works for nodes with large degrees but since it lacks normalization, for smalldegree nodes the similarity scores are usually too small to be meaningful. Jing et al. [9] proposed a node similarity measure “RoleSim” with normalization for node within a single network. It can be a good depiction of nodes’ structural information, but the definition and computation method limits it to a singlenetwork measure. Our new crossnetwork node similarity measure is designed on the basis of all these measures.
9 Conclusions
Social network deanonymization is a popular approach to test the strength of anonymization algorithms. With the help of a good deanonymization solution, we can guide companies to design a much better anonymization approach to protect the user’s privacy. In this paper, we developed a fast seedless deanonymization algorithm, named RoleMatch. RoleMatch deanonymizes networks only based on the structural information. Thanks to the new similarity measure, RoleSim++, it can compute the node similarity in high precision. Moreover, during the node matching phase, besides the node similarity, RoleMatch also uses the neighborhood information to improve the mapping results. The comprehensive experimental results have demonstrated the advantages of RoleMatch compared with previous works. Besides the algorithm itself, the performance of deanonymization is also related to the properties of network. We will study such relationship in the future.
Footnotes
Notes
Acknowledgements
This research is funded by the National Natural Science Foundation of China (No. 61702015).
Compliance with Ethical Standards
Conflict of interest
The authors declare that they have no conflict of interest.
References
 1.Backstrom L, Dwork C, Kleinberg J (2007) Wherefore art thou r3579x? Anonymized social networks, hidden patterns, and structural steganography. In: WWW, pp 181–190Google Scholar
 2.Blondel VD, Gajardo A, Heymans M, Senellart P, Van Dooren P (2004) A measure of similarity between graph vertices: applications to synonym extraction and web searching. SIAM Rev 46(4):647–666MathSciNetCrossRefGoogle Scholar
 3.Bonchi F, Gionis A, Tassa T (2014) Identity obfuscation in graphs through the information theoretic lens. Inf Sci 275:232–256MathSciNetCrossRefGoogle Scholar
 4.Fu H, Zhang A, Xie X (2015) Effective social graph deanonymization based on graph structure and descriptive information. ACM Trans Intell Syst Technol 6(4):49CrossRefGoogle Scholar
 5.Gulyás GG, Simon B, Imre S (2016) An efficient and robust social network deanonymization attack. In: Proceedings of workshop on privacy in the electronic society, WPES ’16, pp 1–11Google Scholar
 6.Henderson K, Gallagher B, Li L, Akoglu L, EliassiRad T, Tong H, Faloutsos C (2011) It’s who you know: graph mining using recursive structural features. In: KDD, pp 663–671Google Scholar
 7.Jeh G, Widom J (2002) SimRank: a measure of structuralcontext similarity. In: KDD, pp 538–543Google Scholar
 8.Ji S, Li W, Srivatsa M, He JS, Beyah R (2016) General graph data deanonymization: from mobility traces to social networks. ACM Trans Inf Syst Secur 18(4):12:1–12:29CrossRefGoogle Scholar
 9.Jin R, Lee VE, Hong H (2011) Axiomatic ranking of network role similarity. In: KDD, pp 922–930Google Scholar
 10.Kazemi E, Hassani SH, Grossglauser M (2015) Growing a graph matching from a handful of seeds. Proc VLDB Endow 8(10):1010–1021CrossRefGoogle Scholar
 11.Korula N, Lattanzi S (2014) An efficient reconciliation algorithm for social networks. Proc VLDB Endow 7(5):377–388CrossRefGoogle Scholar
 12.Kuhn HW (1955) The hungarian method for the assignment problem. Naval Res Logist Q 2(1–2):83–97MathSciNetCrossRefGoogle Scholar
 13.Li H, Chen Q, Zhu H, Ma D (2017) Hybrid deanonymization across realworld heterogeneous social networks. In: Proceedings of the ACM turing 50th celebration conference—China, ACM TURC ’17, pp 33:1–33:7Google Scholar
 14.Li H, Zhu S, Du X, Liang X Shen (2018) Privacy leakage of location sharing in mobile social networks: attacks and defense. IEEE Trans Dependable Secur Comput 15(4):646–660CrossRefGoogle Scholar
 15.Liu K, Terzi E (2008) Towards identity anonymization on graphs. In: SIGMOD, pp 93–106Google Scholar
 16.Narayanan A, Shi E, Rubinstein BI (2011) Link prediction by deanonymization: how we won the kaggle social network challenge. In: IJCNN, pp 1825–1834Google Scholar
 17.Narayanan A, Shmatikov V (2009) Deanonymizing social networks. In: ISSP, pp 173–187Google Scholar
 18.Wu X, Ying X, Liu K, Chen L (2010) A survey of privacypreservation of graphs and social networks. In: Managing and mining graph data, pp 421–453Google Scholar
 19.Yartseva L, Grossglauser M (2013) On the performance of percolation graph matching. In: Proceedings of the first ACM conference on online social networks, pp 119–130Google Scholar
 20.Zheleva E, Getoor L (2008) Preserving the privacy of sensitive relationships in graph data. In: KDD, pp 153–171Google Scholar
 21.Zhou B, Pei J (2008) Preserving privacy in social networks against neighborhood attacks. In: ICDE, pp 506–515Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.