The risk of node reidentification in labeled social graphs
Abstract
Real network datasets provide significant benefits for understanding phenomena such as information diffusion or network evolution. Yet the privacy risks raised from sharing real graph datasets, even when stripped of user identity information, are significant. When nodes have associated attributes, the privacy risks increase. In this paper we quantitatively study the impact of binary node attributes on node privacy by employing machinelearningbased reidentification attacks and exploring the interplay between graph topology and attribute placement. We also analyze the risk of anonymity over epidemic networks subject to different node reidentification attacks. Our experiments show that the population’s diversity on the binary attribute consistently degrades anonymity. More interestingly, we show that similar diverse populations in the SI epidemic model maintain different levels of anonymity with different infection rates.
Keywords
Graph anonymization Social networksIntroduction
Real graph datasets are fundamental to understanding a variety of phenomena, such as epidemics, adoption of behavior, crowd management and political uprisings. At the same time, many such datasets capturing computermediated social interactions are recorded nowadays by individual researchers or by organizations. However, while the need for real social graphs and the supply of such datasets are well established, the flow of data from data owners to researchers is significantly hampered by serious privacy risks: even when humans’ identities are removed, studies have proven repeatedly that deanonymization is doable with high success rate (Narayanan et al. 2011; Srivatsa and Hicks 2012; Ji et al. 2014; Korula and Lattanzi 2014). Such deanonymization techniques reconstruct user identities using thirdparty public data and the graph structure of the naively anonymized social network: specifically, the information about one’s social ties, even without the particularities of the individual nodes, is sufficient to reidentify individuals.
Many anonymization methods have been proposed to mitigate the privacy invasion of individuals from the public release of graph data (Ji et al. 2016). Naive anonymization schemes employ methods to scrub identities of nodes without modifying the graph structure. Structural anonymization methods change the topology of the original graph while attempting to preserve (at least some of) the original graph characteristics (Liu and Terzi 2008; Sala et al. 2011; Liu and Mittal 2016). Often the utility of an anonymized graph depends not only on preserving essential graph properties of the original graph, but also node attributes such as labels that identify nodes as cheaters or noncheaters in online gaming platforms (Blackburn and Iamnitchi 2014).
However, the effects of node attributes on the risks of reidentifications are not yet well understood. While intuitively any extra piece of information can be a danger to privacy, a rigorous understanding of what topological and attribute properties affect the reidentification risks is needed. In cases such as information dissemination, node attributes may be informed by the local graph topology. How does the interplay between topology and node attributes affect node privacy?
Our work assesses the additional vulnerability to reidentification attacks posed by the attributes of a labeled graph. We consider exactly one binary attribute to understand the lower bound of the damage that node attributes inflict. We focus our empirical study on the interplay between topology and labeling as a leverage point for reidentification. While most efforts for reidentification attacks are meant to show the vulnerability or resilience of a particular anonymization technique, this work is different, as it focuses on understanding in which conditions node reidentification is feasible, given the network topology and node attributes. Consequently, whether the network topology is original or anonymized is irrelevant for our study. We apply machine learning techniques that use both topological and attribute information to reidentify nodes based on a common threat model. Our study involves realworld graphs and synthetic graphs in which we control how labels are placed relative to ties to mimic the ubiquitous phenomena of homophily—the tendency to connect with similar people—found in social graphs (McPherson and Cook 2001).
Our empirical results show that the vulnerability to node reidentification depends on the population diversity with respect to the attribute considered (Horawalavithana et al. 2018). Using information about the distribution of labels in a node’s neighborhood provides additional leverage for the reidentification process, even when labels are rudimentary. In this study, we show more evidence on this phenomenon based on the wellstudies SusceptibleInfectious (SI) epidemic model. Furthermore, we quantify the relative importance of attributerelated and topological features in graphs of different characteristics.
Related Work
The availability of auxiliary data (such as public records, product reviews, or comments posted online) helps reveal the true identities of anonymized individuals, as proven empirically in large privacy violation incidents (Lemos 2007; Griffith and Jakobsson 2005). Similarly, in the case of graph deanonymization attacks, information from an auxiliary graph is used to reidentify the nodes in an anonymized graph (Narayanan and Shmatikov 2009). The quality of such an attack is determined by the rate of correct reidentification of the original nodes in the network. In general, deanonymization attacks harness structural characteristics of nodes that uniquely distinguish them (Ji et al. 2016). Many such attacks can be categorized into seedbased and seedfree, based on the prior seed knowledge available to an attacker (Ji et al. 2016).
In seedbased attacks, known mappings of some nodes in an auxiliary graph aid the reidentification of anonymized nodes (Narayanan et al. 2011; Srivatsa and Hicks 2012; Ji et al. 2014; 2016; Korula and Lattanzi 2014). The effectiveness of such attacks is influenced by the quality of the seeds (Sharad 2016b). The quality of the seeds is defined by topological properties of the seeds’ neighborhoods: for example, seeds with high degree whose neighbors have also been mapped to real identities have been shown to be highly effective in helping the reidentification process of the other nodes.
In seedfree attacks, the problem of deanonymization is usually modeled as a graph matching problem. Several research efforts have proposed statistical models for the reidentification of nodes without relying on seeds, such as the Bayesian model (Pedarsani et al. 2013) or optimization models (Ji et al. 2014; 2016). Many heuristics are used in the propagation process of reidentification, exploiting graph characteristics such as degree (Gulyás et al. 2016), khop neighborhood (Yartseva and Grossglauser 2013), linkagecovariance (Aggarwal et al. 2011), eccentricity (Narayanan and Shmatikov 2009), or community (Nilizadeh et al. 2014).
Recently, there have been efforts to incorporate node attribute information into deanonymization attacks. Gong et al. (2014) evaluate the combination of structural and attribute information on link prediction models. Attributes not present may be inferred through prior knowledge and network homophily. Qian et al. (2016) apply link prediction and attribute inference to deanonymization by quantifying the prior background information of an attacker using knowledge graphs. In knowledge graphs, edges not only represent links between nodes but also nodeattribute links and link relationships among attributes. The deanonymization attack in (Ji et al. 2017) maps nodeattribute links between an anonymized graph and its auxiliary. In addition to structural similarity, nodes are matched by attribute difference, the union of the attributes of the node in the anonymized and auxiliary divided by their intersection.
However, the success rate of a deanonymization process is often reported in the literature as dependent on the chosen heuristic of the attack, which is typically designed with knowledge of the anonymization technique (Sharad and Danezis 2014). Comparing the strengths of different anonymization techniques thus becomes challenging, if not impossible. Recently, Sharad (2016b) proposed a general threat model to measure the quality of a deanonymization attack which is independent of the anonymization scheme. He proposed a machine learning framework to benchmark perturbationbased graph anonymization schemes. This framework explores the hidden invariants and similarities to reidentify nodes in the anonymized graphs (Sharad and Danezis 2013; 2014). Importantly, this framework can be easily tuned to model various types of attacks.
Several researchers propose theoretical frameworks to examine how vulnerable or deanonymizable any (anonymized) graph dataset is, given its structure (Pedarsani and Grossglauser 2011; Ji et al. 2014; Ji et al. 2015; Ji et al. 2016). However, some techniques are based on ErdösRènyi (ER) models (Pedarsani and Grossglauser 2011), while others make impractical assumptions about the seed knowledge (Ji et al. 2015). Ji et al. (2016) also introduced a configuration model to quantify the deanonymizablity of graph datasets by considering the topological importance of nodes. The same set of authors analyzed the impact of attributes on graph data anonymity (Ji et al. 2017). They show a significant loss of anonymity when more nodeattribute relations are shared between anonymized and auxiliary graph data. Specifically, they measure the entropy present in nodeattribute mappings available for an attacker. As the entropy decreases, the graph loses node anonymity.
The main aspects distinguishing this study from existing works are as follows: i) In our work, we study the inherent conditions in graphs that provide resistance/vulnerability to a general node reidentification attack based on machine learning techniques. ii) To the best of our knowledge, this is the first work that quantifies the privacy impact of node attributes under an attribute attachment model biased towards homophily. iii) We analyze the interplay between the intrinsic vulnerability of the graph structure and attribute information.
Methodology
Our main objective is to quantitatively estimate the vulnerability to reidentification attacks added by node attributes. In particular, we ask: Given a graph topology, how much better does a node reidentification attack perform when the node attributes are included in the attack compared to when there is no node attribute information available to the attacker?
We are interested in measuring the intrinsic vulnerability of a graph with attributes on nodes, in the absence of any particular anonymization technique on topology or node attributes. The intuition is that particular graphs are inherently more private: for example, in a regular graph, nodes are structurally indistinguishable. Adding attributes to nodes, however, may contribute extra information that could make the reidentification attack more successful. Consider another example, in a highly disassortative network (such as a sexual relationships network), knowing the attribute values (i.e., gender) of a few nodes will quickly lead to correctly inferring the attribute values of the majority of nodes, and thus possibly contributing to the reidentification of more nodes. Thus, we also ask the following question in this study: How does the distribution of node attributes affect the intrinsic vulnerability to a reidentification attack of a labeled graph topology?
To answer these question, we developed a machine learningbased reidentification attack inspired from that presented in (Sharad 2016b). We use the same threat model (“The Threat Model” section) that aims at finding a bijectivemount a machinelearning based attack mapping between nodes in two different graphs. We mount a machinelearning based attack (“Machine Learning Attack” section), in which the algorithm learns the correct mapping between some pairs of nodes from the two graphs, and estimates the mapping of the rest of the dataset. As input data, we use both real and synthetic datasets (as presented in “Datasets” section).
The Threat Model
The threat model we consider is the classical threat model in this context (Pedarsani and Grossglauser 2011): The attacker aims to match nodes from two networks whose edge sets are correlated. We assume each node is associated with a binary valued attribute, and this attribute is publicly available. Common examples of such attributes are gender, professional level (i.e., junior or senior), or education level (i.e., higher education or not).
For clarity, consider the following example: an attacker has access to two networks of individuals in an organization that represent the communication patterns (e.g., email) and friendship information available from an online social network. Individuals in the communication network are described by professional seniority (e.g., junior or senior), while individuals in the friendship network are described by gender. These graphs are structurally overlapping, in that some individuals are present in both graphs, even if their identities have been removed. The attacker’s task is to find a bijective (i.e., onetoone) mapping between the two subsets of nodes in the two graphs that correspond to the individuals present in both networks.
Machine Learning Attack
In order to model this scenario using real data, we split a real dataset graph G=(V,E) into two subgraphs G_{1}=(V_{1},E_{1}) and G_{2}=(V_{2},E_{2}), such that V_{1}⊂V, V_{2}⊂V and V_{1}∩V_{2}=V_{α}, where V_{α}≠ϕ. The fraction of the overlap α is measured by the Jaccard coefficient of two subsets: \(\alpha =\frac {V_{1} \cap V_{2}}{V_{1} \cup V_{2}}\). In the shared subgraph induced by the nodes in V_{α}, nodes will preserve their edges with nodes from V_{α} but might have different edges to nodes that are part of V_{1}−V_{α} or part of V_{2}−V_{α}. Each nodes v∈V_{1}∪V_{2} maintains its original attribute value.
In an optimistic scenario, an attacker has access to a part of the original graph (e.g., G_{1}) as auxiliary data and to an unperturbed subgraph (e.g., G_{2}) as the sanitized data whose nodes the attacker wants to reidentify. We use G_{1} and G_{2} as baseline graphs to measure the impact of attributes on deanonymizability of network data. It is also possible to split G_{1} and G_{2} recursively into multiple overlapping graphs, maintaining the same values of overlap parameters as above. This allows us to assess the feasibility of the deanonymization process for large networks by significantly reducing the size of G_{1} and G_{2}.
The resulting graphs are now the equivalent of the email/friendship networks we used as an example above. The overlap is the knowledge repository that the attacker uses for deanonymization (Henderson et al. 2011). Part of this knowledge will be made available to the machine learning algorithms.
Previous work shows that the larger α, the more successful the attack. However, the relative success of attacks under different anonymization schemes is observed to be independent of α (Sharad 2016b). In order to experiment with a homogeneous attack, we set the value of α=0.2, and we build V_{α} by building a breadthfirstsearch tree starting from the highest degree node (BFSHD) in G. While other alternatives are certainly possible, we chose this approach for two reasons. First, it appears that the threat model we used is quite sensitive to the sampling process when generating G_{1} and G_{2} (Pedarsani and Grossglauser 2011). To avoid sampling bias, we chose a BFSHD split to have a deterministic set of nodes in V_{α}. Second, we empirically found that BFSHD provides the maximally informed seeds for an adversary to propagate the reidentification process, thus providing a bestcase scenario for the attacker.
Node Signatures
NDD is a vector of positive integers where \(NDD^{q}_{u}[k]\) represents the number of u’s neighbors at distance q with degree k. We concatenate the binned version of \(NDD^{1}_{u}\) with the binned version of \(NDD^{2}_{u}\) to define the node u’s NDD signature. We use a bin size of 50, which was shown empirically (Sharad 2016b) to capture the high degree variations of large social graphs. For each q, we use 21 bins, which would correspond to a larger node degree of 1050. All larger values are binned in the last bin. This binning strategy is designed to capture the aggregate structure of ego networks, which is expected to be robust against edge perturbation (Sharad 2016a).
NAD is defined by \(NAD^{q}_{u}[i]\) which represents the number of u’s neighbors at distance q with an attribute value i. It is shown experimentally that the use of neighbor attributes as features often improves the accuracy of edge classification tasks (McDowell and Aha 2013).
We use the notation GS to represent the prediction results from the input features made up from the topology (e.g., NDD). GS(LBL) to represent features from both the topology and attribute information (e.g., concatenation of NDD and NAD vectors).
Random Forest Classification
Note that the nodes in G_{san}∩G_{aux}, common to both graphs, can be recognized as being the same node (identical) in the two graphs based on their node identifier. Nonidentical nodes are unique to each G_{san} and G_{aux} and would not exist in the overlap. In the classification task, we wish to output 1 for an identical node pair and 0 for a nonidentical node pair. This is the ground truth against which we measure the accuracy of the learning algorithms.
We generate examples for the training phase of the deanonymization attack by randomly picking node pairs from the sanitized (G_{san}) and the auxiliary (G_{aux}) graphs, respectively. In most cases, we have an unbalanced dataset with the degree of imbalance depending on the overlap parameter α, where the majority is nonidentical node pairs. We use the reservoir sampling technique (Haas 2016) to take ℓ=1000 balance subsamples from the population S, and the SMOTE algorithm (Chawla et al. 2002) as an oversampling technique for each subsample. Each sample is trained by a forest of 100 random decision trees that allows the algorithm to learn features. Giniindex is used as an impurity measure for the random forest classification. Given the size α of the overlap, we measure the quality of the classifier on the task of differentiating two nodes as identical or not.
Metrics
We measure the accuracy of the classifier in determining whether a randomly chosen pair of nodes (with one node in G_{san} and another in G_{aux}) are identical or not. We use F1score to evaluate the quality of the classifier. F1score is the harmonic mean between precision and recall, typical metrics for prediction output of machine learning algorithms.
For each data sample, we perform 5×2 crossvalidation to evaluate the classifier and record the mean F1score. We thus build two vectors of mean F1scores, each of size ℓ=1000 (as described above), one for the labeled (GS(LBL)) and one for the unlabeled network topology (GS). An important aspect of these vectors is that they are related in the sense that the i^{th} element in one vector represents the same sample as the i^{th} element of the other vector. This is important for the pairwise comparison of the two mean F1score vectors.
We perform a standard Ttest on these two vectors and report the Tstatistic value. The Tstatistic value is a measure of how close to the hypothesis an estimated value is. In our case, the hypothesis is the prediction accuracy of the node identities in the unlabeled graph (GS) and the estimated value is the prediction accuracy in the labeled graph (GS(LBL)). Thus, a large Tstatistic value implies a significantly better prediction accuracy of node identities in GS(LBL) than in GS. In such cases, we can say that the network with node attributes is more vulnerable to node reidentification. This value serves as our statistical measurement to quantify the vulnerability cost of node attributes.
Datasets
Because our work is empirically driven, a larger set of test datasets promises a better understanding of the relations between vulnerability to reidentification attacks and the particular characteristics of the node attributes (such as fractions of attributes of a particular value or the assignment of attributes to topologically related nodes). In this respect, real datasets are always preferable to synthetic ones, as they potentially encapsulate phenomena that are missing in the graph generative models. As an example, until very recently, the relation between the local degree assortativity coefficient and node degree was not captured in graph topology generators (SendiñaNadal et al. 2016).
However, relying only on real datasets has its limitations, due to the scarcity of relevant data (in this case, networks with binary node attributes) and the difficulty of covering the relevant space of graph metrics when relying only on available real datasets. Thus, in this work, we combine real networks (described in “Real Network Datasets” section) with synthetic networks generated from the real datasets. For generating synthetic labelled networks, we employ ERGMs (Holland and Leinhardt 1981; Wasserman and Pattison 1996) and a controlled nodelabeling algorithm as described in “Synthetic Graphs” section.
Real Network Datasets

polblogs (Adamic and Glance 2005) is an interaction network between political blogs during the lead up to the 2004 US presidential election. This dataset includes groundtruth labels identifying each blog as either conservative or liberal.

fbdartmouth, fbmichigan, and fbcaltech (Traud et al. 2012) are Facebook social networks extant at three US universities in 2005. A number of node attributes such as dorm, gender, graduation year, and academic major are available. We chose two such attributes that could be represented as binary attributes: gender and occupation, whereby occupation we could identify the attribute values “student” and “faculty”. From each dataset, we obtained two networks with the same topology but different node attribute distributions.

pokec1 (Takac and Zabovsky 2012) is a sample of an online social network in Slovakia. While the Facebook samples are university networks, Pokec is a general social platform whose membership comprises 30% of the Slovakian population. pokec1 is a onefortieth sample. This dataset has gender information available as a node attribute.

amazonproducts (Leskovec et al. 2007) is a bimodal projection of categories in an Amazon product copurchase network. Nodes are labeled as “book” or “music”, edges signify that the two items were purchased together.
Graph properties of the real network datasets
Network  N  E  p  τ  \(\hspace {2pt}\bar {d}\)  C  r  κ  

R(%)  B(%)  R−R(%)  B−B(%)  R−B(%)  
polblogs  1224  16718  0.02  0.22  −0.22  2.49  
(party)  48  52  44  48  8  0.48  0.84  
fbcaltech  769  16656  0.05  0.29  −0.06  1.33  
(gender)  91.5  8.5  92.8  0.2  7  0.08  0.52  
(occupation)  72  28  69  8  23  0.28  0.42  
fbdartmouth  7694  304076  0.01  0.15  0.04  2.76  
(gender)  86.5  13.5  83.2  0.9  15.9  0.14  0.34  
(occupation)  62  38  58  18  24  0.38  0.5  
fbmichigan  30147  1176516  0.0026  0.13  0.115  3.05  
(gender)  92.2  7.8  90.5  0.2  9.3  0.08  0.37  
(occupation)  77.5  22.5  72  9  19  0.22  0.46  
pokec1  265388  700352  0.46  0  2×10^{−5}  0.0068  −0.044  5.66  
(gender)  46  54  18.6  22.4  59  
amazonproducts  303551  835326  0.18  0.99  1.8×10^{−5}  0.21  −0.06  17.42  
(category)  82  18  83.4  16.4  0.2 
The metrics p and τ shown in Table 1 are inspired from the synthetic node labeling algorithm used for generating synthetic graphs (and presented later), and they also show high variation across different networks. Intuitively, p captures the diversity of attribute values in the node population (with p=0.5 showing equal representation of the attributes) while τ captures the homophily phenomenon (that functions as an attraction force between nodes with identical attribute values). The homophilic attraction metric τ varies between 0 in pokec1 (thus, no higher than chance preference for social ties with people of the same gender in Slovakia) to 0.99 in amazonproducts (books are purchased together with other books much more strongly than given by chance). The diversity metric p varies between the overrepresentation of males in the US academic Facebook networks (8% female representation) to an almost perfect political representation in the polblogs dataset (where p=0.48). Note that, we only consider p as the minimum proportion of two node groups due to the symmetric nature of attributes in our experiments.
This wide variation in graph metrics values is what motivated our choice for these set of real networks. We opted to include the three Facebook networks from similar contexts to also capture more subtle variations in network characteristics.
Synthetic Graphs
In order to be able to control graph characteristics and node attribute distributions, we also generated a number of synthetic graphs comparable with the real datasets just described. The graph generation included two aspects: topology generation, for which we opted for ERGMs, and node attribute assignments, for which we implemented the technique proposed in (Skvoretz 2013).
Varying Topology via ERGMs
Exponentialfamily random graph models (ERGMs) or pstar models (Holland and Leinhardt 1981; Wasserman and Pattison 1996) are used in social network analysis for stipulating, within a set structural parameters, distribution probabilities for networks. Its primary use is to describe structural and local forces that shape the general topology of a network. This is achieved by using a selected set of parameters that encompass different structural forces (e.g., homophily, degree correlation/assortativity, clustering, and average path length). Once the model has converged, we can obtain maximumlikelihood estimates, model comparison and goodnessoffit tests, and generate simulated networks tied to the relationship between the original network and the probability distribution provided by the ERGM.
Our interest in ERGMs is based on simulating graphs that retain set structural information from the original graph to generate a diverse set of graph structures. We used R (R Core Team 2014) and the statnet suite (Handcock et al. 2014), which contains several packages for network analysis, to produce ERGMs and simulate graphs from our realworld network datasets. In this case, we focused on three structural aspects of the graphs: clustering coefficient, average path length, and degree correlation/assortativity.
For the ERGM based on clustering coefficient, we used the edges and triangle parameters in the statnet package. The edges parameter measures the probability of linkage or no linkage between nodes, and the triangle term looks at the number of triangles or triad formations in the original graph. For the average path length model, edges and twopath terms were used. The twopath term measures the number of 2paths in the original network and produces a probability distribution of their formation for the converged ERGM. Lastly, for the assortativity measure, the terms edges and degcor were used to produce the models. The degcor term considers the degree correlation of all pairs of tied nodes (for more on ERGMs see (Hunter et al. 2008; Morris et al. 2008)). These terms proved to be our best choices for preserving, to a certain extent, the desired structural information. Although the creation of ERGMs is a trial and error process, the selected terms were successful in producing models for each of the original networks.
Basic statistics of generated ERGM networks, and the population of node pairs
Network  ERGM  d  C  r  κ  S (millions) 

polblogs  dc  0.02  0.03  .08  2.52  5.5 
cc  0.02  0.33  0.02  2.69  13.1  
apl  0.02  0.10  0.06  2.49  11.5  
fbcaltech  dc  0.06  0.08  0.11  2.13  1.2 
cc  0.06  0.42  0.06  2.73  4.1  
apl  0.06  0.07  0.11  1.97  1.2  
fbdartmouth  dc  0.01  0.17  0.07  2.66  14.5 
cc  0.01  0.24  0.04  2.77  13.2  
apl  0.01  0.20  0.04  2.70  14.2  
fbmichigan  dc  0.003  0.02  0.12  3.28  38.4 
cc  0.002  0.20  0.12  3.52  39.9  
apl  0.002  0.20  0.12  3.64  38.2  
pokec1  dc  2.02E5  0.06  0.04  5.60  29.5 
cc  2.05E5  0.07  0.04  5.84  29.3  
apl  2.04E5  0.06  0.04  5.63  27.3  
amazonproducts  dc  1.82E5  0.37  0.06  11.86  43.7 
cc  1.82E5  0.40  0.06  13.52  72.5  
apl  1.82E5  0.39  0.06  13.47  74.3 
Synthetic Labeling
A simple model that parameterizes a labeled graph with a tendency towards homophily (ties disproportionately between those of similar attribute background) is an “attraction” model (Skvoretz 2013). In the basic case of a binary attribute variable and a constant tendency to inbreed, two parameters, p and τ, both in the (0,1) interval, characterize the distribution of ties within and between the two groups. The first is the proportion of the population that takes on one value of the attribute (with 1−p, the proportion taking on the other value). The second parameter, the inbreeding coefficient or probability, expresses the degree to which a tie whose source is in one group is “attracted” to a target in that group. When τ=0, there is no special attraction and ties within and between groups occur in chance proportions. When τ>0, ties occur disproportionately within groups, increasing as τ approaches 1. Given a total number of ties, values for p and τ determine the number of ties/edges that are between groups, namely, δ=E×2×(1−τ)p(1−p).
In the process of generating synthetic node attributes, we first randomly assign two arbitrary values (i.e., R and B) as labels to all the nodes in the graph for a given p,1−p split. Then, we draw an R node and a B node at random and swap labels if it would decrease the number of RB ties. This process would converge when the total number of crossgroup ties reduce to δ for a particular value of τ.
It should be noted that convergence is not guaranteed for all possible combinations of p and τ. The swapping procedure holds constant all graph properties except the mapping of nodes to labels, and consequently, it may not be possible to find a mapping of nodes to labels that achieves a target number of ties between groups (when that number is low as it is for higher values of τ).
Table 2 presents the graph characteristics of the synthetically generated labeled graphs.
Empirical Results
Our objective is not to measure the success of reidentification attacks on original datasets in which node identities have been removed: it has been demonstrated long ago (Backstrom et al. 2007) that naive anonymization of graph datasets does not provide privacy. Instead, our objective is to quantify the exposure provided by node attributes on top of the intrinsic vulnerability of the particular graph topology under attack.
In our experiments, we leverage the real and synthetic networks described above. We mount the machine learning attack described in “Machine Learning Attack” section to reidentify nodes using features based on both graph topology and node attributes. Our first guiding question is thus: How much risk of node reidentification is added to a network dataset by its binary node attributes?
The Vulnerability Cost of Node Attributes
Another observation from this figure is that different node attributes applied to the same topology have different outcomes: see, for example, the case of the fbmichigan topology, where the difference between the impacts of the gender and the occupation attributes is the largest. We thus formulate a new question: What placement of attributes onto nodes reveal more information?
Diversity Matters, Homophily Not
To understand how the placement of attribute values on nodes affects vulnerability, we generate synthetic node attributes in a controlled manner. By varying p (the diversity ratio) and τ (the bias of nodes with samevalue attributes to be connected by an edge), we can study the effect of these parameters on node reidentification.
We observe three phenomena: First, it appears that p is positively correlated with the Tstatistic value measuring the reidentification impact of attributes. That is, the more diversity (that is, the larger p), the more vulnerable to reidentification the labeled nodes become on average. Intuitively, in a highly skewed attribute population, while the minority nodes will be identified quicker due to node attributes, the majority remains protected. On the other hand, when p=0.5, a network has two equalsized sets of nodes where each set takes one of two attribute values. This is explained by the fact that the NAD feature vector captures more diverse information in the attributes of neighbots when p is larger. This is also the explanation for why the node attributes contribute so much more to vulnerability in the polblogs dataset, which has a large diversity (p=0.48) (thus, almost equal numbers of conservative and liberal blogs). Note that the effect of p on the added vulnerability remains consistent across all topologies (real and synthetic) tested.
The second observation is that there is no visible pattern on how τ influences the vulnerability added by binary node attributes. While this is disappointing from the perspective of story telling, it is potentially encouraging for data sharing, as it suggests that datasets that record homophily (or influence, the debate is irrelevant in this context) do not have to be anonymized by damaging this pattern. As a specific example, the privacy of a dataset that records an information dissemination phenomenon could be provided without perturbing the cascadingrelated ties.
The third class of observations is related to the relative effect of the topological characteristics on the added vulnerability. Both amazonproducts and pokec1 are orders of magnitude sparser than the other datasets considered. This means that the topological information available to the machine learning algorithm is limited. In this situation, the addition of the attribute information turns out to be very significant: the Tstatistic values for these datasets are significantly larger than for the other datasets, with values over 400 in some cases.
Another topological effect is noticed when comparing the real pokec1 topology with the ERGMgenerated ones in Fig. 5e: the node attribute contributes much more to the vulnerability of the original topology compared to the synthetic topologies. The reason for this unusual behavior may lay in the different clustering coefficients of the networks, as seen in Tables 1 and 2: the ERGMgenerated topologies have clustering coefficients one order of magnitude higher than the original topology (for the same graph density), which leads to more diverse NDD feature vectors for the networks with higher clustering and thus richer training information. This in turn leads to better accuracy in node reidentification in the unlabeled ERGM topologies (with higher clustering) than in the original topology. For example, the maximum F1score for the ERGMdc topology is 0.92 while for the original is 0.76 in pokec1. Thus, the relative benefit of the node attribute is significantly higher when the topology features were poorer.
Topology Leaks
We make three observations from this figure. First, most of the NAD features (together with node’s attribute value) that represent node attribute information prove to be important in all datasets.
Second, among the NDD features, only a small number contributes consistently to accurate prediction. As shown in Figs. 6c–i, the first bin of 1hop and 2hop NDD vectors contribute the most. That is, a high impact on the reidentification of a node is brought by the number of its neighbors with degrees between 1 and 50. Even in large networks such as pokec1 and amazonproducts with a larger range of node degrees, this behavior is observed.
Third, Fig. 6 suggests what features explain the effect of diversity p on node reidentification in labeled networks. On datasets with large diversity (such as polblogs or pokec1), the topological information contributes less than on datasets with low diversity (such as fbcaltech (gender)). This is because high diversity correlates to richer NAD feature vectors, and thus the relative importance of the NAD features increases.
Epidemic and the Risk of Node Reidentification
In this section we consider the scenario of node attribute placement under the constraint of an epidemic process. We use the SusceptibleInfectious (SI) (Kermack and Mckendrick 2003) model to generate an epidemic process on the original graph topology. In the SI model,individuals are initially susceptible, with the exception of a small fraction of the population who is infectious. In contact with an infections individual, a susceptible individual becomes infectious with the probability β. Once infected, individuals stay infected and infectious throughout their lifetime.
We use this model to assign binary attributes (i.e., susceptible and infectious) to the nodes in the graph. In each experiment, we select the 0.1% highest degree nodes as infectious to initialize the epidemic. We vary the infection probability β between 0 and 1. We mount the machinelearning attack to each epidemic graph independently on the graph topology GS and on the same topology augmented with binary node attributes GS(LBL) under the respective epidemic process. We make two assumptions in this task. First, we assume that the graph topology remains static during the epidemic process. Second, we assume that the adversary does not have any prior information about other epidemic graphs in the series.
We observe the same phenomena on the correlation between population’s diversity (p) and the Tstatistic values over the epidemic graphs. However, the Tstatistic values show different patterns depending on the infection probability β. Note that, the population’s diversity (p) increases to a local maximum in the initial timesteps, and then drops in later timesteps. This is an intuitive observation given the properties of SI model (Kermack and Mckendrick 2003).
When the epidemic grows slowly (i.e., low infection probability), the Tstatistic value also increases at a slower rate. On the other hand, when the epidemic outbreaks at a faster infection rate, the Tstatistic value also increases at a higher rate and achieves a relative larger peak value. For the fbcaltech network, the Tstatistic value reaches a peak value of 10 in four infection steps for β=0.1, while the Tstatistic value reaches a peak value of 50 in two infection steps for β=0.9. Interestingly, the most diverse population in fbcaltech network is also observed after four infection steps for β=0.1, and two infection steps for β=0.9 (as shown in Fig. 7d). In polblogs, Tstatistic values reach peak values of 31 and 36 for the infection rates of 0.1 and 0.9, respectively (as shown in Fig. 7h). The polblogs population becomes more diverse in the similar number of infection steps given the respective infection rate.
Summary and Discussions
This paper shows that the addition of even a single binary attribute to nodes in a network increases the vulnerability to node reidentification. The increase in vulnerability derives from the fact that the machine learning attack makes use of the relationship between topology and the distribution of node labels. Using information about the distribution of labels in a node’s neighborhood provides additional leverage for the reidentification process, even when the labels are rudimentary.
Furthermore, we find that a population’s diversity with regard to the binary attribute consistently degrades anonymity and increases vulnerability. Diversity means a more even distribution of the binary attribute, which produces a more varied set of neighborhood distributions that nodes can exhibit. Consequently, nodes are more easily distinguished from one another by virtue of their differing neighborhood distributions of labels.
This observation is critical for network datasets for which the node attributes are the result of an epidemic process. If the epidemic process is monitored, an adversary could observe the node states and their changes repeatedly over multiple time steps. In such a scenario, the adversary could mount an even stronger node reidentification attack. The techniques presented in this paper can be applied to build strong anonymization techniques for such cases. Specifically, our techniques can be used to estimate the rate of anonymity loss over the lifespan of an epidemic process and more efficiently guide data owners in the process of network data anonymization.
Another outcome of this work is that there is no consistent discernible impact of homophily, as measured by the inbreeding coefficient, on vulnerability. Our procedure for investigating the impact of homophily simply involves swapping labels without disturbing ties. Therefore, both local and global (unlabeled) topologies remain constant as we decrease the number of crossgroup ties to achieve a target value implied by a particular inbreeding coefficient for a given proportional split along the binary attribute. This procedure disturbs the local labeled topology, but because the machine learning attack uses information from that local topology, it apparently can adapt to the changes and make equally successful predictions regardless of the value of the inbreeding coefficient.
There are multiple directions in which this work could be extended. For example, we would like to asses the vulnerability risk of network data that is subject to different epidemic processes, especially processes in which nodes can recover and become infected multiple times. We suspect that such dynamic processes could lead to less vulnerable network datasets. Also, we would like to apply the techniques developed in this paper for guiding efficient anonymization strategies for network datasets with dynamic node attributes, such as those assigned by an epidemic process.
Notes
Acknowledgements
We are grateful to Clayton Gandy for his support with the acquisition and processing of network data.
Authors’ contributions
SH implemented and executed the experiments. SH and AI designed the experiments. JF and JS generated synthetic graphs based on the ERGM package. SH and AI wrote the manuscript with important contributions from JF and JS. All authors read and approved the final manuscript.
Funding
This research is supported by National Science Foundation (NSF) in USA under the grant IIS 1546453.
Competing interests
The authors declare that they have no competing interests.
References
 Adamic, LA, Glance N (2005) The political blogosphere and the 2004 us election: divided they blog In: Proceedings of the 3rd International Workshop on Link Discovery, 36–43.. ACM, New York.CrossRefGoogle Scholar
 Aggarwal, CC, Li Y, Philip SY (2011) On the hardness of graph anonymization In: Data Mining (ICDM), 2011 IEEE 11th International Conference On, 1002–1007.. IEEE, Vancouver.CrossRefGoogle Scholar
 Backstrom, L, Dwork C, Kleinberg J (2007) Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography In: Proceedings of the 16th International Conference on World Wide Web, 181–190.. ACM, New York.CrossRefGoogle Scholar
 Blackburn, KNSJRMJ, Iamnitchi A (2014) Cheating in online games: A social network perspective. ACM Transactions on Internet Technology 13(3):9–1925.CrossRefGoogle Scholar
 Chawla, NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: Synthetic minority oversampling technique. J Artif Int Res 16(1):321–357.zbMATHGoogle Scholar
 Gong, NZ, Talwalkar A, Mackey L, Huang L, Shin ECR, Stefanov E, Shi ER, Song D (2014) Joint link prediction and attribute inference using a socialattribute network. ACM Transactions on Intelligent Systems and Technology (TIST) 5(2):27.Google Scholar
 Griffith, V, Jakobsson M (2005) Messin’with texas deriving mother’s maiden names using public records In: Applied Cryptography and Network Security, 91–103.. Springer, New York.CrossRefGoogle Scholar
 Gulyás, GG, Simon B, Imre S (2016) An efficient and robust social network deanonymization attack In: Proceedings of the 2016 ACM on Workshop on Privacy in the Electronic Society, 1–11.. ACM, New York.Google Scholar
 Haas, PJ (2016) Datastream sampling: basic techniques and results In: Data Stream Management, 13–44.. Springer, Berlin, Heidelberg.CrossRefGoogle Scholar
 Handcock, M, Hunter DR, Butts CT, Goodreau S, Krivitsky P, BenderdeMoll S, Morris M (2014) statnet: Software tools for the statistical analysis of network data. The Statnet Project. (http://www.statnet.org). R package version. Accessed 1 Mar 2019.
 Henderson, K, Gallagher B, Li L, Akoglu L, EliassiRad T, Tong H, Faloutsos C (2011) It’s who you know: graph mining using recursive structural features In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 663–671.. ACM, New York.CrossRefGoogle Scholar
 Holland, PW, Leinhardt S (1981) An exponential family of probability distributions for directed graphs. Journal of the american Statistical association 76(373):33–50.MathSciNetCrossRefGoogle Scholar
 Horawalavithana, S, Gandy C, Flores JA, Skvoretz J, Iamnitchi A (2018) Diversity, homophily and the risk of node reidentification in labeled social graphs In: International Conference on Complex Networks and Their Applications, 400–411.. Springer, Switzerland.Google Scholar
 Hunter, DR, Handcock MS, Butts CT, Goodreau SM, Morris M (2008) ergm: A package to fit, simulate and diagnose exponentialfamily models for networks. Journal of statistical software 24(3):54860.CrossRefGoogle Scholar
 Ji, S, Li W, Gong NZ, Mittal P, Beyah RA (2015) On your social network deanonymizablity: Quantification and large scale evaluation with seed knowledge In: NDSS.. NDSS, San Diego.Google Scholar
 Ji, S, Li W, Srivatsa M, He JS, Beyah R (2014) Structure based data deanonymization of social networks and mobility traces In: International Conference on Information Security, 237–254.. Springer, Switzerland.Google Scholar
 Ji, S, Li W, Srivatsa M, Beyah R (2014) Structural data deanonymization: Quantification, practice, and implications In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, 1040–1053.. ACM, New York.Google Scholar
 Ji, S, Li W, Srivatsa M, Beyah R (2016) Structural data deanonymization: Theory and practice. IEEE/ACM Transactions on Networking 24(6):3523–3536. New York.CrossRefGoogle Scholar
 Ji, S, Li W, Srivatsa M, He JS, Beyah R (2016) General graph data deanonymization: From mobility traces to social networks. ACM Transactions on Information and System Security (TISSEC) 18(4):12:1–12:29.CrossRefGoogle Scholar
 Ji, S, Li W, Yang S, Mittal P, Beyah R (2016) On the relative deanonymizability of graph data: Quantification and evaluation In: Computer Communications, IEEE INFOCOM 2016The 35th Annual IEEE International Conference On, 1–9.. IEEE.Google Scholar
 Ji, S, Mittal P, Beyah R (2016) Graph data anonymization, deanonymization attacks, and deanonymizability quantification: A survey. IEEE Communications Surveys & Tutorials.Google Scholar
 Ji, S, Wang T, Chen J, Li W, Mittal P, Beyah R (2017) Desag: On the deanonymization of structureattribute graph data. IEEE Transactions on Dependable and Secure Computing PP(99):1–1. https://doi.org/10.1109/TDSC.2017.2712150.CrossRefGoogle Scholar
 Kermack, W, Mckendrick A (2003) A contribution to the mathematical theory of epidemics. Proc Roy Soc 5(772):700–721.CrossRefGoogle Scholar
 Korula, N, Lattanzi S (2014) An efficient reconciliation algorithm for social networks. Proceedings of the VLDB Endowment 7(5):377–388.CrossRefGoogle Scholar
 Lemos, R (2007) Researchers reverse Netflix anonymization. http://www.securityfocus.com/news/11497. Accessed 11 Aug 2017.
 Leskovec, J, Adamic LA, Huberman BA (2007) The dynamics of viral marketing. ACM Transactions on the Web (TWEB) 1(1):5.CrossRefGoogle Scholar
 Liu, C, Mittal P (2016) Linkmirage: Enabling privacypreserving analytics on social relationships In: NDSS.. NDSS, San Diego.Google Scholar
 Liu, K, Terzi E (2008) Towards identity anonymization on graphs In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, 93–106.. ACM, New York.CrossRefGoogle Scholar
 McDowell, LK, Aha DW (2013) Labels or attributes?: rethinking the neighbors for collective classification in sparselylabeled networks In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, 847–852.. ACM, New York.Google Scholar
 McPherson, SLLM, Cook J (2001) Birds of a feather: Homophily in social networks. Annual Review of Sociology 27:415–444.CrossRefGoogle Scholar
 Morris, M, Handcock MS, Hunter DR (2008) Specification of exponentialfamily random graph models: terms and computational aspects. Journal of statistical software 24(4):1548.CrossRefGoogle Scholar
 Narayanan, A, Shi E, Rubinstein BI (2011) Link prediction by deanonymization: How we won the kaggle social network challenge In: Neural Networks (IJCNN), The 2011 International Joint Conference On, 1825–1834.. IEEE, San Jose.CrossRefGoogle Scholar
 Narayanan, A, Shmatikov V (2009) Deanonymizing social networks In: Security and Privacy, 2009 30th IEEE Symposium On, 173–187.. IEEE.Google Scholar
 Nilizadeh, S, Kapadia A, Ahn YY (2014) Communityenhanced deanonymization of online social networks In: Proceedings of the 2014 Acm Sigsac Conference on Computer and Communications Security, 537–548.. ACM, New York.Google Scholar
 Pedarsani, P, Figueiredo DR, Grossglauser M (2013) A bayesian method for matching two similar graphs without seeds In: Communication, Control, and Computing (Allerton), 2013 51st Annual Allerton Conference On, 1598–1607.. IEEE, Monticello.CrossRefGoogle Scholar
 Pedarsani, P, Grossglauser M (2011) On the privacy of anonymized networks In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1235–1243.. ACM, New York.CrossRefGoogle Scholar
 Qian, J, Li XY, Zhang C, Chen L (2016) Deanonymizing social networks and inferring private attributes using knowledge graphs In: Computer Communications, IEEE INFOCOM 2016The 35th Annual IEEE International Conference On, 1–9.. IEEE, San Francisco.Google Scholar
 Sala, A, Zhao X, Wilson C, Zheng H, Zhao BY (2011) Sharing graphs using differentially private graph models In: Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, 81–98.. ACM, New York.CrossRefGoogle Scholar
 SendiñaNadal, I, Danziger MM, Wang Z, Havlin S, Boccaletti S (2016) Assortativity and leadership emerge from antipreferential attachment in heterogeneous networks. Scientific Reports 6:21297.CrossRefGoogle Scholar
 Sharad, K, Danezis G (2013) Deanonymizing d4d datasets In: Workshop on Hot Topics in Privacy Enhancing Technologies, 10.. PETS, Bloomington, Indiana.Google Scholar
 Sharad, K, Danezis G (2014) An automated social graph deanonymization technique In: Proceedings of the 13th Workshop on Privacy in the Electronic Society, 47–58.. ACM, New York.Google Scholar
 Sharad, K (2016) Learning to deanonymize social networks. PhD thesis. University of Cambridge, Computer Laboratory, University of Cambridge.Google Scholar
 Sharad, K (2016) True friends let you down: Benchmarking social graph anonymization schemes In: Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security. AISec ’16, 93–104.. ACM, New York. https://doi.org/10.1145/2996758.2996765. http://doi.acm.org/10.1145/2996758.2996765.Google Scholar
 Skvoretz, J (2013) Diversity, integration, and social ties: Attraction versus repulsion as drivers of intra and intergroup relations. American Journal of Sociology 119:486–517.CrossRefGoogle Scholar
 Srivatsa, M, Hicks M (2012) Deanonymizing mobility traces: Using social network as a sidechannel In: Proceedings of the 2012 ACM Conference on Computer and Communications Security, 628–637.. ACM, New York.Google Scholar
 Takac, L, Zabovsky M (2012) Data analysis in public social networks In: International Scientific Conference and International Workshop Present Day Trends of Innovations, 1.. Present Day Trends of Innovations Lamza, Poland.Google Scholar
 R Core Team (2014) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.Rproject.org/.Google Scholar
 Traud, AL, Mucha PJ, Porter MA (2012) Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications 391(16):4165–4180.CrossRefGoogle Scholar
 Wasserman, S, Pattison P (1996) Logit models and logistic regressions for social networks: I. an introduction to markov graphs andp. Psychometrika 61(3):401–425.MathSciNetCrossRefGoogle Scholar
 Yartseva, L, Grossglauser M (2013) On the performance of percolation graph matching In: Proceedings of the First ACM Conference on Online Social Networks, 119–130.. ACM, New York.CrossRefGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.