Secure Data Management Technology

Mimoto, Tomoaki; Kiyomoto, Shinsaku; Miyaji, Atsuko

doi:10.1007/978-981-15-3654-0_4

Tomoaki Mimoto³,
Shinsaku Kiyomoto³ &
Atsuko Miyaji⁴

4812 Accesses
1 Citations

Abstract

In this chapter, we introduce data anonymization techniques for several types of datasets. Data anonymity of anonymized datasets is an index for estimating the (maximum) reidentification risk from anonymized datasets and is generally defined as a quantitative index based on adversary models. The adversary models are implicitly defined according to the attributes in the datasets, use cases, and anonymization techniques. We first review existing anonymization techniques and the adversary models behind the data anonymity definitions for anonymization techniques; then, we propose a common anonymity definition and its adversary model, which is applicable to several types of anonymization techniques. Furthermore, some extensions of the definition, which is optimized for specific types of datasets, are presented in the chapter.

You have full access to this open access chapter, Download chapter PDF

A Study from the Data Anonymization Competition Pwscup 2015

Data Anonymization

Some Basics on Privacy Techniques, Anonymization and their Big Data Challenges

Article 21 June 2018

4.1 Introduction

Secure data management is a key issue in personal data distribution and analysis. Anonymization techniques have been used to harmonize the utility of data and their privacy risks. These techniques transform personal data into anonymized data to reduce the success probability of reidentification of data principals from the data. If the data are well anonymized, they cannot be connected to a person; thus, the privacy of the person is protected by anonymization techniques.

Secure computation is sometimes a realistic solution for commercial services due to its cost for data of very large size. Some anonymization techniques work on commercial services as a “practical” solution, even though the size of the data is very large. Thus, anonymization techniques have been applied for personal data distribution and data analysis. For example, k-anonymization was first proposed as a practical solution to reduce the reidentification risks of public data; since then, it has been considered to be able to be used for the secure management of personal data.

Quantitative measures for anonymity are required for estimating privacy risks and assessing the feasibility of privacy requirements. In several studies on anonymization, privacy notions providing quantitative measures for anonymity have been defined for each anonymization technique; however, no common notion for all anonymization techniques has been presented to date, which means that each privacy notion is not universal but is localized, and heuristic approaches are still used to harmonize the usability of data and privacy risks through whole processes or services. A common notion is required for consistent secure data management for the whole process.

In this chapter,^{Footnote 1} we discuss a new common privacy notion based on an adversary model, which is applicable to several anonymization techniques, and introduce a novel anonymization technique and implementation of the technique. In Sect. 4.2, we revisit adversary models on several anonymization techniques and review anonymization techniques. We propose a common adversary model and quantitative measures using the adversary model are presented in Sect. 4.3. An extension is discussed in Sect. 4.4. Our implementation of an anonymization tool is introduced in Sect. 4.5. We conclude this chapter in Sect. 4.6.

4.2 Anonymization Techniques and Adversary Models, Revisited

The related work presented below is grouped under k-anonymization and noise addition as anonymization methods.

4.2.1 k-Anonymization

k-anonymity [4,5,6] is a well-known privacy model. The property of k-anonymity is that each published record is such that every combination of values of quasi-identifiers can be matched to at least k respondents.

4.2.1.1 Adversary Model

k-anonymized datasets are assumed to be in public domains. An adversary can obtain all the attribute values in a dataset and execute arbitrary operations on the attribute values.

There are few formal definitions or models for the adversary that aim to identify the attributes of a certain individual in a k-anonymized dataset. Kiyomoto and Martin modeled an adversary [7] for k-anonymized datasets based on two query functions as follows:

Let d be an index of the dth record, $q_x$ be a set of m attribute values in $T^{q*}$, and s be a value for the sensitive attribute. The two query functions are defined as:

read. For the input of an index value d, the function outputs the dth record. That is, $f(T^*, query=\{\mathbf{read }, d\}) \rightarrow \{ d, q_x^d, s^d \}$, where $q_x^d$ and $s^d$ are values of the quasi-identifier and the sensitive attribute in the dth record, respectively. If the dth record does not exist, then the function outputs failed.
search. For input $q_x$ and/or s, the function outputs the number u of records and index values that have a quasi-identifier $q_x$ and/or sensitive attribute s. That is, $f(T^*, query=\{\mathbf{search }, q_x, s\}) \rightarrow u, D$, where u and D are the number of records and a sequence of index values that have the same quasi-identifier and/or sensitive attribute, respectively. If s or $q_x$ do not exist, then the function outputs failed.

4.2.1.2 k-Anonymization Algorithm

This idea is easy to understand, and many types of k-anonymization algorithms have been proposed. The Incognito algorithm [8] generalizes the attributes using taxonomy trees, and the Mondrian algorithm [9] averages or replaces the original data with representative values and achieves k-anonymization. In this paper, we use a k-anonymization algorithm based on clustering and denote $A_{k}(D)$ as k-anonymization for dataset D. The algorithm finds close records and creates clusters such that each partition contains at least k records. For details of the algorithm, see [10].

4.2.2 Noise Addition

Noise addition works by adding or multiplying stochastic or randomized numbers to confidential data [11]. The idea is simple and is also well known to be an anonymization technique.

4.2.2.1 Adversary Model

One objective of an adversary against noise-added datasets is to remove the noise or estimate the original values from the noise-added attribute values. One potential scenario is a probabilistic approach in which an adversary estimates the distribution of noise and chooses an attribute value with high probability. There is no formal adversary model on static noise-added datasets, but Differential Privacy settings assume data include dynamically added noise, and their adversary simulations are defined as query-based.

4.2.2.2 Anonymization Algorithm by Noise Addition

The first work on noise addition was proposed by Kim [12], and the idea was to add noise $\epsilon $ with a distribution $\epsilon \sim N(0, \sigma ^2)$ to the original data. Additive noise is uncorrelated noise and preserves the mean and covariance of the original data, but the correlation coefficients and variance are not retained. Another variation of additive noise is correlated additive noise, which keeps the mean and allows the correlation coefficients in the original data to be retained [13]. Differential privacy is a state-of-the-art privacy model that is based on the statistical distance between two database tables differing by at most one record. The basic idea is that, regardless of background knowledge, an adversary with access to the dataset draws the same conclusions, irrespective of whether a person’s data are included in the dataset. Differential privacy is mainly studied in relation to perturbation methods in an interactive setting, although it is applicable to certain generalization methods.

In this paper, we use Laplace noise as a noise addition and add noise $\epsilon \sim Lap(0, 2\phi ^2)$ to each attribute. We denote $A_{\phi }(D)$ as noise addition for dataset D.

4.2.3 K-Anonymization for Combined Datasets

We introduce an adversary model for a combined dataset from datasets produced by two service providers and anonymization methods [14].

4.2.3.1 Adversary Model

If we consider the existing adversary model and assume that the anonymization tables produced by the service providers satisfy k-anonymity, the combined table also satisfies k-anonymity. However, we have to consider another type of adversary in our new service model. In our service model, the combined table includes many sensitive attributes; thus, the adversary can distinguish a data owner using background knowledge of combinations of sensitive attribute values of the data owner. If the adversary finds a combination of known sensitive attributes on only one record, the adversary can obtain information; the record is a data owner that the adversary knows, and the adversary also knows the remaining sensitive attributes of the data owner. We model the above type of new adversary as follows:

$\pi $-knowledge Adversary Model. An adversary knows certain $\pi $ sensitive attributes $\{s^i_1, ..., s^i_j, ..., s^i_{\pi }\}$ of a victim i. Thus, the adversary can distinguish the victim with an anonymization table in which only one record has any combinations (maximum $\pi $-tuple) of the attributes $\{s^i_1, ..., s^i_j, ..., s^i_{\pi }\}$.

4.2.3.2 Modification of Quasi-identifiers

The first strategy is to modify the quasi-identifiers of the combined table. The data user generates a merged table from two anonymization tables as follows: First, the data user simply merges the records in the two tables as $| q^g_C | s^h_{AB} | s^i_A | s^j_B |$. Then, the data user modifies $q^q_C$ to satisfy the following condition, where $\theta $ is the total number of sensitive attributes in the merged table.

4.2.3.3 Modification of Sensitive Attributes

The second approach is to modify the sensitive attributes in the combined table for the condition. If a subtable $| s^h_{AB} | s^i_A | s^j_B |$ that consists of sensitive attributes is required to satisfy k-anonymity, some sensitive attribute values are removed from the table and are changed to $*$ to satisfy k-anonymity. Note that we do not accept that all sensitive attributes are $*$ due to having no information record.

4.2.3.4 Algorithm for Modification

One algorithm that finds a k-anonymized combined dataset is executed as follows:

1.
The algorithm generalizes quasi-identifiers to satisfy the condition that each group of the same quasi-identifiers has at least $\pi \times k$ records.
2.
The algorithm generates all the tuples of $\pi $ sensitive attributes in the table.
3.
For each tuple, the algorithm finds all the records that have the same sensitive attributes as the tuple or has $*$ for sensitive attributes and makes them a group. We define the number of sensitive attributes in the group which is $\theta $. The algorithm generates a partial table that consists of $\theta -\pi $ sensitive attributes and checks whether the partial table has at least k different combinations of sensitive attributes.
4.
If the partial table does not satisfy the above condition, the algorithm chooses a record from other groups that have different tuples of $\pi $ sensitive attributes and changes the $\pi $ sensitive attributes to $*$. The algorithm executes this step until the partial table has up to $\pi $ different combinations of sensitive attributes.
5.
The algorithm executes step 3 and step 4 for all the tuples of $\pi $ sensitive attributes in the table.

4.2.4 Matrix Factorization for Time-Sequence Data

Some studies have used matrices for time-sequence datasets. Zheng et al. [15, 16] proposed predicting a user’s interests in an unvisited location. They assumed users’ GPS trajectory as a user-location matrix where each value of the matrix indicates the number of visits of a user to a location. The matrix is very sparse because each user visits only a handful of locations, so a collaborative filtering model is applied to the prediction. Zheng et al. [17] built a location-activity matrix, M, which has missing values. M is decomposed into the two low-rank matrices U and V. The missing values can be filled by $X = UV^\mathsf{T} \simeq M$, and locations can be recommended when some activities are given. Chawla et al. [18] constructed a graph from the trajectories of taxis and transformed the graph into matrices. The authors of [19] proposed a method of identifying traffic flows that cause an anomaly between two regions.

4.2.5 Anonymization Techniques for User History Graphs

In this subsection, we introduce two anonymization techniques for user history graphs, which are proposed in [1].

4.2.5.1 Adversary Model

Privacy leakage from a merged history graph is the disclosure of the actions of a particular person from the graph. Attacks against user history graphs are intended to obtain the private information of a particular user from the graph. We assume that the merging process is executed on a trusted domain and that only the merged history graph is published; thus, the adversary can only obtain the merged graph. Furthermore, we assume that the adversary has the following knowledge about the user: The history of the user is included in the merged graph and the user performs an action t. The adversary tries to discover other actions of the user to be able to guess which edges connecting to node t can be assigned to the user.

We summarize the adversary model as follows:

Adversary against a Merged History Graph. It is assumed that an adversary knows that a victim A executed an action t. The objective of the adversary is to obtain the actions that A executed before or after the action t. Thus, the adversary searches the merged history graph, which includes actions of other people and finds the actions of A using the knowledge that action t was executed.

We define privacy notions to use with the above adversary model in a later subsection.

4.2.5.2 Notions for the Untraceability of a Graph

We consider two levels of privacy notions: partial k-untraceability and complete k-untraceability. Partial k-untraceability accepts the leakage of some partial actions of a user but prevents all the actions of the user from being revealed. The definition of complete k-untraceability involves meeting the requirement that no action of the user is leaked. The symbol $Act^A_{N_{x \rightarrow y}}$ for user A denotes the sequence of all the actions of user A from action x to action y. For example, the sequence of actions from the first action to action x and the sequence of actions from action x to the final action are denoted as $Act^A_{\mathcal {N}_{start \rightarrow x}}$ and $Act^A_{\mathcal {N}_{x \rightarrow end}}$, respectively.

Definition 4.1

(Partial k-untraceability) We assume that an adversary knows an action t of a user A, and we consider all the possible adversaries defined for any action t of the user in the merged graph. If at least k sequences of actions are potentially associated with user A and $k-1$, other users exist as candidates for all actions $Act^A_{\mathcal {N}_{start \rightarrow t}}$ and $Act^A_{\mathcal {N}_{t \rightarrow end}}$, the digraph satisfies k-untraceability for A. If the digraph satisfies the above condition for all users, then the digraph is said to satisfy partial k-untraceability.

Definition 4.2

(Complete k-untraceability) We assume that an adversary knows an action t of a user A and we consider all the possible adversaries defined for any action t of the user in the merged graph. If at least k actions are potentially associated with user A and $k-1$ other users exist as candidates for each action in $Act^A_{\mathcal {N}_{start \rightarrow t}}$ and $Act^A_{\mathcal {N}_{t \rightarrow end}}$, the digraph satisfies k-untraceability for A. If the digraph satisfies the above condition for all users, the digraph satisfies complete k-untraceability.

Generally, many trivial actions are performed by many users. It is not important for privacy purposes where we keep the information about such actions. Thus, we relax the above definitions to produce an anonymized graph that includes much of the information needed to analyze a user’s history. Let v be the threshold value for the number of performing users that establishes that an action is trivial; that is, we judge the actions $x \rightarrow y$ to be trivial if the label $L(x \rightarrow y) \ge v$. Both definitions are modified as follows:

Definition 4.3

(Partial (k, v)-untraceability) We assume that an adversary knows an action t of a user A, and we consider all the possible adversaries defined for any t in the merged graph. If at least k sequences of actions are potentially associated with user A and $k-1$ other users exist as candidates for all actions $Act^A_{\mathcal {N}_{start \rightarrow t}}$ and $Act^A_{\mathcal {N}_{t \rightarrow end}}$ except trivial actions $x \rightarrow y$ that have a label $L(x \rightarrow y) \ge v$, then the digraph satisfies partial (k, v)-untraceability for A. If the digraph satisfies the above condition for all users, then the digraph satisfies partial (k, v)-untraceability.

Definition 4.4

(Complete (k, v)-untraceability) We assume that an adversary knows an action t of a user A, and we consider all the possible adversaries defined for any t in the merged graph. If at least k actions are potentially associated with user A and $k-1$ other users exist as candidates for each action in $Act^A_{\mathcal {N}_{start \rightarrow t}}$ and $Act^A_{\mathcal {N}_{t \rightarrow end}}$ except trivial actions $x \rightarrow y$ that have a label $L(x \rightarrow y) \ge v$, then the digraph satisfies complete (k, v)-untraceability for A. If the digraph satisfies the above condition for all users, then the digraph satisfies complete (k, v)-untraceability.

In a complete (k, v)-untraceable graph, each action t except trivial actions has k outgoing edges and incoming edges; thus, an action of user A that connects to action t cannot be identified from k candidates. Thus, the graph satisfies untraceability for an adversary who knows action t of the user. It is trivial that a complete (k, v)-untraceable graph satisfies partial (k, v)-untraceability; all actions except trivial actions are connected to k potential actions in a complete (k, v)-untraceable graph. A graph that satisfies partial (k, v)-untraceability generally produces much more information than a complete (k, v)-untraceable graph, where the partial (k, v)-untraceable graph and the complete (k, v)-untraceable graph are generated from a user history graph. However, the (k, v)-untraceable graph may reveal partial actions of users due to the relaxed definition of the privacy notion; an attack is successful when an adversary obtains all the actions of a user. To trace all the actions of the user, the adversary has to select a sequence of actions from k sequences of actions; thus, all the actions of the user are untraceable, even though some actions are traceable by the adversary. The parameter k means that an action (or a sequence of actions) is potentially associated with a user and $k-1$ other users in the untraceable graph, and the parameter v means that v users perform the same action in the graph. Generally, we should select the parameter $v = k$ with regard to the privacy requirement for a merged graph. The actions of a user are hidden in the actions of a group that consists of k members including the user. A privacy notion for the graph should be selected from the above two notions according to a use case of the graph and its privacy requirements.

4.2.5.3 Algorithm Generating a Partial (k, V)-Untraceable History Graph

The details of the algorithm are denoted as Algorithm 4.1, where $oe_t$ and $ie_t$ are defined as the number of outgoing edges and incoming edges of a node t, respectively. The algorithm for generating a partial (k, v)-untraceable history graph is as follows:

1.
This step consists of a part of the detailed algorithm, from line 1 to line 3. For the input of a user history graph G, the algorithm adds a virtual incoming edge $(s_r \rightarrow r)$ to each node $r \in start$ until the number of incoming edges is the same as the number of outgoing edges. Then, the algorithm adds a virtual outgoing edge $(q \rightarrow u_q)$ to each node $q \in end$ until the number of outgoing edges is the same as the number of incoming edges. A label of a virtual incoming edge $L(s_x \rightarrow x)$ denotes the number of users who first perform the action, and a label of a virtual outgoing edge $L(y \rightarrow u_y)$ denotes the number of users who perform the action at the end.
2.
This step consists of a part of the detailed algorithm, from line 4 to line 12. The algorithm searches for a node t that has fewer outgoing edges than k and for which all its lower nodes $\mathcal {N}_{t \rightarrow end \setminus t}$ have fewer outgoing edges than k. Then, the algorithm removes all the outgoing edges $(t \rightarrow *)$ that satisfy $L(t \rightarrow *) < v$. Next, the algorithm searches for a node $t'$ that receives incoming edges numbering less than k and all upper nodes $\mathcal {N}_{start \rightarrow t' \setminus t'}$ that receive fewer incoming edges than k. Then, the algorithm removes all the incoming edges $(* \rightarrow t')$ that satisfy $L(* \rightarrow t') <v$. The algorithm repeats this step until no node that meets the conditions is found.
3.
This step is the same as line 13, line 14 and line 15 in the detailed algorithm. The algorithm removes virtual incoming and outgoing edges, removes nodes that have no edges, and outputs the modified graph.

4.2.5.4 Algorithm Generating a Complete (k, V)-Untraceable History Graph

The details of the algorithm are denoted as Algorithm 4.2. The algorithm for generating a complete (k, v)-untraceable history graph is as follows:

1.
The algorithm first executes Algorithm 4.1 except line 13 and line 15.
2.
This step consists of a part of the detailed algorithm, from line 3 to line 11. The algorithm searches for a node t that has fewer outgoing edges than k and removes all the outgoing edges $(t \rightarrow *)$ that satisfy $L(t \rightarrow *) < v$, until no node is found. Then, the algorithm searches for a node $t'$ that receives fewer incoming edges than k and removes all the edges $(* \rightarrow t')$ that satisfy $L(* \rightarrow t') < v$. The algorithm repeats this step until no node that meets the conditions is found.
3.
This step consists of line 12, line 13, and line 14 in the detailed program. The algorithm removes virtual edges, removes nodes to which no edge is connected, and outputs the modified graph.

4.2.6 Other Notions

Differential Privacy [20, 21] is a notion of privacy for perturbative methods based on the statistical distance between two database tables differing by, at most, one element. The basic idea is that, regardless of background knowledge, an adversary with access to the dataset draws the same conclusions whether a person’s data are included in the dataset. That is, a person’s data have an insignificant effect on the processing of a query. Differential privacy is mainly studied in relation to perturbation methods [22,23,24] in an interactive setting. Attempts to apply differential privacy to search queries have been discussed in [25]. Li et al. proposed a matrix mechanism [26] applicable to predicate counting queries under a differential privacy setting. Computational relaxations of differential privacy were discussed in [27,28,29]. Another approach for quantifying privacy leakage is an information-theoretic definition proposed by Clarkson and Schneider [30]. They modeled an anonymizer as a program that receives two inputs: a user’s query and a database response to the query. The program acted as a noisy communication channel and produced an anonymized response as the output. Hsu et al. provides a generalized notion [31] in decision theory for making a model of the value of personal information. An alternative model for the quantification of personal information is proposed in [32]. In the model, the value of personal information is estimated by the expected cost that the user has to pay for obtaining perfect knowledge from given privacy information. Furthermore, the sensitivity of different attribute values is taken into account in the average benefit and cost models proposed by Chiang et al. [33]. Krause and Horvitz presented utility-privacy tradeoffs in online services [34, 35].

4.2.7 Combination of Anonymization Techniques

A combination of anonymization methods leads to the construction of datasets that are useful and that preserve privacy. Some countries publish census data, and they combine several anonymization methods, such as generalization, noise addition, and sampling [36, 37]. However, some problems remain. One problem is that it is difficult to evaluate the privacy risks of anonymized datasets when anonymization methods are combined. Some research is available about the relationships among anonymization methods. Chaudhuri et al. proposed $(c, \epsilon , \delta )$-privacy [38] and studied the relationship among sampling and differential privacy [39]. Li et al. proposed $(\beta , \epsilon , \delta )$-differential privacy and studied the relationship among sampling, differential privacy, and k-anonymity. Soria-Comas et al. proposed a k-anonymized algorithm for differential privacy using an insensitive algorithm [40].

4.3 (p, N)-Identifiability

4.3.1 Common Adversary Model

Existing privacy measures are supposed to protect against idealized attackers, and it is difficult to maintain their utility and assess their reidentification risk. We designed adversary models to describe more realistic attackers by structuring a real setting for the attackers. In the case of exchanging anonymized datasets between companies, for instance, a data-providing company first anonymizes and encrypts datasets for transmission to a receiver company via a secure channel. The receiver company locates the dataset in a secure room and allows only authorized employees to access the anonymized dataset. This process can reduce the reidentification risk in the anonymized dataset, and it specifies the attacker and limits the ability to access datasets so that the attacker must know the quasi-identifiers of the neighbors or acquaintances. For example, it seems to be quite rare for an attacker to know all the quasi-identifiers of a target because the target is a neighbor of the attacker. Thus, a more stringent analysis of the reidentification risk can be achieved when we assume a more realistic situation, such as that the attacker has only limited knowledge of the victim.

Access rights to an anonymized dataset may be given to attackers, and attackers may acquire some information about the original dataset or obtain the anonymization algorithm used to generate the anonymized dataset. Information about the original dataset is categorized into three parts as follows: information on a specified record such as a neighbor; the original dataset; and any other information except the target information that the attacker is seeking. The case of William Weld, who was governor of Massachusetts [41], is a typical example of reidentification, and an attack on the Netflix Prize dataset was carried out by a strong attacker who gained access to the Internet Movie Database [42].

We can consider the abilities of an attacker in two areas: knowledge about the dataset and the ability to simulate anonymization algorithms. Many previous studies such as [43, 44] assumed that an attacker has all the information required except knowledge of the target of the attack. In this paper, we consider an attacker who has knowledge of only the target record and can simulate anonymization algorithms to obtain anonymized records that may correspond to the target record.

4.3.1.1 Definitions of Actual Attackers

Generally, when an anonymized dataset is published on the Web, anyone who can access the dataset is a potential attacker; thus, the adversary model should be ideal because we cannot assume there is only a limited-knowledge adversary, and we have to assume all possible adversaries are present. On the other hand, when the dataset is managed under strict controls, the model adversary is not considered to be an unlimited-knowledge adversary. We design two realistic adversary models under the assumption that the dataset is managed in a restricted area (not public) and only a limited set of attackers can access the dataset; and then, we propose a privacy metric for privacy risk analysis.

Definition 4.5

(Anonymization Simulator $f_{sim}$) Let $D_0$ with $n_0$ records, $D_1$ with $n_1$ records, $r^x_i[QI]$, and $r^x_i[SI]$ be an original dataset, an anonymized dataset generated from the original dataset, the quasi-identifiers of a record $r^x_i \in D_x$, and sensitive information from the record $r^x_i \in D_x$, respectively. An anonymization simulator $f_{sim}$ simulates an anonymization algorithm used to generate an anonymized dataset as an oracle and outputs $r^1_i[QI] \in D_1$ for the input $r^0_i[QI] \in D_0$. That is, $f_{sim}: r^0_j[QI] \rightarrow \left\{ \mathbf{r}^1[QI], \bot \right\} $, where $\mathbf{r}^1[QI]$ is a set of $r^1_i[QI]$ and no output is produced in the case of $\bot $.

The simulator is a deterministic process for deterministic anonymization, such as top-coding and bottom-coding, and a probabilistic process for probabilistic anonymization, such as random sampling. The simulator can provide access to $D_0$ to simulate the anonymization algorithm, even though no adversary can access $D_0$. Next, we define two adversary models.

Definition 4.6

(Deanonymizer for Anonymized Datasets, $\mathcal {DA}$) When $\exists _1 r_j^0[QI] \in D_0$, $\forall r^1_i[QI||SI] \in D_1$ and $f_{sim}$ are given, a deanonymizer $\mathcal {DA}$ lines up potential candidates $r^1_i$ corresponding to $r^0_j$ by executing the simulator $f_{sim}$; then, the deanonymizer $\mathcal {DA}$ outputs a list of candidates $r^1_i[QI||SI]$ for $r^0_j$, where the number of records in the list is $n_q$, the number of sensitive information items in the list is $n_s$ and $0 \le n_s \le n_q \le n_0$.

If an attacker knows the actual anonymization function f, the attacker can use f as $f_{sim}$, and the evaluation result should be more credible.

Definition 4.7

(Reidentifying Adversary versus Anonymized Datasets) When $\exists _1 r_j^0[QI] \in D_0$, $\forall r^1_i[QI||SI] \in D_1$ and $f_{sim}$ are given, a reidentifying adversary executes the deanonymizer $\mathcal {DA}$ and can identify $r^1_i$, which is a record of the same person in the record $r^0_j$, from the records in a dataset $D_0$, where $r^0_j \in D_0$ is given. The success probability of the attack is calculated as $1/n_q$ when $r^1_j$ is included in the output by $\mathcal {DA}$; otherwise, it is 0.

Assuming an attacker who has $\exists _1 r_j^0[QI] \in D_0$ is the same as assuming $|D_0|$ attackers who have $r^0_j(j=1,...,|D_0|) \in D_0$.

Definition 4.8

(Revealing Adversary versus Anonymized Datasets) When $\exists _1 r_j^0[QI] \in D_0$, $\forall r^1_i[QI||SI] \in D_1$ and $f_{sim}$ are given, a revealing adversary executes the deanonymizer $\mathcal {DA}$ and finds a $r^0_j[SI]$ from $r^1_i[SI]$ such that $r^1_i$ is a record of the same person as the record $r^0_j$. The success probability of the attack is calculated as $1/n_s$ when $r^1_j$ is included in the output of $\mathcal {DA}$; otherwise, it is zero.

A revealing adversary does not try to identify the record but tries to access sensitive information. In other words, the attacker seeks only to obtain sensitive information from the record in question. More precisely, the success probability of the revealing adversary can be calculated as $[n_s]/n_q$, where the correct number of sensitive items in the list is $[n_s]$, but the probability itself may be uncertain. Assume that when the probability is 0.99, some attackers are convinced that the target should be the majority. Furthermore, in the case that the deanonymizer $\mathcal {DA}$ is leaked and the $f_{sim}$ used in the deanonymizer is a deterministic process, an attacker can infer the sensitive information of $r_j^0$. On the other hand, when the $f_{sim}$ used in the deanonymizer is a probabilistic process, even if $\mathcal {DA}$ is leaked, outputting the result should not involve uncertainty.

4.3.1.2 (p, N)-Identifiability

Here, we assume that anonymized datasets are strictly controlled and that the attacker has knowledge of a specific record and the anonymization algorithms. We assume that the attacker is the strongest type of attacker and has knowledge of the most characteristic record. Nevertheless, it is difficult to quantify this characteristic, so we assume that each attacker has an original record. In other words, we assume there are as many attackers as there are original records.

Definition 4.9

((p, N)-identifiability) Let p be the success probability for an adversary who has $\exists _1 r^0[QI] \in D_0$, $\forall r^1_i[QI||SI] \in D_1$ and $f_{sim}$, and N be the number of adversaries whose attack success probability is p.

The probability p is the conditional probability that the adversary can select the correct record from the list produced by the deanonymizer $\mathcal {DA}$ when the collected record is included in the list. The probability that the deanonymizer successfully produces the list, including the correct record, depends on the anonymization algorithms.

Our model can extend to an adversary who has knowledge of two or more records. For simplicity, we use an adversary model that knows a single record and consider N single knowledge adversaries in our risk analysis. The idea of (p, N)-identifiability is studied in [2].

4.3.2 Success Probability Analysis Based on the Common Adversary Model

In this section, we assume the attackers described in the previous section and explain the calculation to obtain the success probability of attacks on representative anonymization methods: generalization, noise addition, and sampling. We consider that $f_{sim}$ is constructed as a typical combined algorithm selected from three anonymization algorithms, $f_{generalization}, f_{sampling}$ and $f_{noise}$. We explain the above three anonymization algorithms and show combined anonymization using an example dataset.

4.3.2.1 Generalization

We include deletion of records or cells and top- or bottom-coding as steps in generalization. One step of $f_{generalization}$ is similar to k-anonymity in checking the number of identical combinations of quasi-identifiers. When an anonymized dataset has k-anonymity, p equals 1/k. k-anonymity is an intuitive privacy metric, but the greater the number of attributes, the more difficult it is for the datasets to achieve k-anonymity. If an attacker has generalization trees for each attribute, the attacker adds records which satisfy the requirements of the trees of the list of candidates. When there is a record whose address attribute is Tokyo, for instance, an attacker who has the generalization tree adds records whose addresses are in the Kanto region as well as records whose addresses are in Eastern Japan to the list of candidates. It is appropriate that an attacker can infer the generalization tree and in our experiment, $f_{sim}$ can be considered capable of accessing the generalization trees of each attribute.

4.3.2.2 Random Sampling

When an attacker who has one original record is assumed, the privacy risk differs greatly among the original datasets. Consider an original dataset with many unique records, and assume that random sampling is implemented. Let M be the number of unique records and $\alpha $ be the sampling rate. The probability that unique records will not appear is $(1-\alpha )^M$. Even when $\alpha =0.1$ and $M=44$, the probability is less than $0.1 \%$. When a large dataset is anonymized, it is possible that there will be more than 44 unique records, which shows that if sampling is implemented, a characteristic record may be identified or suspected.

We evaluate sampling as follows: For simplicity, we consider the case where the anonymization method is only random sampling. When a unique record is sampled, an attacker who knows the person is certain that the record is for that person. Thus, the probability p does not change. On the other hand, sampling reduces the number of unique records, and N decreases accordingly. When unique records are very few and do not appear in an anonymized dataset, p decreases. We apply this approach to the case of combining different anonymization methods.

The approaches to sampling vary, and we can also consider $f_{sampling}$ in various ways. For instance, the probability of disclosing the identity of any individual is evaluated by using the posterior probability of population uniqueness [45].

4.3.2.3 Noise Addition

There are two cases of noise addition: One is adding noise to the numerical data itself, and the other is adding noise to its quantity. In the former case, the data consist of original numerical data or data anonymized by a process, such as microaggregation, and in the latter case, the data are original quantity data or anonymized data, such as 11–20 in the age attribute.

In the former case, we can consider $f_{noise}$ as follows. Noise is added based on a probability distribution, such as normal, Laplace, and exponential distributions. In particular, it has been mathematically proven that adding Laplace noise to the output of some queries achieves differential privacy [39], so this type of noise is widely used. Therefore, when an anonymized record is included in the 90 or $95 \%$ confidence interval, the record is added to the list of candidates. More simply, when original data and anonymized data have small differences such as 10 or $20 \%$ for each attribute, the attacker may consider the possibility that they are the same.

In the latter case, we cannot use the same method. When a record has 72 and is anonymized to 95, for instance, the attacker whose target is a specific person may not regard the target to be that person. However, the attacker can link them after the top-coding is executed and change the value to 70-. On the other hand, when a record is 19, is anonymized to 20 and is generalized to 20–29, the attacker may not link them. One of the ideas of $f_{noise}$ is that a group with each attribute can be changed to next group and such records are output as candidates. As in the generalization step, an attacker can infer the next group for each group and $f_{noise}$ can be thought of as defining the distance of each classification.

The description above shows that when the order of anonymization is changed, $f_{sim}$ will also be changed.

4.3.2.4 Combination of Anonymization Methods

The principles of each anonymization can be combined by evaluating each anonymization step by step. Stated differently, an attacker has $f_{generalization}, f_{sampling}$, and $f_{noise}$ as $f_{sim}$. We show examples of combined cases by using a sample dataset (Fig. 4.1). An attacker should change his or her approach when the order of anonymization is changed if he or she knows this fact. We assume five attacker models, $A_1$ to $A_5$, in the following example, and the candidates of each attacker model are represented as $C_1$ to $C_5$. We denote $C_i$ of $r_j$ in the following figures as the candidates of an attacker $A_i$ who has $r_j$ as a target. The adversary model for $A_1$ to $A_4$ is the reidentifying adversary defined in Definition 4.3, and the adversary model in Fig. 4.4 is the revealing adversary defined in Definition 4.4.

Let the conditions of attackers be as follows: $A_1$ and $A_3$ do not consider noise-adding and generalization but simply compare $r^1_i \in D_1$ with $r^0_j \in D_0$. This is one approach to $f_{noise}$ and $f_{generalization}$. On the other hand, $A_2$, $A_4$, and $A_5$ do consider the added noise and generalization. We define the noise addition shown in Fig. 4.2 as follows: the classifications of each attribute change to the next classification with a certain probability. We assume $A_2$ knows the rule of noise addition and that $f_{noise}$ of $A_2$ outputs candidates that have a different classification in one attribute from an original record. On the other hand, let a small amount of noise be added in step (a) of Figs. 4.3 and 4.4. We assume the attackers $A_4$ and $A_5$ know the rule and that $f_{noise}$ of $A_4$ and $A_5$ outputs candidates whose values of $ATTR_1$ are different but within 2 from the original record and whose values of $ATTR_2$ are different but within 4 from the original record. In the figures, the boldface sections show that the classifications are not correct but are within the permissible range for $f_{noise}$ of $A_2, A_4$, and $A_5$: The red boldface sections show that there are substantial distances from the original values and that attackers who have the record cannot link them.

4.3.2.5 Examples of Analyses

The Case of $A_1$

Generalization, noise addition, and sampling are executed as anonymizing methods in Fig. 4.2. In the generalization step (a), all records are generalized to be divisible into equal parts. As a result, only $r_2$ is unique, and this dataset has (1, 1)-identifiability.

In step (b), $r_1, r_4$, and $r_6$ are changed by the addition of noise. As a result, $r_1$ and $r_2$ are indistinguishable. $r_3, r_4$, and $r_7$ are also indistinguishable, but $r_5$ and $r_6$ become unique. We define $A_1$ as not considering the addition of noise, so that an attacker who has $r_6$ cannot link the original record but an attacker who has $r_5$ can. Therefore, identifiability becomes (1, 1)-identifiability.

After sampling, in step (c), $r_2, r_4$, and $r_5$ do not appear. Then, $r_3$ and $r_7$ become the focus are focused and identifiability becomes (1/2, 2)-identifiability. This attacker simply checks how many of the same records there are in the dataset. Even if various anonymization methods are implemented, some records may not be affected. Therefore, it is important to assume such attackers. When we can say that a dataset has a certain level of privacy from such attackers, it means that an attacker cannot link the target with the original record by accident.

The Case of $A_2$

We omit the explanation of step (a) because noise is not added. In step (b), the attacker with $r_1$, for example, chooses $r_1, r_2, r_5$, and $r_6$ as candidates because one or more of their attributes match $r_1=\{$-30, 175-}. On the other hand, an attacker with $r_4$ cannot output candidates because both attributes of $r_4$ are changed. Hence, identifiability is (1/4, 2)-identifiability. In step (c), $r_5$ does not appear, and identifiability becomes (1/4, 1)-identifiability.

The Case of $A_3$

In Fig. 4.3, the dataset is anonymized by the addition of noise, generalization, and sampling.

In the case of $A_3$, the dataset with added noise is safe enough from attackers who do not consider the added noise and we omit this case; however, this does not mean that noise addition is safe, and when another attacker, such as $A_4$, is considered, the result should be different. In step (b), we focus on the attacker with $r_3$. This is the strongest attacker, and this attacker suspects that $r_2$ and $r_3$ are the candidates. More specifically, the scope is $r_3 = \left\{ 38, 165\right\} = \{31$-, -$174\}$ and $r_2, r_3$ meet the requirement. The attacker with $r_2$ seems to have the same risk but cannot identify the actual target $r_2$ is a possible candidate because the noise of $ATTR_2$ is great enough. Hence, the identifiability becomes (1/2, 1)-identifiability. In step (c), $r_3$ does not appear, and the privacy risk is (1/3, 1)-identifiability.

The Case of $A_4$

Next, we show the case of $A_4$. In step (a), every record but $r_1$ and $r_7$ has enough added noise, and attackers cannot infer which is the correct record. The attacker with $r_7$ regards the records within $\left\{ 33 \pm 2, 173 \pm 4\right\} $ as candidates. Only $r_7$ satisfies the condition, and the privacy risk is (1, 1)-identifiability. In step (b), the effect of noise addition becomes weak, and the number of attackers who should be considered increases. The attacker with $r_6$, for instance, regards the records within $\left\{ 29 \pm 2, 171 \pm 4\right\} = \{($-30, 31-), (-174, 175-$)\}$, namely, all records, as candidates. The privacy risk becomes (1/2, 1)-identifiability after generalization is finished. In step (c), similar to the previous steps, the privacy risk becomes (1/3, 1)-identifiability.

The Case of $A_5$

Finally, we show an example of a revealing adversary.

An attacker can claim to succeed when the sensitive information $ATTR_S$ of the target can be correctly identified. Step (a) is similar to that of the case of $A_4$. In step (b), the attacker with $r_3$ suspects $r_2$ and $r_3$ are the candidates. Their $ATTR_S$ are, however, “Office” and the attacker claims to identify the person. Thus, the privacy risk is $(2/2=1,1)$-identifiability, which is similar to l-diversity. In step (c), the attacker with $r_1$ suspects $r_1, r_4$ and $r_6$ are the candidates; the $ATTR_S$ of $r_1$ is “Hospital,” and that of the others is “Shop.” Therefore, the probability of reidentification is 1/2. More precisely, the probability is 1/3 because there are three candidates and one is correct, but the probability may be important information for the attacker with $r_1$. The same can be said of the attacker with $r_7$; therefore, the risk according to our definition is (1/2, 2)-identifiability.

As described above, when the adversary model is different, the result of the risk is also different. Assuming attackers who disregard noise, we consider the risk to the records whose fluctuations are due to anonymization to be small. On the other hand, assuming attackers who do consider the actual added noise, we consider the risk to the dataset as a whole. Moreover, strong attackers can be assumed to use the inverse function of the actual noise or anonymization method. In the case that noise based on a normal distribution is added, for instance, an optimal distance-based record linkage can be performed [46].

It is important to consider the various types of attackers in this way, because the most important factor of privacy is the inability to definitely link an anonymized record $X'$ and original record X. Our metrics ensure that the attackers considered can neither identify a record nor make an identification by chance, by considering many attackers.

4.3.2.6 Implementation of the Analysis Algorithm

Processing time is a problem when our metric is applied to a large dataset. In this section, we discuss this problem.

First, we have to evaluate the risk from attackers with each record, and when sampling is implemented, the candidates in each record need to be preserved across the sampling. However, we do not need to store the candidates for every record or the records that have certain risks because the metric does not consider attackers who have knowledge of a record that does not have the highest risk. Moreover, when anonymization and evaluation are performed repeatedly, it takes a long time to evaluate the risk because the same number of attackers as the number of records are assumed. Thus, a threshold risk can be introduced to resolve the problem. When the risk of an attack does not exceed the threshold, attackers do not need to be evaluated. It is possible, however, that the risk may increase depending on the situation (see $r_5, r_6$ in Fig. 4.2). Therefore, when a threshold is introduced, the accuracy of the privacy risk may worsen. We describe the pseudocode of risk analysis as follows:

Second, the attackers do not have to compare their records with every record because the method of evaluation is similar to that of k-anonymity, and the attackers only need to compare a representative of each group. The attackers need to compare their records with $\{$-30, 175-$\},\{31$-,-$174\}$, and $\{31$-, 175-$\}$ in (b) of Fig. 4.3, for instance. However, when the levels of generalization are different, such methods cannot be applied, and every record should be checked. To solve the problem, we first count the number of values of each attribute and then compare each attribute of $r^0_j$ with that of each record of $D_1$ in accordance with the large number of varieties.

Finally, when the procedure for anonymization is known in advance, it is possible to perform the evaluation more quickly by considering the effect of the initial part of the anonymization. For instance, in Fig. 4.3a, we only have to consider cells whose values do not exceed 30 in $ATTR_1$ or fall short of 174 in $ATTR_2$.

4.3.3 Experiment

4.3.3.1 Experimental Environments

We conducted experiments to evaluate the validity of the proposed metrics. We measured the time to output the risk and confirmed that the privacy metric was appropriate. We used three parameters, $k, \beta , \epsilon $, for comparison and verified the relationships among k-anonymity, sampling, and noise addition. We implemented our risk analysis method on a PC with an Intel Core i7-4790 3.6-GHz CPU and a 16.0-GB memory.

4.3.3.2 Dataset and Adversary Model

We used a pseudomedical dataset based on an actual medical dataset. The dataset had 10,000 records and two attributes, total cholesterol (TC) and HbA1c, and the distribution of each attribute is shown in Figs. 4.5 and 4.6. We first measured the computation time while changing the number of records and then evaluated the validity of our metrics while changing the parameters of each anonymization method. Noise addition, generalization, and sampling were used as representative anonymization methods, and we adopted the Mondrian algorithm [9] for k-anonymization, Laplace noise for noise addition, and random sampling for sampling. We assumed reidentifying adversary $A_1$ to $A_4$. The conditions of the attacker models are the same as those of Sect. 4.3.2.4 except for noise addition. We define the $f_{noise}$ of the $A_2$ and $A_4$ output records, whose value for each attribute differed by $5 \%$ from the original value, to be candidates.

4.3.4 Results

4.3.4.1 Computational Complexity

Our proposed privacy metrics are intended to be able to applied to large datasets. We measured the execution time by changing the number of records (Table 4.1) and parameters (Table 4.2, 4.3 and 4.4).

Table 4.1 Execution time

Secure Data Management Technology

Abstract

Similar content being viewed by others

A Study from the Data Anonymization Competition Pwscup 2015

Data Anonymization

Some Basics on Privacy Techniques, Anonymization and their Big Data Challenges

4.1 Introduction

4.2 Anonymization Techniques and Adversary Models, Revisited

4.2.1 k-Anonymization

4.2.1.1 Adversary Model

4.2.1.2 k-Anonymization Algorithm

4.2.2 Noise Addition

4.2.2.1 Adversary Model

4.2.2.2 Anonymization Algorithm by Noise Addition

4.2.3 K-Anonymization for Combined Datasets

4.2.3.1 Adversary Model

4.2.3.2 Modification of Quasi-identifiers

4.2.3.3 Modification of Sensitive Attributes

4.2.3.4 Algorithm for Modification

4.2.4 Matrix Factorization for Time-Sequence Data

4.2.5 Anonymization Techniques for User History Graphs

4.2.5.1 Adversary Model

4.2.5.2 Notions for the Untraceability of a Graph

Definition 4.1

Definition 4.2

Definition 4.3

Definition 4.4

4.2.5.3 Algorithm Generating a Partial (k, V)-Untraceable History Graph

4.2.5.4 Algorithm Generating a Complete (k, V)-Untraceable History Graph

4.2.6 Other Notions

4.2.7 Combination of Anonymization Techniques

4.3 (p, N)-Identifiability

4.3.1 Common Adversary Model

4.3.1.1 Definitions of Actual Attackers

Definition 4.5

Definition 4.6

Definition 4.7

Definition 4.8

4.3.1.2 (p, N)-Identifiability

Definition 4.9

4.3.2 Success Probability Analysis Based on the Common Adversary Model

4.3.2.1 Generalization

4.3.2.2 Random Sampling

4.3.2.3 Noise Addition

4.3.2.4 Combination of Anonymization Methods

4.3.2.5 Examples of Analyses

The Case of \(A_1\)

The Case of \(A_2\)

The Case of \(A_3\)

The Case of \(A_4\)

The Case of \(A_5\)

4.3.2.6 Implementation of the Analysis Algorithm

4.3.3 Experiment

4.3.3.1 Experimental Environments

4.3.3.2 Dataset and Adversary Model

4.3.4 Results

4.3.4.1 Computational Complexity

Validation

4.4 Extension to Time-Sequence Data

4.4.1 Privacy Definition

Definition 4.10

Definition 4.11

Definition 4.129

4.4.2 Utility Definition

Definition 4.13

4.4.3 Matrix Factorization

4.4.3.1 SGD Matrix Factorization

4.4.4 Anonymization Using Matrix Factorization

4.4.5 Experiment

4.4.5.1 Dataset

4.4.5.2 The Privacy Risk Against a Linkage Attack

4.4.5.3 Effects of Matrix Factorization

4.4.6 Results

4.4.6.1 Risk Evaluation

4.4.6.2 Utility Evaluation

4.5 Anonymization and Privacy Risk Evaluation Tool

4.6 Conclusion

Notes

References

Author information