4.1 Introduction

Secure data management is a key issue in personal data distribution and analysis. Anonymization techniques have been used to harmonize the utility of data and their privacy risks. These techniques transform personal data into anonymized data to reduce the success probability of reidentification of data principals from the data. If the data are well anonymized, they cannot be connected to a person; thus, the privacy of the person is protected by anonymization techniques.

Secure computation is sometimes a realistic solution for commercial services due to its cost for data of very large size. Some anonymization techniques work on commercial services as a “practical” solution, even though the size of the data is very large. Thus, anonymization techniques have been applied for personal data distribution and data analysis. For example, k-anonymization was first proposed as a practical solution to reduce the reidentification risks of public data; since then, it has been considered to be able to be used for the secure management of personal data.

Quantitative measures for anonymity are required for estimating privacy risks and assessing the feasibility of privacy requirements. In several studies on anonymization, privacy notions providing quantitative measures for anonymity have been defined for each anonymization technique; however, no common notion for all anonymization techniques has been presented to date, which means that each privacy notion is not universal but is localized, and heuristic approaches are still used to harmonize the usability of data and privacy risks through whole processes or services. A common notion is required for consistent secure data management for the whole process.

In this chapter,Footnote 1 we discuss a new common privacy notion based on an adversary model, which is applicable to several anonymization techniques, and introduce a novel anonymization technique and implementation of the technique. In Sect. 4.2, we revisit adversary models on several anonymization techniques and review anonymization techniques. We propose a common adversary model and quantitative measures using the adversary model are presented in Sect. 4.3. An extension is discussed in Sect. 4.4. Our implementation of an anonymization tool is introduced in Sect. 4.5. We conclude this chapter in Sect. 4.6.

4.2 Anonymization Techniques and Adversary Models, Revisited

The related work presented below is grouped under k-anonymization and noise addition as anonymization methods.

4.2.1 k-Anonymization

k-anonymity [4,5,6] is a well-known privacy model. The property of k-anonymity is that each published record is such that every combination of values of quasi-identifiers can be matched to at least k respondents.

4.2.1.1 Adversary Model

k-anonymized datasets are assumed to be in public domains. An adversary can obtain all the attribute values in a dataset and execute arbitrary operations on the attribute values.

There are few formal definitions or models for the adversary that aim to identify the attributes of a certain individual in a k-anonymized dataset. Kiyomoto and Martin modeled an adversary [7] for k-anonymized datasets based on two query functions as follows:

Let d be an index of the dth record, \(q_x\) be a set of m attribute values in \(T^{q*}\), and s be a value for the sensitive attribute. The two query functions are defined as:

  • read. For the input of an index value d, the function outputs the dth record. That is, \(f(T^*, query=\{\mathbf{read }, d\}) \rightarrow \{ d, q_x^d, s^d \}\), where \(q_x^d\) and \(s^d\) are values of the quasi-identifier and the sensitive attribute in the dth record, respectively. If the dth record does not exist, then the function outputs failed.

  • search. For input \(q_x\) and/or s, the function outputs the number u of records and index values that have a quasi-identifier \(q_x\) and/or sensitive attribute s. That is, \(f(T^*, query=\{\mathbf{search }, q_x, s\}) \rightarrow u, D\), where u and D are the number of records and a sequence of index values that have the same quasi-identifier and/or sensitive attribute, respectively. If s or \(q_x\) do not exist, then the function outputs failed.

4.2.1.2 k-Anonymization Algorithm

This idea is easy to understand, and many types of k-anonymization algorithms have been proposed. The Incognito algorithm [8] generalizes the attributes using taxonomy trees, and the Mondrian algorithm [9] averages or replaces the original data with representative values and achieves k-anonymization. In this paper, we use a k-anonymization algorithm based on clustering and denote \(A_{k}(D)\) as k-anonymization for dataset D. The algorithm finds close records and creates clusters such that each partition contains at least k records. For details of the algorithm, see [10].

4.2.2 Noise Addition

Noise addition works by adding or multiplying stochastic or randomized numbers to confidential data [11]. The idea is simple and is also well known to be an anonymization technique.

4.2.2.1 Adversary Model

One objective of an adversary against noise-added datasets is to remove the noise or estimate the original values from the noise-added attribute values. One potential scenario is a probabilistic approach in which an adversary estimates the distribution of noise and chooses an attribute value with high probability. There is no formal adversary model on static noise-added datasets, but Differential Privacy settings assume data include dynamically added noise, and their adversary simulations are defined as query-based.

4.2.2.2 Anonymization Algorithm by Noise Addition

The first work on noise addition was proposed by Kim [12], and the idea was to add noise \(\epsilon \) with a distribution \(\epsilon \sim N(0, \sigma ^2)\) to the original data. Additive noise is uncorrelated noise and preserves the mean and covariance of the original data, but the correlation coefficients and variance are not retained. Another variation of additive noise is correlated additive noise, which keeps the mean and allows the correlation coefficients in the original data to be retained [13]. Differential privacy is a state-of-the-art privacy model that is based on the statistical distance between two database tables differing by at most one record. The basic idea is that, regardless of background knowledge, an adversary with access to the dataset draws the same conclusions, irrespective of whether a person’s data are included in the dataset. Differential privacy is mainly studied in relation to perturbation methods in an interactive setting, although it is applicable to certain generalization methods.

In this paper, we use Laplace noise as a noise addition and add noise \(\epsilon \sim Lap(0, 2\phi ^2)\) to each attribute. We denote \(A_{\phi }(D)\) as noise addition for dataset D.

4.2.3 K-Anonymization for Combined Datasets

We introduce an adversary model for a combined dataset from datasets produced by two service providers and anonymization methods [14].

4.2.3.1 Adversary Model

If we consider the existing adversary model and assume that the anonymization tables produced by the service providers satisfy k-anonymity, the combined table also satisfies k-anonymity. However, we have to consider another type of adversary in our new service model. In our service model, the combined table includes many sensitive attributes; thus, the adversary can distinguish a data owner using background knowledge of combinations of sensitive attribute values of the data owner. If the adversary finds a combination of known sensitive attributes on only one record, the adversary can obtain information; the record is a data owner that the adversary knows, and the adversary also knows the remaining sensitive attributes of the data owner. We model the above type of new adversary as follows:

\(\pi \)-knowledge Adversary Model. An adversary knows certain \(\pi \) sensitive attributes \(\{s^i_1, ..., s^i_j, ..., s^i_{\pi }\}\) of a victim i. Thus, the adversary can distinguish the victim with an anonymization table in which only one record has any combinations (maximum \(\pi \)-tuple) of the attributes \(\{s^i_1, ..., s^i_j, ..., s^i_{\pi }\}\).

4.2.3.2 Modification of Quasi-identifiers

The first strategy is to modify the quasi-identifiers of the combined table. The data user generates a merged table from two anonymization tables as follows: First, the data user simply merges the records in the two tables as \(| q^g_C | s^h_{AB} | s^i_A | s^j_B |\). Then, the data user modifies \(q^q_C\) to satisfy the following condition, where \(\theta \) is the total number of sensitive attributes in the merged table.

4.2.3.3 Modification of Sensitive Attributes

The second approach is to modify the sensitive attributes in the combined table for the condition. If a subtable \(| s^h_{AB} | s^i_A | s^j_B |\) that consists of sensitive attributes is required to satisfy k-anonymity, some sensitive attribute values are removed from the table and are changed to \(*\) to satisfy k-anonymity. Note that we do not accept that all sensitive attributes are \(*\) due to having no information record.

4.2.3.4 Algorithm for Modification

One algorithm that finds a k-anonymized combined dataset is executed as follows:

  1. 1.

    The algorithm generalizes quasi-identifiers to satisfy the condition that each group of the same quasi-identifiers has at least \(\pi \times k\) records.

  2. 2.

    The algorithm generates all the tuples of \(\pi \) sensitive attributes in the table.

  3. 3.

    For each tuple, the algorithm finds all the records that have the same sensitive attributes as the tuple or has \(*\) for sensitive attributes and makes them a group. We define the number of sensitive attributes in the group which is \(\theta \). The algorithm generates a partial table that consists of \(\theta -\pi \) sensitive attributes and checks whether the partial table has at least k different combinations of sensitive attributes.

  4. 4.

    If the partial table does not satisfy the above condition, the algorithm chooses a record from other groups that have different tuples of \(\pi \) sensitive attributes and changes the \(\pi \) sensitive attributes to \(*\). The algorithm executes this step until the partial table has up to \(\pi \) different combinations of sensitive attributes.

  5. 5.

    The algorithm executes step 3 and step 4 for all the tuples of \(\pi \) sensitive attributes in the table.

4.2.4 Matrix Factorization for Time-Sequence Data

Some studies have used matrices for time-sequence datasets. Zheng et al. [15, 16] proposed predicting a user’s interests in an unvisited location. They assumed users’ GPS trajectory as a user-location matrix where each value of the matrix indicates the number of visits of a user to a location. The matrix is very sparse because each user visits only a handful of locations, so a collaborative filtering model is applied to the prediction. Zheng et al. [17] built a location-activity matrix, M, which has missing values. M is decomposed into the two low-rank matrices U and V. The missing values can be filled by \(X = UV^\mathsf{T} \simeq M\), and locations can be recommended when some activities are given. Chawla et al. [18] constructed a graph from the trajectories of taxis and transformed the graph into matrices. The authors of [19] proposed a method of identifying traffic flows that cause an anomaly between two regions.

4.2.5 Anonymization Techniques for User History Graphs

In this subsection, we introduce two anonymization techniques for user history graphs, which are proposed in [1].

4.2.5.1 Adversary Model

Privacy leakage from a merged history graph is the disclosure of the actions of a particular person from the graph. Attacks against user history graphs are intended to obtain the private information of a particular user from the graph. We assume that the merging process is executed on a trusted domain and that only the merged history graph is published; thus, the adversary can only obtain the merged graph. Furthermore, we assume that the adversary has the following knowledge about the user: The history of the user is included in the merged graph and the user performs an action t. The adversary tries to discover other actions of the user to be able to guess which edges connecting to node t can be assigned to the user.

We summarize the adversary model as follows:

Adversary against a Merged History Graph. It is assumed that an adversary knows that a victim A executed an action t. The objective of the adversary is to obtain the actions that A executed before or after the action t. Thus, the adversary searches the merged history graph, which includes actions of other people and finds the actions of A using the knowledge that action t was executed.

We define privacy notions to use with the above adversary model in a later subsection.

4.2.5.2 Notions for the Untraceability of a Graph

We consider two levels of privacy notions: partial k-untraceability and complete k-untraceability. Partial k-untraceability accepts the leakage of some partial actions of a user but prevents all the actions of the user from being revealed. The definition of complete k-untraceability involves meeting the requirement that no action of the user is leaked. The symbol \(Act^A_{N_{x \rightarrow y}}\) for user A denotes the sequence of all the actions of user A from action x to action y. For example, the sequence of actions from the first action to action x and the sequence of actions from action x to the final action are denoted as \(Act^A_{\mathcal {N}_{start \rightarrow x}}\) and \(Act^A_{\mathcal {N}_{x \rightarrow end}}\), respectively.

Definition 4.1

(Partial k-untraceability) We assume that an adversary knows an action t of a user A, and we consider all the possible adversaries defined for any action t of the user in the merged graph. If at least k sequences of actions are potentially associated with user A and \(k-1\), other users exist as candidates for all actions \(Act^A_{\mathcal {N}_{start \rightarrow t}}\) and \(Act^A_{\mathcal {N}_{t \rightarrow end}}\), the digraph satisfies k-untraceability for A. If the digraph satisfies the above condition for all users, then the digraph is said to satisfy partial k-untraceability.

Definition 4.2

(Complete k-untraceability) We assume that an adversary knows an action t of a user A and we consider all the possible adversaries defined for any action t of the user in the merged graph. If at least k actions are potentially associated with user A and \(k-1\) other users exist as candidates for each action in \(Act^A_{\mathcal {N}_{start \rightarrow t}}\) and \(Act^A_{\mathcal {N}_{t \rightarrow end}}\), the digraph satisfies k-untraceability for A. If the digraph satisfies the above condition for all users, the digraph satisfies complete k-untraceability.

Generally, many trivial actions are performed by many users. It is not important for privacy purposes where we keep the information about such actions. Thus, we relax the above definitions to produce an anonymized graph that includes much of the information needed to analyze a user’s history. Let v be the threshold value for the number of performing users that establishes that an action is trivial; that is, we judge the actions \(x \rightarrow y\) to be trivial if the label \(L(x \rightarrow y) \ge v\). Both definitions are modified as follows:

Definition 4.3

(Partial (k, v)-untraceability) We assume that an adversary knows an action t of a user A, and we consider all the possible adversaries defined for any t in the merged graph. If at least k sequences of actions are potentially associated with user A and \(k-1\) other users exist as candidates for all actions \(Act^A_{\mathcal {N}_{start \rightarrow t}}\) and \(Act^A_{\mathcal {N}_{t \rightarrow end}}\) except trivial actions \(x \rightarrow y\) that have a label \(L(x \rightarrow y) \ge v\), then the digraph satisfies partial (kv)-untraceability for A. If the digraph satisfies the above condition for all users, then the digraph satisfies partial (kv)-untraceability.

Definition 4.4

(Complete (k, v)-untraceability) We assume that an adversary knows an action t of a user A, and we consider all the possible adversaries defined for any t in the merged graph. If at least k actions are potentially associated with user A and \(k-1\) other users exist as candidates for each action in \(Act^A_{\mathcal {N}_{start \rightarrow t}}\) and \(Act^A_{\mathcal {N}_{t \rightarrow end}}\) except trivial actions \(x \rightarrow y\) that have a label \(L(x \rightarrow y) \ge v\), then the digraph satisfies complete (kv)-untraceability for A. If the digraph satisfies the above condition for all users, then the digraph satisfies complete (kv)-untraceability.

In a complete (kv)-untraceable graph, each action t except trivial actions has k outgoing edges and incoming edges; thus, an action of user A that connects to action t cannot be identified from k candidates. Thus, the graph satisfies untraceability for an adversary who knows action t of the user. It is trivial that a complete (kv)-untraceable graph satisfies partial (kv)-untraceability; all actions except trivial actions are connected to k potential actions in a complete (kv)-untraceable graph. A graph that satisfies partial (k, v)-untraceability generally produces much more information than a complete (k, v)-untraceable graph, where the partial (k, v)-untraceable graph and the complete (k, v)-untraceable graph are generated from a user history graph. However, the (k, v)-untraceable graph may reveal partial actions of users due to the relaxed definition of the privacy notion; an attack is successful when an adversary obtains all the actions of a user. To trace all the actions of the user, the adversary has to select a sequence of actions from k sequences of actions; thus, all the actions of the user are untraceable, even though some actions are traceable by the adversary. The parameter k means that an action (or a sequence of actions) is potentially associated with a user and \(k-1\) other users in the untraceable graph, and the parameter v means that v users perform the same action in the graph. Generally, we should select the parameter \(v = k\) with regard to the privacy requirement for a merged graph. The actions of a user are hidden in the actions of a group that consists of k members including the user. A privacy notion for the graph should be selected from the above two notions according to a use case of the graph and its privacy requirements.

4.2.5.3 Algorithm Generating a Partial (kV)-Untraceable History Graph

The details of the algorithm are denoted as Algorithm 4.1, where \(oe_t\) and \(ie_t\) are defined as the number of outgoing edges and incoming edges of a node t, respectively. The algorithm for generating a partial (kv)-untraceable history graph is as follows:

  1. 1.

    This step consists of a part of the detailed algorithm, from line 1 to line 3. For the input of a user history graph G, the algorithm adds a virtual incoming edge \((s_r \rightarrow r)\) to each node \(r \in start\) until the number of incoming edges is the same as the number of outgoing edges. Then, the algorithm adds a virtual outgoing edge \((q \rightarrow u_q)\) to each node \(q \in end\) until the number of outgoing edges is the same as the number of incoming edges. A label of a virtual incoming edge \(L(s_x \rightarrow x)\) denotes the number of users who first perform the action, and a label of a virtual outgoing edge \(L(y \rightarrow u_y)\) denotes the number of users who perform the action at the end.

  2. 2.

    This step consists of a part of the detailed algorithm, from line 4 to line 12. The algorithm searches for a node t that has fewer outgoing edges than k and for which all its lower nodes \(\mathcal {N}_{t \rightarrow end \setminus t}\) have fewer outgoing edges than k. Then, the algorithm removes all the outgoing edges \((t \rightarrow *)\) that satisfy \(L(t \rightarrow *) < v\). Next, the algorithm searches for a node \(t'\) that receives incoming edges numbering less than k and all upper nodes \(\mathcal {N}_{start \rightarrow t' \setminus t'}\) that receive fewer incoming edges than k. Then, the algorithm removes all the incoming edges \((* \rightarrow t')\) that satisfy \(L(* \rightarrow t') <v\). The algorithm repeats this step until no node that meets the conditions is found.

  3. 3.

    This step is the same as line 13, line 14 and line 15 in the detailed algorithm. The algorithm removes virtual incoming and outgoing edges, removes nodes that have no edges, and outputs the modified graph.

figure a

4.2.5.4 Algorithm Generating a Complete (kV)-Untraceable History Graph

The details of the algorithm are denoted as Algorithm 4.2. The algorithm for generating a complete (kv)-untraceable history graph is as follows:

  1. 1.

    The algorithm first executes Algorithm 4.1 except line 13 and line 15.

  2. 2.

    This step consists of a part of the detailed algorithm, from line 3 to line 11. The algorithm searches for a node t that has fewer outgoing edges than k and removes all the outgoing edges \((t \rightarrow *)\) that satisfy \(L(t \rightarrow *) < v\), until no node is found. Then, the algorithm searches for a node \(t'\) that receives fewer incoming edges than k and removes all the edges \((* \rightarrow t')\) that satisfy \(L(* \rightarrow t') < v\). The algorithm repeats this step until no node that meets the conditions is found.

  3. 3.

    This step consists of line 12, line 13, and line 14 in the detailed program. The algorithm removes virtual edges, removes nodes to which no edge is connected, and outputs the modified graph.

4.2.6 Other Notions

Differential Privacy [20, 21] is a notion of privacy for perturbative methods based on the statistical distance between two database tables differing by, at most, one element. The basic idea is that, regardless of background knowledge, an adversary with access to the dataset draws the same conclusions whether a person’s data are included in the dataset. That is, a person’s data have an insignificant effect on the processing of a query. Differential privacy is mainly studied in relation to perturbation methods [22,23,24] in an interactive setting. Attempts to apply differential privacy to search queries have been discussed in [25]. Li et al. proposed a matrix mechanism [26] applicable to predicate counting queries under a differential privacy setting. Computational relaxations of differential privacy were discussed in [27,28,29]. Another approach for quantifying privacy leakage is an information-theoretic definition proposed by Clarkson and Schneider [30]. They modeled an anonymizer as a program that receives two inputs: a user’s query and a database response to the query. The program acted as a noisy communication channel and produced an anonymized response as the output. Hsu et al. provides a generalized notion [31] in decision theory for making a model of the value of personal information. An alternative model for the quantification of personal information is proposed in [32]. In the model, the value of personal information is estimated by the expected cost that the user has to pay for obtaining perfect knowledge from given privacy information. Furthermore, the sensitivity of different attribute values is taken into account in the average benefit and cost models proposed by Chiang et al. [33]. Krause and Horvitz presented utility-privacy tradeoffs in online services [34, 35].

4.2.7 Combination of Anonymization Techniques

A combination of anonymization methods leads to the construction of datasets that are useful and that preserve privacy. Some countries publish census data, and they combine several anonymization methods, such as generalization, noise addition, and sampling [36, 37]. However, some problems remain. One problem is that it is difficult to evaluate the privacy risks of anonymized datasets when anonymization methods are combined. Some research is available about the relationships among anonymization methods. Chaudhuri et al. proposed \((c, \epsilon , \delta )\)-privacy [38] and studied the relationship among sampling and differential privacy [39]. Li et al. proposed \((\beta , \epsilon , \delta )\)-differential privacy and studied the relationship among sampling, differential privacy, and k-anonymity. Soria-Comas et al. proposed a k-anonymized algorithm for differential privacy using an insensitive algorithm [40].

4.3 (pN)-Identifiability

4.3.1 Common Adversary Model

Existing privacy measures are supposed to protect against idealized attackers, and it is difficult to maintain their utility and assess their reidentification risk. We designed adversary models to describe more realistic attackers by structuring a real setting for the attackers. In the case of exchanging anonymized datasets between companies, for instance, a data-providing company first anonymizes and encrypts datasets for transmission to a receiver company via a secure channel. The receiver company locates the dataset in a secure room and allows only authorized employees to access the anonymized dataset. This process can reduce the reidentification risk in the anonymized dataset, and it specifies the attacker and limits the ability to access datasets so that the attacker must know the quasi-identifiers of the neighbors or acquaintances. For example, it seems to be quite rare for an attacker to know all the quasi-identifiers of a target because the target is a neighbor of the attacker. Thus, a more stringent analysis of the reidentification risk can be achieved when we assume a more realistic situation, such as that the attacker has only limited knowledge of the victim.

Access rights to an anonymized dataset may be given to attackers, and attackers may acquire some information about the original dataset or obtain the anonymization algorithm used to generate the anonymized dataset. Information about the original dataset is categorized into three parts as follows: information on a specified record such as a neighbor; the original dataset; and any other information except the target information that the attacker is seeking. The case of William Weld, who was governor of Massachusetts [41], is a typical example of reidentification, and an attack on the Netflix Prize dataset was carried out by a strong attacker who gained access to the Internet Movie Database [42].

We can consider the abilities of an attacker in two areas: knowledge about the dataset and the ability to simulate anonymization algorithms. Many previous studies such as [43, 44] assumed that an attacker has all the information required except knowledge of the target of the attack. In this paper, we consider an attacker who has knowledge of only the target record and can simulate anonymization algorithms to obtain anonymized records that may correspond to the target record.

4.3.1.1 Definitions of Actual Attackers

Generally, when an anonymized dataset is published on the Web, anyone who can access the dataset is a potential attacker; thus, the adversary model should be ideal because we cannot assume there is only a limited-knowledge adversary, and we have to assume all possible adversaries are present. On the other hand, when the dataset is managed under strict controls, the model adversary is not considered to be an unlimited-knowledge adversary. We design two realistic adversary models under the assumption that the dataset is managed in a restricted area (not public) and only a limited set of attackers can access the dataset; and then, we propose a privacy metric for privacy risk analysis.

Definition 4.5

(Anonymization Simulator \(f_{sim}\)) Let \(D_0\) with \(n_0\) records, \(D_1\) with \(n_1\) records, \(r^x_i[QI]\), and \(r^x_i[SI]\) be an original dataset, an anonymized dataset generated from the original dataset, the quasi-identifiers of a record \(r^x_i \in D_x\), and sensitive information from the record \(r^x_i \in D_x\), respectively. An anonymization simulator \(f_{sim}\) simulates an anonymization algorithm used to generate an anonymized dataset as an oracle and outputs \(r^1_i[QI] \in D_1\) for the input \(r^0_i[QI] \in D_0\). That is, \(f_{sim}: r^0_j[QI] \rightarrow \left\{ \mathbf{r}^1[QI], \bot \right\} \), where \(\mathbf{r}^1[QI]\) is a set of \(r^1_i[QI]\) and no output is produced in the case of \(\bot \).

The simulator is a deterministic process for deterministic anonymization, such as top-coding and bottom-coding, and a probabilistic process for probabilistic anonymization, such as random sampling. The simulator can provide access to \(D_0\) to simulate the anonymization algorithm, even though no adversary can access \(D_0\). Next, we define two adversary models.

Definition 4.6

(Deanonymizer for Anonymized Datasets, \(\mathcal {DA}\)) When \(\exists _1 r_j^0[QI] \in D_0\), \(\forall r^1_i[QI||SI] \in D_1\) and \(f_{sim}\) are given, a deanonymizer \(\mathcal {DA}\) lines up potential candidates \(r^1_i\) corresponding to \(r^0_j\) by executing the simulator \(f_{sim}\); then, the deanonymizer \(\mathcal {DA}\) outputs a list of candidates \(r^1_i[QI||SI]\) for \(r^0_j\), where the number of records in the list is \(n_q\), the number of sensitive information items in the list is \(n_s\) and \(0 \le n_s \le n_q \le n_0\).

If an attacker knows the actual anonymization function f, the attacker can use f as \(f_{sim}\), and the evaluation result should be more credible.

Definition 4.7

(Reidentifying Adversary versus Anonymized Datasets) When \(\exists _1 r_j^0[QI] \in D_0\), \(\forall r^1_i[QI||SI] \in D_1\) and \(f_{sim}\) are given, a reidentifying adversary executes the deanonymizer \(\mathcal {DA}\) and can identify \(r^1_i\), which is a record of the same person in the record \(r^0_j\), from the records in a dataset \(D_0\), where \(r^0_j \in D_0\) is given. The success probability of the attack is calculated as \(1/n_q\) when \(r^1_j\) is included in the output by \(\mathcal {DA}\); otherwise, it is 0.

Assuming an attacker who has \(\exists _1 r_j^0[QI] \in D_0\) is the same as assuming \(|D_0|\) attackers who have \(r^0_j(j=1,...,|D_0|) \in D_0\).

Definition 4.8

(Revealing Adversary versus Anonymized Datasets) When \(\exists _1 r_j^0[QI] \in D_0\), \(\forall r^1_i[QI||SI] \in D_1\) and \(f_{sim}\) are given, a revealing adversary executes the deanonymizer \(\mathcal {DA}\) and finds a \(r^0_j[SI]\) from \(r^1_i[SI]\) such that \(r^1_i\) is a record of the same person as the record \(r^0_j\). The success probability of the attack is calculated as \(1/n_s\) when \(r^1_j\) is included in the output of \(\mathcal {DA}\); otherwise, it is zero.

A revealing adversary does not try to identify the record but tries to access sensitive information. In other words, the attacker seeks only to obtain sensitive information from the record in question. More precisely, the success probability of the revealing adversary can be calculated as \([n_s]/n_q\), where the correct number of sensitive items in the list is \([n_s]\), but the probability itself may be uncertain. Assume that when the probability is 0.99, some attackers are convinced that the target should be the majority. Furthermore, in the case that the deanonymizer \(\mathcal {DA}\) is leaked and the \(f_{sim}\) used in the deanonymizer is a deterministic process, an attacker can infer the sensitive information of \(r_j^0\). On the other hand, when the \(f_{sim}\) used in the deanonymizer is a probabilistic process, even if \(\mathcal {DA}\) is leaked, outputting the result should not involve uncertainty.

4.3.1.2 (pN)-Identifiability

Here, we assume that anonymized datasets are strictly controlled and that the attacker has knowledge of a specific record and the anonymization algorithms. We assume that the attacker is the strongest type of attacker and has knowledge of the most characteristic record. Nevertheless, it is difficult to quantify this characteristic, so we assume that each attacker has an original record. In other words, we assume there are as many attackers as there are original records.

Definition 4.9

((pN)-identifiability) Let p be the success probability for an adversary who has \(\exists _1 r^0[QI] \in D_0\), \(\forall r^1_i[QI||SI] \in D_1\) and \(f_{sim}\), and N be the number of adversaries whose attack success probability is p.

The probability p is the conditional probability that the adversary can select the correct record from the list produced by the deanonymizer \(\mathcal {DA}\) when the collected record is included in the list. The probability that the deanonymizer successfully produces the list, including the correct record, depends on the anonymization algorithms.

Our model can extend to an adversary who has knowledge of two or more records. For simplicity, we use an adversary model that knows a single record and consider N single knowledge adversaries in our risk analysis. The idea of (pN)-identifiability is studied in [2].

4.3.2 Success Probability Analysis Based on the Common Adversary Model

In this section, we assume the attackers described in the previous section and explain the calculation to obtain the success probability of attacks on representative anonymization methods: generalization, noise addition, and sampling. We consider that \(f_{sim}\) is constructed as a typical combined algorithm selected from three anonymization algorithms, \(f_{generalization}, f_{sampling}\) and \(f_{noise}\). We explain the above three anonymization algorithms and show combined anonymization using an example dataset.

4.3.2.1 Generalization

We include deletion of records or cells and top- or bottom-coding as steps in generalization. One step of \(f_{generalization}\) is similar to k-anonymity in checking the number of identical combinations of quasi-identifiers. When an anonymized dataset has k-anonymity, p equals 1/k. k-anonymity is an intuitive privacy metric, but the greater the number of attributes, the more difficult it is for the datasets to achieve k-anonymity. If an attacker has generalization trees for each attribute, the attacker adds records which satisfy the requirements of the trees of the list of candidates. When there is a record whose address attribute is Tokyo, for instance, an attacker who has the generalization tree adds records whose addresses are in the Kanto region as well as records whose addresses are in Eastern Japan to the list of candidates. It is appropriate that an attacker can infer the generalization tree and in our experiment, \(f_{sim}\) can be considered capable of accessing the generalization trees of each attribute.

4.3.2.2 Random Sampling

When an attacker who has one original record is assumed, the privacy risk differs greatly among the original datasets. Consider an original dataset with many unique records, and assume that random sampling is implemented. Let M be the number of unique records and \(\alpha \) be the sampling rate. The probability that unique records will not appear is \((1-\alpha )^M\). Even when \(\alpha =0.1\) and \(M=44\), the probability is less than \(0.1 \%\). When a large dataset is anonymized, it is possible that there will be more than 44 unique records, which shows that if sampling is implemented, a characteristic record may be identified or suspected.

We evaluate sampling as follows: For simplicity, we consider the case where the anonymization method is only random sampling. When a unique record is sampled, an attacker who knows the person is certain that the record is for that person. Thus, the probability p does not change. On the other hand, sampling reduces the number of unique records, and N decreases accordingly. When unique records are very few and do not appear in an anonymized dataset, p decreases. We apply this approach to the case of combining different anonymization methods.

The approaches to sampling vary, and we can also consider \(f_{sampling}\) in various ways. For instance, the probability of disclosing the identity of any individual is evaluated by using the posterior probability of population uniqueness [45].

4.3.2.3 Noise Addition

There are two cases of noise addition: One is adding noise to the numerical data itself, and the other is adding noise to its quantity. In the former case, the data consist of original numerical data or data anonymized by a process, such as microaggregation, and in the latter case, the data are original quantity data or anonymized data, such as 11–20 in the age attribute.

In the former case, we can consider \(f_{noise}\) as follows. Noise is added based on a probability distribution, such as normal, Laplace, and exponential distributions. In particular, it has been mathematically proven that adding Laplace noise to the output of some queries achieves differential privacy [39], so this type of noise is widely used. Therefore, when an anonymized record is included in the 90 or \(95 \%\) confidence interval, the record is added to the list of candidates. More simply, when original data and anonymized data have small differences such as 10 or \(20 \%\) for each attribute, the attacker may consider the possibility that they are the same.

In the latter case, we cannot use the same method. When a record has 72 and is anonymized to 95, for instance, the attacker whose target is a specific person may not regard the target to be that person. However, the attacker can link them after the top-coding is executed and change the value to 70-. On the other hand, when a record is 19, is anonymized to 20 and is generalized to 20–29, the attacker may not link them. One of the ideas of \(f_{noise}\) is that a group with each attribute can be changed to next group and such records are output as candidates. As in the generalization step, an attacker can infer the next group for each group and \(f_{noise}\) can be thought of as defining the distance of each classification.

The description above shows that when the order of anonymization is changed, \(f_{sim}\) will also be changed.

4.3.2.4 Combination of Anonymization Methods

The principles of each anonymization can be combined by evaluating each anonymization step by step. Stated differently, an attacker has \(f_{generalization}, f_{sampling}\), and \(f_{noise}\) as \(f_{sim}\). We show examples of combined cases by using a sample dataset (Fig. 4.1). An attacker should change his or her approach when the order of anonymization is changed if he or she knows this fact. We assume five attacker models, \(A_1\) to \(A_5\), in the following example, and the candidates of each attacker model are represented as \(C_1\) to \(C_5\). We denote \(C_i\) of \(r_j\) in the following figures as the candidates of an attacker \(A_i\) who has \(r_j\) as a target. The adversary model for \(A_1\) to \(A_4\) is the reidentifying adversary defined in Definition 4.3, and the adversary model in Fig. 4.4 is the revealing adversary defined in Definition 4.4.

Fig. 4.1
figure 1

Sample dataset

Let the conditions of attackers be as follows: \(A_1\) and \(A_3\) do not consider noise-adding and generalization but simply compare \(r^1_i \in D_1\) with \(r^0_j \in D_0\). This is one approach to \(f_{noise}\) and \(f_{generalization}\). On the other hand, \(A_2\), \(A_4\), and \(A_5\) do consider the added noise and generalization. We define the noise addition shown in Fig. 4.2 as follows: the classifications of each attribute change to the next classification with a certain probability. We assume \(A_2\) knows the rule of noise addition and that \(f_{noise}\) of \(A_2\) outputs candidates that have a different classification in one attribute from an original record. On the other hand, let a small amount of noise be added in step (a) of Figs. 4.3 and 4.4. We assume the attackers \(A_4\) and \(A_5\) know the rule and that \(f_{noise}\) of \(A_4\) and \(A_5\) outputs candidates whose values of \(ATTR_1\) are different but within 2 from the original record and whose values of \(ATTR_2\) are different but within 4 from the original record. In the figures, the boldface sections show that the classifications are not correct but are within the permissible range for \(f_{noise}\) of \(A_2, A_4\), and \(A_5\): The red boldface sections show that there are substantial distances from the original values and that attackers who have the record cannot link them.

Fig. 4.2
figure 2

Sample anonymization and the result of simulation attack 1

Fig. 4.3
figure 3

Sample anonymization and the result of simulation attack 2

Fig. 4.4
figure 4

Sample anonymization and the result of simulation attack 3

4.3.2.5 Examples of Analyses

The Case of \(A_1\)

Generalization, noise addition, and sampling are executed as anonymizing methods in Fig. 4.2. In the generalization step (a), all records are generalized to be divisible into equal parts. As a result, only \(r_2\) is unique, and this dataset has (1, 1)-identifiability.

In step (b), \(r_1, r_4\), and \(r_6\) are changed by the addition of noise. As a result, \(r_1\) and \(r_2\) are indistinguishable. \(r_3, r_4\), and \(r_7\) are also indistinguishable, but \(r_5\) and \(r_6\) become unique. We define \(A_1\) as not considering the addition of noise, so that an attacker who has \(r_6\) cannot link the original record but an attacker who has \(r_5\) can. Therefore, identifiability becomes (1, 1)-identifiability.

After sampling, in step (c), \(r_2, r_4\), and \(r_5\) do not appear. Then, \(r_3\) and \(r_7\) become the focus are focused and identifiability becomes (1/2, 2)-identifiability. This attacker simply checks how many of the same records there are in the dataset. Even if various anonymization methods are implemented, some records may not be affected. Therefore, it is important to assume such attackers. When we can say that a dataset has a certain level of privacy from such attackers, it means that an attacker cannot link the target with the original record by accident.

The Case of \(A_2\)

We omit the explanation of step (a) because noise is not added. In step (b), the attacker with \(r_1\), for example, chooses \(r_1, r_2, r_5\), and \(r_6\) as candidates because one or more of their attributes match \(r_1=\{\)-30, 175-}. On the other hand, an attacker with \(r_4\) cannot output candidates because both attributes of \(r_4\) are changed. Hence, identifiability is (1/4, 2)-identifiability. In step (c), \(r_5\) does not appear, and identifiability becomes (1/4, 1)-identifiability.

The Case of \(A_3\)

In Fig. 4.3, the dataset is anonymized by the addition of noise, generalization, and sampling.

In the case of \(A_3\), the dataset with added noise is safe enough from attackers who do not consider the added noise and we omit this case; however, this does not mean that noise addition is safe, and when another attacker, such as \(A_4\), is considered, the result should be different. In step (b), we focus on the attacker with \(r_3\). This is the strongest attacker, and this attacker suspects that \(r_2\) and \(r_3\) are the candidates. More specifically, the scope is \(r_3 = \left\{ 38, 165\right\} = \{31\)-, -\(174\}\) and \(r_2, r_3\) meet the requirement. The attacker with \(r_2\) seems to have the same risk but cannot identify the actual target \(r_2\) is a possible candidate because the noise of \(ATTR_2\) is great enough. Hence, the identifiability becomes (1/2, 1)-identifiability. In step (c), \(r_3\) does not appear, and the privacy risk is (1/3, 1)-identifiability.

The Case of \(A_4\)

Next, we show the case of \(A_4\). In step (a), every record but \(r_1\) and \(r_7\) has enough added noise, and attackers cannot infer which is the correct record. The attacker with \(r_7\) regards the records within \(\left\{ 33 \pm 2, 173 \pm 4\right\} \) as candidates. Only \(r_7\) satisfies the condition, and the privacy risk is (1, 1)-identifiability. In step (b), the effect of noise addition becomes weak, and the number of attackers who should be considered increases. The attacker with \(r_6\), for instance, regards the records within \(\left\{ 29 \pm 2, 171 \pm 4\right\} = \{(\)-30, 31-), (-174, 175-\()\}\), namely, all records, as candidates. The privacy risk becomes (1/2, 1)-identifiability after generalization is finished. In step (c), similar to the previous steps, the privacy risk becomes (1/3, 1)-identifiability.

The Case of \(A_5\)

Finally, we show an example of a revealing adversary.

An attacker can claim to succeed when the sensitive information \(ATTR_S\) of the target can be correctly identified. Step (a) is similar to that of the case of \(A_4\). In step (b), the attacker with \(r_3\) suspects \(r_2\) and \(r_3\) are the candidates. Their \(ATTR_S\) are, however, “Office” and the attacker claims to identify the person. Thus, the privacy risk is \((2/2=1,1)\)-identifiability, which is similar to l-diversity. In step (c), the attacker with \(r_1\) suspects \(r_1, r_4\) and \(r_6\) are the candidates; the \(ATTR_S\) of \(r_1\) is “Hospital,” and that of the others is “Shop.” Therefore, the probability of reidentification is 1/2. More precisely, the probability is 1/3 because there are three candidates and one is correct, but the probability may be important information for the attacker with \(r_1\). The same can be said of the attacker with \(r_7\); therefore, the risk according to our definition is (1/2, 2)-identifiability.

As described above, when the adversary model is different, the result of the risk is also different. Assuming attackers who disregard noise, we consider the risk to the records whose fluctuations are due to anonymization to be small. On the other hand, assuming attackers who do consider the actual added noise, we consider the risk to the dataset as a whole. Moreover, strong attackers can be assumed to use the inverse function of the actual noise or anonymization method. In the case that noise based on a normal distribution is added, for instance, an optimal distance-based record linkage can be performed [46].

It is important to consider the various types of attackers in this way, because the most important factor of privacy is the inability to definitely link an anonymized record \(X'\) and original record X. Our metrics ensure that the attackers considered can neither identify a record nor make an identification by chance, by considering many attackers.

4.3.2.6 Implementation of the Analysis Algorithm

Processing time is a problem when our metric is applied to a large dataset. In this section, we discuss this problem.

First, we have to evaluate the risk from attackers with each record, and when sampling is implemented, the candidates in each record need to be preserved across the sampling. However, we do not need to store the candidates for every record or the records that have certain risks because the metric does not consider attackers who have knowledge of a record that does not have the highest risk. Moreover, when anonymization and evaluation are performed repeatedly, it takes a long time to evaluate the risk because the same number of attackers as the number of records are assumed. Thus, a threshold risk can be introduced to resolve the problem. When the risk of an attack does not exceed the threshold, attackers do not need to be evaluated. It is possible, however, that the risk may increase depending on the situation (see \(r_5, r_6\) in Fig. 4.2). Therefore, when a threshold is introduced, the accuracy of the privacy risk may worsen. We describe the pseudocode of risk analysis as follows:

figure b

Second, the attackers do not have to compare their records with every record because the method of evaluation is similar to that of k-anonymity, and the attackers only need to compare a representative of each group. The attackers need to compare their records with \(\{\)-30, 175-\(\},\{31\)-,-\(174\}\), and \(\{31\)-, 175-\(\}\) in (b) of Fig. 4.3, for instance. However, when the levels of generalization are different, such methods cannot be applied, and every record should be checked. To solve the problem, we first count the number of values of each attribute and then compare each attribute of \(r^0_j\) with that of each record of \(D_1\) in accordance with the large number of varieties.

Finally, when the procedure for anonymization is known in advance, it is possible to perform the evaluation more quickly by considering the effect of the initial part of the anonymization. For instance, in Fig. 4.3a, we only have to consider cells whose values do not exceed 30 in \(ATTR_1\) or fall short of 174 in \(ATTR_2\).

4.3.3 Experiment

4.3.3.1 Experimental Environments

We conducted experiments to evaluate the validity of the proposed metrics. We measured the time to output the risk and confirmed that the privacy metric was appropriate. We used three parameters, \(k, \beta , \epsilon \), for comparison and verified the relationships among k-anonymity, sampling, and noise addition. We implemented our risk analysis method on a PC with an Intel Core i7-4790 3.6-GHz CPU and a 16.0-GB memory.

Fig. 4.5
figure 5

Distribution of TC

Fig. 4.6
figure 6

Distribution of HbA1c

4.3.3.2 Dataset and Adversary Model

We used a pseudomedical dataset based on an actual medical dataset. The dataset had 10,000 records and two attributes, total cholesterol (TC) and HbA1c, and the distribution of each attribute is shown in Figs. 4.5 and 4.6. We first measured the computation time while changing the number of records and then evaluated the validity of our metrics while changing the parameters of each anonymization method. Noise addition, generalization, and sampling were used as representative anonymization methods, and we adopted the Mondrian algorithm [9] for k-anonymization, Laplace noise for noise addition, and random sampling for sampling. We assumed reidentifying adversary \(A_1\) to \(A_4\). The conditions of the attacker models are the same as those of Sect. 4.3.2.4 except for noise addition. We define the \(f_{noise}\) of the \(A_2\) and \(A_4\) output records, whose value for each attribute differed by \(5 \%\) from the original value, to be candidates.

4.3.4 Results

4.3.4.1 Computational Complexity

Our proposed privacy metrics are intended to be able to applied to large datasets. We measured the execution time by changing the number of records (Table 4.1) and parameters (Table 4.2, 4.3 and 4.4).

Table 4.1 Execution time

It takes little time to evaluate the risk when simple attackers, such as \(A_1\) and \(A_3\), are considered. On the other hand, when reflective attackers are assumed, the number of calculations increases and more time is required for evaluation. However, some of the processing described above reduces the time. For instance, the number of combinations of attributes increases with increasing numbers of records, and once an attacker has checked the risk of a record, that attacker does not have to calculate the risk of other records that have the same values. Therefore, the analysis algorithm is appropriate for large datasets.

Table 4.2 The case of \(\epsilon =0.5, k=2\)

When the sampling rate is changed, the computation time differs depending on the attacker. This is because there are two loop processes, one for sampled records and one for nonsampled records, and the calculation methods of each process differ depending on the attacker.

Table 4.3 The case of \(\beta =0.05, k=2\)

The effect of noise addition on computation time is not different in this experiment, but when a very large amount of noise is added, the distribution of the records is uniform and the different kinds of records increase; as a result, the computation time may increase.

Table 4.4 The case of \(\beta =0.05,\epsilon =0.5\)

The effect of k-anonymity also seems minimal, but when k is large the number of different types of records decreases and the computation time may decrease.

Validation

We observed p and N by changing the sampling rate \(\beta \) and the noise parameter \(\epsilon \) to verify the validity of our metrics. We evaluated the attacker model \(A_4\) while changing the parameters \(k, \beta \), and \(\epsilon \). The evaluation result is shown below (Table 4.5, 4.6).

Table 4.5 Relationship among parameters and our metrics (p, N)
Table 4.6 Relationship among parameters and our metrics (p, N)
Table 4.7 Case of \(\beta =0.05,\epsilon =1.0\)

The risk to privacy decreases as k increases and as \(\beta \) and \(\epsilon \) decrease, and the risk is a valid privacy metric. Sampling rates are the key factor that reduces the risk in this experiment. There are some outliers in the datasets, and they are the cause of the risk. In fact, if such records are not sampled, the privacy risk decreases. We conducted this experiment multiple times, and the result was different each time. Table 4.7 presents a sample of the evaluation results. Some outliers were included in the third operation, and the risk was higher than that of other operations. Therefore, the key factor may change when outliers are removed in advance.

4.4 Extension to Time-Sequence Data

4.4.1 Privacy Definition

We define two types of attack models for time-sequence datasets. The first, a reidentification attack, is a general attack model where an attacker has information on the original dataset M and tries to reidentify it in an anonymized dataset A(M). This model assumes that an attacker has maximal information about the original dataset. This model is the same as that of k-anonymization, where even if an attacker has an original dataset, the probability of the reidentification of a k-anonymized dataset is 1/k.

Definition 4.10

(Reidentification attack) Let an attacker have a matrix \(M_{t_1} \in \mathbb {R}^{n \times m}\) and an anonymized matrix \(A(M_{t_1}) \in \mathbb {R}^{n \times m}\). A reidentification attack against a record \(r_i\) succeeds if record \(r_{i} \in M_{t_1}\) is linked to record \(r'_{j} \in A(M_{t_1})\), where \(r_i\) and \(r'_j\) are the same user.

A linkage attack, which is an attack on a valid user, is one in which an attacker tries to obtain information from the given datasets \(A(M_{t_1})\) and \(A(M_{t_2})\). \(A(M_{t_1})\) and \(A(M_{t_2})\) are assumed to include the same users, but the primary keys are different. An attacker in this model has only anonymized datasets, so a valid user is assumed to be an attacker in this model. There are few studies concerning this problem, and we evaluate the risk using actual datasets in this paper.

Definition 4.11

(Linkage attack) Let an attacker have two anonymized matrices, \(A(M_{t_1}) \in \mathbb {R}^{n \times m}\) and \(A(M_{t_2}) \in \mathbb {R}^{n \times m}\). \(M_{t_1}\) and \(M_{t_2}\) include the same users and items, where each user and item of \(M_{t_2}\) are the same as those of \(M_{t_1}\). A linkage attack against a record \(r_i\) succeeds if record \(r'_{i} \in A(M_{t_1})\) is linked to record \(r''_{j} \in A(M_{t_2})\), where \(r'_i\) and \(r''_j\) are the same user.

We next define the privacy metric as follows:

Definition 4.129

(Privacy metric) Let n be the total number of users of a dataset M and \(n'\) be the number of users that are successfully attacked. The privacy risk of M is defined as \(\frac{n'}{n}\).

We consider the attacks to be the same as the previous ones to solve an assignment problem. An assignment problem is to find an appropriate task assignment when there are n users and tasks, and the Hungarian algorithm [47] solves the assignment problem in such a way that the entire cost is minimal.

We apply the same algorithm as used for reidentification and linkage attacks and assume that when an attacker assigns a record to the correct user, the attack succeeds. When a dataset is k-anonymized, there are at least \(k-1\) of the same records. Hence, when a record is assigned to the cluster to which the correct record belongs to, we regard the record as being assigned correctly even if the assigned record is not actually correct. Furthermore, we define the privacy metric as the result obtained by multiplying the probability, and we define 1/k because the probability is the ratio of correctly assigned clusters (Fig. 4.7).

Fig. 4.7
figure 7

Example of a risk evaluation

Figure 4.1 shows an example of a risk evaluation. The dataset on the left is the original dataset and that on the right is the anonymized dataset. The arrows indicate the assignment result. User 2 of the original dataset, for instance, is assigned to user 3 of the anonymized dataset, so the attack on user 2 fails. When noise addition is used as the anonymization method, users 2, 3, 4, and 5 are assigned to the wrong users and the privacy risk is 3/7. On the other hand, when k-anonymization is used, in this case, \(k=2\), users 4 and 5 are assigned to the wrong users (blue arrows) but are assigned to the clusters that are the same as those of the correct users. Therefore, we consider the attacks on users 4 and 5 to be successful. The failed attacks are only for users 2 and 3 (red arrows), and the privacy risk is \(5/7 \times 1/2 = 5/14\).

4.4.2 Utility Definition

We define the utility metric here. In previous research, most utility metrics are based on either the distance between the original dataset and the anonymized dataset, or the amount of information loss [48, 49]. However, the utility depends on the situation (i.e., context and use case), and these metrics do not necessarily match the actual utility. Therefore, we consider a use case scenario and present a utility definition that matches the scenario. Specifically, we consider a use case in which an anonymized dataset is used as training data for a machine learning algorithm. In the case of a Web access log dataset, for example, a client, who is a developer of an anti-virus software, may generate a machine learning model from an anonymized dataset and predict whether their user will access a phishing Web site.

Definition 4.13

(Utility metric) Let F(ME) be the F-measure of a machine learning model, where the training data are M and the test data are E. The utility metric is defined as follows:

$$\begin{aligned} Uti(A(M)) = \frac{F(A(M), E)}{F(M, E)}. \end{aligned}$$
(4.1)

Figure 4.8 gives an overview of the utility evaluation. We first generate two machine learning models: One is from an original dataset, and the other is from its anonymized dataset. An item is randomly chosen as an objective variable, and the remaining items are explanation variables. Then, we use these models and predict an attribute of each record of an evaluation dataset that has the same attributes as those of the original dataset. This operation is performed several times while an objective variable is changed. The utility is defined as the average of the ratio of the F-measure of a model of the anonymized dataset to that of a model of the corresponding original dataset. In this paper, we apply logistic regression as the machine learning algorithm and predict fifty attributes.

Fig. 4.8
figure 8

Overview of utility evaluation

4.4.3 Matrix Factorization

Matrix factorization is a fundamental task in data analysis, and the technique is used in various scenarios, such as text data mining, acoustic analysis, and product recommendation by collaborative filtering. We use matrix factorization as an anonymization technique, so we present an overview of matrix factorization in this section.

4.4.3.1 SGD Matrix Factorization

We consider an unknown rank-r matrix \(M \in \mathbb {R}^{n \times m}\) and assume that we know a set of elements \(\Omega \subset [n] \times [m]\). \(P_{\Omega }(M) \in \mathbb {R}^{n \times m}\) is defined as:

$$\begin{aligned} P_{\Omega }(M)= {\left\{ \begin{array}{ll} M_{ij} \;\;\;\; \mathrm{if}(i,j) \in \Omega , \\ 0 \;\;\;\;\;\;\;\;\; \mathrm{otherwise}. \end{array}\right. } \end{aligned}$$
(4.2)

The goal of matrix factorization is to find two matrices \(U \in \mathbb {R}^{r \times n}\) and \(V \in \mathbb {R}^{r \times m}\) which approximate the original matrix \(M_{ij} \approx X_{ij} \;\;\mathrm{s.t.}\;\; \forall M_{ij} \in \Omega (M)\) with lower dimensionality \(r<< min(n, m)\). Here, \(X = U^\mathsf{T}V\).

This problem is defined to solve the following optimization problem:

$$\begin{aligned} \min _{u^*, v^*} \sum _{(i,j)\in P_{\Omega }(M)}(M_{ij} - u^\mathsf{T}_iv_j)^2 + \lambda (||u_i||^2 + ||v_j||^2), \end{aligned}$$
(4.3)

where \(u_i\) is a vector of user factors and \(v_j\) is a vector of item factors. When \(u_i\) and \(v_j\) are variables, this function is not a convex set, so the problem described above cannot be solved. Some techniques are proposed to solve the problem, and gradient descent [50], for example, is a fundamental technique to find a local minimum value. However, gradient descent needs to update vectors iteratively to obtain an optimal solution and using gradient descent is computationally expensive, so stochastic gradient descent (SGD) is widely used, for example, in the KDD Cup 2011 [51] and the Netflix Prize [52].

There has been some research to speed up SGD-based matrix factorization, such as [53,54,55,56], and each algorithm updates the matrices in parallel or in a distributed manner.

In this paper, we apply a simple SGD technique to optimize formula (2) and denote Update(A) as the update of a matrix A using the SGD technique.

4.4.4 Anonymization Using Matrix Factorization

We consider matrix factorization to be an anonymization method, and rank r contributes to the accuracy of the matrix approximation. Moreover, we propose combining matrix factorization with another anonymization method ano, such as k-anonymization or noise addition. We denote p as a parameter of the anonymization method, and p is k or \(\phi \) in this paper. A basis matrix U and weighting matrix V can be assumed to be the characteristics of the rows and columns, respectively, and U is a characteristic matrix of users in our dataset. Therefore, we propose to anonymize U and maintain V so that the characteristics of the domain are preserved. In our algorithm, we first divide the dataset M into U and V, and anonymize U. Then, we optimize V once and recombine it with the anonymized U. The algorithm is described below.

We indicate that \(A_{r}(D)\) applies matrix factorization to matrix D and that \(A_{(ano,r)}(D)\) combines matrix factorization and the anonymization method ano by:

$$\begin{aligned} \!\!\!\!\!\!\!\!\!\!\!A_{(ano, r)}(D) \! = \! (A_{(ano)}(U))^\mathsf{T}V, \! \mathrm{where} \!\,\, U \! \in \! \mathbb {R}^{r \times n}, V \! \in \! \mathbb {R}^{r \times m}. \end{aligned}$$
(4.4)
figure c

4.4.5 Experiment

4.4.5.1 Dataset

We use an actual Web access log dataset as a time-sequence dataset. The dataset consists of an ID, a time stamp, and the access domain, as shown in Table 4.8. We convert the dataset into a matrix as follows:

$$\begin{aligned} M_T = \begin{bmatrix} r_{11} &{} r_{12} &{} \cdots &{} r_{1m}\\ r_{21} &{} r_{22} &{} \cdots &{} r_{2m}\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ r_{n1} &{} r_{n2} &{} \cdots &{} r_{nm}\\ \end{bmatrix} \end{aligned}$$
(4.5)

Here, T is the observation time.

Table 4.8 Dataset format

We say \(r_{ij} = 1\) if a user whose ID is i accesses domain j during time T, and otherwise, \(r_{ij} = 0\). For example, we construct the datasets in Table 4.8 as follows:

$$\begin{aligned} M_{t_1} = \begin{bmatrix} 1 &{} 0 &{} 1 &{} 0 \\ 0 &{} 1 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 1 \\ \end{bmatrix} \end{aligned}$$
(4.6)
$$\begin{aligned} M_{t_2} = \begin{bmatrix} 1 &{} 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 &{} 0 \\ \end{bmatrix} \end{aligned}$$
(4.7)

Here, \(t_1\) is the 10-min span between 2016-12-01 16:10:00 and 2016-12-01 16:19:59, and \(t_2\) is the similar 10-min span between 2016-12-01 16:20:00 and 2016-12-01 16:29:59. The IDs are different between \(t_1\) and \(t_2\), but \(x_{t_1}\) and \(x_{t_2}\), and \(z_{t_1}\) and \(z_{t_2}\) represent the same users.

In the following experiments, we chose randomly 200 users and 1,000 domains from an actual Web access log and let the pseudonymous ID be changed at each designated time T.

4.4.5.2 The Privacy Risk Against a Linkage Attack

First, we evaluate whether a linkage attack is possible. We set the observation time \(t_1\) as 2, 4, and 8 h from 16:00 on a weekday and the observation time \(t_2\) as the same time on another weekday. The probability of a linkage attack between \(M_{t_1}\) and \(M_{t_2}\) is shown in Table 4.9.

Table 4.9 Linkage attack against a non-anonymized dataset

The matrix only includes information on whether a domain has been accessed, and even if the observation time is 2 h, the linkage attack probability, i.e., risk, is very high (over 50\(\%\)). Moreover, the risk increases as the observation time increases because when the observation time increases, the trend of a user becomes noticeable. The result shows that the pattern of Web access for people has consistent characteristics. Hence, we need to consider not only reidentification attacks but also linkage attacks to avoid privacy leakages.

4.4.5.3 Effects of Matrix Factorization

Observation times \(t_1\) and \(t_2\) are fixed as 8 h from 16:00 h on a weekday in the following experiments. The inputs of matrix factorization are the original dataset M, the number of iterations I, and the rank r. Furthermore, \(\lambda \) and \(\gamma \) and are the hyperparameters. We fix \(I = 100\), which is enough to converge, \(\gamma = 0.05\), and \(\lambda = 0.01\). The convergence result is shown in Fig. 4.9. The rank r can be treated as the parameter of anonymization by matrix factorization because the accuracy of dataset \(X = UV^\mathsf{T}\) depends on the rank r, so r is the parameter of our algorithm; we set \(r = 10, 20, 30, 40\). We set larger values in the experiments in [3], but the results of the case \(r > 40\) are saturated. The probabilities of reidentification and linkage attacks are shown in Table 4.10.

Fig. 4.9
figure 9

Convergence result

Table 4.10 Attacks against matrix factorization

The results show that matrix factorization itself does not have much effect on reidentification attacks. Note that matrix factorization can preserve the relative positional relationship among the records so that the privacy risk of the reidentification attack does not decrease much by using a matching algorithm. When the rank is small enough, \(r = 10\), the positional relationship is broken, and the privacy risk is lowered.

On the other hand, compared with the reidentification attack presented in Table 4.9, the linkage attack probability between \(A_{r}(M_{t_1})\) and \(A_{r}(M_{t_2})\) is better. This is because the relationship between the records of \(M_{t_1}\) and \(M_{t_2}\) is weaker than that between \(M_{t_1}\) and \(A_{r}(M_{t_1})\). In our experiment, the dataset of the observation time is 8 h and \(r = 30\) has almost the same privacy level as when the observation time is 2 h (Fig. 4.10).

Fig. 4.10
figure 10

Overview of the experiment

4.4.6 Results

4.4.6.1 Risk Evaluation

We evaluate our anonymization method, Algorithm 4.1, in the following experiments. We apply the method described in [10] as k-anonymization and Laplace noise as the noise addition. When noise addition is applied, noise \(\epsilon \sim Lap(0, 2\phi ^2)\) is added to each element, and the parameter is \(\phi \).

  1. 1.

    Evaluate the privacy risk of a reidentification attack between \(A_{k}(M_{t_1})\) and \(M_{t_1}\) and a linkage attack between \(A_{k}(M_{t_1})\) and \(A_{k}(M_{t_2})\).

  2. 2.

    Evaluate the privacy risk of a reidentification attack between \(A_{\phi }(M_{t_1})\) and \(M_{t_1}\) and a linkage attack between \(A_{\phi }(M_{t_1})\) and \(A_{\phi }(M_{t_2})\).

  3. 3.

    Evaluate the privacy risk of reidentification attacks between \(A_{k}(U_{t_1})^\mathsf{T}V\) and \(M_{t_1}\) and linkage attacks between \(A_{k}(U_{t_1})^\mathsf{T}V\) and \(A_{k}(U_{t_2})^\mathsf{T}V\).

  4. 4.

    Evaluate the privacy risk of reidentification attacks between \(A_{\phi }(U_{t_1})^\mathsf{T}V\) and \(M_{t_1}\) and linkage attacks between \(A_{\phi }(U_{t_1})^\mathsf{T}V\) and \(A_{\phi }(U_{t_2})^\mathsf{T}V\).

The evaluations of the reidentification attacks in experiments 1 and 2 are almost the same as those conducted in many previous studies. The difference is the privacy metric (see 4.4.1), and these results are used for comparison with experiments 3 and 4, which are evaluations of our algorithm. There are few studies on linkage attacks, and evaluations of this type of attack are one of our contributions.

The evaluation of the reidentification attack in experiment 1 (Table 4.11) is simple, and the result is almost the same as for k-anonymization. However, our privacy metric is slightly different from that for k-anonymity, so the result is also slightly different from 1/k. The result of the linkage attack also shows that k-anonymization can greatly improve the privacy of linkage attacks and that 2-anonymization can reduce the privacy risk by \(77\% (0.8 \rightarrow 0.185)\).

Table 4.11 Experiment 1

The evaluations of experiment 2 are shown in Table 4.12. The privacy of the reidentification attack is improved from \(\phi \ge 0.9\), and when \(\phi \) is large, for example, \(\phi = 1.5\), the score appears to be good. However, almost half of the records are changed by more than 1 by the added noise, and each original value of M is 0 or 1, namely, \(M_{ij} \in \left\{ 0, 1\right\} \), so that the noise is too large to preserve utility. Therefore, we conclude that simple noise addition is not good, in terms of utility preservation, as an anonymization method. On the other hand, we obtain an interesting result for linkage attacks. The privacy for linkage attacks is improved even if the noise is very small and adding even a small amount of noise is an effective countermeasure against a linkage attack.

Table 4.12 Experiment 2

In experiment 3, we evaluate the effect of our proposed algorithm, which is a combination of matrix factorization and k-anonymization. Table 4.13 presents the result of the reidentification attack. In the experiment, we cannot find the effect of the matrix factorization very well, but the privacy slightly improves as r increases. This is because k-anonymization has a large effect on the reidentification risk, and the effect of the matrix factorization does not appear.

Table 4.13 Experiment 3: reidentification attack

The results of the linkage attack in experiment 3 are shown in Table 4.14. In the experiment, we cannot obtain new knowledge about the effect of matrix factorization. When the datasets, which are observed at different time periods, are sufficiently anonymized by k-anonymization, there is no relationship among the same users of each dataset and only outliers can be linked.

Table 4.14 Experiment 3: linkage attack

In experiment 4, we evaluate the impact of our method, which is a combination of matrix factorization and noise addition. The evaluation results of the reidentification attack are presented in Table 4.15. Noise is added to U, which is the user’s characteristics, and then, \(U^\mathsf{T}\) is multiplied by V. Therefore, we cannot simply compare the results with those of experiment 2, but the impact of the matrix factorization is high. This result shows that using matrix factorization can help to construct anonymized datasets flexibly from the viewpoint of privacy. For example, the privacy risk of \(A_{(\phi =0.15, r=20)}(M_{t_1})\) and \(A_{(\phi =0.20, r=40)}(M_{t_1})\) is almost the same as that of \(A_{(k=2)}(M_{t_1})\) and \(A_{(\phi =1.5)}(M_{t_1})\).

Table 4.15 Experiment 4: reidentification attack

The results of the linkage attack in experiment 4 are presented in Table 4.16. The trend is the same as that of the reidentification attack, and the matrix factorization is compatible with noise addition. We present the details of the results of the reidentification attack and the linkage attack in Figs. 4.11 and 4.12.

Table 4.16 Experiment 4: linkage attack
Fig. 4.11
figure 11

Reidentification risk of the combination of matrix factorization and noise addition

Fig. 4.12
figure 12

Linkage risk of the combination of matrix factorization and noise addition

4.4.6.2 Utility Evaluation

We next evaluate the utility of anonymized datasets. We evaluate the utility of datasets by applying a machine learning algorithm. Logistic regression (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) is applied in the following experiment, and the parameters are those of the default setting. One of the applications of an access log dataset is to predict a malicious site and inform the web browser’s users. Therefore, we use a machine learning algorithm and predict whether each user will access a malicious site. We generate learning models using the original (non-anonymized) dataset and the anonymized datasets and input the test dataset to these models. The utility score is defined in Definition 4.13, and the F-measure of the model of the original dataset is 0.763. Each result of the evaluation is shown in Tables. 4.17, 4.18, 4.19, and 4.20.

  1. 1.

    Evaluate the utility of \(A_{(k)}(M_{t_1})\) for \(k=2, 4, 6, 8\), and 10.

  2. 2.

    Evaluate the utility of \(A_{(\phi )}(M_{t_1})\) for \(\phi = 0.3, 0.6, 0.9, 1.2\), and 1.5.

  3. 3.

    Evaluate the utility of \(A_{(k=2, r)}(M_{t_1})\) for \(r = 10, 20, 30\), and 40.

  4. 4.

    Evaluate the utility of \(A_{(\phi , r)}(M_{t_1})\) for \(\phi = 0.1\) and 0.15 and \(r = 10, 20, 30\), and 40.

In experiment 1, each element is \(M_{ij}\in \left\{ 0, 1\right\} \) and the matrix is sparse, even when k-anonymization is effective. However, when the dataset is more complex, the utility of k-anonymization will decrease; this is widely known as the curse of dimensionality.

Table 4.17 Utility evaluation 1
Table 4.18 Utility evaluation 2
Table 4.19 Utility evaluation 3

The results of experiment 2 show that the utility of the dataset decreases as noise increases. As stated in the risk evaluation section, each element of the original dataset is 0 or 1, and the utility drastically worsens when the noise parameter is large, such as \(\phi = 1.5\).

When k-anonymization and matrix factorization are combined, the effect of matrix factorization is small, as is the case for the privacy risk. In this experiment, the effect of k-anonymization is large, and the effect of matrix factorization is relatively small.

Table 4.20 Utility evaluation 4
Table 4.21 Utility evaluation 5

The evaluation results of the combination of noise addition and matrix factorization show a good performance (Tables 4.20 and 4.21). A dataset generated by combining matrix factorization and noise addition has more utility than a dataset generated by noise addition when each dataset has the same privacy level.

4.5 Anonymization and Privacy Risk Evaluation Tool

In this section, we introduce an anonymization and privacy risk evaluation tool. So far, we have shown how to evaluate the privacy and utility of several datasets. We focus on static datasets and apply the theory we have described in the tool. First, we explain the outline of the tool. The tool requires a dataset that is the target of anonymization and privacy risk evaluation. At this time, the data type is defined for each attribute (see Fig. 4.13). Numerical, qualitative, set, code, and sensitive types can be defined. Age, height, and weight are defined as numerical types, and a user can assign a range of values. For instance, a user may want to divide age into groups of two years or five years depending on the situation. Qualitative-type records have nonnumerical value, such as gender and occupation. The set type is an extended numerical or qualitative type, and attributes that include multiple data correspond to this type. The code type is defined when every value is the same digit, such as a postcode. The sensitive type corresponds to sensitive information. The privacy risk is evaluated using quasi-identifiers in our tool, and the attributes that are sensitive do not effect the privacy risk. However, it is known that sensitive information may cause privacy leakages, and the tool can cover the risk for sensitive information such as l-diversity.

Fig. 4.13
figure 13

Anonymization and privacy risk evaluation tool 1

Fig. 4.14
figure 14

Anonymization and privacy risk evaluation tool 2

After the type of each attribute is decided, a user defines the noise and sampling parameters. Our tool can evaluate datasets that are anonymized by the combined method. Then, the user generates a hierarchical tree for each attribute, and the tool anonymizes the values in accordance with the tree. The user can generate and change the construction of hierarchical trees by using a UI (see Fig. 4.14.).

Fig. 4.15
figure 15

Anonymization and privacy risk evaluation tool 3

After these preparations are finished, the user can define the conditions and generate a dataset flexibly. A sample operation screen is shown in Fig. 4.15. Let us introduce a method commonly used as an example. First, a user searches records that do not achieve k-anonymity. Namely, the user searches records that do not include more than k copies of the same record, and then the user changes the level of an attribute of the records. The records that are secure enough are not processed, so the utility of the dataset can be maintained. The conditions can be more complex. For example, the records that have a value of “age” over 80 and a value of “occupation” that is not “self-employed” are identified and anonymized. The ranks of the records are “balanced” according to the hierarchical tree. The privacy risk can be seen in real time (in Fig. 4.16), and the user can anonymize a dataset by trial and error. The operation procedure can be output as a setting file, and once the operation is decided, the procedure can be performed automatically, such as in batch processing.

Fig. 4.16
figure 16

Anonymization and privacy risk evaluation tool 4

4.6 Conclusion

In this chapter, we considered the importance of data and privacy. Several anonymization techniques, including k-anonymization, are introduced in Sect. 4.2, and the privacy and adversary model for static data are shown in Sect. 4.3. We focused on static data and time-sequence data in this project, and we discuss time-sequence data in Sect. 4.4. Finally, in Sect. 4.5, we introduce an anonymization and privacy risk evaluation tool. The tool is partly developed in this project, and we are proactive in using it commercially.