Methods to Mitigate Risk of Composition Attack in Independent Data Publications

Li, Jiuyong; Sattar, Sarowar A.; Baig, Muzammil M.; Liu, Jixue; Heatherly, Raymond; Tang, Qiang; Malin, Bradley

doi:10.1007/978-3-319-23633-9_8

Jiuyong Li³,
Sarowar A. Sattar³,
Muzammil M. Baig⁴,
Jixue Liu³,
Raymond Heatherly⁵,
Qiang Tang⁶ &
…
Bradley Malin⁷

2567 Accesses
5 Citations

Abstract

Data publication is a simple and cost-effective approach for data sharing across organizations. Data anonymization is a central technique in privacy preserving data publications. Many methods have been proposed to anonymize individual datasets and multiple datasets of the same data publisher. In real life, a dataset is rarely isolated and two datasets published by two organizations may contain the records of the same individuals. For example, patients might have visited two hospitals for follow-up or specialized treatment regarding a disease, and their records are independently anonymized and published. Although each published dataset poses a small privacy risk, the intersection of two datasets may severely compromise the privacy of the individuals. The attack using the intersection of datasets published by different organizations is called a composition attack. Some research work has been done to study methods for anonymizing data to prevent a composition attack for independent data releases where one data publisher has no knowledge of records of another data publisher. In this chapter, we discuss two exemplar methods, a randomization based and a generalization based approaches, to mitigate risks of composition attacks. In the randomization method, noise is added to the original values to make it difficult for an adversary to pinpoint an individual’s record in a published dataset. In the generalization method, a group of records according to potentially identifiable attributes are generalized to the same so that individuals are indistinguishable. We discuss and experimentally demonstrate the strengths and weaknesses of both types of methods. We also present a mixed data publication framework where a small proportion of the records are managed and published centrally and other records are managed and published locally in different organizations to reduce the risk of the composition attack and improve the overall utility of the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://ipums.org.

References

Baig, M.M., Li, J., Liu, J., Ding, X., Wang, H.: Data privacy against composition attack. In: Proceedings of the 17th International Conference on Database Systems for Advanced Applications, pp. 320–334, Busan (2012)
Google Scholar
Baig, M.M., Li, J., Liu, J., Wang, H.: Cloning for privacy protection in multiple independent data publications. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 885–894, Glasgow (2011)
Google Scholar
Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the 26th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 273–282, Beijing (2007)
Google Scholar
Bu, Y., Fu, A.W., Wong, R.C.W., Chen, L., Li, J.: Privacy preserving serial data publishing by role composition. Proc. VLDB Endowment 1(1), 845–856 (2008)
Article Google Scholar
Cebul, R.D., Rebitzer, J.B., Taylor, L.J., Votruba, M.: Organizational fragmentation and care quality in the U.S. health care system. J. Econ. Perspect. (2008). doi:10.3386/w14212
Google Scholar
Collaboration, P.S.: Age-specific relevance of usual blood pressure to vascular mortality: a meta-analysis of individual data for one million adults in 61 prospective studies. Lancet 360(9349), 1903–1913 (2002)
Article Google Scholar
Cormode, G., Procopiuc, C.M., Shen, E., Srivastava, D., Yu, T.: Empirical privacy and empirical utility of anonymized data. In: Proceedings of the 29th IEEE International Conference on Data Engineering Workshops, pp. 77–82, Brisbane (2013)
Google Scholar
Dwork, C.: Differential privacy. In: Proceedings of the 5th International Colloquium on Automata, Languages and Programming, pp. 1–12, Venice (2006)
Google Scholar
Dwork, C.: A firm foundation for private data analysis. Commun. ACM 54(1), 86–95 (2011)
Article Google Scholar
Dwork, C., Kenthapadi, K., Mcsherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: Advances in Cryptology - EUROCRYPT, pp. 486–503, St. Petersburg (2006)
Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Proceedings of the 3rd Conference on Theory of Cryptography, pp. 265–284, Berlin (2006)
Google Scholar
Dwork, C., Smith, A.: Differential privacy for statistics: what we know and what we want to learn. J. Priv. Confidentiality 1(2), 135–154 (2009)
Google Scholar
Fung, B.C.M., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. 42(4), 1–53 (2010)
Article Google Scholar
Fung, B.C.M., Wang, K., Fu, A.W., Pei, J.: Anonymity for continuous data publishing. In: Proceeding of the 11th International Conference on Extending Database Technology, pp. 264–275, Nantes (2008)
Google Scholar
Ganta, S.R., Prasad, S., Smith, A.: Composition attacks and auxiliary information in data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 265–273, Las Vegas, Nevada (2008)
Google Scholar
Jiang, W., Clifton, C.: A secure distributed framework for achieving k-anonymity. VLDB J. 15(4), 316–333 (2006)
Article Google Scholar
Jurczyk, P., Xiong, L.: Towards privacy-preserving integration of distributed heterogeneous data. In: Proceedings of the 2nd Ph.D. Workshop on Information and Knowledge Management, pp. 65–72, Napa Valley, California (2008)
Google Scholar
Kasiviswanathan, S.P., Smith, A.: On the ‘semantics’ of differential privacy: a Bayesian formulation. Priv. Confidentiality 6(1), 1–16 (2014)
Google Scholar
Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet MATH Google Scholar
LaValle, S., Lesser, E., Shockley, R., Hopkins, M.S., Kruschwitz, N.: Big data, analytics, and the path from insights to value. MIT Sloan Manag. Rev. 52, 21–31 (2011)
Google Scholar
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of the 22nd IEEE International Conference on Data Engineering, pp. 25–25, Atlanta, Georgia (2006)
Google Scholar
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the 23rd IEEE International Conference on Data Engineering, pp. 106–115, Istanbul (2007)
Google Scholar
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1(1) (2007). doi: 10.1145/1217299.1217302
Malin, B., Sweeney, L.: How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. J. Biomed. Inform. 37, 179–192 (2004)
Article Google Scholar
Mohammed, N., Fung, B.C.M., Wang, K., Hung, P.C.K.: Privacy-preserving data mashup. In: Proceeding of the 12th International Conference on Extending Database Technology, pp. 228–239, Saint Petersburg (2009)
Google Scholar
Mohammed, N., Chen, R., Fung, B.C., Yu, P.S.: Differentially private data release for data mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 493–501, San Diego, California (2011)
Google Scholar
Muralidhar, K., Sarathy, R.: Does differential privacy protect Terry Gross privacy? In: Domingo-Ferrer, J., Magkos, E., (eds.) Privacy in Statistical Databases. Lecture Notes in Computer Science, vol. 6344, pp. 200–209. Springer, Berlin (2010)
Chapter Google Scholar
Newton, K.M., Peissig, P.L., Kho, A.N., Bielinski, S.J., Berg, R.L., Choudhary, V., Basford, M., Chute, C.G., Kullo, I.J., Li, R., Pacheco, J.A., Rasmussen, L.V., Spangler, L., Denny, J.C.: Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J. Am. Med. Inform. Assoc. 20(e1), e147–e154 (2013)
Article Google Scholar
Provost, F., Fawcett, T.: Data science and its relationship to big data and data-driven decision making. Big Data 1(1), 51–59 (2013)
Article Google Scholar
Sarathy, R., Muralidhar, K.: Evaluating Laplace noise addition to satisfy differential privacy for numeric data. Trans. Data Priv. 4(1), 1–17 (2011)
MathSciNet Google Scholar
Sattar, S.A., Li, J., Ding, X., Liu, J., Vincent, M.: A general framework for privacy preserving data publishing. Knowl.-Based Syst. 54(0), 276–287 (2013)
Google Scholar
Sattar, S.A., Li, J., Liu, J., Heatherly, R., Malin, B.: A probabilistic approach to mitigate composition attacks on privacy in non-coordinated environments. Knowl.-Based Syst. 67(0), 361–372 (2014)
Google Scholar
Soria-Comas, J., Domingo-Ferrer, J., Sánchez, D., Martínez, S.: Enhancing data utility in differential privacy via microaggregation-based k-anonymity. VLDB J. 23(5), 771–794 (2014)
Article Google Scholar
Sweeney, L.: k-anonymity: A model for protecting privacy. Int. J. Uncertainty Fuzziness Knowledge Based Syst. 10(5), 557–570 (2002)
Google Scholar
Tene, O., Polonetsky, J.: Privacy in the age of big data: a time for big decisions. Stanford Law Rev. 64, 63–69 (2012)
Google Scholar
Wang, K., Fung, B.C.M.: Anonymizing sequential releases. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 414–423, Philadelphia, PA (2006)
Google Scholar
Wong, R.C., Li, J., Fu, A.W., Wang, K.: (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 754–759, Philadelphia, PA (2006)
Google Scholar
Wong, R.C., Fu, A.W., Liu, J., Wang, K., Xu, Y.: Global privacy guarantee in serial data publishing. In: Proceedings of 26th IEEE International Conference on Data Engineering, pp. 956–959, Long Beach, California (2010)
Google Scholar
Xiao, X., Tao, Y.: m-invariance: towards privacy preserving re-publication of dynamic data sets. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 689–700, Beijing (2007)
Google Scholar
Xiao, X., Wang, G., Gehrke, Gehrke, J., Jefferson, T.: Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23(8), 1200–1214 (2011)
Google Scholar
Xiong, L., Sunderam, V., Fan, L., Goryczka, S., Pournajaf, L.: Predict: privacy and security enhancing dynamic information collection and monitoring. Procedia Comput. Sci. 18(0), 1979–1988 (2013)
Google Scholar

Download references

Acknowledgements

The work has been partially supported by Australian Research Council (ARC) Discovery Grant DP110103142 and a CORE (junior track) grant from the National Research Fund, Luxembourg.

Author information

Authors and Affiliations

School of Information Technology and Mathematical Sciences, University of South Australia, Adelaide, SA, Australia
Jiuyong Li, Sarowar A. Sattar & Jixue Liu
InterSect Alliance International Pty Ltd, Adelaide Area, SA, Australia
Muzammil M. Baig
Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
Raymond Heatherly
APSIA group, SnT, University of Luxembourg, Walferdange, Luxembourg
Qiang Tang
Departments of Biomedical Informatics and EE and CS, Vanderbilt University, Nashville, TN, USA
Bradley Malin

Authors

Jiuyong Li
View author publications
You can also search for this author in PubMed Google Scholar
Sarowar A. Sattar
View author publications
You can also search for this author in PubMed Google Scholar
Muzammil M. Baig
View author publications
You can also search for this author in PubMed Google Scholar
Jixue Liu
View author publications
You can also search for this author in PubMed Google Scholar
Raymond Heatherly
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Tang
View author publications
You can also search for this author in PubMed Google Scholar
Bradley Malin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiuyong Li .

Editor information

Editors and Affiliations

IBM Research - Ireland, Mulhuddart, Dublin, Ireland
Aris Gkoulalas-Divanis
Cardiff University, Cardiff, United Kingdom
Grigorios Loukides

Appendices

Appendix

In this appendix, we discuss three measures used in the experiments and the definitions of differential privacy.

A. Metrics

Definition 8.1 (Kullback-Leibler Divergence).

Kullback-Leibler (KL) divergence [19] is a non-symmetric measure of the difference between two probability distributions P and Q. Specifically, KL-divergence is a measure of information loss when Q is used to approximate P. It is denoted by D _KL(P | | Q). If P and Q represent discrete probability distributions, KL-divergence of Q from P is defined as:

$$ \displaystyle{ D_{KL}(P\vert \vert Q) =\sum _{i}P(i)\ln \frac{P(i)} {Q(i)} } $$

In other words, it is the expectation of the logarithmic difference between the probabilities P and Q, where the expectation is taken the probabilities P.

Definition 8.2 (City Block Distance).

City block distance measures the similarity between two objects. If a and b are two objects described by a m-dimensional vector, then the city block distance between a and b is calculated as follows.

$$ \displaystyle{ Distance(a,b) =\sum _{ j=1}^{m}\vert a_{ j} - b_{j}\vert } $$

The city block distance is greater than or equal to zero. The distance is zero for identical objects and high for objects that share little similarity.

Definition 8.3 (Relative Error).

Relative error indicates how accurate a measurement is relative to the actual value of an object being measured. If R _act represents the actual value and R _est represents the estimated value, then the relative error is defined as follows:

$$ \displaystyle{ Error = \frac{\vert R_{act} - R_{est}\vert } {R_{act}} = \frac{\Delta R} {R_{act}} } $$

where $ \Delta R $ represents the absolute error.

B. Differential Privacy

Differential privacy has received significant attention recently because it provides semantic and cryptographically strong guarantees [18]. It ensures that an adversary learns little more about an individual being in the dataset than not [9, 18, 26].

“Differential privacy will ensure that the ability of an adversary to inflict harm (or good, for that matter) of any sort, to any set of people, should be essentially the same, independent of whether any individual opts in to, or opts out of, the dataset. [9]”

The intuition behind this is that the output from a differentially private mechanism is insensitive to any particular record.

Definition 8.4 (Differential Privacy [26]).

A randomized function K is differentially private if for all datasets D and D′ where their symmetric difference contains at most one record (i.e., $ \vert D\Delta D'\vert \leq 1 $), and for all possible anonymized dataset $ \hat{D} $,

$$ \displaystyle{ Pr[K(D) =\hat{ D}] \leq e^{\epsilon } \times Pr[K(D') =\hat{ D}] } $$

where the probabilities are over the randomness of K.

The parameter ε > 0 is public and set by the data publishers [9]. The lower the value of ε, the stronger the privacy guarantee, whereas a higher value of ε provides more data utility [27]. Therefore, it is crucial to choose an appropriate value for ε to balance data privacy and utility [30, 33, 40].

A standard mechanism to achieve differential privacy is to add random noise to the original output of a function. The added random noise is calibrated according to the sensitivity of the function. The sensitivity of the function is the maximum difference between the values that the function may take on a pair of datasets that differ in only one record [9].

Definition 8.5 (Sensitivity [12]).

For any function $ f: D \rightarrow \mathbf{R}^{d} $, the sensitivity of f is measured as:

$$ \displaystyle{ \Delta f =\max _{D,D'}\vert \vert f(D) - f(D')\vert \vert _{1} } $$

for all D, D′ differing in at most one record.

Dwork et al. [12] suggest the Laplace mechanism to achieve differential privacy. The Laplace mechanism takes a dataset D, a function f and the parameter b to generate noise according to the Laplace distribution with the probability density function $ Pr(x\vert b) = \frac{1} {2b}\exp (-\vert x\vert /b) $ with variance 2b ² and mean 0. Theorem 8.1 connects the sensitivity to the magnitude of the noise that generates the noisy output $ f(\hat{D}) = f(D) + Lap(b) $ to satisfy ε-differential privacy. Note that Lap(b) is a random variable sampled from the Laplace distribution.

Theorem 8.1 ([10]).

For any function $ f: D \rightarrow \mathbf{R}^{d} $ , the randomized function K that adds independently generated noise with distribution $ Lap(\Delta /\epsilon ) $ to each of the outputs guarantees ε-differential privacy.

Therefore, the function f, returns the count value, first computes the original count f(D) and then outputs the noisy answer $ f(\hat{D}) = f(D) + Lap(1/\epsilon ) $.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Li, J. et al. (2015). Methods to Mitigate Risk of Composition Attack in Independent Data Publications. In: Gkoulalas-Divanis, A., Loukides, G. (eds) Medical Data Privacy Handbook. Springer, Cham. https://doi.org/10.1007/978-3-319-23633-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-23633-9_8
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23632-2
Online ISBN: 978-3-319-23633-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics