Sharing Data with Guaranteed Privacy

Siebes, Arno

doi:10.1007/978-3-319-41706-6_4

Arno Siebes¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9580))

1474 Accesses

Abstract

Big Data is both a curse and a blessing. A blessing because the unprecedented amount of detailed data allows for research in, e.g., social sciences and health on scales that were until recently unimaginable. A curse, e.g., because of the risk that such – often very private – data leaks out though hacks or by other means causing almost unlimited harm to the individual.

To neutralize the risks while maintaining the benefits, we should be able to randomize the data in such a way that the data at the individual level is random, while statistical models induced from the randomized data are indistinguishable from the same models induced from the original data.

In this paper we first analyse the risks in sharing micro data – as statisticians tend to call it – even if it is anonymized, discretized, grouped, and perturbed. Next we quasi-formalize the kind of randomization we are after and argue why it is safe to share such data. Unfortunately, it is not clear that such randomizations of data sets exist. We briefly discuss why, if they exist at all, will be hard to find. Next I explain why I think they do exist and can be constructed by showing that the code tables computed by, e.g., Krimp are already close to what we would like to achieve. Thus making privacy safe sharing of micro-data possible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Surprisingly often terminology arising in marketing and/or journalism enters the scientific vernacular. While this is understandable from the point of view that funding agencies want to fund research that society needs and researchers need funding, but one can but wonder what would have happened if the term hypology – from the classical Greek \(\upsilon \pi {o}\lambda {o}\gamma \iota \sigma \mu {o}\) (calculate) – once coined as an alternative for the term computer science [21], would have caught on.
2.
If only because of the large number of parameters we use.

References

Agrawal, R., Mannila, H., Srikant, R., Toivonen, A.H., Verkamo, I.: Fast discovery of association rules. In: Usama, M.F., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 307–328. AAAI/MIT Press, Menlo Park (1996)
Google Scholar
Bakshy, E., Messing, S., Adami, L.: Exposure to ideologically diverse news and opinion on Facebook. Science 348(6239), 1130–1132 (2015)
Article MathSciNet Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman & Hall, Wadsworth (1984)
MATH Google Scholar
Calders, T., Goethals, B.: Non-derivable itemset mining. Science 14(1), 171–206 (2007)
MathSciNet Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (2006)
MATH Google Scholar
de Montjoye, Y.-A., Radaelli, L., Singh, V.K., Sandy, A.: Pentland: unique in the shopping mall: on the reidentifiability of credit card metadata. Science 347(6221), 536–539 (2015)
Article Google Scholar
Dong, G., Li, J.: Efficient mining of emerging patterns: discovering trends and differences. In: Fayyad, S.C., Madigan, D., (eds.) Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 15–18 August 1999, pp. 43–52 (1999)
Google Scholar
Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Data Min. Knowl. Disc. 9(3–4), 211–407 (2014)
MathSciNet MATH Google Scholar
Grünwald, P.: The Minimum Description Length Principle. MIT Press, Cambridge (2007)
Google Scholar
Klösgen, W.: Subgroup patterns. In: Klösgen, W., Zytkow, J.M. (eds.) Handbook of Data Mining and Knowledge Discovery, pp. 47–51. Oxford University Press, New York (2002)
Google Scholar
Lane, J., Stodden, V., Bender, S., Nissenbaum, H. (eds.): Privacy, Big Data, the Public Good: Frameworks for Engagement. Cambridge University Press, New York (2015)
Google Scholar
Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Springer, New York (1993)
Book MATH Google Scholar
Mannila, H., Toivonen, A.H., Verkamo, I.: Levelwise search and borders of theories in knowledge discovery. Data Min. Knowl. Disc. 1(3), 241–258 (1997)
Article Google Scholar
Mayer-Schönberger, V., Cukier, K.: Big Data, A Revolution That Will Transform How We Live, Work and Think. John Murray, London (2013)
Google Scholar
Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: 2008 IEEE Symposium on Security and Privacy (S&P 2008), 18–21 May 2008, Oakland, California, USA, pp. 111–125 (2008)
Google Scholar
Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In: Proceedings of the IEEE Symposium on Research in Security and Privacy (1998)
Google Scholar
Schneier, B.: Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World. W.W. Norton and Company, New York (2015)
Google Scholar
Siebes, A., Puspitaningrum, D.: Mining databases to mine queries faster. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS, vol. 5782, pp. 382–397. Springer, Heidelberg (2009)
Chapter Google Scholar
Siebes, A., Vreeken, J., van Leeuwen, M.: Item sets that compress. In: Ghosh, J., Lambert, D., Skillicorn, D.B., Srivastava, J. (eds.) SDM 2006 Proceedings, pp. 393–404. SIAM (2006)
Google Scholar
Smets, K., Vreeken, J.: Slim: directly mining descriptive patterns. In: Ghosh, J., Liu, H., Davidson, I., Domeniconi, C., Kamath, C. (eds.) SDM 2012 Proceedings, pp. 236–247. SIAM (2012)
Google Scholar
Tedre, M.: The Science of Computing: Shaping a Discipline. CRC Press/Taylor & Francis, Boca Raton (2014)
Book MATH Google Scholar
Vreeken, J., van Leeuwen, M., Siebes, A.: Preserving privacy through data generation. In: Ramakrishnan, N., Zaïane, O.R., Shi, Y., Clifton, C.W., Wu, X. (eds.) ICDM 2007 Proceedings, pp. 685–690. IEEE (2007)
Google Scholar
Vreeken, J., van Leeuwen, M., Siebes, A.: Krimp: mining itemsets that compress. Science 23(1), 169–214 (2011)
MathSciNet MATH Google Scholar
Wikipedia: Netflix prize (2015). Accessed 31 July 2015
Google Scholar

Download references

Acknowledgements

The author is supported by the Dutch national COMMIT project.

Author information

Authors and Affiliations

Algorithmic Data Analysis Group, Universiteit Utrecht, Utrecht, The Netherlands
Arno Siebes

Authors

Arno Siebes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arno Siebes .

Editor information

Editors and Affiliations

TU Dortmund , Dortmund, Germany
Stefan Michaelis
TU Dortmund , Dortmund, Germany
Nico Piatkowski
TU Dortmund , Dortmund, Germany
Marco Stolpe

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Siebes, A. (2016). Sharing Data with Guaranteed Privacy. In: Michaelis, S., Piatkowski, N., Stolpe, M. (eds) Solving Large Scale Learning Tasks. Challenges and Algorithms. Lecture Notes in Computer Science(), vol 9580. Springer, Cham. https://doi.org/10.1007/978-3-319-41706-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-41706-6_4
Published: 03 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41705-9
Online ISBN: 978-3-319-41706-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics