Abstract
Big Data is both a curse and a blessing. A blessing because the unprecedented amount of detailed data allows for research in, e.g., social sciences and health on scales that were until recently unimaginable. A curse, e.g., because of the risk that such – often very private – data leaks out though hacks or by other means causing almost unlimited harm to the individual.
To neutralize the risks while maintaining the benefits, we should be able to randomize the data in such a way that the data at the individual level is random, while statistical models induced from the randomized data are indistinguishable from the same models induced from the original data.
In this paper we first analyse the risks in sharing micro data – as statisticians tend to call it – even if it is anonymized, discretized, grouped, and perturbed. Next we quasi-formalize the kind of randomization we are after and argue why it is safe to share such data. Unfortunately, it is not clear that such randomizations of data sets exist. We briefly discuss why, if they exist at all, will be hard to find. Next I explain why I think they do exist and can be constructed by showing that the code tables computed by, e.g., Krimp are already close to what we would like to achieve. Thus making privacy safe sharing of micro-data possible.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Surprisingly often terminology arising in marketing and/or journalism enters the scientific vernacular. While this is understandable from the point of view that funding agencies want to fund research that society needs and researchers need funding, but one can but wonder what would have happened if the term hypology – from the classical Greek \(\upsilon \pi {o}\lambda {o}\gamma \iota \sigma \mu {o}\) (calculate) – once coined as an alternative for the term computer science [21], would have caught on.
- 2.
If only because of the large number of parameters we use.
References
Agrawal, R., Mannila, H., Srikant, R., Toivonen, A.H., Verkamo, I.: Fast discovery of association rules. In: Usama, M.F., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 307–328. AAAI/MIT Press, Menlo Park (1996)
Bakshy, E., Messing, S., Adami, L.: Exposure to ideologically diverse news and opinion on Facebook. Science 348(6239), 1130–1132 (2015)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman & Hall, Wadsworth (1984)
Calders, T., Goethals, B.: Non-derivable itemset mining. Science 14(1), 171–206 (2007)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (2006)
de Montjoye, Y.-A., Radaelli, L., Singh, V.K., Sandy, A.: Pentland: unique in the shopping mall: on the reidentifiability of credit card metadata. Science 347(6221), 536–539 (2015)
Dong, G., Li, J.: Efficient mining of emerging patterns: discovering trends and differences. In: Fayyad, S.C., Madigan, D., (eds.) Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 15–18 August 1999, pp. 43–52 (1999)
Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Data Min. Knowl. Disc. 9(3–4), 211–407 (2014)
Grünwald, P.: The Minimum Description Length Principle. MIT Press, Cambridge (2007)
Klösgen, W.: Subgroup patterns. In: Klösgen, W., Zytkow, J.M. (eds.) Handbook of Data Mining and Knowledge Discovery, pp. 47–51. Oxford University Press, New York (2002)
Lane, J., Stodden, V., Bender, S., Nissenbaum, H. (eds.): Privacy, Big Data, the Public Good: Frameworks for Engagement. Cambridge University Press, New York (2015)
Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Springer, New York (1993)
Mannila, H., Toivonen, A.H., Verkamo, I.: Levelwise search and borders of theories in knowledge discovery. Data Min. Knowl. Disc. 1(3), 241–258 (1997)
Mayer-Schönberger, V., Cukier, K.: Big Data, A Revolution That Will Transform How We Live, Work and Think. John Murray, London (2013)
Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: 2008 IEEE Symposium on Security and Privacy (S&P 2008), 18–21 May 2008, Oakland, California, USA, pp. 111–125 (2008)
Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In: Proceedings of the IEEE Symposium on Research in Security and Privacy (1998)
Schneier, B.: Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World. W.W. Norton and Company, New York (2015)
Siebes, A., Puspitaningrum, D.: Mining databases to mine queries faster. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS, vol. 5782, pp. 382–397. Springer, Heidelberg (2009)
Siebes, A., Vreeken, J., van Leeuwen, M.: Item sets that compress. In: Ghosh, J., Lambert, D., Skillicorn, D.B., Srivastava, J. (eds.) SDM 2006 Proceedings, pp. 393–404. SIAM (2006)
Smets, K., Vreeken, J.: Slim: directly mining descriptive patterns. In: Ghosh, J., Liu, H., Davidson, I., Domeniconi, C., Kamath, C. (eds.) SDM 2012 Proceedings, pp. 236–247. SIAM (2012)
Tedre, M.: The Science of Computing: Shaping a Discipline. CRC Press/Taylor & Francis, Boca Raton (2014)
Vreeken, J., van Leeuwen, M., Siebes, A.: Preserving privacy through data generation. In: Ramakrishnan, N., Zaïane, O.R., Shi, Y., Clifton, C.W., Wu, X. (eds.) ICDM 2007 Proceedings, pp. 685–690. IEEE (2007)
Vreeken, J., van Leeuwen, M., Siebes, A.: Krimp: mining itemsets that compress. Science 23(1), 169–214 (2011)
Wikipedia: Netflix prize (2015). Accessed 31 July 2015
Acknowledgements
The author is supported by the Dutch national COMMIT project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Siebes, A. (2016). Sharing Data with Guaranteed Privacy. In: Michaelis, S., Piatkowski, N., Stolpe, M. (eds) Solving Large Scale Learning Tasks. Challenges and Algorithms. Lecture Notes in Computer Science(), vol 9580. Springer, Cham. https://doi.org/10.1007/978-3-319-41706-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-41706-6_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41705-9
Online ISBN: 978-3-319-41706-6
eBook Packages: Computer ScienceComputer Science (R0)