Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9580))

  • 1474 Accesses

Abstract

Big Data is both a curse and a blessing. A blessing because the unprecedented amount of detailed data allows for research in, e.g., social sciences and health on scales that were until recently unimaginable. A curse, e.g., because of the risk that such – often very private – data leaks out though hacks or by other means causing almost unlimited harm to the individual.

To neutralize the risks while maintaining the benefits, we should be able to randomize the data in such a way that the data at the individual level is random, while statistical models induced from the randomized data are indistinguishable from the same models induced from the original data.

In this paper we first analyse the risks in sharing micro data – as statisticians tend to call it – even if it is anonymized, discretized, grouped, and perturbed. Next we quasi-formalize the kind of randomization we are after and argue why it is safe to share such data. Unfortunately, it is not clear that such randomizations of data sets exist. We briefly discuss why, if they exist at all, will be hard to find. Next I explain why I think they do exist and can be constructed by showing that the code tables computed by, e.g., Krimp are already close to what we would like to achieve. Thus making privacy safe sharing of micro-data possible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Surprisingly often terminology arising in marketing and/or journalism enters the scientific vernacular. While this is understandable from the point of view that funding agencies want to fund research that society needs and researchers need funding, but one can but wonder what would have happened if the term hypology – from the classical Greek \(\upsilon \pi {o}\lambda {o}\gamma \iota \sigma \mu {o}\) (calculate) – once coined as an alternative for the term computer science [21], would have caught on.

  2. 2.

    If only because of the large number of parameters we use.

References

  1. Agrawal, R., Mannila, H., Srikant, R., Toivonen, A.H., Verkamo, I.: Fast discovery of association rules. In: Usama, M.F., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 307–328. AAAI/MIT Press, Menlo Park (1996)

    Google Scholar 

  2. Bakshy, E., Messing, S., Adami, L.: Exposure to ideologically diverse news and opinion on Facebook. Science 348(6239), 1130–1132 (2015)

    Article  MathSciNet  Google Scholar 

  3. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman & Hall, Wadsworth (1984)

    MATH  Google Scholar 

  4. Calders, T., Goethals, B.: Non-derivable itemset mining. Science 14(1), 171–206 (2007)

    MathSciNet  Google Scholar 

  5. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (2006)

    MATH  Google Scholar 

  6. de Montjoye, Y.-A., Radaelli, L., Singh, V.K., Sandy, A.: Pentland: unique in the shopping mall: on the reidentifiability of credit card metadata. Science 347(6221), 536–539 (2015)

    Article  Google Scholar 

  7. Dong, G., Li, J.: Efficient mining of emerging patterns: discovering trends and differences. In: Fayyad, S.C., Madigan, D., (eds.) Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 15–18 August 1999, pp. 43–52 (1999)

    Google Scholar 

  8. Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Data Min. Knowl. Disc. 9(3–4), 211–407 (2014)

    MathSciNet  MATH  Google Scholar 

  9. Grünwald, P.: The Minimum Description Length Principle. MIT Press, Cambridge (2007)

    Google Scholar 

  10. Klösgen, W.: Subgroup patterns. In: Klösgen, W., Zytkow, J.M. (eds.) Handbook of Data Mining and Knowledge Discovery, pp. 47–51. Oxford University Press, New York (2002)

    Google Scholar 

  11. Lane, J., Stodden, V., Bender, S., Nissenbaum, H. (eds.): Privacy, Big Data, the Public Good: Frameworks for Engagement. Cambridge University Press, New York (2015)

    Google Scholar 

  12. Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Springer, New York (1993)

    Book  MATH  Google Scholar 

  13. Mannila, H., Toivonen, A.H., Verkamo, I.: Levelwise search and borders of theories in knowledge discovery. Data Min. Knowl. Disc. 1(3), 241–258 (1997)

    Article  Google Scholar 

  14. Mayer-Schönberger, V., Cukier, K.: Big Data, A Revolution That Will Transform How We Live, Work and Think. John Murray, London (2013)

    Google Scholar 

  15. Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: 2008 IEEE Symposium on Security and Privacy (S&P 2008), 18–21 May 2008, Oakland, California, USA, pp. 111–125 (2008)

    Google Scholar 

  16. Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In: Proceedings of the IEEE Symposium on Research in Security and Privacy (1998)

    Google Scholar 

  17. Schneier, B.: Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World. W.W. Norton and Company, New York (2015)

    Google Scholar 

  18. Siebes, A., Puspitaningrum, D.: Mining databases to mine queries faster. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS, vol. 5782, pp. 382–397. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  19. Siebes, A., Vreeken, J., van Leeuwen, M.: Item sets that compress. In: Ghosh, J., Lambert, D., Skillicorn, D.B., Srivastava, J. (eds.) SDM 2006 Proceedings, pp. 393–404. SIAM (2006)

    Google Scholar 

  20. Smets, K., Vreeken, J.: Slim: directly mining descriptive patterns. In: Ghosh, J., Liu, H., Davidson, I., Domeniconi, C., Kamath, C. (eds.) SDM 2012 Proceedings, pp. 236–247. SIAM (2012)

    Google Scholar 

  21. Tedre, M.: The Science of Computing: Shaping a Discipline. CRC Press/Taylor & Francis, Boca Raton (2014)

    Book  MATH  Google Scholar 

  22. Vreeken, J., van Leeuwen, M., Siebes, A.: Preserving privacy through data generation. In: Ramakrishnan, N., Zaïane, O.R., Shi, Y., Clifton, C.W., Wu, X. (eds.) ICDM 2007 Proceedings, pp. 685–690. IEEE (2007)

    Google Scholar 

  23. Vreeken, J., van Leeuwen, M., Siebes, A.: Krimp: mining itemsets that compress. Science 23(1), 169–214 (2011)

    MathSciNet  MATH  Google Scholar 

  24. Wikipedia: Netflix prize (2015). Accessed 31 July 2015

    Google Scholar 

Download references

Acknowledgements

The author is supported by the Dutch national COMMIT project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arno Siebes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Siebes, A. (2016). Sharing Data with Guaranteed Privacy. In: Michaelis, S., Piatkowski, N., Stolpe, M. (eds) Solving Large Scale Learning Tasks. Challenges and Algorithms. Lecture Notes in Computer Science(), vol 9580. Springer, Cham. https://doi.org/10.1007/978-3-319-41706-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41706-6_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41705-9

  • Online ISBN: 978-3-319-41706-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics