Minimising Information Loss on Anonymised High Dimensional Data with Greedy In-Memory Processing

  • Nikolai J. PodlesnyEmail author
  • Anne V. D. M. KayemEmail author
  • Stephan von SchorlemerEmail author
  • Matthias UflackerEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11029)


Minimising information loss on anonymised high dimensional data is important for data utility. Syntactic data anonymisation algorithms address this issue by generating datasets that are neither use-case specific nor dependent on runtime specifications. This results in anonymised datasets that can be re-used in different scenarios which is performance efficient. However, syntactic data anonymisation algorithms incur high information loss on high dimensional data, making the data unusable for analytics. In this paper, we propose an optimised exact quasi-identifier identification scheme, based on the notion of k-anonymity, to generate anonymised high dimensional datasets efficiently, and with low information loss. The optimised exact quasi-identifier identification scheme works by identifying and eliminating maximal partial unique column combination (mpUCC) attributes that endanger anonymity. By using in-memory processing to handle the attribute selection procedure, we significantly reduce the processing time required. We evaluated the effectiveness of our proposed approach with an enriched dataset drawn from multiple real-world data sources, and augmented with synthetic values generated in close alignment with the real-world data distributions. Our results indicate that in-memory processing drops attribute selection time for the mpUCC candidates from 400s to 100s, while significantly reducing information loss. In addition, we achieve a time complexity speed-up of \(O(3^{n/3})\approx O(1.4422^{n})\).


  1. 1.
    Aggarwal, C.C.: On k-anonymity and the curse of dimensionality. In: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB 2005 (2005)Google Scholar
  2. 2.
    Barbaro, M., Zeller, T., Hansell, S.: A face is exposed for AOL searcher no. 4417749. New York Times 9(2008), 8 (2006).
  3. 3.
    Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: Proceedings of the 21st International Conference on Data Engineering, ICDE 2005, pp. 217–228. IEEE (2005)Google Scholar
  4. 4.
    Bhaskar, R., Laxman, S., Smith, A., Thakurta, A.: Discovering frequent patterns in sensitive data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 503–512. ACM (2010)Google Scholar
  5. 5.
    Bläsius, T., Friedrich, T., Schirneck, M.: The parameterized complexity of dependency detection in relational databases. In: LIPIcs-Leibniz International Proceedings in Informatics. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2017)Google Scholar
  6. 6.
    Bonomi, L., Xiong, L.: Mining frequent patterns with differential privacy. Proc. VLDB Endow. 6(12), 1422–1427 (2013)CrossRefGoogle Scholar
  7. 7.
    De Montjoye, Y.A., Hidalgo, C.A., Verleysen, M., Blondel, V.D.: Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 3, 1376 (2013)CrossRefGoogle Scholar
  8. 8.
    Dondi, R., Mauri, G., Zoppis, I.: On the complexity of the l-diversity problem. In: Murlak, F., Sankowski, P. (eds.) MFCS 2011. LNCS, vol. 6907, pp. 266–277. Springer, Heidelberg (2011). Scholar
  9. 9.
    Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). Scholar
  10. 10.
    Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). Scholar
  11. 11.
    Färber, F., et al.: The SAP HANA database-an architecture overview. IEEE Data Eng. Bull. 35(1), 28–33 (2012)Google Scholar
  12. 12.
    Fienberg, S.E., Jin, J.: Privacy-preserving data sharing in high dimensional regression and classification settings. J. Priv. Confid. 4(1), 221–243 (2012)Google Scholar
  13. 13.
    Fredj, F.B., Lammari, N., Comyn-Wattiau, I.: Abstracting anonymization techniques: a prerequisite for selecting a generalization algorithm. Procedia Comput. Sci. 60, 206–215 (2015)CrossRefGoogle Scholar
  14. 14.
    Ghosh, A., Roughgarden, T., Sundararajan, M.: Universally utility-maximizing privacy mechanisms. SIAM J. Comput. 41(6), 1673–1693 (2012)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Ibarra, O.H.: Reversal-bounded multicounter machines and their decision problems. J. ACM (JACM) 25(1), 116–133 (1978)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Islam, M.Z., Brankovic, L.: Privacy preserving data mining: a noise addition framework using a novel clustering technique. Knowl.-Based Syst. 24(8), 1214–1223 (2011)CrossRefGoogle Scholar
  17. 17.
    Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J.D. (eds.) Complexity of Computer Computations. IRSS, pp. 85–103. Springer, Boston (1972). Scholar
  18. 18.
    Kifer, D., Machanavajjhala, A.: No free lunch in data privacy. In: Proceedings of the 2011 ACM SIGMOD, SIGMOD 2011, pp. 193–204. ACM (2011)Google Scholar
  19. 19.
    Kohlmayer, F., Prasser, F., Eckert, C., Kuhn, K.A.: A flexible approach to distributed data anonymization. J. Biomed. Inform. 50, 62–76 (2014)CrossRefGoogle Scholar
  20. 20.
    Koufogiannis, F., Han, S., Pappas, G.J.: Optimality of the Laplace mechanism in differential privacy (2015)Google Scholar
  21. 21.
    Lee, J., et al.: High-performance transaction processing in SAP HANA. IEEE Data Eng. Bull. 36(2), 28–33 (2013)Google Scholar
  22. 22.
    Li, C., Miklau, G., Hay, M., McGregor, A., Rastogi, V.: The matrix mechanism: optimizing linear counting queries under differential privacy. VLDB J. 24(6), 757–781 (2015)CrossRefGoogle Scholar
  23. 23.
    Li, N., Li, T., Venkatasubramanian, S.: T-closeness: privacy beyond k-anonymity and l-diversity. In: 2007 IEEE 23rd ICDE, pp. 106–115, April 2007Google Scholar
  24. 24.
    Liang, H., Yuan, H.: On the complexity of t-closeness anonymization and related problems. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013. LNCS, vol. 7825, pp. 331–345. Springer, Heidelberg (2013). Scholar
  25. 25.
    Liu, F.: Generalized Gaussian mechanism for differential privacy (2016)Google Scholar
  26. 26.
    Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. ACM TKDD 1(1), 3 (2007)CrossRefGoogle Scholar
  27. 27.
    McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: 48th IEEE Symposium Foundations of Computer Science, FOCS 2007 (2007)Google Scholar
  28. 28.
    Meyer, A.R., Stockmeyer, L.J.: The equivalence problem for regular expressions with squaring requires exponential space. In: SWAT (FOCS), pp. 125–129 (1972)Google Scholar
  29. 29.
    Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 223–228. ACM (2004)Google Scholar
  30. 30.
    Mohammed, N., Fung, B., Hung, P.C., Lee, C.K.: Centralized and distributed anonymization for high-dimensional healthcare data. ACM TKDD 4(4), 18 (2010)Google Scholar
  31. 31.
    Papenbrock, T., Naumann, F.: A hybrid approach for efficient unique column combination discovery. Proc. der Fachtagung Business, Technologie und Web (2017)Google Scholar
  32. 32.
    Plattner, H., et al.: A Course in In-Memory Data Management. Springer, Heidelberg (2013). Scholar
  33. 33.
    Polonetsky, J., Tene, O., Finch, K.: Shades of gray: seeing the full spectrum of practical data de-identification (2016)Google Scholar
  34. 34.
    Rubinstein, I., Hartzog, W.: Anonymization and risk (2015)Google Scholar
  35. 35.
    Rzhetsky, A., Wajngurt, D., Park, N., Zheng, T.: Probing genetic overlap among complex human phenotypes. Proc. Nat. Acad. Sci. 104(28), 11694–11699 (2007)CrossRefGoogle Scholar
  36. 36.
    Suthram, S., Dudley, J.T., Chiang, A.P., Chen, R., Hastie, T.J., Butte, A.J.: Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets. PLoS Comput. Biol. 6(2), 1–10 (2010)CrossRefGoogle Scholar
  37. 37.
    Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 571–588 (2002)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Sweeney, L.: K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Terrovitis, M., Mamoulis, N., Kalnis, P.: Privacy-preserving anonymization of set-valued data. Proc. VLDB Endow. 1(1), 115–125 (2008)CrossRefGoogle Scholar
  40. 40.
    Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 206–215. ACM (2003)Google Scholar
  41. 41.
    Vaidya, J., Kantarcıoğlu, M., Clifton, C.: Privacy-preserving Naive Bayes classification. VLDB J.—Int. J. Very Large Data Bases 17(4), 879–898 (2008)CrossRefGoogle Scholar
  42. 42.
    Vessenes, P., Seidensticker, R.: System and method for analyzing transactions in a distributed ledger. US Patent 9,298,806, 29 March 2016Google Scholar
  43. 43.
    Wernke, M., Skvortsov, P., Dürr, F., Rothermel, K.: A classification of location privacy attacks and approaches. Pers. Ubiquit. Comput. 18(1), 163–175 (2014)CrossRefGoogle Scholar
  44. 44.
    Wimmer, H., Powell, L.: A comparison of the effects of k-anonymity on machine learning algorithms. In: Proceedings of the Conference for Information Systems Applied Research ISSN, vol. 2167, p. 1508 (2014)Google Scholar
  45. 45.
    Zhang, B., Dave, V., Mohammed, N., Hasan, M.A.: Feature selection for classification under anonymity constraint. arXiv preprint arXiv:1512.07158 (2015)
  46. 46.
    Zhang, X., Yang, L.T., Liu, C., Chen, J.: A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. IEEE Trans. Parallel Distrib. Syst. 25(2), 363–373 (2014)CrossRefGoogle Scholar
  47. 47.
    Zhou, X., Menche, J., Barabási, A.L., Sharma, A.: Human symptoms-disease network. Nat. Commun. 5, 4212 (2014)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Hasso Plattner InstituteUniversity of PotsdamPotsdamGermany

Personalised recommendations