Anonymized noise addition in subspaces for privacy preserved data mining in high dimensional continuous data

Abstract

Data privacy is a major concern in data mining. Privacy-preserving data mining algorithms have been used for preserving privacy in data mining. However, privacy-preserving data mining on high dimensional continuous data leads to high data loss, information loss and identifying clusters are very difficult. In this paper, a novel technique Anonymized Noise Addition in Subspaces (ANAS) is proposed, which reduces data loss, information loss and enhances identification of clusters and privacy. Anonymization using aggregation is performed in dense and non-dense subspaces considering Euclidean distances to reduce data loss and enhance privacy. Random noise within the subspace limits is then applied to anonymized subspaces to enhance identification of clusters and reduce data loss. ANAS is run on benchmark datasets, and results show that ANAS can identify 80% of the original dataset clusters on sparse datasets, whereas the existing techniques do not identify any clusters. ANAS reduces data loss by 50%, information loss by 20% and enhances privacy by 40%.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

References

  1. 1.

    Taipale, Kim A (2003) Data mining and domestic security: Connecting the dots to make sense of data Columbia Science and Technology Law Review. 5(2)

  2. 2.

    Dittrich D, Kenneally E (2011) The Menlo report: ethical principles guiding information and communication technology research. US Department of Homeland Security

  3. 3.

    Sweeney L (2002) k-anonymity: A model for protecting privacy. In Int J Uncertain Fuzziness and Knowledge-based Syst volume 10:557–570

    MathSciNet  Article  Google Scholar 

  4. 4.

    Li T, Venkatasubramanian S (2010) t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. IEEE TKDE 22(7)

  5. 5.

    Gaby G, Iqbal M and Fung B (2015) Fusion: privacy-preserving distributed protocol for high-dimensional data Mashup IEEE 21st international conference on parallel and distributed systems

    Google Scholar 

  6. 6.

    Liew C, Choi C, Liew J (1985) A data distortion by probability distribution ACM trans. Database Syst (TODS) 10(3):395–411

    Article  Google Scholar 

  7. 7.

    Brand R (2002) Microdata protection through noise addition. Lecture Notes in Computer Science London: Springer

  8. 8.

    Matthias T, Alexander K, Bernhard M (2015) Statistical disclosure control for micro-data using the R package sdcMicro. J Stat Softw 67(4):1–36. https://doi.org/10.18637/jss.v067.i04

    Article  Google Scholar 

  9. 9.

    Templ M. (2017) Disclosure risk. In: Statistical Disclosure Control for Microdata. Springer, 49–87,

  10. 10.

    Panagopoulos P Pappu V Xanthopoulos P, Pardalos PM (2015) Constrained subspace classifier for high dimensional datasets. Omega https://doi.org/10.1016/j.omega-.2015.05.-009i

  11. 11.

    Beyer K, Goldstein J (1999) When is nearest neighbor meaningful?’ Proc 7th Int Conf database theory. In: Database theory –ICDT’99, vol 1540, pp 217–235

    Google Scholar 

  12. 12.

    Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD 6(1):90–105

    Article  Google Scholar 

  13. 13.

    Kriegal HP, Kroger P, Zimek A (2009) Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering and correlation clustering ACM transactions on knowledge discovery from data, 3

  14. 14.

    Agrawal R, Gehrke J, Gunopulos D, Raghavan R (2005) Automatic subspace clustering of high dimensional data for data mining applications. Data Min Knowl Disc 11(1):5–33

    Article  Google Scholar 

  15. 15.

    Sweeney, L (2002) Achieving k-anonymity privacy protection using generalization and suppression. Int. J Uncertainty Fuzziness Knowledge Based Syst, 10(5):571–588, 2002

  16. 16.

    Ashwin M, Daniel K, Johannes G, Venkatasubramaniam M (2007) l-diversity: Privacy beyond k-anonymity in ACM Transactions on Knowledge Discovery from Data (TKDD). 1(1):3

  17. 17.

    Li T, Venkatasubramanian S (2010) t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. IEEE Trans Know Data Eng 22(7)

  18. 18.

    Defays D, Nanopoulos P (1992) Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of the symposium on design and analysis of longitudinal surveys. Statistics Canada, Ottawa, pp 195–204

    Google Scholar 

  19. 19.

    Defays DA, MN. (1998) Masking microdata using micro-aggregation. J Off Stat 14(4):449–461

    Google Scholar 

  20. 20.

    Domingo-Ferrer J, Mateo-Sanz JM (2002) Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans Knowl Data Eng 14(1):189–201

    Article  Google Scholar 

  21. 21.

    Laszlo M, Mukherjee S (2005) Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans Know Data Eng 17(7):902–911

    Article  Google Scholar 

  22. 22.

    Lefons E, Silvestri A, Tangorra F (1983) An Analytic Approach to Statistical Databases. Proc. Ninth Int’l Conf. Very Large Data Bases:260–274

  23. 23.

    Agrawal R, Srikant R (2000) Privacy-preserving data mining. ACM SIGMOD Rec 29(2):439–450

    Article  Google Scholar 

  24. 24.

    KimJJ, Winkler WE (2003) Multiplicative noise for masking continuous data, statist. Res. Division, U.S. bureau census, Washington, DC, USA, tech. Rep

  25. 25.

    Liu K, Kargupta H, Ryan J (2006) Random projection- based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans Know Data Eng 18

  26. 26.

    Yi X, Zhang Y (2013) Equally contributory privacy preserving k-means clustering over vertically partitioned data. Inf Syst 38(1):97–107

    Article  Google Scholar 

  27. 27.

    Vaidya J, Clifton C (2003) Privacy-preserving k-means clustering over vertically partitioned data. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 206–215

  28. 28.

    Clifton C, Kantarcioglou M, Lin X, Zhu M (2002) Tools for privacy preserving distributed data mining. SIGKDD Explor 4(2)

  29. 29.

    Zaman MA, Taniar D, Smith AT (2005) PPDAM: privacy- preserving distributed association rule mining algorithm. IJIIT 1(1):49–69

    Google Scholar 

  30. 30.

    Fung BW, Wang K, L. and Hung, P. C. K. (2009) Privacy preserving data publishing for cluster analysis. Data Knowl Eng 68:552–575

    Article  Google Scholar 

  31. 31.

    Kumar P, Varma KI, Sureka A (2011) Fuzzy based clustering algorithm for privacy preserving data mining. Int J Bus Inf Syst 7(1):27–40

    Google Scholar 

  32. 32.

    Onashoga S, Bamiro B, Akinwale J, Oguntuase J (2017) KC-slice: A dynamic privacy preserving data publishing technique for multi sensitive attributes. Inf Secur J : A Glob Perspect 26(3):121–135

    Google Scholar 

  33. 33.

    Wang Y, Xiang Y, Singh A (2015) Differentially private subspace clustering. NIPS'15 proceedings of the 28th international conference on neural information processing systems. 1000-1008. Research collection school of information systems

  34. 34.

    Hamm JH (2015) Preserving privacy of continuous high dimensional data with Minimax filters proceedings of the 18th international conference on artificial intelligence and statistics (AISTATS) San Diego, CA, USA JMLR: W&CP volume 38

  35. 35.

    Xing K, Hu C, Yu J (2017) Mutual privacy preserving K-means clustering in social participatory sensing. IEEE Transactions on Industrial Informatics 13(4):2066–2076

    Article  Google Scholar 

  36. 36.

    Purohit R, Bhargava D (2017) An illustration to secured way of data mining using privacy preserving data mining. Journal of Statistics and Management Systems 20(4):637–645

    Article  Google Scholar 

  37. 37.

    Xin Y, Qiang Y, Yang X (2017) The privacy preserving method for dynamic trajectory releasing based on adaptive clustering. Information Sciences 378:131–143

    Article  Google Scholar 

  38. 38.

    Waluyo AB, Taniar D, Rahayu W and Srinivasan B (2018) A Dual Privacy Preserving Approach for Location-Based Services Mobile Multicast Environment Mobile Netw Appl 23: 34. 2018 https://doi.org/10.1007/s11036-017-0898-6

  39. 39.

    Liu L, Li L (2018) A clustering 퐾 –anonymity privacy-preserving method for wearable IoT devices. Secur Commun Netw 2018:1–8. https://doi.org/10.1155/2018/4945152

    Article  Google Scholar 

  40. 40.

    Zheng XL, Tian G, L and B. Xiao, B. (2018) Privacy preserved community discovery in online social networks. Futur Gener Comput Syst

  41. 41.

    Fanyu B (2018) A High-Order Clustering Algorithm Based on Dropout Deep Learning for Heterogeneous. Data Cyber-Phys-Soc Syst IEEE Access 6:11687–11693

    Google Scholar 

  42. 42.

    Cao H, Liu S, Wu L, Guan Z, Du X (2018) Achieving differential privacy against non-intrusive load monitoring in smart grid: a fog computing approach. Concurr. Comput. Pract. Exp

  43. 43.

    Talat, R. Obaidat, M. Muzammal, M. A (2020) Decentralised approach to privacy preserving trajectory mining future Gener. Comput Syst, 102 382–392

  44. 44.

    Fan W, He J, Guo M, Li P, Han Z, Wang R (2010) Privacy preserving classification on local differential privacy in data centers. J Parallel Distrib Comput 135:70–82

    Article  Google Scholar 

  45. 45.

    Shaham S, Ding M, Liu B, Dang S, Lin Z, Li J Privacy preserving location data publishing: A machine learning approach. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2020.2964658

  46. 46.

    Agrawal R, Gehrke J, Gunopulos D, Raghavan R (1998) Austomatic subspace clustering of high dimensional data for data mining applications. In: Proc. of 1998 ACM SIGMOD Int. Conf. On Management of Data, pp 94–105

    Google Scholar 

  47. 47.

    Agrawal R, Gehrke J, Gunopulos D, Raghavan R (2005) Automatic subspace clustering of high dimensional data for data mining applications. Data Min Knowl Disc 11(1):5–33

    Article  Google Scholar 

  48. 48.

    Josep MM-S, Joseph F (1998) A comparative study of microaggregation methods. Qüestió 22:511–526

  49. 49.

    Hansen PJ, Mladenovic B, N. (1998) Minimum sum of squares clustering in a low dimensional space. J Classif. 15:37–55

    MathSciNet  Article  Google Scholar 

  50. 50.

    Ward J (1963) Optimal grouping to optimize an optimal Function. J Am Stat Assoc. 58:236–244

    Article  Google Scholar 

  51. 51.

    Shashidhar V, Venkatesulu D (2019) Subspace-based aggregation for enhancing utility, information measures, and cluster identification in privacy preserved data mining on high-dimensional continuous data. In J Comput Appl Taylor and Francis England DOI:1–10. https://doi.org/10.1080/1206212X.2019.1686211

  52. 52.

    Shashidhar V, Venkatesulu, D. (2020) Subspace based noise addition for privacy preserved data mining on high dimensional continuous data ambient intelligence and humanized computing, Springer Germany https://doi.org/10.1007/s12652-020-01881-8

  53. 53.

    R Core Team R (2017) A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.Rproject.org/

  54. 54.

    M. Hassani and M. Hansen (2015) subspace: Interface to OpenSubspace. R package version 1.0.4 https://CRAN.project.org-/package=subspace

  55. 55.

    Mateo-Sanz J, Domingo-Ferrer J, Sebe F (2005) Probabilistic information loss measures in confidentiality protection of continuous microdata. Data Mining Knowl Dis 11:181–193

    MathSciNet  Article  Google Scholar 

  56. 56.

    Asuncion, A. and Newman, D. J. (2007) UCI Machine Learning Repository [http://www.ics.uci.edu-/~mlearn/MLRepository.html]

  57. 57.

    Bertino E, Fovino F, Provenza LP (2005) A Framework for Evaluating Privacy Preserving Data Mining Algorithms Data Mining and Knowledge Discovery 11:121–154

    Google Scholar 

  58. 58.

    Hussaeni K, Fung B, Cheung W (2014) Privacy preserving trajectory stream publishing’. Data Knowl Eng:89–109

  59. 59.

    Dalenius T (1977) Towards a methodology for statistical disclosure control. Statistisk Tidskrift 5:429–444

    Google Scholar 

  60. 60.

    Tao Y, Chen H, Xiao X, Zhou S, Zhang D (2009) Angel: enhancing the utility of generalization for privacy preserving publication. IEEE Trans Knowl Data Eng 21(7):1073–1087

    Article  Google Scholar 

  61. 61.

    Carrizosa E, Gómez A, Morales D (2017) Clustering categories in support vector machines. Omega 66:28–37

    Article  Google Scholar 

  62. 62.

    Nergiz M, Atzori M, Saygin Y, Guc Y (2009) Towards trajectory anonymization: A generalization-based approach. Trans Data Privacy 2(1):47–75

    MathSciNet  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Shashidhar Virupaksha.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection: Special Issue on Privacy-Preserving Computing

Guest Editors: Kaiping Xue, Zhe Liu, Haojin Zhu, Miao Pan and David S.L. Wei

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Virupaksha, S., Dondeti, V. Anonymized noise addition in subspaces for privacy preserved data mining in high dimensional continuous data. Peer-to-Peer Netw. Appl. (2021). https://doi.org/10.1007/s12083-021-01080-y

Download citation

Keywords

  • Privacy preserving data mining
  • Noise addition
  • Data privacy