Abstract
Data privacy is a major concern in data mining. Privacy-preserving data mining algorithms have been used for preserving privacy in data mining. However, privacy-preserving data mining on high dimensional continuous data leads to high data loss, information loss and identifying clusters are very difficult. In this paper, a novel technique Anonymized Noise Addition in Subspaces (ANAS) is proposed, which reduces data loss, information loss and enhances identification of clusters and privacy. Anonymization using aggregation is performed in dense and non-dense subspaces considering Euclidean distances to reduce data loss and enhance privacy. Random noise within the subspace limits is then applied to anonymized subspaces to enhance identification of clusters and reduce data loss. ANAS is run on benchmark datasets, and results show that ANAS can identify 80% of the original dataset clusters on sparse datasets, whereas the existing techniques do not identify any clusters. ANAS reduces data loss by 50%, information loss by 20% and enhances privacy by 40%.
Similar content being viewed by others
References
Taipale, Kim A (2003) Data mining and domestic security: Connecting the dots to make sense of data Columbia Science and Technology Law Review. 5(2)
Dittrich D, Kenneally E (2011) The Menlo report: ethical principles guiding information and communication technology research. US Department of Homeland Security
Sweeney L (2002) k-anonymity: A model for protecting privacy. In Int J Uncertain Fuzziness and Knowledge-based Syst volume 10:557–570
Li T, Venkatasubramanian S (2010) t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. IEEE TKDE 22(7)
Gaby G, Iqbal M and Fung B (2015) Fusion: privacy-preserving distributed protocol for high-dimensional data Mashup IEEE 21st international conference on parallel and distributed systems
Liew C, Choi C, Liew J (1985) A data distortion by probability distribution ACM trans. Database Syst (TODS) 10(3):395–411
Brand R (2002) Microdata protection through noise addition. Lecture Notes in Computer Science London: Springer
Matthias T, Alexander K, Bernhard M (2015) Statistical disclosure control for micro-data using the R package sdcMicro. J Stat Softw 67(4):1–36. https://doi.org/10.18637/jss.v067.i04
Templ M. (2017) Disclosure risk. In: Statistical Disclosure Control for Microdata. Springer, 49–87,
Panagopoulos P Pappu V Xanthopoulos P, Pardalos PM (2015) Constrained subspace classifier for high dimensional datasets. Omega https://doi.org/10.1016/j.omega-.2015.05.-009i
Beyer K, Goldstein J (1999) When is nearest neighbor meaningful?’ Proc 7th Int Conf database theory. In: Database theory –ICDT’99, vol 1540, pp 217–235
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD 6(1):90–105
Kriegal HP, Kroger P, Zimek A (2009) Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering and correlation clustering ACM transactions on knowledge discovery from data, 3
Agrawal R, Gehrke J, Gunopulos D, Raghavan R (2005) Automatic subspace clustering of high dimensional data for data mining applications. Data Min Knowl Disc 11(1):5–33
Sweeney, L (2002) Achieving k-anonymity privacy protection using generalization and suppression. Int. J Uncertainty Fuzziness Knowledge Based Syst, 10(5):571–588, 2002
Ashwin M, Daniel K, Johannes G, Venkatasubramaniam M (2007) l-diversity: Privacy beyond k-anonymity in ACM Transactions on Knowledge Discovery from Data (TKDD). 1(1):3
Li T, Venkatasubramanian S (2010) t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. IEEE Trans Know Data Eng 22(7)
Defays D, Nanopoulos P (1992) Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of the symposium on design and analysis of longitudinal surveys. Statistics Canada, Ottawa, pp 195–204
Defays DA, MN. (1998) Masking microdata using micro-aggregation. J Off Stat 14(4):449–461
Domingo-Ferrer J, Mateo-Sanz JM (2002) Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans Knowl Data Eng 14(1):189–201
Laszlo M, Mukherjee S (2005) Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans Know Data Eng 17(7):902–911
Lefons E, Silvestri A, Tangorra F (1983) An Analytic Approach to Statistical Databases. Proc. Ninth Int’l Conf. Very Large Data Bases:260–274
Agrawal R, Srikant R (2000) Privacy-preserving data mining. ACM SIGMOD Rec 29(2):439–450
KimJJ, Winkler WE (2003) Multiplicative noise for masking continuous data, statist. Res. Division, U.S. bureau census, Washington, DC, USA, tech. Rep
Liu K, Kargupta H, Ryan J (2006) Random projection- based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans Know Data Eng 18
Yi X, Zhang Y (2013) Equally contributory privacy preserving k-means clustering over vertically partitioned data. Inf Syst 38(1):97–107
Vaidya J, Clifton C (2003) Privacy-preserving k-means clustering over vertically partitioned data. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 206–215
Clifton C, Kantarcioglou M, Lin X, Zhu M (2002) Tools for privacy preserving distributed data mining. SIGKDD Explor 4(2)
Zaman MA, Taniar D, Smith AT (2005) PPDAM: privacy- preserving distributed association rule mining algorithm. IJIIT 1(1):49–69
Fung BW, Wang K, L. and Hung, P. C. K. (2009) Privacy preserving data publishing for cluster analysis. Data Knowl Eng 68:552–575
Kumar P, Varma KI, Sureka A (2011) Fuzzy based clustering algorithm for privacy preserving data mining. Int J Bus Inf Syst 7(1):27–40
Onashoga S, Bamiro B, Akinwale J, Oguntuase J (2017) KC-slice: A dynamic privacy preserving data publishing technique for multi sensitive attributes. Inf Secur J : A Glob Perspect 26(3):121–135
Wang Y, Xiang Y, Singh A (2015) Differentially private subspace clustering. NIPS'15 proceedings of the 28th international conference on neural information processing systems. 1000-1008. Research collection school of information systems
Hamm JH (2015) Preserving privacy of continuous high dimensional data with Minimax filters proceedings of the 18th international conference on artificial intelligence and statistics (AISTATS) San Diego, CA, USA JMLR: W&CP volume 38
Xing K, Hu C, Yu J (2017) Mutual privacy preserving K-means clustering in social participatory sensing. IEEE Transactions on Industrial Informatics 13(4):2066–2076
Purohit R, Bhargava D (2017) An illustration to secured way of data mining using privacy preserving data mining. Journal of Statistics and Management Systems 20(4):637–645
Xin Y, Qiang Y, Yang X (2017) The privacy preserving method for dynamic trajectory releasing based on adaptive clustering. Information Sciences 378:131–143
Waluyo AB, Taniar D, Rahayu W and Srinivasan B (2018) A Dual Privacy Preserving Approach for Location-Based Services Mobile Multicast Environment Mobile Netw Appl 23: 34. 2018 https://doi.org/10.1007/s11036-017-0898-6
Liu L, Li L (2018) A clustering 퐾 –anonymity privacy-preserving method for wearable IoT devices. Secur Commun Netw 2018:1–8. https://doi.org/10.1155/2018/4945152
Zheng XL, Tian G, L and B. Xiao, B. (2018) Privacy preserved community discovery in online social networks. Futur Gener Comput Syst
Fanyu B (2018) A High-Order Clustering Algorithm Based on Dropout Deep Learning for Heterogeneous. Data Cyber-Phys-Soc Syst IEEE Access 6:11687–11693
Cao H, Liu S, Wu L, Guan Z, Du X (2018) Achieving differential privacy against non-intrusive load monitoring in smart grid: a fog computing approach. Concurr. Comput. Pract. Exp
Talat, R. Obaidat, M. Muzammal, M. A (2020) Decentralised approach to privacy preserving trajectory mining future Gener. Comput Syst, 102 382–392
Fan W, He J, Guo M, Li P, Han Z, Wang R (2010) Privacy preserving classification on local differential privacy in data centers. J Parallel Distrib Comput 135:70–82
Shaham S, Ding M, Liu B, Dang S, Lin Z, Li J Privacy preserving location data publishing: A machine learning approach. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2020.2964658
Agrawal R, Gehrke J, Gunopulos D, Raghavan R (1998) Austomatic subspace clustering of high dimensional data for data mining applications. In: Proc. of 1998 ACM SIGMOD Int. Conf. On Management of Data, pp 94–105
Agrawal R, Gehrke J, Gunopulos D, Raghavan R (2005) Automatic subspace clustering of high dimensional data for data mining applications. Data Min Knowl Disc 11(1):5–33
Josep MM-S, Joseph F (1998) A comparative study of microaggregation methods. Qüestió 22:511–526
Hansen PJ, Mladenovic B, N. (1998) Minimum sum of squares clustering in a low dimensional space. J Classif. 15:37–55
Ward J (1963) Optimal grouping to optimize an optimal Function. J Am Stat Assoc. 58:236–244
Shashidhar V, Venkatesulu D (2019) Subspace-based aggregation for enhancing utility, information measures, and cluster identification in privacy preserved data mining on high-dimensional continuous data. In J Comput Appl Taylor and Francis England DOI:1–10. https://doi.org/10.1080/1206212X.2019.1686211
Shashidhar V, Venkatesulu, D. (2020) Subspace based noise addition for privacy preserved data mining on high dimensional continuous data ambient intelligence and humanized computing, Springer Germany https://doi.org/10.1007/s12652-020-01881-8
R Core Team R (2017) A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.Rproject.org/
M. Hassani and M. Hansen (2015) subspace: Interface to OpenSubspace. R package version 1.0.4 https://CRAN.project.org-/package=subspace
Mateo-Sanz J, Domingo-Ferrer J, Sebe F (2005) Probabilistic information loss measures in confidentiality protection of continuous microdata. Data Mining Knowl Dis 11:181–193
Asuncion, A. and Newman, D. J. (2007) UCI Machine Learning Repository [http://www.ics.uci.edu-/~mlearn/MLRepository.html]
Bertino E, Fovino F, Provenza LP (2005) A Framework for Evaluating Privacy Preserving Data Mining Algorithms Data Mining and Knowledge Discovery 11:121–154
Hussaeni K, Fung B, Cheung W (2014) Privacy preserving trajectory stream publishing’. Data Knowl Eng:89–109
Dalenius T (1977) Towards a methodology for statistical disclosure control. Statistisk Tidskrift 5:429–444
Tao Y, Chen H, Xiao X, Zhou S, Zhang D (2009) Angel: enhancing the utility of generalization for privacy preserving publication. IEEE Trans Knowl Data Eng 21(7):1073–1087
Carrizosa E, Gómez A, Morales D (2017) Clustering categories in support vector machines. Omega 66:28–37
Nergiz M, Atzori M, Saygin Y, Guc Y (2009) Towards trajectory anonymization: A generalization-based approach. Trans Data Privacy 2(1):47–75
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the Topical Collection: Special Issue on Privacy-Preserving Computing
Guest Editors: Kaiping Xue, Zhe Liu, Haojin Zhu, Miao Pan and David S.L. Wei
Rights and permissions
About this article
Cite this article
Virupaksha, S., Dondeti, V. Anonymized noise addition in subspaces for privacy preserved data mining in high dimensional continuous data. Peer-to-Peer Netw. Appl. 14, 1608–1628 (2021). https://doi.org/10.1007/s12083-021-01080-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12083-021-01080-y