Abstract
Despite its attractiveness, Big Data oftentimes is hard, slow and expensive to handle due to its size. Moreover, as the amount of collected data grows, individual privacy raises more and more concerns: “what do they know about me?” Different algorithms were suggested to enable privacy-preserving data release with the current de-facto standard differential privacy. However, the processing time of keeping the data private is inhibiting and currently not practical for every day use. Combined with the continuously growing data collection, the solution is not seen on a horizon.
In this research, we suggest replacing the Big Data with a much smaller similitude model. The model “resembles” the data with respect to a class of query. The user defines the maximum acceptable error and privacy requirements ahead of the query execution. Those requirements define the minimal size of the similitude model. The suggested method is demonstrated by using a wavelet transform and then by pruning the tree according to both the data reduction and the privacy requirements. We propose methods of combining the noise required for privacy preservation with noise of similitude model, that allow us to decrease the amount of added noise thus, improving the utilization of the method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Pandas - python data analysis library. http://pandas.pydata.org
Pywavelets - wavelet transforms in python. https://github.com/PyWavelets/pywt
Ács, G., Castelluccia, C., Chen, R.: Differentially private histogram publishing through lossy compression. In: 2012 IEEE 12th International Conference on Data Mining, pp. 1–10 (2012)
AT&T and Contributers. Graphviz - graph visualization software. http://graphviz.org
Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007, pp. 273–282. ACM, New York (2007)
Blum, A., Dwork, C., Mcsherry, F., Nissim, K.: Practical privacy: the SulQ framework. In: PODS, pp. 128–138. ACM (2005)
Blum, A., Ligett, K., Roth, A.: A learning theory approach to non-interactive database privacy. In: Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, STOC 2008, pp. 609–618. ACM, New York (2008)
Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB J. 10(2–3), 199–223 (2001)
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006). https://doi.org/10.1007/11787006_1
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Aggarwal, C.C. (ed.): Data Streams: Models and Algorithms. Springer, New York (2007). https://doi.org/10.1007/978-0-387-47534-9
Gaboardi, M., Arias, E.J.G., Hsu, J., Roth, A., Wu, Z.S.: Dual query: practical private query release for high dimensional data. In: Xing, E.P., Jebara, T. (eds.) Proceedings of the 31st International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 32, pp. 1170–1178. PMLR, Bejing, 22–24 June 2014
Garofalakis, M., Kumar, A.: Deterministic wavelet thresholding for maximum-error metrics. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2004, pp. 166–176. ACM, New York (2004)
Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: Optimal and approximate computation of summary statistics for range aggregates. In: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2001, pp. 227–236. ACM, New York (2001)
Hardt, M., Ligett, K., Mcsherry, F.: A simple and practical algorithm for differentially private data release. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 2339–2347. Curran Associates Inc. (2012)
Hardt, M., Rothblum, G.: A multiplicative weights mechanism for privacy-preserving data analysis, pp. 61–70, May 2010
Hay, M., Rastogi, V., Miklau, G., Suciu, D.: Boosting the accuracy of differentially private histograms through consistency. Proc. VLDB Endow. 3(1–2), 1021–1032 (2010)
Lichman, M.: UCI Machine Learning Repository (2013)
Matias, Y., Vitter, J.S., Wang, M.: Wavelet-based histograms for selectivity estimation. SIGMOD Rec. 27(2), 448–459 (1998)
Qardaji, W.H., Yang, W., Li, N.: Understanding hierarchical methods for differentially private histograms. PVLDB 6, 1954–1965 (2013)
Rastogi, V., Nath, S.: Differentially private aggregation of distributed time-series with transformation and encryption. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 735–746. ACM, New York (2010)
Stollnitz, E.J., Derose, T.D., Salesin, D.H.: Wavelets for Computer Graphics: Theory and Applications. Morgan Kaufmann Publishers Inc., San Francisco (1996)
Ullman, J.: Answering n2+O(1) counting queries with differential privacy is hard. In: Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC 2013, pp. 361–370. ACM, New York (2013)
Ullman, J., Vadhan, S.: PCPs and the hardness of generating private synthetic data. In: Ishai, Y. (ed.) TCC 2011. LNCS, vol. 6597, pp. 400–416. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19571-6_24
Vitter, J.S., Wang, M.: Approximate computation of multidimensional aggregates of sparse data using wavelets. SIGMOD Rec. 28(2), 193–204 (1999)
Vitter, J.S., Wang, M., Iyer, B.: Data cube approximation and histograms via wavelets. In: Proceedings of the Seventh International Conference on Information and Knowledge Management, CIKM 1998, pp. 96–104. ACM, New York (1998)
Xiao, X., Wang, G., Gehrke, J.: Differential privacy via wavelet transforms. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), pp. 225–236 (2010)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42(4), 25:1–25:41 (2017)
Acknowledgement
The research was partially supported by the Rita Altura Trust Chair in Computer Sciences; the Lynne and William Frankel Center for Computer Science; the Ministry of Foreign Affairs, Italy; the grant from the Ministry of Science, Technology and Space, Israel, and the National Science Council (NSC) of Taiwan; the Ministry of Science, Technology and Space, Infrastructure Research in the Field of Advanced Computing and Cyber Security; and the Israel National Cyber Bureau.
Authors are grateful to John Ullman for the fruitful discussions of the paper ideas and differential privacy.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Derbeko, P., Dolev, S., Gudes, E. (2018). Privacy via Maintaining Small Similitude Data for Big Data Statistical Representation. In: Dinur, I., Dolev, S., Lodha, S. (eds) Cyber Security Cryptography and Machine Learning. CSCML 2018. Lecture Notes in Computer Science(), vol 10879. Springer, Cham. https://doi.org/10.1007/978-3-319-94147-9_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-94147-9_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94146-2
Online ISBN: 978-3-319-94147-9
eBook Packages: Computer ScienceComputer Science (R0)