Abstract
An abundance of data generated from a multitude of sources, and intelligence derived by analyzing the same, has become an important asset across many walks of life. Simultaneously, it raises serious concerns about privacy. Differential privacy has become a popular way to reason about the amount of information about individual entries of a dataset that is divulged upon giving out a perturbed result for a query on a given data-set. However, current differentially-private algorithms are computationally inefficient, and do not explicitly exploit the abundance of data, thus wearing out the privacy budget irrespective of the volume of data. In this paper, we propose BPMiner, a solution that is both private and accurate, while simultaneously addressing the computation and budget challenges of very big datasets. The main idea is a non-trivial combination between differential privacy, sample-and-aggregation, and a classical statistical methodology called sequential estimation. Rigorous proof regarding the privacy and asymptotic accuracy of our solution are provided. Furthermore, experimental results over multiple datasets demonstrate that BPMiner outperforms current private algorithms in terms of computational and budget efficiencies, while achieving comparable accuracy. Overall, BPMiner is a practical solution based on strong theoretical foundations for privacy-preserving analysis on big datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In this paper, an ‘individual’ refers to an entry in a statistical database, which may correspond to information about a real-world entity, e.g., a patient’s record, a financial transaction, etc.
- 2.
- 3.
This name reflects the fact that we use sequential estimation w.r.t. data blocks. It is not to be confused with moving blocks boostrap in the field of bootstrap/subsampling.
References
Agarwal, A., Chapelle, O., DudÃk, M., Langford, J.: A reliable effective terascale linear learning system. J. Mach. Learn. Res. 15(1), 1111–1133 (2014)
Agarwal, S., Milner, H., Kleiner, A., Talwalkar, A., Jordan, M.I., Madden, S., Mozafari, B., Stoica, I.: Knowing when you’re wrong: building fast and reliable approximate query processing systems. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 481–492 (2014)
Anscombe, F.J.: Large-sample theory of sequential estimation. Math. Proc. Cambridge Philos. Soc. 48(4), 600–607 (1952)
Aoshima, M., Yata, K.: Two-stage procedures for high-dimensional data. Sequential Analysis 30(4), 356–399 (2011)
Bache, K., Lichman, M.: UCI Machine Learning Repository. IOS Press, Amsterdam (2013)
Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In: Proceedings of the 12th International Society for Music Information Retrieval Conference ISMIR 2011, Miami, Florida, USA, 24–28 October 2011, pp. 591–596 (2011)
Bickel, P.J., Levina, E.: Regularized estimation of large covariance matrices. Ann. Stat. 36, 199–227 (2008)
Blum, A., Dwork, C., McSherry, F., Nissim, F.: Practical privacy: the sulq framework. In: Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Baltimore, Maryland, USA, 13–15 June 2005, pp. 128–138 (2005)
Cai, Z., Gao, Z.J., Luo, S., Perez, L.L., Vagena, Z., Jermaine, C.M.: A comparison of platforms for implementing and running very large scale machine learning algorithms. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 1371–1382 (2014)
Chen, J., Chen, X.: A new method for adaptive sequential sampling for learning and parameter estimation. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS, vol. 6804, pp. 220–229. Springer, Heidelberg (2011)
Chow, Y.S., Robbins, H.: On the asymptotic theory of fixed-width sequential confidence intervals for the mean. Ann. Math. Stat. 36(2), 457–462 (1965)
Condie, T., Mineiro, P., Polyzotis, N., Weimer, M.: Machine learning for big data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, 22–27 June 2013, pp. 939–942 (2013)
Condie, T., Mineiro, P., Polyzotis, N., Weimer, M.: Machine learning on big data. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia 8–12 April 2013, pp. 1242–1244 (2013)
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006)
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D.-Z., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008)
Dwork, C.: A firm foundation for private data analysis. Commun. ACM 54(1), 86–95 (2011)
Dwork, C., Rothblum, G.N., Vadhan, S.P.: Boosting and differential privacy. In: 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, Las Vegas, Neveda, USA, 23–26 October 2010, pp. 51–60 (2010)
Dwork, C., Smith, A.: Differential privacy for statistics: What we know and what we want to learn. J. Priv. Confidentiality 1(2), 135–154 (2009)
Haeberlen, A., Pierce, B.C., Narayan, A.: Differential privacy under fire. In: 20th USENIX Security Symposium, San Francisco, CA, USA, 8–12 August 2011, Proceedings, pp. 33–33 (2011)
Ho, C.-H., Lin, C.-J.: Large-scale linear support vector regression. J. Mach. Learn. Res. 13, 3323–3348 (2012)
Jordan, M.I.: Divide-and-conquer and statistical inference for big data. In: The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2012, Beijing, China, 12–16 August 2012, p. 4 (2012)
Kleiner, A., Talwalkar, A., Sarkar, P., Jordan, M.I.: The big data bootstrap. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK June 26 - July 1 2012, pp. 1759–1766 (2012)
Kraska, T., Talwalkar, A., Duchi, J.C., Griffith, R., Franklin, M.J., Jordan, M.I.: Mlbase: A distributed machine-learning system. In: CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 6–9 January 2013, Online Proceedings (2013)
Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endow. PVLDB 5(10), 1028–1039 (2012)
Laptev, N., Zeng, K., Zaniolo, C.: Very fast estimation for result and accuracy of big data analytics: the EARL system. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp. 1296–1299 (2013)
Lin, J., Kolcz, A.: Large-scale machine learning at twitter. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, AZ, USA,Scottsdale, 20–24 May 2012, pp. 793–804 (2012)
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning in the cloud. Proc. VLDB Endow. PVLDB 5(8), 716–727 (2012)
McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, pp. 19–30 (2009)
Mohan, P., Thakurta, A., Shi, E., Song, D., Culler, D.E.: Gupt: privacy preserving data analysis made easy. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, 20–24 May 2012, pp. 349–360 (2012)
Mukhopadhyay, N.: A consistent and asymptotically efficient two-stage procedure to construct fixed width confidence intervals for the mean. Metrika 27(1), 281–284 (1980)
Mukhopadhyay, Nitis, de Silva, Basil M.: Sequential Methods and Their Applications. Chapman and Hall/CRC, Boca Raton (2008)
Nadas, A.: An extension of a theorem of chow and robbins on sequential confidence intervals for the mean. Ann. Math. Stat. 40(2), 667–671 (1969)
Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: Proceedings of the 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, 11–13 June 2007, pp. 75–84 (2007)
Committee on the Analysis of Massive Data: Committee on Applied & Theoretical Statistics, Board on Mathematical Sciences & Their Applications, Division on Engineering & Physical Sciences, and National Research Council. The National Academies Press, Frontiers in Massive Data Analysis (2013)
Sandmann, W.: Sequential estimation for prescribed statistical accuracy in stochastic simulation of biological systems. Math. Biosci. 221(1), 43–53 (2009)
Seelbinder, B.M.: On stein’s two-stage sampling scheme. Ann. Math. Stat. 24(4), 640–649 (1953)
Smith, A.: Asymptotically optimal and private statistical estimation. In: Garay, J.A., Miyaji, A., Otsuka, A. (eds.) CANS 2009. LNCS, vol. 5888, pp. 53–57. Springer, Heidelberg (2009)
Smith, A.: Privacy-preserving statistical estimation with optimal convergence rates. In: Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, June 6–8 2011, pp. 813–822 (2011)
Wasserman, L.: Minimaxity, statistical thinking and differential privacy. J. Priv. Confidentiality 4(1), 51–63 (2012)
Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, 22–27 June 2013, pp. 13–24 (2013)
Yui, M., Kojima, I.: A database-hadoop hybrid approach to scalable machine learning. In: IEEE International Congress on Big Data, BigData Congress 2013, June 27 2013-July 2, 2013, pp. 1–8 (2013)
Zeng, K., Gao, S., Gu, J., Mozafari, B., Zaniolo, C.: ABS: a system for scalable approximate queries with accuracy guarantees. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 1067–1070 (2014)
Zeng, K., Gao, S., Mozafari, B., Zaniolo, C.: The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 277–288 (2014)
Acknowledgement
This work was funded by A*Star Science and Engineering Research Council (SERC)’s Thematic Strategic Research Programme (TSRP) grant number 102 158 0038.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Proof of Theorem 1
A Proof of Theorem 1
For estimators from [18, 29, 37], the Laplace noise Y is added directly to \(\bar{Z}\) as in Algorithm  1. We have the following useful lemma.
Lemma 5
Suppose the Laplace noise is given by \(Y = Y_k = \mathrm {Laplace}\left( \mathsf {Range}/k\epsilon \right) \). Then \(Y_k \xrightarrow {P} 0\) as \(k \rightarrow \infty \).
Proof
(of Lemma 5 ). Given \(\epsilon > 0\), consider the probability \(\Pr (|Y_k - 0| < \epsilon )\). This probability equals
The CDF of a Laplace random variable X is given by \(F_X(x) = \tfrac{1}{2} \exp \left( \tfrac{x-\mu }{\sigma } \right) \) if \(x < \mu \), and \(F_X(x) = 1 - \tfrac{1}{2} \exp \left( - \tfrac{x-\mu }{\sigma } \right) \) otherwise. Since \(Y_k\) has a Laplace distribution with parameters \(\mu = 0\) and \(\sigma = \mathsf {Range}/(k\epsilon )\), computing \(F_{Y_k}(\epsilon )\) and \(F_{Y_k}(-\epsilon )\) (with \(\epsilon > 0 = \mu \)) gives us
as \(k \rightarrow \infty \). Therefore, it follows from the definition of convergence in probability that \(Y_k \xrightarrow {P} 0\) as \(k \rightarrow \infty \).
Using Lemma 5, the proof is straightforward for the statistics proposed in [18, 29, 37]: \(\Pr ( |\hat{\theta }_T(X) - \theta | \le \delta )\) \(=\) \(\Pr ( | \bar{Z} + Y_k - \theta |\) \(\le \delta )\) \(\rightarrow \) \(\Pr ( |\bar{Z} - \theta | \le \delta )\).
For the estimator from [38], the noise Y is added to the winsorized mean rather than to \(\bar{Z}\), so the above argument does not apply. Nevertheless, the results extracted from [38] are particularly useful in deriving the proof. First, we obseve that
where \(F_X\) denotes the CDF of X. By Corollary 10 in [38], the \(\mathsf {KS}\) distance between \(\hat{\theta }_T\) and \(\bar{Z}\) goes to 0 as k goes to infinity. This leads to the fact that \(\hat{\theta }_T\) converges in distribution to \(\bar{Z}\). In other words, when \(k \rightarrow \infty \), \(F_{\hat{\theta }_T}(t) \rightarrow F_{\bar{Z}}(t)\), and the expression (22) converges to
which completes the proof.
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Thanh, Q.V., Datta, A. (2015). BPMiner: Algorithms for Large-Scale Private Analysis. In: Hameurlain, A., Küng, J., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXII. Lecture Notes in Computer Science(), vol 9430. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48567-5_1
Download citation
DOI: https://doi.org/10.1007/978-3-662-48567-5_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48566-8
Online ISBN: 978-3-662-48567-5
eBook Packages: Computer ScienceComputer Science (R0)