BPMiner: Algorithms for Large-Scale Private Analysis

Thanh, Quach Vinh; Datta, Anwitaman

doi:10.1007/978-3-662-48567-5_1

Quach Vinh Thanh¹⁶ &
Anwitaman Datta¹⁶

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 9430))

431 Accesses

Abstract

An abundance of data generated from a multitude of sources, and intelligence derived by analyzing the same, has become an important asset across many walks of life. Simultaneously, it raises serious concerns about privacy. Differential privacy has become a popular way to reason about the amount of information about individual entries of a dataset that is divulged upon giving out a perturbed result for a query on a given data-set. However, current differentially-private algorithms are computationally inefficient, and do not explicitly exploit the abundance of data, thus wearing out the privacy budget irrespective of the volume of data. In this paper, we propose BPMiner, a solution that is both private and accurate, while simultaneously addressing the computation and budget challenges of very big datasets. The main idea is a non-trivial combination between differential privacy, sample-and-aggregation, and a classical statistical methodology called sequential estimation. Rigorous proof regarding the privacy and asymptotic accuracy of our solution are provided. Furthermore, experimental results over multiple datasets demonstrate that BPMiner outperforms current private algorithms in terms of computational and budget efficiencies, while achieving comparable accuracy. Overall, BPMiner is a practical solution based on strong theoretical foundations for privacy-preserving analysis on big datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In this paper, an ‘individual’ refers to an entry in a statistical database, which may correspond to information about a real-world entity, e.g., a patient’s record, a financial transaction, etc.
2.
While some works regard ‘privacy budget’ as $\epsilon $ (the privacy parameter), we consider it a fixed budget that is reduced per analysis. Such interpretation is seen in [19, 28, 29].
3.
This name reflects the fact that we use sequential estimation w.r.t. data blocks. It is not to be confused with moving blocks boostrap in the field of bootstrap/subsampling.

References

Agarwal, A., Chapelle, O., Dudík, M., Langford, J.: A reliable effective terascale linear learning system. J. Mach. Learn. Res. 15(1), 1111–1133 (2014)
MathSciNet MATH Google Scholar
Agarwal, S., Milner, H., Kleiner, A., Talwalkar, A., Jordan, M.I., Madden, S., Mozafari, B., Stoica, I.: Knowing when you’re wrong: building fast and reliable approximate query processing systems. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 481–492 (2014)
Google Scholar
Anscombe, F.J.: Large-sample theory of sequential estimation. Math. Proc. Cambridge Philos. Soc. 48(4), 600–607 (1952)
Article MathSciNet MATH Google Scholar
Aoshima, M., Yata, K.: Two-stage procedures for high-dimensional data. Sequential Analysis 30(4), 356–399 (2011)
Article MathSciNet MATH Google Scholar
Bache, K., Lichman, M.: UCI Machine Learning Repository. IOS Press, Amsterdam (2013)
Google Scholar
Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In: Proceedings of the 12th International Society for Music Information Retrieval Conference ISMIR 2011, Miami, Florida, USA, 24–28 October 2011, pp. 591–596 (2011)
Google Scholar
Bickel, P.J., Levina, E.: Regularized estimation of large covariance matrices. Ann. Stat. 36, 199–227 (2008)
Article MathSciNet MATH Google Scholar
Blum, A., Dwork, C., McSherry, F., Nissim, F.: Practical privacy: the sulq framework. In: Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Baltimore, Maryland, USA, 13–15 June 2005, pp. 128–138 (2005)
Google Scholar
Cai, Z., Gao, Z.J., Luo, S., Perez, L.L., Vagena, Z., Jermaine, C.M.: A comparison of platforms for implementing and running very large scale machine learning algorithms. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 1371–1382 (2014)
Google Scholar
Chen, J., Chen, X.: A new method for adaptive sequential sampling for learning and parameter estimation. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS, vol. 6804, pp. 220–229. Springer, Heidelberg (2011)
Chapter Google Scholar
Chow, Y.S., Robbins, H.: On the asymptotic theory of fixed-width sequential confidence intervals for the mean. Ann. Math. Stat. 36(2), 457–462 (1965)
Article MathSciNet MATH Google Scholar
Condie, T., Mineiro, P., Polyzotis, N., Weimer, M.: Machine learning for big data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, 22–27 June 2013, pp. 939–942 (2013)
Google Scholar
Condie, T., Mineiro, P., Polyzotis, N., Weimer, M.: Machine learning on big data. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia 8–12 April 2013, pp. 1242–1244 (2013)
Google Scholar
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006)
Chapter Google Scholar
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D.-Z., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008)
Chapter Google Scholar
Dwork, C.: A firm foundation for private data analysis. Commun. ACM 54(1), 86–95 (2011)
Article Google Scholar
Dwork, C., Rothblum, G.N., Vadhan, S.P.: Boosting and differential privacy. In: 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, Las Vegas, Neveda, USA, 23–26 October 2010, pp. 51–60 (2010)
Google Scholar
Dwork, C., Smith, A.: Differential privacy for statistics: What we know and what we want to learn. J. Priv. Confidentiality 1(2), 135–154 (2009)
Google Scholar
Haeberlen, A., Pierce, B.C., Narayan, A.: Differential privacy under fire. In: 20th USENIX Security Symposium, San Francisco, CA, USA, 8–12 August 2011, Proceedings, pp. 33–33 (2011)
Google Scholar
Ho, C.-H., Lin, C.-J.: Large-scale linear support vector regression. J. Mach. Learn. Res. 13, 3323–3348 (2012)
MathSciNet MATH Google Scholar
Jordan, M.I.: Divide-and-conquer and statistical inference for big data. In: The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2012, Beijing, China, 12–16 August 2012, p. 4 (2012)
Google Scholar
Kleiner, A., Talwalkar, A., Sarkar, P., Jordan, M.I.: The big data bootstrap. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK June 26 - July 1 2012, pp. 1759–1766 (2012)
Google Scholar
Kraska, T., Talwalkar, A., Duchi, J.C., Griffith, R., Franklin, M.J., Jordan, M.I.: Mlbase: A distributed machine-learning system. In: CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 6–9 January 2013, Online Proceedings (2013)
Google Scholar
Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endow. PVLDB 5(10), 1028–1039 (2012)
Article Google Scholar
Laptev, N., Zeng, K., Zaniolo, C.: Very fast estimation for result and accuracy of big data analytics: the EARL system. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp. 1296–1299 (2013)
Google Scholar
Lin, J., Kolcz, A.: Large-scale machine learning at twitter. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, AZ, USA,Scottsdale, 20–24 May 2012, pp. 793–804 (2012)
Google Scholar
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning in the cloud. Proc. VLDB Endow. PVLDB 5(8), 716–727 (2012)
Article Google Scholar
McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, pp. 19–30 (2009)
Google Scholar
Mohan, P., Thakurta, A., Shi, E., Song, D., Culler, D.E.: Gupt: privacy preserving data analysis made easy. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, 20–24 May 2012, pp. 349–360 (2012)
Google Scholar
Mukhopadhyay, N.: A consistent and asymptotically efficient two-stage procedure to construct fixed width confidence intervals for the mean. Metrika 27(1), 281–284 (1980)
Article MathSciNet MATH Google Scholar
Mukhopadhyay, Nitis, de Silva, Basil M.: Sequential Methods and Their Applications. Chapman and Hall/CRC, Boca Raton (2008)
Book MATH Google Scholar
Nadas, A.: An extension of a theorem of chow and robbins on sequential confidence intervals for the mean. Ann. Math. Stat. 40(2), 667–671 (1969)
Article MathSciNet MATH Google Scholar
Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: Proceedings of the 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, 11–13 June 2007, pp. 75–84 (2007)
Google Scholar
Committee on the Analysis of Massive Data: Committee on Applied & Theoretical Statistics, Board on Mathematical Sciences & Their Applications, Division on Engineering & Physical Sciences, and National Research Council. The National Academies Press, Frontiers in Massive Data Analysis (2013)
Google Scholar
Sandmann, W.: Sequential estimation for prescribed statistical accuracy in stochastic simulation of biological systems. Math. Biosci. 221(1), 43–53 (2009)
Article MathSciNet MATH Google Scholar
Seelbinder, B.M.: On stein’s two-stage sampling scheme. Ann. Math. Stat. 24(4), 640–649 (1953)
Article MathSciNet MATH Google Scholar
Smith, A.: Asymptotically optimal and private statistical estimation. In: Garay, J.A., Miyaji, A., Otsuka, A. (eds.) CANS 2009. LNCS, vol. 5888, pp. 53–57. Springer, Heidelberg (2009)
Chapter Google Scholar
Smith, A.: Privacy-preserving statistical estimation with optimal convergence rates. In: Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, June 6–8 2011, pp. 813–822 (2011)
Google Scholar
Wasserman, L.: Minimaxity, statistical thinking and differential privacy. J. Priv. Confidentiality 4(1), 51–63 (2012)
Google Scholar
Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, 22–27 June 2013, pp. 13–24 (2013)
Google Scholar
Yui, M., Kojima, I.: A database-hadoop hybrid approach to scalable machine learning. In: IEEE International Congress on Big Data, BigData Congress 2013, June 27 2013-July 2, 2013, pp. 1–8 (2013)
Google Scholar
Zeng, K., Gao, S., Gu, J., Mozafari, B., Zaniolo, C.: ABS: a system for scalable approximate queries with accuracy guarantees. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 1067–1070 (2014)
Google Scholar
Zeng, K., Gao, S., Mozafari, B., Zaniolo, C.: The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 277–288 (2014)
Google Scholar

Download references

Acknowledgement

This work was funded by A*Star Science and Engineering Research Council (SERC)’s Thematic Strategic Research Programme (TSRP) grant number 102 158 0038.

Author information

Authors and Affiliations

School of Computer Engineering, Nanyang Technological University, Singapore, Singapore
Quach Vinh Thanh & Anwitaman Datta

Authors

Quach Vinh Thanh
View author publications
You can also search for this author in PubMed Google Scholar
Anwitaman Datta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anwitaman Datta .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz, Linz, Austria
Josef Küng
FAW, University of Linz, Linz, Austria
Roland Wagner

A Proof of Theorem 1

For estimators from [18, 29, 37], the Laplace noise Y is added directly to $\bar{Z}$ as in Algorithm 1. We have the following useful lemma.

Lemma 5

Suppose the Laplace noise is given by $Y = Y_k = \mathrm {Laplace}\left( \mathsf {Range}/k\epsilon \right) $. Then $Y_k \xrightarrow {P} 0$ as $k \rightarrow \infty $.

Proof

(of Lemma 5 ). Given $\epsilon > 0$, consider the probability $\Pr (|Y_k - 0| < \epsilon )$. This probability equals

$$\begin{aligned} \Pr (|Y_k| < \epsilon ) = \Pr (-\epsilon < Y_k < \epsilon ) = F_{Y_k}(\epsilon ) - F_{Y_k}(-\epsilon ). \end{aligned}$$

The CDF of a Laplace random variable X is given by $F_X(x) = \tfrac{1}{2} \exp \left( \tfrac{x-\mu }{\sigma } \right) $ if $x < \mu $, and $F_X(x) = 1 - \tfrac{1}{2} \exp \left( - \tfrac{x-\mu }{\sigma } \right) $ otherwise. Since $Y_k$ has a Laplace distribution with parameters $\mu = 0$ and $\sigma = \mathsf {Range}/(k\epsilon )$, computing $F_{Y_k}(\epsilon )$ and $F_{Y_k}(-\epsilon )$ (with $\epsilon > 0 = \mu $) gives us

$$\begin{aligned} \Pr (|Y_k - \mu | < \epsilon ) = F_{Y_k}(\epsilon ) - F_{Y_k}(-\epsilon ) = 1 - \exp \left( - \tfrac{k\epsilon ^2}{\mathsf {Range}} \right) \rightarrow 1 \end{aligned}$$

as $k \rightarrow \infty $. Therefore, it follows from the definition of convergence in probability that $Y_k \xrightarrow {P} 0$ as $k \rightarrow \infty $.

Using Lemma 5, the proof is straightforward for the statistics proposed in [18, 29, 37]: $\Pr ( |\hat{\theta }_T(X) - \theta | \le \delta )$ $=$ $\Pr ( | \bar{Z} + Y_k - \theta |$ $\le \delta )$ $\rightarrow $ $\Pr ( |\bar{Z} - \theta | \le \delta )$.

For the estimator from [38], the noise Y is added to the winsorized mean rather than to $\bar{Z}$, so the above argument does not apply. Nevertheless, the results extracted from [38] are particularly useful in deriving the proof. First, we obseve that

$$\begin{aligned} \nonumber \Pr ( |\hat{\theta }_T(X) - \theta | \le \delta ) \nonumber&= \Pr ( \theta - \delta \le \hat{\theta }_T(X) \le \theta + \delta ) \\&= F_{\hat{\theta }_T}(\theta + \delta ) - F_{\hat{\theta }_T}(\theta - \delta ), \end{aligned}$$

(22)

where $F_X$ denotes the CDF of X. By Corollary 10 in [38], the $\mathsf {KS}$ distance between $\hat{\theta }_T$ and $\bar{Z}$ goes to 0 as k goes to infinity. This leads to the fact that $\hat{\theta }_T$ converges in distribution to $\bar{Z}$. In other words, when $k \rightarrow \infty $, $F_{\hat{\theta }_T}(t) \rightarrow F_{\bar{Z}}(t)$, and the expression (22) converges to

$$\begin{aligned} F_{\bar{Z}}(\theta + \delta ) - F_{\bar{Z}}(\theta - \delta )&= \Pr ( \theta - \delta \le \bar{Z} \le \theta + \delta ) \\&= \Pr ( |\bar{Z} - \theta | \le \delta ), \end{aligned}$$

which completes the proof.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Thanh, Q.V., Datta, A. (2015). BPMiner: Algorithms for Large-Scale Private Analysis. In: Hameurlain, A., Küng, J., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXII. Lecture Notes in Computer Science(), vol 9430. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48567-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-662-48567-5_1
Published: 08 November 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48566-8
Online ISBN: 978-3-662-48567-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BPMiner: Algorithms for Large-Scale Private Analysis

Abstract

Access this chapter

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Proof of Theorem 1

A Proof of Theorem 1

Lemma 5

Proof

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation