Skip to main content

BPMiner: Algorithms for Large-Scale Private Analysis

  • Chapter
  • First Online:
Transactions on Large-Scale Data- and Knowledge-Centered Systems XXII

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 9430))

  • 431 Accesses

Abstract

An abundance of data generated from a multitude of sources, and intelligence derived by analyzing the same, has become an important asset across many walks of life. Simultaneously, it raises serious concerns about privacy. Differential privacy has become a popular way to reason about the amount of information about individual entries of a dataset that is divulged upon giving out a perturbed result for a query on a given data-set. However, current differentially-private algorithms are computationally inefficient, and do not explicitly exploit the abundance of data, thus wearing out the privacy budget irrespective of the volume of data. In this paper, we propose BPMiner, a solution that is both private and accurate, while simultaneously addressing the computation and budget challenges of very big datasets. The main idea is a non-trivial combination between differential privacy, sample-and-aggregation, and a classical statistical methodology called sequential estimation. Rigorous proof regarding the privacy and asymptotic accuracy of our solution are provided. Furthermore, experimental results over multiple datasets demonstrate that BPMiner outperforms current private algorithms in terms of computational and budget efficiencies, while achieving comparable accuracy. Overall, BPMiner is a practical solution based on strong theoretical foundations for privacy-preserving analysis on big datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In this paper, an ‘individual’ refers to an entry in a statistical database, which may correspond to information about a real-world entity, e.g., a patient’s record, a financial transaction, etc.

  2. 2.

    While some works regard ‘privacy budget’ as \(\epsilon \) (the privacy parameter), we consider it a fixed budget that is reduced per analysis. Such interpretation is seen in [19, 28, 29].

  3. 3.

    This name reflects the fact that we use sequential estimation w.r.t. data blocks. It is not to be confused with moving blocks boostrap in the field of bootstrap/subsampling.

References

  1. Agarwal, A., Chapelle, O., Dudík, M., Langford, J.: A reliable effective terascale linear learning system. J. Mach. Learn. Res. 15(1), 1111–1133 (2014)

    MathSciNet  MATH  Google Scholar 

  2. Agarwal, S., Milner, H., Kleiner, A., Talwalkar, A., Jordan, M.I., Madden, S., Mozafari, B., Stoica, I.: Knowing when you’re wrong: building fast and reliable approximate query processing systems. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 481–492 (2014)

    Google Scholar 

  3. Anscombe, F.J.: Large-sample theory of sequential estimation. Math. Proc. Cambridge Philos. Soc. 48(4), 600–607 (1952)

    Article  MathSciNet  MATH  Google Scholar 

  4. Aoshima, M., Yata, K.: Two-stage procedures for high-dimensional data. Sequential Analysis 30(4), 356–399 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bache, K., Lichman, M.: UCI Machine Learning Repository. IOS Press, Amsterdam (2013)

    Google Scholar 

  6. Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In: Proceedings of the 12th International Society for Music Information Retrieval Conference ISMIR 2011, Miami, Florida, USA, 24–28 October 2011, pp. 591–596 (2011)

    Google Scholar 

  7. Bickel, P.J., Levina, E.: Regularized estimation of large covariance matrices. Ann. Stat. 36, 199–227 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  8. Blum, A., Dwork, C., McSherry, F., Nissim, F.: Practical privacy: the sulq framework. In: Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Baltimore, Maryland, USA, 13–15 June 2005, pp. 128–138 (2005)

    Google Scholar 

  9. Cai, Z., Gao, Z.J., Luo, S., Perez, L.L., Vagena, Z., Jermaine, C.M.: A comparison of platforms for implementing and running very large scale machine learning algorithms. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 1371–1382 (2014)

    Google Scholar 

  10. Chen, J., Chen, X.: A new method for adaptive sequential sampling for learning and parameter estimation. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS, vol. 6804, pp. 220–229. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  11. Chow, Y.S., Robbins, H.: On the asymptotic theory of fixed-width sequential confidence intervals for the mean. Ann. Math. Stat. 36(2), 457–462 (1965)

    Article  MathSciNet  MATH  Google Scholar 

  12. Condie, T., Mineiro, P., Polyzotis, N., Weimer, M.: Machine learning for big data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, 22–27 June 2013, pp. 939–942 (2013)

    Google Scholar 

  13. Condie, T., Mineiro, P., Polyzotis, N., Weimer, M.: Machine learning on big data. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia 8–12 April 2013, pp. 1242–1244 (2013)

    Google Scholar 

  14. Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  15. Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D.-Z., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  16. Dwork, C.: A firm foundation for private data analysis. Commun. ACM 54(1), 86–95 (2011)

    Article  Google Scholar 

  17. Dwork, C., Rothblum, G.N., Vadhan, S.P.: Boosting and differential privacy. In: 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, Las Vegas, Neveda, USA, 23–26 October 2010, pp. 51–60 (2010)

    Google Scholar 

  18. Dwork, C., Smith, A.: Differential privacy for statistics: What we know and what we want to learn. J. Priv. Confidentiality 1(2), 135–154 (2009)

    Google Scholar 

  19. Haeberlen, A., Pierce, B.C., Narayan, A.: Differential privacy under fire. In: 20th USENIX Security Symposium, San Francisco, CA, USA, 8–12 August 2011, Proceedings, pp. 33–33 (2011)

    Google Scholar 

  20. Ho, C.-H., Lin, C.-J.: Large-scale linear support vector regression. J. Mach. Learn. Res. 13, 3323–3348 (2012)

    MathSciNet  MATH  Google Scholar 

  21. Jordan, M.I.: Divide-and-conquer and statistical inference for big data. In: The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2012, Beijing, China, 12–16 August 2012, p. 4 (2012)

    Google Scholar 

  22. Kleiner, A., Talwalkar, A., Sarkar, P., Jordan, M.I.: The big data bootstrap. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK June 26 - July 1 2012, pp. 1759–1766 (2012)

    Google Scholar 

  23. Kraska, T., Talwalkar, A., Duchi, J.C., Griffith, R., Franklin, M.J., Jordan, M.I.: Mlbase: A distributed machine-learning system. In: CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 6–9 January 2013, Online Proceedings (2013)

    Google Scholar 

  24. Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endow. PVLDB 5(10), 1028–1039 (2012)

    Article  Google Scholar 

  25. Laptev, N., Zeng, K., Zaniolo, C.: Very fast estimation for result and accuracy of big data analytics: the EARL system. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp. 1296–1299 (2013)

    Google Scholar 

  26. Lin, J., Kolcz, A.: Large-scale machine learning at twitter. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, AZ, USA,Scottsdale, 20–24 May 2012, pp. 793–804 (2012)

    Google Scholar 

  27. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning in the cloud. Proc. VLDB Endow. PVLDB 5(8), 716–727 (2012)

    Article  Google Scholar 

  28. McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, pp. 19–30 (2009)

    Google Scholar 

  29. Mohan, P., Thakurta, A., Shi, E., Song, D., Culler, D.E.: Gupt: privacy preserving data analysis made easy. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, 20–24 May 2012, pp. 349–360 (2012)

    Google Scholar 

  30. Mukhopadhyay, N.: A consistent and asymptotically efficient two-stage procedure to construct fixed width confidence intervals for the mean. Metrika 27(1), 281–284 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  31. Mukhopadhyay, Nitis, de Silva, Basil M.: Sequential Methods and Their Applications. Chapman and Hall/CRC, Boca Raton (2008)

    Book  MATH  Google Scholar 

  32. Nadas, A.: An extension of a theorem of chow and robbins on sequential confidence intervals for the mean. Ann. Math. Stat. 40(2), 667–671 (1969)

    Article  MathSciNet  MATH  Google Scholar 

  33. Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: Proceedings of the 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, 11–13 June 2007, pp. 75–84 (2007)

    Google Scholar 

  34. Committee on the Analysis of Massive Data: Committee on Applied & Theoretical Statistics, Board on Mathematical Sciences & Their Applications, Division on Engineering & Physical Sciences, and National Research Council. The National Academies Press, Frontiers in Massive Data Analysis (2013)

    Google Scholar 

  35. Sandmann, W.: Sequential estimation for prescribed statistical accuracy in stochastic simulation of biological systems. Math. Biosci. 221(1), 43–53 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  36. Seelbinder, B.M.: On stein’s two-stage sampling scheme. Ann. Math. Stat. 24(4), 640–649 (1953)

    Article  MathSciNet  MATH  Google Scholar 

  37. Smith, A.: Asymptotically optimal and private statistical estimation. In: Garay, J.A., Miyaji, A., Otsuka, A. (eds.) CANS 2009. LNCS, vol. 5888, pp. 53–57. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  38. Smith, A.: Privacy-preserving statistical estimation with optimal convergence rates. In: Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, June 6–8 2011, pp. 813–822 (2011)

    Google Scholar 

  39. Wasserman, L.: Minimaxity, statistical thinking and differential privacy. J. Priv. Confidentiality 4(1), 51–63 (2012)

    Google Scholar 

  40. Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, 22–27 June 2013, pp. 13–24 (2013)

    Google Scholar 

  41. Yui, M., Kojima, I.: A database-hadoop hybrid approach to scalable machine learning. In: IEEE International Congress on Big Data, BigData Congress 2013, June 27 2013-July 2, 2013, pp. 1–8 (2013)

    Google Scholar 

  42. Zeng, K., Gao, S., Gu, J., Mozafari, B., Zaniolo, C.: ABS: a system for scalable approximate queries with accuracy guarantees. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 1067–1070 (2014)

    Google Scholar 

  43. Zeng, K., Gao, S., Mozafari, B., Zaniolo, C.: The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 277–288 (2014)

    Google Scholar 

Download references

Acknowledgement

This work was funded by A*Star Science and Engineering Research Council (SERC)’s Thematic Strategic Research Programme (TSRP) grant number 102 158 0038.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anwitaman Datta .

Editor information

Editors and Affiliations

A Proof of Theorem 1

A Proof of Theorem 1

For estimators from [18, 29, 37], the Laplace noise Y is added directly to \(\bar{Z}\) as in Algorithm  1. We have the following useful lemma.

Lemma 5

Suppose the Laplace noise is given by \(Y = Y_k = \mathrm {Laplace}\left( \mathsf {Range}/k\epsilon \right) \). Then \(Y_k \xrightarrow {P} 0\) as \(k \rightarrow \infty \).

Proof

(of Lemma 5 ). Given \(\epsilon > 0\), consider the probability \(\Pr (|Y_k - 0| < \epsilon )\). This probability equals

$$\begin{aligned} \Pr (|Y_k| < \epsilon ) = \Pr (-\epsilon < Y_k < \epsilon ) = F_{Y_k}(\epsilon ) - F_{Y_k}(-\epsilon ). \end{aligned}$$

The CDF of a Laplace random variable X is given by \(F_X(x) = \tfrac{1}{2} \exp \left( \tfrac{x-\mu }{\sigma } \right) \) if \(x < \mu \), and \(F_X(x) = 1 - \tfrac{1}{2} \exp \left( - \tfrac{x-\mu }{\sigma } \right) \) otherwise. Since \(Y_k\) has a Laplace distribution with parameters \(\mu = 0\) and \(\sigma = \mathsf {Range}/(k\epsilon )\), computing \(F_{Y_k}(\epsilon )\) and \(F_{Y_k}(-\epsilon )\) (with \(\epsilon > 0 = \mu \)) gives us

$$\begin{aligned} \Pr (|Y_k - \mu | < \epsilon ) = F_{Y_k}(\epsilon ) - F_{Y_k}(-\epsilon ) = 1 - \exp \left( - \tfrac{k\epsilon ^2}{\mathsf {Range}} \right) \rightarrow 1 \end{aligned}$$

as \(k \rightarrow \infty \). Therefore, it follows from the definition of convergence in probability that \(Y_k \xrightarrow {P} 0\) as \(k \rightarrow \infty \).

Using Lemma 5, the proof is straightforward for the statistics proposed in [18, 29, 37]: \(\Pr ( |\hat{\theta }_T(X) - \theta | \le \delta )\) \(=\) \(\Pr ( | \bar{Z} + Y_k - \theta |\) \(\le \delta )\) \(\rightarrow \) \(\Pr ( |\bar{Z} - \theta | \le \delta )\).

For the estimator from [38], the noise Y is added to the winsorized mean rather than to \(\bar{Z}\), so the above argument does not apply. Nevertheless, the results extracted from [38] are particularly useful in deriving the proof. First, we obseve that

$$\begin{aligned} \nonumber \Pr ( |\hat{\theta }_T(X) - \theta | \le \delta ) \nonumber&= \Pr ( \theta - \delta \le \hat{\theta }_T(X) \le \theta + \delta ) \\&= F_{\hat{\theta }_T}(\theta + \delta ) - F_{\hat{\theta }_T}(\theta - \delta ), \end{aligned}$$
(22)

where \(F_X\) denotes the CDF of X. By Corollary 10 in [38], the \(\mathsf {KS}\) distance between \(\hat{\theta }_T\) and \(\bar{Z}\) goes to 0 as k goes to infinity. This leads to the fact that \(\hat{\theta }_T\) converges in distribution to \(\bar{Z}\). In other words, when \(k \rightarrow \infty \), \(F_{\hat{\theta }_T}(t) \rightarrow F_{\bar{Z}}(t)\), and the expression (22) converges to

$$\begin{aligned} F_{\bar{Z}}(\theta + \delta ) - F_{\bar{Z}}(\theta - \delta )&= \Pr ( \theta - \delta \le \bar{Z} \le \theta + \delta ) \\&= \Pr ( |\bar{Z} - \theta | \le \delta ), \end{aligned}$$

which completes the proof.

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Thanh, Q.V., Datta, A. (2015). BPMiner: Algorithms for Large-Scale Private Analysis. In: Hameurlain, A., Küng, J., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXII. Lecture Notes in Computer Science(), vol 9430. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48567-5_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-48567-5_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-48566-8

  • Online ISBN: 978-3-662-48567-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics