Practical Distributed Privacy-Preserving Data Analysis at Large Scale



In this chapter we investigate practical technologies for security and privacy in data analysis at large scale. We motivate our approach by discussing the challenges and opportunities in light of current and emerging analysis paradigms on large data sets. In particular, we present a framework for privacy-preserving distributed data analysis that is practical for many real-world applications. The framework is called Peers for Privacy (P4P) and features a novel heterogeneous architecture and a number of efficient tools for performing private computation and offering security at large scale. It maintains three key properties, which are essential for real-world applications: (i) provably strong privacy; (ii) adequate efficiency at reasonably large scale; and (iii) robustness against realistic adversaries. The framework gains its practicality by decomposing data mining algorithms into a sequence of vector addition steps, which can be privately evaluated using efficient cryptographic tools, namely verifiable secret sharing over small field (e.g., 32 or 64 bits), which have the same cost as regular, non-private arithmetic. This paradigm supports a large number of statistical learning algorithms, including SVD, PCA, k-means, ID3 and machine learning algorithms based on Expectation-Maximization, as well as all algorithms in the statistical query model (Kearns, Efficient noise-tolerant learning from statistical queries. In: STOC’93, San Diego, pp. 392–401, 1993). As a concrete example, we show how singular value decomposition, which is an extremely useful algorithm and the core of many data mining tasks, can be performed efficiently with privacy in P4P. Using real data, we demonstrate that P4P is orders of magnitude faster than other solutions.


Random Projection Elliptic Curve Cryptography Homomorphic Encryption Differential Privacy Secure Multiparty Computation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Alaggan, M., Gambs, S., Kermarrec, A.M.: Private similarity computation in distributed systems: from cryptography to differential privacy. In: Principles of Distributed Systems. Lecture Notes in Computer Science. Springer, Berlin/New York (2011)Google Scholar
  2. 2.
    Alderman, E., Kennedy, C.: The Right to Privacy. DIANE, Collingdale (1995)Google Scholar
  3. 3.
    Beaver, D., Goldwasser, S.: Multiparty computation with faulty majority. In: CRYPTO’89, Santa BarbaraGoogle Scholar
  4. 4.
    Beerliová-Trubíniová, Z., Hirt, M.: Perfectly-secure mpc with linear communication complexity. In: TCC 2008, New York, pp. 213–230. Springer (2008)Google Scholar
  5. 5.
    Beimel, A., Nissim1, K., Omri, E.: Distributed private data analysis: simultaneously solving how and what. In: CRYPTO 2008, Santa Barbara (2008)Google Scholar
  6. 6.
    Ben-David, A., Nisan, N., Pinkas, B.: Fairplaymp: a system for secure multi-party computation. In: CCS’08, Alexandria, pp. 257–266 (2008)Google Scholar
  7. 7.
    Ben-Or, M., Goldwasser, S., Wigderson, A.: Completeness theorems for non-cryptographic fault-tolerant distributed computation. In: STOC’88, Hong Kong, Chicago, IL, USA, pp. 1–10. ACM (1988)Google Scholar
  8. 8.
    Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the SuLQ framework. In: PODS’05, Baltimore, Maryland, USA, pp. 128–138. ACM (2005)Google Scholar
  9. 9.
    Blum, A., Ligett, K., Roth, A.: A learning theory approach to non-interactive database privacy. In: STOC 08, Victoria, British Columbia, Canada (2008)Google Scholar
  10. 10.
    Boaz Barak, E.A.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: PODS’07, Beijing (2007)Google Scholar
  11. 11.
    Canny, J.: Collaborative filtering with privacy. In: IEEE Symposium on Security and Privacy, San Francisco, Oakland, Ca, USA, pp. 45–57 (2002)Google Scholar
  12. 12.
    Canny, J.: Collaborative filtering with privacy via factor analysis. In: SIGIR’02, Tampere, Tampere, Finland, pp. 238–245. ACM (2002)Google Scholar
  13. 13.
    Chen, H., Cramer, R.: Algebraic geometric secret sharing schemes and secure multi-party computations over small fields. In: CRYPTO 2006, Santa Barbara (2006)Google Scholar
  14. 14.
    Chin, F., Ozsoyoglu, G.: Auditing for secure statistical databases. In: ACM 81: Proceedings of the ACM’81 Conference, Seattle, ACM’ 81 is Los Angeles, Ca, USA, pp. 53–59 (1981)Google Scholar
  15. 15.
    Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: NIPS 2006, Vancouver, B.C., Canada (2006)Google Scholar
  16. 16.
    Cohen, W.W.: Enron email dataset. (2004)
  17. 17.
    Cohen Benaloh, J.: Secret sharing homomorphisms: keeping shares of a secret secret. In: CRYPTO’86, Santa Barbara, pp. 251–260 (1987)Google Scholar
  18. 18.
    Cormode, G.: Personal privacy vs population privacy: learning to attack anonymization. In: KDD’11, Chicago, pp. 1253–1261. ACM, New York (2011)Google Scholar
  19. 19.
    Cramer, R., Damgård, I.: Zero-knowledge proof for finite field arithmetic, or: can zero-knowledge be for free? In: CRYPTO’98, San Diego. Springer (1998)Google Scholar
  20. 20.
    Dalenius, T.: Towards a methodology for statistical disclosure control. Statistik Tidskrift 15, 429–444 (1977)Google Scholar
  21. 21.
    Damgård, I., Ishai, Y., Krøigaard, M., Nielsen, J.B., Smith, A.: Scalable multiparty computation with nearly optimal work and resilience. In: CRYPTO 2008, Santa Barbara, pp. 241–261 (2008)Google Scholar
  22. 22.
    Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW’07, Geneva, Banff, Alberta, Canada, pp. 271–280. ACM (2007)Google Scholar
  23. 23.
    Dhanjani, N.: Amazon’s elastic compute cloud [ec2]: initial thoughts on security implications.
  24. 24.
    Dinur, I., Nissim, K.: Revealing information while preserving privacy. In: PODS’03, San Diego, San Diego, California, pp. 202–210 (2003)Google Scholar
  25. 25.
    Du, W., Zhan, Z.: Using randomized response techniques for privacy-preserving data mining. In: KDD’03, Washington DC, pp. 505–510. ACM, New York (2003)Google Scholar
  26. 26.
    Du, W., Han, Y., Chen, S.: Privacy-preserving multivariate statistical analysis: linear regression and classification. In: SDM 04, Toronto, Lake Buena Vista, Florida, USA, pp. 222–233 (2004)Google Scholar
  27. 27.
    Duan, Y.: Privacy without noise. In: CIKM’09, Hong Kong. ACM, New York (2009)Google Scholar
  28. 28.
    Duan, Y., Wang, J., Kam, M., Canny, J.: A secure online algorithm for link analysis on weighted graph. In: Proceedings of the Workshop on Link Analysis, Counterterrorism and Security, SDM 05, Newport Beach, pp. 71–81 (2005)Google Scholar
  29. 29.
    Duan, Y., Canny, J.: Zero-knowledge test of vector equivalence and granulation of user data with privacy. In: IEEE GrC 2006, Atlanta (2006)Google Scholar
  30. 30.
    Duan, Y., Canny, J.: Practical private computation and zero-knowledge tools for privacy-preserving distributed data mining. In: SDM’08, Atlanta (2008)Google Scholar
  31. 31.
    Duan, Y., Canny, J.: How to deal with malicious users in privacy-preserving distributed data mining. Stat. Anal. Data Min. 2(1), 18–33 (2009)CrossRefMATHMathSciNetGoogle Scholar
  32. 32.
    Duan, Y., Canny, J., Zhan, J.: P4P: Practical large-scale privacy-preserving distributed computation robust against malicious users. In: USENIX Security Symposium 2010, San Francisco, Washington, D.C, pp. 609–618 (2010)Google Scholar
  33. 33.
    Dwork, C.: An ad omnia approach to defining and achieving private data analysis. In: PinKDD, San Jose, pp. 1–13 (2007)Google Scholar
  34. 34.
    Dwork, C.: Ask a better question, get a better answer a new approach to private data analysis. In: ICDT 2007, Barcelona, Spain, pp. 18–27. Springer (2007)Google Scholar
  35. 35.
    Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: EUROCRYPT 2006, Saint Petersburg. Springer (2006)Google Scholar
  36. 36.
    Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: TCC 2006, New York, pp. 265–284. Springer (2006)Google Scholar
  37. 37.
    Evfimievski, A., Gehrke, J., Srikant, R.: Limiting privacy breaches in privacy preserving data mining. In: PODS’03, San Diego, pp. 211–222 (2003)Google Scholar
  38. 38.
    Feigenbaum, J., Nisan, N., Ramachandran, V., Sami, R., Shenker, S.: Agents’ privacy in distributed algorithmic mechanisms. In: Workshop on Economics and Information Securit, Berkeley (2002)Google Scholar
  39. 39.
    Fiat, A., Shamir, A.: How to prove yourself: practical solutions to identification and signature problems. In: CRYPTO 86, Santa Barbara, California, USA (1987)Google Scholar
  40. 40.
    Fitzi, M., Hirt, M., Maurer, U.: General adversaries in unconditional multi-party computation. In: ASIACRYPT’99, Singapore (1999)Google Scholar
  41. 41.
    Ganta, S.R., Kasiviswanathan, S.P., Smith, A.: Composition attacks and auxiliary information in data privacy. In: KDD’08, Las Vegas, pp. 265–273. ACM, New York (2008)Google Scholar
  42. 42.
    Goldreich, O.: Foundations of Cryptography: Volume 2 – Basic Applications. Cambridge University Press, Cambridge (2004)CrossRefGoogle Scholar
  43. 43.
    Goldreich, O., Micali, S., Wigderson, A.: How to play any mental game. In: STOC’87, New York, pp. 218–229 (1987)Google Scholar
  44. 44.
    Goldreich, O., Oren, Y.: Definitions and properties of zero-knowledge proof systems. J. Cryptol. 7(1), 1–32 (1994)CrossRefMATHMathSciNetGoogle Scholar
  45. 45.
    Goldwasser, S., Micali, S., Rackoff, C.: The knowledge complexity of interactive proof systems. SIAM J. Comput. 18(1), 186–208 (1989)CrossRefMATHMathSciNetGoogle Scholar
  46. 46.
    Goldwasser, S., Levin, L.: Fair computation of general functions in presence of immoral majority. In: CRYPTO’90, Santa Barbara, pp. 77–93. Springer (1991)Google Scholar
  47. 47.
    Hirt, M., Maurer, U.: Complete characterization of adversaries tolerable in secure multi-party computation (extended abstract). In: PODC’97, Santa Barbara (1997)Google Scholar
  48. 48.
    Hirt, M., Maurer, U.: Player simulation and general adversary structures in perfect multiparty computation. J. Cryptol. 13(1), 31–60 (2000)CrossRefMATHMathSciNetGoogle Scholar
  49. 49.
    Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the privacy preserving properties of random data perturbation techniques. In: ICDM’03, Melbourne, Florida, USA, p. 99. IEEE Computer Society, Washington (2003)Google Scholar
  50. 50.
    Kearns, M.: Efficient noise-tolerant learning from statistical queries. In: STOC’93, San Diego, pp. 392–401 (1993)Google Scholar
  51. 51.
    Kifer, D., Machanavajjhala, A.: No free lunch in data privacy. In: SIGMOD’11, Athens, Greece, pp. 193–204. ACM, New York (2011)Google Scholar
  52. 52.
    Kleinberg, J., Papadimitriou, C., Raghavan, P.: Auditing boolean attributes. In: PODS’00, Dallas, pp. 86–91. ACM, New York (2000). doi:
  53. 53.
    Lehoucq, R.B., Sorensen, D.C., Yang, C.: ARPACK users’ guide: solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods. SIAM, San Francisco (1998)CrossRefGoogle Scholar
  54. 54.
    Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the IEEE 23rd International Conference on Data Engineering, Istanbul, pp. 106–115 (2007)Google Scholar
  55. 55.
    Lindell, Y., Pinkas, B.: Privacy preserving data mining. J. Cryptol. 15(3), 177–206 (2002)CrossRefMATHMathSciNetGoogle Scholar
  56. 56.
    Lindell, Y., Pinkas, B., Smart, N.P.: Implementing two-party computation efficiently with security against malicious adversaries. In: SCN’08, Amalfi, Italy (2008)Google Scholar
  57. 57.
    Liu, W.M., Wang, L.: Privacy streamliner: a two-stage approach to improving algorithm efficiency. In: CODASPY’12, San Antonio, pp. 193–204. ACM, New York (2012)Google Scholar
  58. 58.
    Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. In: Proceedings of the IEEE 22rd International Conference on Data Engineering, Atlanta (2006)Google Scholar
  59. 59.
    Malkhi, D., Nisan, N., Pinkas, B., Sella, Y.: Fairplay—a secure two-party computation system. In: SSYM’04: Proceedings of the 13th Conference on USENIX Security Symposium, San Diego, CA, pp. 20–20. USENIX Association, Berkeley (2004)Google Scholar
  60. 60.
    McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Commun. ACM 53(9), 89–97 (2010)CrossRefGoogle Scholar
  61. 61.
    McSherry, F., Mironov, I.: Differentially private recommender systems: building privacy into the netflix prize contenders. In: KDD’09, Paris, pp. 627–636 (2009)Google Scholar
  62. 62.
    McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: FOCS’07 Rhode Island (2007)Google Scholar
  63. 63.
    Nergiz, M.E., Atzori, M., Clifton, C.: Hiding the presence of individuals from shared databases. In: SIGMOD’07, Beijing, pp. 665–676. ACM, New York (2007)Google Scholar
  64. 64.
    Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: STOC’07, El Paso, Texas, USA, pp. 75–84. ACM (2007)Google Scholar
  65. 65.
    Paillier, P.: Trapdooring discrete logarithms on elliptic curves over rings. In: ASIACRYPT’00, Kyoto (2000)Google Scholar
  66. 66.
    Pedersen, T.: Non-interactive and information-theoretic secure verifiable secret sharing. In: CRYPTO’91, Santa Barbara (1992)Google Scholar
  67. 67.
    Pinkas, B., Schneider, T., Smart, N., Williams, S.: Secure two-party computation is practical. Cryptology ePrint Archive, Report 2009/314 (2009)Google Scholar
  68. 68.
    Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information (abstract). In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of database systems, PODS’98, Seattle, p. 188. ACM, New York (1998). doi:10.1145/275487.275508.
  69. 69.
    Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical Report SRI-CSL-98-04, SRI International (1998)Google Scholar
  70. 70.
    Stewart, G.W., Sun, J.G.: Matrix Perturbation Theory. Academic, Boston New York (1990)MATHGoogle Scholar
  71. 71.
    Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)Google Scholar
  72. 72.
    Trefethen, L.N., III, D.B.: Numerical Linear Algebra. SIAM, Philadelphia (1997)Google Scholar
  73. 73.
    Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: KDD’03, Washington DC (2003)Google Scholar
  74. 74.
    Wright, R., Yang, Z.: Privacy-preserving bayesian network structure computation on distributed heterogeneous data. In: KDD’04, New York, pp. 713–718 (2004)Google Scholar
  75. 75.
    Xiao, X., Tao, Y.: M-invariance: Towards privacy preserving re-publication of dynamic datasets. In: SIGMOD 2007, Beijing, pp. 689–700 (2007)Google Scholar
  76. 76.
    Yang, Z., Zhong, S., Wright, R.N.: Privacy-preserving classification of customer data without loss of accuracy. In: SDM 2005, Newport Beach (2005)Google Scholar
  77. 77.
    Yao, A.C.C.: Protocols for secure computations. In: FOCS’82, Chicago, pp. 160–164. IEEE (1982)Google Scholar
  78. 78.
    Zhang, L., Jajodia, S., Brodsky, A.: Information disclosure under realistic assumptions: privacy versus optimality. In: CCS’07, Alexandria, pp. 573–583 (2007)Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.BeijingChina
  2. 2.Computer Science DivisionUniversity of CaliforniaBerkeleyUSA

Personalised recommendations