Abstract
Real-time processing of user data streams in online services inadvertently creates tension between the users and analysts: users are looking for stronger privacy, while analysts desire for higher utility data analytics in real time. To resolve this tension, this paper describes the design, implementation, and evaluation of PrivApprox, a data analytics system for privacy-preserving stream processing. PrivApprox provides three important properties: (i) privacy, zero-knowledge privacy guarantee for users, a privacy bound tighter than the state-of-the-art differential privacy; (ii) utility, an interface for data analysts to systematically explore the trade-offs between the output accuracy (with error estimation) and the query execution budget; and (iii) latency, near real-time stream processing based on a scalable “synchronization-free” distributed architecture. The key idea behind PrivApprox is to combine two techniques together, namely, sampling (used for approximate computation) and randomized response (used for privacy-preserving analytics). The resulting combination is complementary – it achieves stronger privacy guarantees and also improves the performance for stream analytics.
References
Al-Kateb M, Lee BS (2010) Stratified reservoir sampling over heterogeneous data streams. In: Proceedings of the 22nd international conference on scientific and statistical database management (SSDBM)
Apache spark streaming. http://spark.apache.org/streaming. Accessed Nov 2017
Bhatotia P (2015) Incremental parallel and distributed systems. PhD thesis, Max Planck Institute for Software Systems (MPI-SWS)
Bhatotia P, Wieder A, Akkus IE, Rodrigues R, Acar UA (2011a) Large-scale incremental data processing with change propagation. In: Proceedings of the conference on hot topics in cloud computing (HotCloud)
Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquini R (2011b) Incoop: MapReduce for incremental computations. In: Proceedings of the ACM symposium on cloud computing (SoCC)
Bhatotia P, Dischinger M, Rodrigues R, Acar UA (2012a) Slider: incremental sliding-window computations for large-scale data analysis. Technical Report MPI-SWS-2012-004, MPI-SWS. http://www.mpi-sws.org/tr/2012-004.pdf
Bhatotia P, Rodrigues R, Verma A (2012b) Shredder: GPU-accelerated incremental storage and computation. In: Proceedings of USENIX conference on file and storage technologies (FAST)
Bhatotia P, Acar UA, Junqueira FP, Rodrigues R (2014) Slider: incremental sliding window analytics. In: Proceedings of the 15th international middleware conference (Middleware)
Bhatotia P, Fonseca P, Acar UA, Brandenburg B, Rodrigues R (2015) iThreads: a threading library for parallel incremental computation. In: Proceedings of the 20th international conference on architectural support for programming languages and operating systems (ASPLOS)
Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K (2015) Apache Flink: stream and batch processing in a single engine. Bull IEEE Comput Soc Tech Committee Data Eng 36(4)
Chan THH, Shi E, Song D (2011) Private and continual release of statistics. ACM Trans Inf Syst Secur 14(3), 26
Chan THH, Li M, Shi E, Xu W (2012) Differentially private continual monitoring of heavy hitters from distributed streams. In: Proceedings of the 12th international conference on privacy enhancing technologies (PETS)
Chaudhuri K, Mishra N (2006) When random sampling preserves privacy. In: Proceedings of the 26th annual international conference on advances in cryptology (CRYPTO)
Chen R, Akkus IE, Francis P (2013) SplitX: high-performance private analytics. In: Proceedings of the conference on applications, technologies, architectures, and protocols for computer communications (SIGCOMM)
Cormode G, Garofalakis M, Haas PJ, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Found Trends Databases 4(1–3):1–294
Dingledine R, Mathewson N, Syverson P (2004) Tor: the second-generation onion router. Technical report, DTIC Document
Douceur JR (2002) The Sybil attack. In: Proceedings of 1st international workshop on peer-to-peer systems (IPTPS)
Dwork C (2006) Differential privacy. In: Proceedings of the 33rd international colloquium on automata, languages and programming, part II (ICALP)
Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M (2006a) Our data, ourselves: privacy via distributed noise generation. In: Proceedings of the 24th annual international conference on the theory and applications of cryptographic techniques (EUROCRYPT)
Dwork C, McSherry F, Nissim K, Smith A (2006b) Calibrating noise to sensitivity in private data analysis. In: Proceedings of the third conference on theory of cryptography (TCC)
Dwork C, Naor M, Pitassi T, Rothblum GN (2010) Differential privacy under continual observation. In: Proceedings of the ACM symposium on theory of computing (STOC)
Fox JA, Tracy PE (1986) Randomized response: a method for sensitive surveys. Sage Publications, Beverly Hills
Gehrke J, Lui E, Pass R (2011) Towards privacy for social networks: a zero-knowledge based definition of privacy. In: Theory of cryptography
Gehrke J, Hay M, Lui E, Pass R (2012) Crowd-blending privacy. In: Proceedings of the 32nd annual international conference on advances in cryptology (CRYPTO)
Guha S, Cheng B, Francis P (2011) Privad: practical privacy in online advertising. In: Proceedings of the 8th symposium on networked systems design and implementation (NSDI)
Hellerstein JM, Haas PJ, Wang HJ (1997) Online aggregation. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD)
HealthCare.gov sends personal data to dozens of tracking websites. https://www.eff.org/deeplinks/2015/01/healthcare.gov-sends-personal-data. Accessed Nov 2017
Hubert Chan Th, Shi E, Song D (2012) Privacy-preserving stream aggregation with fault tolerance. In: Proceedings of 16th international conference on financial cryptography and data security (FC)
Krishnan DR, Quoc DL, Bhatotia P, Fetzer C, Rodrigues R (2016) IncApprox: a data analytics system for incremental approximate computing. In: Proceedings of the 25th international conference on world wide web (WWW)
McSherry F, Mahajan R (2010) Differentially-private network trace analysis. In: Proceedings of the conference on applications, technologies, architectures, and protocols for computer communications (SIGCOMM)
Mohan P, Thakurta A, Shi E, Song D, Culler D (2012) GUPT: privacy preserving data analysis made easy. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data (SIGMOD)
Moore DS (1999) The basic practice of statistics, 2nd edn. W. H. Freeman & Co., New York
Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T (2017a) Privacy preserving stream analytics: the marriage of randomized response and approximate computing. https://arxiv.org/abs/1701.05403
Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T (2017b) PrivApprox: privacy-preserving stream analytics. In: Proceedings of the 2017 USENIX conference on USENIX annual technical conference (USENIX ATC)
Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T (2017c) Approximate stream analytics in Apache Flink and Apache Spark streaming. CoRR, abs/1709.02946
Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T (2017d) StreamApprox: approximate computing for stream analytics. In: Proceedings of the international middleware conference (Middleware)
Rastogi V, Nath S (2010) Differentially private aggregation of distributed time-series with transformation and encryption. In: Proceedings of the international conference on management of data (SIGMOD)
SEC Charges Two Employees of a Credit Card Company with Insider Trading. http://www.sec.gov/litigation/litreleases/2015/lr23179.htm. Accessed Nov 2017
Shi E, Chan TH, Rieffel EG, Chow R, Song D (2011) Privacy-preserving aggregation of time-series data. In: Proceedings of the symposium on network and distributed system security (NDSS)
Wang G, Wang B, Wang T, Nika A, Zheng H, Zhao BY (2016a) Defending against Sybil devices in crowdsourced mapping services. In: Proceedings of the 14th annual international conference on mobile systems, applications, and services (MobiSys)
Wang Q, Zhang Y, Lu X, Wang Z, Qin Z, Ren K (2016b) RescueDP: real-time spatio-temporal crowd-sourced data publishing with differential privacy. In: Proceedings of the 35th annual IEEE international conference on computer communications (INFOCOM)
Wieder A, Bhatotia P, Post A, Rodrigues R (2010a) Brief announcement: modelling mapreduce for optimal execution in the cloud. In: Proceedings of the 29th ACM SIGACT-SIGOPS symposium on principles of distributed computing (PODC)
Wieder A, Bhatotia P, Post A, Rodrigues R (2010b) Conductor: orchestrating the clouds. In: Proceedings of the 4th international workshop on large scale distributed systems and middleware (LADIS)
Wieder A, Bhatotia P, Post A, Rodrigues R (2012) Orchestrating the deployment of computations in the cloud with conductor. In: Proceedings of the 9th USENIX symposium on networked systems design and implementation (NSDI)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this entry
Cite this entry
Quoc, D.L., Beck, M., Bhatotia, P., Chen, R., Fetzer, C., Strufe, T. (2018). Privacy-Preserving Data Analytics. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_152-1
Download citation
DOI: https://doi.org/10.1007/978-3-319-63962-8_152-1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63962-8
Online ISBN: 978-3-319-63962-8
eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering