Encyclopedia of Big Data Technologies

Living Edition
| Editors: Sherif Sakr, Albert Zomaya

Incremental Approximate Computing

  • Do Le Quoc
  • Dhanya R Krishnan
  • Pramod Bhatotia
  • Christof Fetzer
  • Rodrigo Rodrigues
Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-63962-8_151-1

Abstract

Approximate computing is increasingly used for speeding up computations and efficiently utilizing the computing resources. The idea behind approximate computing is to return an approximate answer instead of the exact answer for user queries. The trick is to choose a representative sample of the data for computing instead of using the entire data. As a result, it allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. At the same time, another technique called incremental computing tries to achieve the same goals as approximate computing, i.e., speeding up job execution and utilizing resource efficiently. Incremental computing relies on the memoization of intermediate results of sub-computations and reusing these memoized results across jobs. This work makes the observation that these two computing paradigms are complementary and can be married together! The idea is quite simple: design a sampling algorithm that biases the sample selection to the memoized data items from previous runs. To realize this idea, an online stratified sampling algorithm is designed. The algorithm uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. The algorithm is implemented in a data analytics system called IncApprox.

This is a preview of subscription content, log in to check access.

References

  1. Acar UA (2005) Self-adjusting computation. PhD thesis, Carnegie Mellon UniversityGoogle Scholar
  2. Acar UA, Cotter A, Hudson B, Türkoğlu D (2010) Dynamic well-spaced point sets. In: Proceedings of the 26th annual symposium on computational geometry (SoCG)Google Scholar
  3. Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) Blinkdb: queries with bounded errors and bounded response times on very large data. In: Proceedings of the ACM European conference on computer systems (EuroSys)Google Scholar
  4. Al-Kateb M, Lee BS (2010) Stratified reservoir sampling over heterogeneous data streams. In: Proceedings of the 22nd international conference on scientific and statistical database management (SSDBM)Google Scholar
  5. Angel S, Ballani H, Karagiannis T, O’Shea G, Thereska E (2014) End-to-end performance isolation through virtual datacenters. In: Proceedings of the USENIX conference on operating systems design and implementation (OSDI)Google Scholar
  6. Bhatotia P (2015) Incremental parallel and distributed systems. PhD thesis, Max Planck Institute for Software Systems (MPI-SWS)Google Scholar
  7. Bhatotia P, Wieder A, Akkus IE, Rodrigues R, Acar UA (2011a) Large-scale incremental data processing with change propagation. In: Proceedings of the conference on hot topics in cloud computing (HotCloud)Google Scholar
  8. Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquini R (2011b) Incoop: MapReduce for incremental computations. In: Proceedings of the ACM symposium on cloud computing (SoCC)Google Scholar
  9. Bhatotia P, Dischinger M, Rodrigues R, Acar UA (2012a) Slider: incremental sliding-window computations for large-scale data analysis. In: Technical Report: MPI-SWS-2012-004Google Scholar
  10. Bhatotia P, Rodrigues R, Verma A (2012b) Shredder: GPU-accelerated incremental storage and computation. In: Proceedings of USENIX conference on file and storage technologies (FAST)Google Scholar
  11. Bhatotia P, Acar UA, Junqueira FP, Rodrigues R (2014) Slider: incremental sliding window analytics. In: Proceedings of the 15th international middleware conference (Middleware)Google Scholar
  12. Bhatotia P, Fonseca P, Acar UA, Brandenburg B, Rodrigues R (2015) iThreads: a threading library for parallel incremental computation. In: Proceedings of the 20th international conference on architectural support for programming languages and operating systems (ASPLOS)Google Scholar
  13. Brodal GS, Jacob R (2002) Dynamic planar convex hull. In: Proceedings of the 43rd annual IEEE symposium on foundations of computer science (FOCS)Google Scholar
  14. Chiang YJ, Tamassia R (1992) Dynamic algorithms in computational geometry. In: Proceedings of the IEEEGoogle Scholar
  15. Coles S (2001) An introduction to statistical modeling of extreme values. Springer, London/New YorkCrossRefMATHGoogle Scholar
  16. Cormode G, Garofalakis M, Haas PJ, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Found Trends Databases 4(1–3):1–294MATHGoogle Scholar
  17. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the USENIX conference on operating systems design and implementation (OSDI)Google Scholar
  18. Dziuda DM (2010) Data mining for genomics and proteomics: analysis of gene and protein expression data. Wiley, HobokenCrossRefGoogle Scholar
  19. Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat Sci 1(1):54–75MathSciNetCrossRefMATHGoogle Scholar
  20. Ganapathi AS (2009) Predicting and optimizing system utilization and performance via statistical machine learning. In: Technical Report No. UCB/EECS- 2009-181Google Scholar
  21. Goiri I, Bianchini R, Nagarakatte S, Nguyen TD (2015) ApproxHadoop: bringing approximations to MapReduce frameworks. In: Proceedings of the twentieth international conference on architectural support for programming languages and operating systems (ASPLOS)Google Scholar
  22. Gunda PK, Ravindranath L, Thekkath CA, Yu Y, Zhuang L (2010) Nectar: automatic management of data and computation in datacenters. In: Proceedings of the USENIX conference on operating systems design and implementation (OSDI)Google Scholar
  23. He B, Yang M, Guo Z, Chen R, Su B, Lin W, Zhou L (2010) Comet: batched stream processing for data intensive distributed computing. In: Proceedings of the ACM symposium on cloud computing (SoCC)Google Scholar
  24. Hellerstein JM, Haas PJ, Wang HJ (1997) Online aggregation. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD)Google Scholar
  25. Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the ACM European conference on computer systems (EuroSys)Google Scholar
  26. Kafka – A high-throughput distributed messaging system. http://kafka.apache.org. Accessed Nov 2017
  27. Krishnan DR, Quoc DL, Bhatotia P, Fetzer C, Rodrigues R (2016) IncApprox: a data analytics system for incremental approximate computing. In: Proceedings of the 25th international conference on world wide web (WWW)Google Scholar
  28. Ley-Wild R, Acar UA, Fluet M (2009) A cost semantics for self-adjusting computation. In: Proceedings of the annual ACM SIGPLAN-SIGACT symposium on principles of programming languages (POPL)Google Scholar
  29. Liu S, Meeker WQ (2014) Statistical methods for estimating the minimum thickness along a pipeline. Technometrics 57(2):164–179MathSciNetCrossRefGoogle Scholar
  30. Logothetis D, Olston C, Reed B, Web K, Yocum K (2010) Stateful bulk processing for incremental analytics. In: Proceedings of the ACM symposium on cloud computing (SoCC)Google Scholar
  31. Lohr S (2009) Sampling: design and analysis, 2nd edn. Cengage Learning, BostonMATHGoogle Scholar
  32. Masud MM, Woolam C, Gao J, Khan L, Han J, Hamlen KW, Oza NC (2012) Facing the reality of data stream classification: coping with scarcity of labeled data. Knowl Inf Syst 33(1):213–244CrossRefGoogle Scholar
  33. Murray DG, McSherry F, Isaacs R, Isard M, Barham P, Abadi M (2013) Naiad: a timely dataflow system. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles (SOSP)Google Scholar
  34. Olston C et al (2011) Nova: continuous pig/hadoop workflows. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD)Google Scholar
  35. Peng D, Dabek F (2010) Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the USENIX conference on operating systems design and implementation (OSDI)Google Scholar
  36. Popa L, Budiu M, Yu Y, Isard M (2009) DryadInc: reusing work in large-scale computations. In: Proceedings of the conference on hot topics in cloud computing (HotCloud)Google Scholar
  37. Quoc DL, Martin A, Fetzer C (2013) Scalable and real-time deep packet inspection. In: Proceedings of the 2013 IEEE/ACM 6th international conference on utility and cloud computing (UCC)Google Scholar
  38. Quoc DL, Yazdanov L, Fetzer C (2014) Dolen: user-side multi-cloud application monitoring. In: International conference on future internet of things and cloud (FICLOUD)Google Scholar
  39. Quoc DL, D’Alessandro V, Park B, Romano L, Fetzer C (2015a) Scalable network traffic classification using distributed support vector machines. In: Proceedings of the 2015 IEEE 8th international conference on cloud computing (CLOUD)Google Scholar
  40. Quoc DL, Fetzer C, Felber P, Rivière É, Schiavoni V, Sutra P (2015b) Unicrawl: a practical geographically distributed web crawler. In: Proceedings of the 2015 IEEE 8th international conference on cloud computing (CLOUD)Google Scholar
  41. Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T (2017a) Privacy preserving stream analytics: the marriage of randomized response and approximate computing. https://arxiv.org/abs/1701.05403
  42. Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T (2017b) PrivApprox: privacy-preserving stream analytics. In: Proceedings of the 2017 USENIX conference on USENIX annual technical conference (USENIX ATC)Google Scholar
  43. Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T (2017c) Approximate stream analytics in Apache Flink and Apache Spark streaming. CoRR, abs/1709.02946Google Scholar
  44. Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T (2017d) StreamApprox: approximate computing for stream analytics. In: Proceedings of the international middleware conference (Middleware)Google Scholar
  45. The Apache Commons Mathematics Library. http://commons.apache.org/proper/commons-math. Accessed Nov 2017
  46. Wieder A, Bhatotia P, Post A, Rodrigues R (2010a) Brief announcement: modelling MapReduce for optimal execution in the cloud. In: Proceedings of the 29th ACM SIGACT-SIGOPS symposium on principles of distributed computing (PODC)Google Scholar
  47. Wieder A, Bhatotia P, Post A, Rodrigues R (2010b) Conductor: orchestrating the clouds. In: Proceedings of the 4th international workshop on large scale distributed systems and middleware (LADIS)Google Scholar
  48. Wieder A, Bhatotia P, Post A, Rodrigues R (2012) Orchestrating the deployment of computations in the cloud with conductor. In: Proceedings of the 9th USENIX symposium on networked systems design and implementation (NSDI)Google Scholar
  49. Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles (SOSP)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Do Le Quoc
    • 1
  • Dhanya R Krishnan
    • 1
  • Pramod Bhatotia
    • 2
  • Christof Fetzer
    • 1
  • Rodrigo Rodrigues
    • 3
  1. 1.TU DresdenDresdenGermany
  2. 2.Alan Turing InstituteUniversity of EdinburghEdinburghUK
  3. 3.INESC-IDIST (University of Lisbon)LisbonPortugal

Section editors and affiliations

  • Asterios Katsifodimos
    • 1
  • Pramod Bhatotia
    • 2
  1. 1.Delft University of TechnologyDelftNetherlands
  2. 2.School of InformaticsUniversity of EdinburghEdinburghUnited Kingdom