Encyclopedia of Big Data Technologies

Living Edition
| Editors: Sherif Sakr, Albert Zomaya

Optimizing Geo-Distributed Streaming Analytics

  • Abhishek Chandra
  • Benjamin Heintz
  • Ramesh Sitaraman
Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-63962-8_155-1

Abstract

Rapid data streams are generated continuously from diverse sources including users, devices, and sensors located around the globe. Modern analytics services require the analysis of large quantities of such data streams derived from disparate geo-distributed sources. Further, the analytics requirements can be complex, resulting in complex trade-offs between cost, performance, and accuracy. A typical geo-distributed analytics service uses a hub-and-spoke model, comprising multiple edges connected by a wide area network (WAN) to a central data warehouse, which leads to the question of how much computation should be performed at the edges versus the center. While the traditional approach to analytics processing is to send all the data to a dedicated centralized location, an alternative approach would be to push all computing to the edge for in situ processing. However, neither approach is optimal for modern analytics requirements. Instead, the optimal solution often entails carefully orchestrating the analytics processing at both the center and the edges and is driven by factors such as application, data, and resource characteristics.

This is a preview of subscription content, log in to check access.

References

  1. Agarwal S et al (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of EuroSys, pp 29–42Google Scholar
  2. Akidau T et al (2013) MillWheel: fault-tolerant stream processing at Internet scale. Proc VLDB Endow 6(11):1033–1044CrossRefGoogle Scholar
  3. Akidau T et al (2015) The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc VLDB Endow 8:1792–1803CrossRefGoogle Scholar
  4. Amur H et al (2013) Memory-efficient groupby-aggregate using compressed buffer trees. In: Proceedings of the symposium on cloud computing (SoCC)Google Scholar
  5. Apache Flink (2016) Scalable batch and stream data processing. http://flink.apache.org/
  6. Apache Storm (2015) Storm, distributed and fault-tolerant realtime computation. http://storm.apache.org/
  7. Beam (2016) Apache Beam (incubating). http://beam.incubator.apache.org/
  8. Boykin O, Ritchie S, O’Connel I, Lin J (2014) Summingbird: a framework for integrating batch and online mapreduce computations. In: Proceedings of VLDB, vol 7, pp 1441–1451Google Scholar
  9. Chandrasekaran S et al (2003) TelegraphCQ: continuous dataflow processing for an uncertain world. In: Proceedings of the conference on innovative data systems researchCrossRefGoogle Scholar
  10. Chen GJ, Wiener JL, Iyer S, Jaiswal A, Lei R, Simha N, Wang W, Wilfong K, Williamson T, Yilmaz S (2016) Realtime data processing at facebook. In: Proceedings of SIGMOD, pp 1087–1098Google Scholar
  11. Das T, Zhong Y, Stoica I, Shenker S (2014) Adaptive stream processing using dynamic batch sizing. In: Proceedings of the ACM symposium on cloud computing, pp 16:1–16:13Google Scholar
  12. Flajolet P, Fusy É, Gandouet O et al (2007) HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Proceedings of the international conference on analysis of algorithmszbMATHGoogle Scholar
  13. Gray J et al (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min Knowl Discov 1(1):29–53CrossRefGoogle Scholar
  14. Heintz B, Chandra A, Sitaraman RK (2015) Optimizing grouped aggregation in geo-distributed streaming analytics. In: Proceedings of the ACM symposium on high-performance parallel and distributed computing, pp 133–144Google Scholar
  15. Heintz B, Chandra A, Sitaraman RK (2016a) Trading timeliness and accuracy in geo-distributed streaming analytics. In: Proceedings of the ACM symposium on cloud computingCrossRefGoogle Scholar
  16. Heintz B, Chandra A, Sitaraman RK, Weissman J (2016b) End-to-end optimization for geo-distributed mapreduce. IEEE Trans Cloud Comput 4(3):293–306CrossRefGoogle Scholar
  17. Heintz B, Chandra A, Sitaraman RK (2017) Optimizing timeliness and cost in geo-distributed streaming analytics. IEEE Trans Cloud Comput. http://ieeexplore.ieee.org/document/8031021/
  18. Hwang JH, Cetintemel U, Zdonik S (2008) Fast and highly-available stream processing over wide area networks. In: Proceedings of ICDE, pp 804–813Google Scholar
  19. Kulkarni S et al (2015) Twitter heron: stream processing at scale. In: Proceedings of SIGMOD, pp 239–250Google Scholar
  20. Larson PA (2002) Data reduction by partial preaggregation. In: Proceedings of ICDE, pp 706–715Google Scholar
  21. Madden S, Franklin MJ, Hellerstein JM, Hong W (2002) TAG: a Tiny AGgregation service for ad-hoc sensor networks. In: Proceedings of OSDI, pp 131–146Google Scholar
  22. Nygren E, Sitaraman RK, Sun J (2010) The Akamai network: a platform for high-performance internet applications. SIGOPS Oper Syst Rev 44(3):2–19CrossRefGoogle Scholar
  23. Peterson L, Anderson T, Culler D, Roscoe T (2003) A blueprint for introducing disruptive technology into the Internet. SIGCOMM Comput Commun Rev 33(1): 59–64CrossRefGoogle Scholar
  24. Pietzuch P et al (2006) Network-aware operator placement for stream-processing systems. In: Proceedings of ICDECrossRefGoogle Scholar
  25. PlanetLab (2015) http://planet-lab.org/
  26. Podlipnig S, Böszörmenyi L (2003) A survey of web cache replacement strategies. ACM Comput Surv 35(4):374–398CrossRefGoogle Scholar
  27. Pu Q, Ananthanarayanan G, Bodik P, Kandula S, Akella A, Bahl P, Stoica I (2015) Low latency geo-distributed data analytics. In: Proceedings of SIGCOMM, pp 421–434Google Scholar
  28. Qian Z et al (2013) TimeStream: reliable stream computation in the cloud. In: Proceedings of EuroSys, pp 1–14Google Scholar
  29. Rabkin A, Arye M, Sen S, Pai VS, Freedman MJ (2014) Aggregation and degradation in JetStream: streaming analytics in the wide area. In: Proceedings of NSDI, pp. 275–288Google Scholar
  30. Rajagopalan R, Varshney P (2006) Data-aggregation techniques in sensor networks: a survey. IEEE Commun Surv Tutor 8(4):48–63CrossRefGoogle Scholar
  31. Vulimiri A, Curino C, Godfrey B, Karanasos K, Varghese G (2015a) WANalytics: analytics for a geo-distributed data-intensive world. In: Proceedings of CIDRCrossRefGoogle Scholar
  32. Vulimiri A, Curino C, Godfrey PB, Jungblut T, Padhye J, Varghese G (2015b) Global analytics in the face of bandwidth and regulatory constraints. In: Proceedings of NSDI, pp 323–336Google Scholar
  33. Yu Y, Gunda PK, Isard M (2009) Distributed aggregation for data-parallel computing: interfaces and implementations. In: Proceedings of SOSP, pp 247–260Google Scholar
  34. Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of SOSP, pp 423–438Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Abhishek Chandra
    • 1
  • Benjamin Heintz
    • 1
  • Ramesh Sitaraman
    • 2
  1. 1.University of MinnesotaMinneapolisUSA
  2. 2.University of MassachusettsAmherstUSA

Section editors and affiliations

  • Asterios Katsifodimos
    • 1
  • Pramod Bhatotia
    • 2
  1. 1.Delft University of TechnologyDelftNetherlands
  2. 2.School of InformaticsUniversity of EdinburghEdinburghUK