Advertisement

Building Engines and Platforms for the Big Data Age

  • Badrish ChandramouliEmail author
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 206)

Abstract

Big data analytics involves the collection of real-time operational data into large clusters, followed by the execution of analytics queries to derive insights from the data. The results of these insights are periodically deployed into the real-time pipeline, in order to perform business actions or raise alerts. We are currently witnessing a move towards fast data analytics, where some of the offline activities may be performed in memory, directly over the real-time input streams, in order to reduce the time taken to derive and exploit insights from the data. Further, there is an increasing emphasis on enabling data scientists to derive quick approximate insights from large volumes of offline data interactively and at low cost, i.e., without having to process the entire dataset each time. Such hybrid and interconnected workflows across offline and real-time data, stored and processed across multiple machines, and with varying latency needs and complex application logic, requires a rethinking of both data and query processing models and software artifacts that realize such models. This paper surveys the challenges and requirements created by such workflows, and summarizes our research efforts on addressing these problems.

Keywords

Big data Streaming Fast data Analytics Performance Latency Programming languages Query processing Iterative Incremental 

References

  1. 1.
    Abadi, D.J., et al.: The design of the borealis stream processing system. In: CIDR (2005)Google Scholar
  2. 2.
    Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. TKDE 17(6), 734–749 (2005)Google Scholar
  3. 3.
    Agarwal, S., et al.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: EuroSys (2013)Google Scholar
  4. 4.
    Akidau, T. et al.: MillWheel: fault-tolerant stream processing at internet scale. In: VLDB (2013)Google Scholar
  5. 5.
    Ali, M. et al.: Microsoft CEP server and online behavioral targeting. In: VLDB (2009)Google Scholar
  6. 6.
  7. 7.
  8. 8.
    Babcock, B., et al.: Models and issues in data stream systems. In: PODS (2002)Google Scholar
  9. 9.
    Barga, R.S., et al.: Consistent streaming through time: a vision for event stream processing. In: CIDR, pp. 363–374 (2007)Google Scholar
  10. 10.
    Barnett, M., et al.: Stat! - an interactive analytics environment for big data. In: SIGMOD (2013)Google Scholar
  11. 11.
    Berkeley data analytics stack (BDAS). https://amplab.cs.berkeley.edu/software/
  12. 12.
    Bernstein, P., et al.: Orleans: distributed virtual actors for programmability and scalability. MSR Technical report (MSR-TR-2014-41, 24). http://aka.ms/Ykyqft
  13. 13.
  14. 14.
    Building real-time services for halo. http://research.microsoft.com/apps/video/?id=198324
  15. 15.
    Cetintemel, U., et al.: S-Store: a streaming new SQL system for big velocity applications. In: VLDB (2014)Google Scholar
  16. 16.
    Chaiken, R., et al.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)Google Scholar
  17. 17.
    Chandramouli, B., Levandoski, J.J., Eldawy, A., Mokbel, M.: StreamRec: a real-time recommender system. In: SIGMOD (2011)Google Scholar
  18. 18.
    Chandramouli, B., Nath, S., Zhou, W.: Supporting distributed feed-following apps over edge devices. PVLDB 6(13), 1570–1581 (2013)Google Scholar
  19. 19.
    Chandramouli, B., Goldstein, J., Barnett, M., DeLine, R., Fisher, D., Platt, J.C., Terwilliger, J.F., Wernsing, J.: Trill: a high-performance incremental query processor for diverse analytics. In: VLDB (2015, to appear)Google Scholar
  20. 20.
    Chandramouli, B., Goldstein, J., Duan, S.: Temporal analytics on big data for web advertising. In: ICDE (2012)Google Scholar
  21. 21.
    Chandramouli, B., Goldstein, J., Maier, D.: On-the-fly progress detection in iterative stream queries. In: VLDB (2009)Google Scholar
  22. 22.
    Chandramouli, B., Goldstein, J., Maier, D.: High-Performance dynamic pattern matching over disordered streams. In: VLDB (2010)Google Scholar
  23. 23.
    Chandramouli, B., Goldstein, J., Quamar, A.: Scalable progressive analytics on big data in the cloud. PVLDB 6(14), 1726–1737 (2013)Google Scholar
  24. 24.
    Chen, Y., et al.: Large-scale behavioral targeting. In: KDD (2009)Google Scholar
  25. 25.
    Chun, B., et al.: REEF: retainable evaluator execution framework. PVLDB 6(12), 1370–1373 (2013)Google Scholar
  26. 26.
  27. 27.
    Fisher, D., Chandramouli, B., DeLine, R., Goldstein, J., Aron, A., Barnett, M., Platt, J.C., Terwilliger, J.F., Wernsing, J.: Tempe: an interactive data science environment for exploration of temporal and streaming data. MSR Technical report MSR-TR-2014–148 (2014). http://research.microsoft.com/apps/pubs/?id=232385. Accessed Nov 2014
  28. 28.
    Hammad, M., et al.: NILE: a query processing engine for data streams. In: ICDE (2004)Google Scholar
  29. 29.
    Hellerstein, J.M., Avnur, R.: Informix under control: online query processing. J. Data Min. Knowl. Discov. 12, 281–314 (2000)CrossRefGoogle Scholar
  30. 30.
    Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD (1997)Google Scholar
  31. 31.
    Internet live stats. http://www.internetlivestats.com/
  32. 32.
    Jain, N., et al.: Towards a streaming SQL standard. In: VLDB (2008)Google Scholar
  33. 33.
    Jensen, C., Snodgrass, R.: Temporal specialization. In: ICDE (1992)Google Scholar
  34. 34.
    Li, B., et al.: A platform for scalable one-pass analytics using MapReduce. In: SIGMOD, pp. 985–996 (2011)Google Scholar
  35. 35.
    Liarou, E., et al.: Enhanced stream processing in a DBMS kernel. In: EDBT (2013)Google Scholar
  36. 36.
    Lim, H., et al.: How to fit when no one size fits. In: CIDR (2013)Google Scholar
  37. 37.
    Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning in the cloud. In: VLDB (2012)Google Scholar
  38. 38.
    Maier, D., Li, J., Tucker, P., Tufte, K., Papadimos, V.: Semantics of data streams and operators. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 37–52. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  39. 39.
  40. 40.
    Murray, D., et al.: Naiad: a timely dataflow system. In: SOSP (2013)Google Scholar
  41. 41.
    Reactive extensions for .NET. http://aka.ms/rx
  42. 42.
    Santos, I., Tilly, M., Chandramouli, B., Goldstein, J.: DiAl: distributed streaming analytics anywhere anytime. In: VLDB (2013)Google Scholar
  43. 43.
    The LINQ project. http://aka.ms/rjhi00
  44. 44.
  45. 45.
    Wu, E., Diao, Y., Rizvi, S.: High-performance complex event processing over streams. In: SIGMOD (2006)Google Scholar
  46. 46.
    Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI (2012)Google Scholar
  47. 47.
    Zaharia, M., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: SOSP (2013)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.Microsoft ResearchRedmondUSA

Personalised recommendations