Advertisement

Online Aggregation: A Review

  • Yun Li
  • Yanlong WenEmail author
  • Xiaojie Yuan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11242)

Abstract

Recent demands for querying big data have revealed various shortcomings of traditional database systems. This, in turn, has led to the emergency of a new kind of query mode, approximate query.Online aggregation is a sample-based technology for approximate querying. It becomes quite indispensable in the era of information explosion today. Online aggregation continuously gives an approximate result with some error estimation (usually confidence interval) until all data are processed. This survey mainly aims at elucidating the most critical two steps for online aggregation: sampling mechanism and error estimation methods. As the development of MapReduce, researchers try to implement online aggregation in MapReduce framework. We will also briefly introduce some implementations of online aggregation in MapReduce and evaluate their features, strength, and drawbacks. Finally, we disclose some existing challenges in online aggregation, which needs attention of the research community and application designers.

Keywords

Online aggregation Big data Approximate query MapReduce 

Notes

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant No. 61772289 and the Fundamental Research Funds for the Central Universities.

References

  1. 1.
    Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: ACM SIGMOD Record, vol. 26, pp. 171–182. ACM (1997)Google Scholar
  2. 2.
    Aarnio, T.: Parallel data processing with MapReduce. In: TKK T-110.5190, Seminar on Internetworking (2009)Google Scholar
  3. 3.
    Olken, F.: Random sampling from databases. Ph.D. thesis, University of California, Berkeley (1993)Google Scholar
  4. 4.
    Wu, S., Ooi, B.C., Tan, K.L.: Continuous sampling for online aggregation over multiple queries. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 651–662. ACM (2010)Google Scholar
  5. 5.
    Agarwal, S., et al.: Knowing when you’re wrong: building fast and reliable approximate query processing systems. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 481–492. ACM (2014)Google Scholar
  6. 6.
    Zeng, K., Gao, S., Mozafari, B., Zaniolo, C.: The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 277–288. ACM (2014)Google Scholar
  7. 7.
    Park, Y., Mozafari, B., Sorenson, J., Wang, J.: VerdictDB: universalizing approximate query processing. arXiv preprint arXiv:1804.00770 (2018)
  8. 8.
    An, M., Sun, X., Ninghui, S.: Dynamic data partitioned online aggregation. J. Comput. Res. Dev. (2010)Google Scholar
  9. 9.
    Joshi, S., Jermaine, C.: Robust stratified sampling plans for low selectivity queries. In: IEEE 24th International Conference on Data Engineering, ICDE 2008, pp. 199–208. IEEE (2008)Google Scholar
  10. 10.
    Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 29–42. ACM (2013)Google Scholar
  11. 11.
    Kim, A., Blais, E., Parameswaran, A., Indyk, P., Madden, S., Rubinfeld, R.: Rapid sampling for visualizations with ordering guarantees. Proc. VLDB Endow. 8(5), 521–532 (2015)CrossRefGoogle Scholar
  12. 12.
    Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. ACM SIGMOD Rec. 28(2), 287–298 (1999)CrossRefGoogle Scholar
  13. 13.
    Haas, P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: Proceedings of Ninth International Conference on Scientific and Statistical Database Management, pp. 51–62. IEEE (1997)Google Scholar
  14. 14.
    Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple join algorithm. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 252–262. ACM (2002)Google Scholar
  15. 15.
    Dittrich, J.P., Seeger, B., Taylor, D.S., Widmayer, P.: Progressive merge join: a generic and non-blocking sort-based join algorithm** this work has been supported by grant no. se 553/2-2 from DFG. In: VLDB 2002: Proceedings of the 28th International Conference on Very Large Databases, pp. 299–310. Elsevier (2002)Google Scholar
  16. 16.
    Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: The sort-merge-shrink join. ACM Trans. Database Syst. (TODS) 31(4), 1382–1416 (2006)CrossRefGoogle Scholar
  17. 17.
    Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: A disk-based join with probabilistic guarantees. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 563–574. ACM (2005)Google Scholar
  18. 18.
    Jermaine, C., Dobra, A., Pol, A., Joshi, S.: Online estimation for subset-based SQL queries. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 745–756. VLDB Endowment (2005)Google Scholar
  19. 19.
    Li, F., Wu, B., Yi, K., Zhao, Z.: Wander join: online aggregation via random walks. In: Proceedings of the 2016 International Conference on Management of Data, pp. 615–629. ACM (2016)Google Scholar
  20. 20.
    Wang, Y., Luo, J., Song, A., Dong, F.: Oats: online aggregation with two-level sharing strategy in cloud. Distrib. Parallel Databases 32(4), 467–505 (2014)CrossRefGoogle Scholar
  21. 21.
    Efron, B.: Bootstrap methods: another look at the jackknife. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics, pp. 569–593. Springer, New York (1992).  https://doi.org/10.1007/978-1-4612-4380-9_41CrossRefGoogle Scholar
  22. 22.
    Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system, vol. 37. ACM (2003)Google Scholar
  23. 23.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)CrossRefGoogle Scholar
  24. 24.
    Condie, T., et al.: Online aggregation and continuous query support in MapReduce. In: ACM SIGMOD International Conference on Management of Data, pp. 1115–1118 (2010)Google Scholar
  25. 25.
    Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: NSDI, vol. 10, p. 20 (2010)Google Scholar
  26. 26.
    Qin, C., Rusu, F.: PF-OLA: a high-performance framework for parallel online aggregation. Distrib. Parallel Databases 32(3), 337–375 (2014)CrossRefGoogle Scholar
  27. 27.
    Pansare, N., Borkar, V.R., Jermaine, C., Condie, T.: Online aggregation for large MapReduce jobs. Proc. VLDB Endow. 4(11), 1135–1145 (2011)Google Scholar
  28. 28.
    Agarwal, S., Agarwal, S., Armbrust, M., Armbrust, M., Stoica, I.: G-OLA: generalized on-line aggregation for interactive analysis on big data. In: ACM SIGMOD International Conference on Management of Data, pp. 913–918 (2015)Google Scholar
  29. 29.
    Zeng, K., Gao, S., Gu, J., Mozafari, B., Zaniolo, C.: ABS: a system for scalable approximate queries with accuracy guarantees. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1067–1070. ACM (2014)Google Scholar
  30. 30.
    Zhang, Z., Hu, J., Xie, X., Pan, H., Feng, X.: An online approximate aggregation query processing method based on hadoop. In: 2016 IEEE 20th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pp. 117–122. IEEE (2016)Google Scholar
  31. 31.
    Cheng, Y., Zhao, W., Rusu, F.: Bi-level online aggregation on raw data. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, p. 10. ACM (2017)Google Scholar
  32. 32.
    Shi, Y., Meng, X., Wang, F., Gan, Y.: You can stop early with cola: online processing of aggregate queries in the cloud. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 1223–1232. ACM (2012)Google Scholar
  33. 33.
    Gan, Y., Meng, X., Shi, Y.: COLA: a cloud-based system for online aggregation. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1368–1371. IEEE (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.College of Cyberspace SecurityNankai UniversityTianjinChina
  2. 2.College of Computer ScienceNankai UniversityTianjinChina

Personalised recommendations