Skip to main content
Log in

Abstract cost models for distributed data-intensive computations

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

We consider data analytics workloads on distributed architectures, in particular clusters of commodity machines. To find a job partitioning that minimizes running time, a cost model, which we more accurately refer to as makespan model, is needed. In attempting to find the simplest possible, but sufficiently accurate, such model, we explore piecewise linear functions of input, output, and computational complexity. They are abstract in the sense that they capture fundamental algorithm properties, but do not require explicit modeling of system and implementation details such as the number of disk accesses. We show how the simplified functional structure can be exploited to reduce optimization cost. In the general case, we identify a lower bound that can be used for search-space pruning. For applications with homogeneous tasks, we further demonstrate how to directly integrate the model into the makespan optimization process, reducing search-space dimensionality and thus complexity by orders of magnitude. Experimental results provide evidence of good prediction quality and successful makespan optimization across a variety of operators and cluster architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39(5), 575–582 (1995)

    Article  Google Scholar 

  2. Akdere, M., Cetintemel, U., Riondato, M., Upfal, E., Zdonik, S.: Learning-based query performance modeling and prediction. In: ICDE, pp. 390–401 (2012)

  3. Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubiatowicz, J., Morgan, N., Patterson, D., Sen, K., Wawrzynek, J., Wessel, D., Yelick, K.: A view of the parallel computing landscape. Commun. ACM 52(10), 56–67 (2009)

    Article  Google Scholar 

  4. Ballard, G., Buluc, A., Demmel, J., Grigori, L., Lipshitz, B., Schwartz, O., Toledo, S.: Communication optimal parallel multiplication of sparse random matrices. In: SPAA, pp. 222–231 (2013)

  5. Duggan, J., Cetintemel, U., Papaemmanouil, O., Upfal, E.: Performance prediction for concurrent database workloads. In: SIGMOD, pp. 337–348 (2011)

  6. Duggan, J., Papaemmanouil, O., Çetintemel, U., Upfal, E.: Contender: a resource modeling approach for concurrent query performance prediction. In: EDBT, pp. 109–120 (2014)

  7. Elmroth, E., Gustavson, F., Jonsson, I., K$\mathring{\text{a}}$gstr$\ddot{\text{ o }}$m, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Rev. 46(1), 3–45 (2004)

  8. Ganapathi, A., Kuno, H.A., Dayal, U., Wiener, J.L., Fox, A., Jordan, M.I., Patterson, D.A.: Predicting multiple metrics for queries: Better decisions enabled by machine learning. In: ICDE, pp. 592–603 (2009)

  9. Goodrich, M.T., Sitchinava, N., Zhang, Q.: Sorting, searching, and simulation in the mapreduce framework. In: ISAAC, pp. 374–383 (2011)

  10. Gounaris, A., Kougka, G., Tous, R., Montes, C.T., Torres, J.: Dynamic configuration of partitioning in spark applications. IEEE Trans. Parallel Distrib. Syst. 28(7), 1891–1904 (2017)

    Article  Google Scholar 

  11. Gunther, N.J., Puglia, P., Tomasette, K.: Hadoop superlinear scalability. Commun. ACM 58(4), 46–55 (2015)

    Article  Google Scholar 

  12. Hahn, C., Warren, S., Eastman, R.: Extended edited synoptic cloud reports from ships and land stations over the globe, 1952–2009 (ndp-026c) (2012)

  13. Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. VLDB 4(11), 1111–1122 (2011)

    Google Scholar 

  14. Huang, B., Babu, S., Yang, J.: Cumulon: optimizing statistical data analysis in the cloud. In: Proceedings of SIGMOD, pp. 1–12 (2013)

  15. Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)

    Article  MATH  Google Scholar 

  16. Kaoudi, Z., Quiane-Ruiz, JA., Thirumuruganathan, S., Chawla, S., Agrawal, D.: A cost-based optimizer for gradient descent optimization. In: SIGMOD, pp. 977–992 (2017)

  17. Karloff, H., Suri, S., Vassilvitskii, S.: A model of computation for mapreduce. In: SODA, pp. 938–948 (2010)

  18. Li, J., Naughton, J.F., Nehme, R.V.: Resource bricolage and resource selection for parallel database systems. VLDB J. 26(1), 31–54 (2017)

    Article  Google Scholar 

  19. Li, R., Riedewald, M., Deng, X.: Submodularity of distributed join computation. In: (Upcoming) SIGMOD (2018)

  20. Lichman, M.: UCI machine learning repository (2013)

  21. Morton, K., Balazinska, M., Grossman, D.: Paratimer: a progress indicator for mapreduce dags. In: SIGMOD, pp. 507–518 (2010)

  22. Munson, A.M., Webb, K., Sheldon, D., Fink, D., Hochachka, W.M., Iliff, M., Riedewald, M., Sorokina, D., Sullivan, B., Wood, C., Kelling, S.: The Ebird Reference Dataset, Version 2014. Cornell Lab of Ornithology and National Audubon Society, Ithaca, NY (2014)

    Google Scholar 

  23. Quinlan, J.R., et al.: Learning with continuous classes. Aust. Jt. Conf. Artif. Intell. 92, 343–348 (1992)

    Google Scholar 

  24. Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill, New York (2003)

    MATH  Google Scholar 

  25. Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs. In: VLDB, pp. 1319–1330 (2014)

  26. Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5d matrix multiplication and LU factorization algorithms. In: Euro-Par 2011 Parallel Processing, pp. 90–109 (2011)

  27. Valsalam, V., Skjellum, A.: A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurr. Comput. 14(10), 805–839 (2002)

    Article  MATH  Google Scholar 

  28. van de Geijn, R.A., Watts, J.: Summa: Scalable Universal Matrix Multiplication Algorithm. University of Texas at Austin, Tech. rep. (1995)

  29. Venkataraman, S., Yang, Z., Franklin, M., Recht, B., Stoica, I.: Ernest: efficient performance prediction for large-scale advanced analytics. In: NSDI, pp. 363–378 (2016)

  30. Vieth, E.: Fitting piecewise linear regression functions to biological responses. J. Appl. Physiol. 67(1), 390–396 (1989)

    Article  Google Scholar 

  31. Wang, G., Chan, CY.: Multi-query optimization in mapreduce framework. In: VLDB, pp. 145–156 (2013)

  32. White, B., Lepreau, J., Stoller, L., Ricci, R., Guruprasad, S., Newbold, M., Hibler, M., Barb, C, Joglekar, A.: An integrated experimental environment for distributed systems and networks. In: OSDI, pp. 255–270 (2002)

  33. Wu, W., Chi, Y., Hacígümüş, H., Naughton, J.F.: Towards predicting query execution time for concurrent and dynamic database workloads. VLDB 6(10), 925–936 (2013)

    Google Scholar 

  34. Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using mapreduce. In: VLDB, pp. 1184–1195 (2012)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rundong Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by a Northeastern University (NU) Tier 1 award, by the National Institutes of Health (NIH) under Award Number R01 NS091421, by the Air Force Office of Scientific Research (AFOSR) under Grant Number FA9550-14-1-0160, and by the NSF Career Award under Award Number 1741634. The content is solely the responsibility of the authors and does not necessarily represent the official views of NU, NIH, AFOSR or NSF.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, R., Mi, N., Riedewald, M. et al. Abstract cost models for distributed data-intensive computations. Distrib Parallel Databases 37, 411–439 (2019). https://doi.org/10.1007/s10619-018-7244-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-018-7244-2

Keywords

Navigation