Abstract cost models for distributed data-intensive computations

Li, Rundong; Mi, Ningfang; Riedewald, Mirek; Sun, Yizhou; Yao, Yi

doi:10.1007/s10619-018-7244-2

Abstract cost models for distributed data-intensive computations

Published: 24 August 2018

Volume 37, pages 411–439, (2019)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Rundong Li¹,
Ningfang Mi²,
Mirek Riedewald¹,
Yizhou Sun³ &
…
Yi Yao²

405 Accesses
1 Citation
Explore all metrics

Abstract

We consider data analytics workloads on distributed architectures, in particular clusters of commodity machines. To find a job partitioning that minimizes running time, a cost model, which we more accurately refer to as makespan model, is needed. In attempting to find the simplest possible, but sufficiently accurate, such model, we explore piecewise linear functions of input, output, and computational complexity. They are abstract in the sense that they capture fundamental algorithm properties, but do not require explicit modeling of system and implementation details such as the number of disk accesses. We show how the simplified functional structure can be exploited to reduce optimization cost. In the general case, we identify a lower bound that can be used for search-space pruning. For applications with homogeneous tasks, we further demonstrate how to directly integrate the model into the makespan optimization process, reducing search-space dimensionality and thus complexity by orders of magnitude. Experimental results provide evidence of good prediction quality and successful makespan optimization across a variety of operators and cluster architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Containerization technologies: taxonomies, applications and challenges

Article 08 June 2021

Performance improvement of the triangular matrix product in commodity clusters

Article Open access 15 April 2024

Algorithms for Scheduling Deadline-Sensitive Malleable Tasks

Article 01 April 2024

References

Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39(5), 575–582 (1995)
Article Google Scholar
Akdere, M., Cetintemel, U., Riondato, M., Upfal, E., Zdonik, S.: Learning-based query performance modeling and prediction. In: ICDE, pp. 390–401 (2012)
Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubiatowicz, J., Morgan, N., Patterson, D., Sen, K., Wawrzynek, J., Wessel, D., Yelick, K.: A view of the parallel computing landscape. Commun. ACM 52(10), 56–67 (2009)
Article Google Scholar
Ballard, G., Buluc, A., Demmel, J., Grigori, L., Lipshitz, B., Schwartz, O., Toledo, S.: Communication optimal parallel multiplication of sparse random matrices. In: SPAA, pp. 222–231 (2013)
Duggan, J., Cetintemel, U., Papaemmanouil, O., Upfal, E.: Performance prediction for concurrent database workloads. In: SIGMOD, pp. 337–348 (2011)
Duggan, J., Papaemmanouil, O., Çetintemel, U., Upfal, E.: Contender: a resource modeling approach for concurrent query performance prediction. In: EDBT, pp. 109–120 (2014)
Elmroth, E., Gustavson, F., Jonsson, I., K$\mathring{\text{a}}$gstr$\ddot{\text{ o }}$m, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Rev. 46(1), 3–45 (2004)
Ganapathi, A., Kuno, H.A., Dayal, U., Wiener, J.L., Fox, A., Jordan, M.I., Patterson, D.A.: Predicting multiple metrics for queries: Better decisions enabled by machine learning. In: ICDE, pp. 592–603 (2009)
Goodrich, M.T., Sitchinava, N., Zhang, Q.: Sorting, searching, and simulation in the mapreduce framework. In: ISAAC, pp. 374–383 (2011)
Gounaris, A., Kougka, G., Tous, R., Montes, C.T., Torres, J.: Dynamic configuration of partitioning in spark applications. IEEE Trans. Parallel Distrib. Syst. 28(7), 1891–1904 (2017)
Article Google Scholar
Gunther, N.J., Puglia, P., Tomasette, K.: Hadoop superlinear scalability. Commun. ACM 58(4), 46–55 (2015)
Article Google Scholar
Hahn, C., Warren, S., Eastman, R.: Extended edited synoptic cloud reports from ships and land stations over the globe, 1952–2009 (ndp-026c) (2012)
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. VLDB 4(11), 1111–1122 (2011)
Google Scholar
Huang, B., Babu, S., Yang, J.: Cumulon: optimizing statistical data analysis in the cloud. In: Proceedings of SIGMOD, pp. 1–12 (2013)
Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)
Article MATH Google Scholar
Kaoudi, Z., Quiane-Ruiz, JA., Thirumuruganathan, S., Chawla, S., Agrawal, D.: A cost-based optimizer for gradient descent optimization. In: SIGMOD, pp. 977–992 (2017)
Karloff, H., Suri, S., Vassilvitskii, S.: A model of computation for mapreduce. In: SODA, pp. 938–948 (2010)
Li, J., Naughton, J.F., Nehme, R.V.: Resource bricolage and resource selection for parallel database systems. VLDB J. 26(1), 31–54 (2017)
Article Google Scholar
Li, R., Riedewald, M., Deng, X.: Submodularity of distributed join computation. In: (Upcoming) SIGMOD (2018)
Lichman, M.: UCI machine learning repository (2013)
Morton, K., Balazinska, M., Grossman, D.: Paratimer: a progress indicator for mapreduce dags. In: SIGMOD, pp. 507–518 (2010)
Munson, A.M., Webb, K., Sheldon, D., Fink, D., Hochachka, W.M., Iliff, M., Riedewald, M., Sorokina, D., Sullivan, B., Wood, C., Kelling, S.: The Ebird Reference Dataset, Version 2014. Cornell Lab of Ornithology and National Audubon Society, Ithaca, NY (2014)
Google Scholar
Quinlan, J.R., et al.: Learning with continuous classes. Aust. Jt. Conf. Artif. Intell. 92, 343–348 (1992)
Google Scholar
Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill, New York (2003)
MATH Google Scholar
Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs. In: VLDB, pp. 1319–1330 (2014)
Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5d matrix multiplication and LU factorization algorithms. In: Euro-Par 2011 Parallel Processing, pp. 90–109 (2011)
Valsalam, V., Skjellum, A.: A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurr. Comput. 14(10), 805–839 (2002)
Article MATH Google Scholar
van de Geijn, R.A., Watts, J.: Summa: Scalable Universal Matrix Multiplication Algorithm. University of Texas at Austin, Tech. rep. (1995)
Venkataraman, S., Yang, Z., Franklin, M., Recht, B., Stoica, I.: Ernest: efficient performance prediction for large-scale advanced analytics. In: NSDI, pp. 363–378 (2016)
Vieth, E.: Fitting piecewise linear regression functions to biological responses. J. Appl. Physiol. 67(1), 390–396 (1989)
Article Google Scholar
Wang, G., Chan, CY.: Multi-query optimization in mapreduce framework. In: VLDB, pp. 145–156 (2013)
White, B., Lepreau, J., Stoller, L., Ricci, R., Guruprasad, S., Newbold, M., Hibler, M., Barb, C, Joglekar, A.: An integrated experimental environment for distributed systems and networks. In: OSDI, pp. 255–270 (2002)
Wu, W., Chi, Y., Hacígümüş, H., Naughton, J.F.: Towards predicting query execution time for concurrent and dynamic database workloads. VLDB 6(10), 925–936 (2013)
Google Scholar
Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using mapreduce. In: VLDB, pp. 1184–1195 (2012)

Download references

Author information

Authors and Affiliations

CCIS, Northeastern University, Boston, USA
Rundong Li & Mirek Riedewald
ECE, Northeastern University, Boston, USA
Ningfang Mi & Yi Yao
Department of Computer Science, UCLA, Los Angeles, USA
Yizhou Sun

Authors

Rundong Li
View author publications
You can also search for this author in PubMed Google Scholar
Ningfang Mi
View author publications
You can also search for this author in PubMed Google Scholar
Mirek Riedewald
View author publications
You can also search for this author in PubMed Google Scholar
Yizhou Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rundong Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by a Northeastern University (NU) Tier 1 award, by the National Institutes of Health (NIH) under Award Number R01 NS091421, by the Air Force Office of Scientific Research (AFOSR) under Grant Number FA9550-14-1-0160, and by the NSF Career Award under Award Number 1741634. The content is solely the responsibility of the authors and does not necessarily represent the official views of NU, NIH, AFOSR or NSF.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, R., Mi, N., Riedewald, M. et al. Abstract cost models for distributed data-intensive computations. Distrib Parallel Databases 37, 411–439 (2019). https://doi.org/10.1007/s10619-018-7244-2

Download citation

Published: 24 August 2018
Issue Date: 15 September 2019
DOI: https://doi.org/10.1007/s10619-018-7244-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Abstract cost models for distributed data-intensive computations

Abstract

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

Performance improvement of the triangular matrix product in commodity clusters

Algorithms for Scheduling Deadline-Sensitive Malleable Tasks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract cost models for distributed data-intensive computations

Abstract

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

Performance improvement of the triangular matrix product in commodity clusters

Algorithms for Scheduling Deadline-Sensitive Malleable Tasks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation