Skip to main content
Log in

Optimization of data flow execution in a parallel environment

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Although the modern data flows are executed in parallel and distributed environments, e.g. on a multi-core machine or on the cloud, current cost models, e.g., those considered by state-of-the-art data flow optimization techniques, do not accurately reflect the response time of real data flow execution in these execution environments. This is mainly due to the fact that the impact of parallelism, and more specifically, the impact of concurrent task execution on the running time is not adequately modeled in current cost models. The contribution of this work is twofold. Firstly, we propose an advanced cost model that aims to reflect the response time of a data flow that is executed in parallel more accurately. Secondly, we show that existing optimization solutions are inadequate and develop new optimization techniques targeting the proposed cost model. We focus on the single multi-core machine environment provided by modern business intelligence tools, such as Pentaho Kettle, but our approach can be extended to massively parallel and distributed settings. The distinctive features of our proposal is that we model both time overlaps and the impact of concurrency on task running times in a combined manner; the latter is appropriately quantified and its significance is exemplified. Furthermore, we propose extensions to current optimizers that decide on the exact ordering of flow tasks taking into account the new optimization metric. Finally, we evaluate the new optimization algorithms and show up to 59% response time improvement over state-of-the-art task ordering techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://community.pentaho.com/projects/data-integration.

  2. http://spark.apache.org/.

  3. http://flink.apache/.

  4. An early version of this work containing the modeling material but no optimization solutions has appeared in [19]. The novel material in this work compared to the preliminary version is in Sects. 5, 6 and 7, while Sect. 2 has been also extended accordingly.

  5. We were able to emulate flows in Kettle, where tasks had arbitrary cost and selectivity values through Javascript user-defined functions. However, for this specific type of transformation step, the execution of PDI is not parallel as usual, and the response time becomes more commensurate to the SCM-F. To avoid confusion of the readers, we omitted all these runs of synthetic flows in a real system.

References

  1. Agrawal, K., Benoit, A., Dufossé, F., Robert, Y.: Mapping filtering streaming applications with communication costs. In: SPAA, pp. 19–28 (2009)

  2. Agrawal, K., Benoit, A., Dufossé, F., Robert, Y.: Mapping filtering streaming applications. Algorithmica 62(1–2), 258–308 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  3. Boehm, M., Tatikonda, S., Reinwald, B., Sen, P., Tian, Yuanyuan, Burdick, D.R., Vaithyanathan, S.: Hybrid parallelization strategies for large-scale machine learning in systemml. Proc. VLDB Endow. 7(7), 553–564 (2014)

    Article  Google Scholar 

  4. Burge, J., Munagala, K., Srivastava, U.: Ordering pipelined query operators with precedence constraints. Technical Report 2005-40, Stanford InfoLab (2005)

  5. Chaudhuri, S., Shim, Kyuseok: Optimization of queries with user-defined predicates. ACM Trans. Database Syst. 24(2), 177–228 (1999)

    Article  Google Scholar 

  6. Chirkin, A.M, Belloum, A.S.Z., Kovalchuk, S.V., Makkes, M.X.: Execution time estimation for workflow scheduling. WORKS ’14, pp. 1–10 (2014)

  7. Chirkin A.M., Belloum, A.S.Z., Kovalchuk, S.V., Makkes, M.X.: Execution time estimation for workflow scheduling. In: proceeding of the 9th Workshop on Workflows in Support of Large-Scale Science, pp. 1–10. IEEE Press (2014)

  8. Deshpande, A., Hellerstein, L.: Parallel pipelined filter ordering with precedence constraints. ACM Transac. Algorithms 8(4), 41:1–41:38 (2012)

    MathSciNet  MATH  Google Scholar 

  9. DeWitt, D.J., Gray, J.: Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6) (1992)

  10. Florescu, D., Levy, A., Manolescu, I., Suciu, D.: Query optimization in the presence of limited access patterns. In: Proceedings of the 1999 ACM SIGMOD international conference on Management of data, SIGMOD ’99, pp. 311–322. ACM (1999)

  11. Halasipuram, R., Deshpande, P.M., Padmanabhan, S.: Determining essential statistics for cost based optimization of an ETL workflow. In EDBT, pp. 307–318 (2014)

  12. Hellerstein, J.M.: Optimization techniques for queries with expensive methods. ACM Trans. Database Syst. 23(2), 113–157 (1998)

    Article  MathSciNet  Google Scholar 

  13. Hueske, F., Peters, M., Sax, M., Rheinländer, A., Bergmann, R., Krettek, A., Tzoumas, K.: Opening the black boxes in data flow optimization. PVLDB 5(11), 1256–1267 (2012)

    Google Scholar 

  14. Ibaraki, T., Kameda, T.: On the optimal nesting order for computing n-relational joins. ACM Trans. Database Syst. 9(3), 482–502 (1984)

    Article  MathSciNet  Google Scholar 

  15. Kougka, G., Gounaris, A.: On optimizing workflows using query processing techniques. In: SSDBM, pp. 601–606 (2012)

  16. Kougka, G., Gounaris, A.: Optimization of data-intensive flows: is it needed? is it solved? In: proceeding of the DOLAP, pp.95–98 (2014)

  17. Kougka, G., Gounaris, A.: Cost optimization of data flows based on task re-ordering. T. Large-Scale Data- and Knowledge-Centered Systems 33, pp. 113–145 (2017)

  18. Kougka, G., Gounaris, A.: Optimal task ordering in chain data flows: exploring the practicality of non-scalable solutions. In Big Data Analytics and Knowledge Discovery-19th International Conference, DaWaK 2017, Lyon, August 28-31, 2017, Proceedings, pp. 19–32 (2017)

  19. Kougka, G., Gounaris, A., Leser: Ulf Modeling data flow execution in a parallel environment. In: Big Data Analytics and Knowledge Discovery-19th International Conference, DaWaK 2017, pp. 183–196 (2017)

  20. Kougka, G., Gounaris, A., Simitsis, A.: The many faces of data-centric workflow optimization: a survey. International Journal of Data Science and Analytics (2018)

  21. Krishnamurthy, R., Boral, H., Zaniolo, C.: Optimization of nonrecursive queries. In: VLDB, pp. 128–137 (1986)

  22. Kumar, N., Kumar, P.S.: An efficient heuristic for logical optimization of ETL workflows. In: BIRTE, pp.68–83 (2010)

  23. Pietri, I., Juve, G., Deelman, E., Sakellariou, R.: A performance model to estimate execution time of scientific workflows on the cloud. WORKS ’14, pp. 11–19. IEEE Press (2014)

  24. Pietri, I., Juve, G., Deelman, E., Sakellariou, R.: A performance model to estimate execution time of scientific workflows on the cloud. In: Proceeding of the 9th Workshop on Workflows in Support of Large-Scale Science, pp. 11–19. IEEE Press (2014)

  25. Poess, M., Rabl, T., Caufield, B.: TPC-DI: the first industry benchmark for data integration. PVLDB 7(13), 1367–1378 (2014)

    Google Scholar 

  26. Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, Felix: SOFA: an extensible logical optimizer for UDF-heavy data flows. Inf. Syst. 52, 96–125 (2015)

    Article  Google Scholar 

  27. Rheinländer, A., Leser, U., Graefe, G.: Optimization of complex dataflows with user-defined functions. ACM Comput. Surv. 50(3), 38:1–38:39 (2017)

    Article  Google Scholar 

  28. Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: Sofa: An extensible logical optimizer for udf-heavy data flows. Inform. Syst. 52, 96–125 (2015)

    Article  Google Scholar 

  29. Shi, J., Zou, J., Jiaheng, L., Cao, Z., Li, S., Wang, C.: Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs. Proc. VLDB Endow. 7(13), 1319–1330 (2014)

    Article  Google Scholar 

  30. Simitsis, A., Vassiliadis, P., Sellis, T.K.: State-space optimization of ETL workflows. IEEE Trans. Knowl. Data Eng. 17(10), 1404–1419 (2005)

    Article  Google Scholar 

  31. Simitsis, A., Wilkinson, K., Dayal, U., Castellanos, M.: Optimizing ETL workflows for fault-tolerance. In: ICDE, pp. 385–396 (2010)

  32. Singhal, R., Verma, A.: Predicting job completion time in heterogeneous mapreduce environments. In: IEEE IPDPSW, pp. 17–27 (2016)

  33. Srivastava, U., Munagala, K., Widom, J., Motwani, R.: Query optimization over web services. In: Proceeding of the PVLDB, pp. 355–366 (2006)

  34. Tziovara V., Vassiliadis, P., Simitsis, A.: Deciding the physical implementation of ETL workflows. In: DOLAP, pp. 49–56 (2007)

  35. Verma, A., Cherkasova, L., Roy, H.: Campbell. Aria: Automatic resource inference and allocation for mapreduce environments. ICAC ’11, pp. 235–244. ACM (2011)

  36. Yerneni, R. , Li, C., Ullman, J.D., Garcia-Molina, Hector: Optimizing large join queries in mediation systems. In: ICDT, pp. 348–364 (1999)

  37. Zhang, Z., Cherkasova, L., Loo, BT.: Performance modeling of mapreduce jobs in heterogeneous cloud environments. CLOUD ’13, pp. 839–846 (2013)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anastasios Gounaris.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kougka, G., Gounaris, A. Optimization of data flow execution in a parallel environment. Distrib Parallel Databases 37, 385–410 (2019). https://doi.org/10.1007/s10619-018-7243-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-018-7243-3

Keywords

Navigation