Advertisement

Distributed and Parallel Databases

, Volume 37, Issue 3, pp 385–410 | Cite as

Optimization of data flow execution in a parallel environment

  • Georgia Kougka
  • Anastasios GounarisEmail author
Article
  • 102 Downloads
Part of the following topical collections:
  1. Special Issue on Extending Data Warehouses to Big Data Analytics

Abstract

Although the modern data flows are executed in parallel and distributed environments, e.g. on a multi-core machine or on the cloud, current cost models, e.g., those considered by state-of-the-art data flow optimization techniques, do not accurately reflect the response time of real data flow execution in these execution environments. This is mainly due to the fact that the impact of parallelism, and more specifically, the impact of concurrent task execution on the running time is not adequately modeled in current cost models. The contribution of this work is twofold. Firstly, we propose an advanced cost model that aims to reflect the response time of a data flow that is executed in parallel more accurately. Secondly, we show that existing optimization solutions are inadequate and develop new optimization techniques targeting the proposed cost model. We focus on the single multi-core machine environment provided by modern business intelligence tools, such as Pentaho Kettle, but our approach can be extended to massively parallel and distributed settings. The distinctive features of our proposal is that we model both time overlaps and the impact of concurrency on task running times in a combined manner; the latter is appropriately quantified and its significance is exemplified. Furthermore, we propose extensions to current optimizers that decide on the exact ordering of flow tasks taking into account the new optimization metric. Finally, we evaluate the new optimization algorithms and show up to 59% response time improvement over state-of-the-art task ordering techniques.

Keywords

Data flow optimization Cost modeling Task ordering 

Notes

References

  1. 1.
    Agrawal, K., Benoit, A., Dufossé, F., Robert, Y.: Mapping filtering streaming applications with communication costs. In: SPAA, pp. 19–28 (2009)Google Scholar
  2. 2.
    Agrawal, K., Benoit, A., Dufossé, F., Robert, Y.: Mapping filtering streaming applications. Algorithmica 62(1–2), 258–308 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Boehm, M., Tatikonda, S., Reinwald, B., Sen, P., Tian, Yuanyuan, Burdick, D.R., Vaithyanathan, S.: Hybrid parallelization strategies for large-scale machine learning in systemml. Proc. VLDB Endow. 7(7), 553–564 (2014)CrossRefGoogle Scholar
  4. 4.
    Burge, J., Munagala, K., Srivastava, U.: Ordering pipelined query operators with precedence constraints. Technical Report 2005-40, Stanford InfoLab (2005)Google Scholar
  5. 5.
    Chaudhuri, S., Shim, Kyuseok: Optimization of queries with user-defined predicates. ACM Trans. Database Syst. 24(2), 177–228 (1999)CrossRefGoogle Scholar
  6. 6.
    Chirkin, A.M, Belloum, A.S.Z., Kovalchuk, S.V., Makkes, M.X.: Execution time estimation for workflow scheduling. WORKS ’14, pp. 1–10 (2014)Google Scholar
  7. 7.
    Chirkin A.M., Belloum, A.S.Z., Kovalchuk, S.V., Makkes, M.X.: Execution time estimation for workflow scheduling. In: proceeding of the 9th Workshop on Workflows in Support of Large-Scale Science, pp. 1–10. IEEE Press (2014)Google Scholar
  8. 8.
    Deshpande, A., Hellerstein, L.: Parallel pipelined filter ordering with precedence constraints. ACM Transac. Algorithms 8(4), 41:1–41:38 (2012)MathSciNetzbMATHGoogle Scholar
  9. 9.
    DeWitt, D.J., Gray, J.: Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6) (1992)Google Scholar
  10. 10.
    Florescu, D., Levy, A., Manolescu, I., Suciu, D.: Query optimization in the presence of limited access patterns. In: Proceedings of the 1999 ACM SIGMOD international conference on Management of data, SIGMOD ’99, pp. 311–322. ACM (1999)Google Scholar
  11. 11.
    Halasipuram, R., Deshpande, P.M., Padmanabhan, S.: Determining essential statistics for cost based optimization of an ETL workflow. In EDBT, pp. 307–318 (2014)Google Scholar
  12. 12.
    Hellerstein, J.M.: Optimization techniques for queries with expensive methods. ACM Trans. Database Syst. 23(2), 113–157 (1998)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Hueske, F., Peters, M., Sax, M., Rheinländer, A., Bergmann, R., Krettek, A., Tzoumas, K.: Opening the black boxes in data flow optimization. PVLDB 5(11), 1256–1267 (2012)Google Scholar
  14. 14.
    Ibaraki, T., Kameda, T.: On the optimal nesting order for computing n-relational joins. ACM Trans. Database Syst. 9(3), 482–502 (1984)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Kougka, G., Gounaris, A.: On optimizing workflows using query processing techniques. In: SSDBM, pp. 601–606 (2012)Google Scholar
  16. 16.
    Kougka, G., Gounaris, A.: Optimization of data-intensive flows: is it needed? is it solved? In: proceeding of the DOLAP, pp.95–98 (2014)Google Scholar
  17. 17.
    Kougka, G., Gounaris, A.: Cost optimization of data flows based on task re-ordering. T. Large-Scale Data- and Knowledge-Centered Systems 33, pp. 113–145 (2017)Google Scholar
  18. 18.
    Kougka, G., Gounaris, A.: Optimal task ordering in chain data flows: exploring the practicality of non-scalable solutions. In Big Data Analytics and Knowledge Discovery-19th International Conference, DaWaK 2017, Lyon, August 28-31, 2017, Proceedings, pp. 19–32 (2017)Google Scholar
  19. 19.
    Kougka, G., Gounaris, A., Leser: Ulf Modeling data flow execution in a parallel environment. In: Big Data Analytics and Knowledge Discovery-19th International Conference, DaWaK 2017, pp. 183–196 (2017)Google Scholar
  20. 20.
    Kougka, G., Gounaris, A., Simitsis, A.: The many faces of data-centric workflow optimization: a survey. International Journal of Data Science and Analytics (2018)Google Scholar
  21. 21.
    Krishnamurthy, R., Boral, H., Zaniolo, C.: Optimization of nonrecursive queries. In: VLDB, pp. 128–137 (1986)Google Scholar
  22. 22.
    Kumar, N., Kumar, P.S.: An efficient heuristic for logical optimization of ETL workflows. In: BIRTE, pp.68–83 (2010)Google Scholar
  23. 23.
    Pietri, I., Juve, G., Deelman, E., Sakellariou, R.: A performance model to estimate execution time of scientific workflows on the cloud. WORKS ’14, pp. 11–19. IEEE Press (2014)Google Scholar
  24. 24.
    Pietri, I., Juve, G., Deelman, E., Sakellariou, R.: A performance model to estimate execution time of scientific workflows on the cloud. In: Proceeding of the 9th Workshop on Workflows in Support of Large-Scale Science, pp. 11–19. IEEE Press (2014)Google Scholar
  25. 25.
    Poess, M., Rabl, T., Caufield, B.: TPC-DI: the first industry benchmark for data integration. PVLDB 7(13), 1367–1378 (2014)Google Scholar
  26. 26.
    Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, Felix: SOFA: an extensible logical optimizer for UDF-heavy data flows. Inf. Syst. 52, 96–125 (2015)CrossRefGoogle Scholar
  27. 27.
    Rheinländer, A., Leser, U., Graefe, G.: Optimization of complex dataflows with user-defined functions. ACM Comput. Surv. 50(3), 38:1–38:39 (2017)CrossRefGoogle Scholar
  28. 28.
    Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: Sofa: An extensible logical optimizer for udf-heavy data flows. Inform. Syst. 52, 96–125 (2015)CrossRefGoogle Scholar
  29. 29.
    Shi, J., Zou, J., Jiaheng, L., Cao, Z., Li, S., Wang, C.: Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs. Proc. VLDB Endow. 7(13), 1319–1330 (2014)CrossRefGoogle Scholar
  30. 30.
    Simitsis, A., Vassiliadis, P., Sellis, T.K.: State-space optimization of ETL workflows. IEEE Trans. Knowl. Data Eng. 17(10), 1404–1419 (2005)CrossRefGoogle Scholar
  31. 31.
    Simitsis, A., Wilkinson, K., Dayal, U., Castellanos, M.: Optimizing ETL workflows for fault-tolerance. In: ICDE, pp. 385–396 (2010)Google Scholar
  32. 32.
    Singhal, R., Verma, A.: Predicting job completion time in heterogeneous mapreduce environments. In: IEEE IPDPSW, pp. 17–27 (2016)Google Scholar
  33. 33.
    Srivastava, U., Munagala, K., Widom, J., Motwani, R.: Query optimization over web services. In: Proceeding of the PVLDB, pp. 355–366 (2006)Google Scholar
  34. 34.
    Tziovara V., Vassiliadis, P., Simitsis, A.: Deciding the physical implementation of ETL workflows. In: DOLAP, pp. 49–56 (2007)Google Scholar
  35. 35.
    Verma, A., Cherkasova, L., Roy, H.: Campbell. Aria: Automatic resource inference and allocation for mapreduce environments. ICAC ’11, pp. 235–244. ACM (2011)Google Scholar
  36. 36.
    Yerneni, R. , Li, C., Ullman, J.D., Garcia-Molina, Hector: Optimizing large join queries in mediation systems. In: ICDT, pp. 348–364 (1999)Google Scholar
  37. 37.
    Zhang, Z., Cherkasova, L., Loo, BT.: Performance modeling of mapreduce jobs in heterogeneous cloud environments. CLOUD ’13, pp. 839–846 (2013)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of InformaticsAristotle University of ThessalonikiThessaloníkiGreece

Personalised recommendations