Ensemble Learning of Run-Time Prediction Models for Data-Intensive Scientific Workflows

Monge, David A.; Holec, Matĕj; Z̆elezný, Filip; García Garino, Carlos

doi:10.1007/978-3-662-45483-1_7

Ensemble Learning of Run-Time Prediction Models for Data-Intensive Scientific Workflows

David A. Monge^20,21,
Matĕj Holec²²,
Filip Z̆elezný²² &
…
Carlos García Garino^20,23

Conference paper

578 Accesses
2 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 485))

Abstract

Workflow applications for in-silico experimentation involve the processing of large amounts of data. One of the core issues for the efficient management of such applications is the prediction of tasks performance. This paper proposes a novel approach that enables the construction models for predicting task’s running-times of data-intensive scientific workflows. Ensemble Machine Learning techniques are used to produce robust combined models with high predictive accuracy. Information derived from workflow systems and the characteristics and provenance of the data are exploited to guarantee the accuracy of the models. The proposed approach has been tested on Bioinformatics workflows for Gene Expressions Analysis over homogeneous and heterogeneous computing environments. Obtained results highlight the convenience of using ensemble models in comparison with single/standalone prediction models. Ensemble learning techniques permitted reductions of the prediction error up to 24.9% in comparison with single-model strategies.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allan, R.: Survey of HPC performance modelling and prediction tools. Tech. Rep. DL-TR-2010-006, Science and Technology Facilities Council, Great Britain (July 2010), http://epubs.cclrc.ac.uk/bitstream/5264/DLTR-2010-006.pdf
Bengio, Y., Courville, A.C., Vincent, P.: Representation Learning: A Review and New Perspectives. Computing Reseach Repository-arXiv abs/1206.5538, 1–30 (April 2014), http://arxiv.org/abs/1206.5538
Chen, W., Deelman, E.: Partitioning and scheduling workflows across multiple sites with storage constraints. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2011, Part II. LNCS, vol. 7204, pp. 11–20. Springer, Heidelberg (2012)
Chapter Google Scholar
da Cruz, S., Campos, M., Mattoso, M.: Towards a taxonomy of provenance in scientific workflow management systems. In: 2009 World Conference on Services - I, pp. 259–266 (2009)
Google Scholar
Genez, T., Bittencourt, L., Madeira, E.R.M.: Workflow scheduling for SaaS / PaaS cloud providers considering two SLA levels. In: 2012 IEEE Network Operations and Management Symposium (NOMS), pp. 906–912 (2012)
Google Scholar
Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Washington (October 2009)
Google Scholar
Holec, M., Klema, J., Železný, F., Tolar, J.: Comparative evaluation of set-level techniques in predictive classification of gene expression samples. BMC Bioinformatics 13(Suppl. 10, S15), 1–15 (2012)
Google Scholar
Iverson, M., Ozguner, F., Potter, L.: Statistical prediction of task execution times through analytic benchmarking for scheduling in a heterogeneous environment. In: Heterogeneous Computing Workshop (HCW 1999) Proceedings of the Eighth, vol. 8, pp. 99–111. IEEE Computer Society, San Juan (1999)
Chapter Google Scholar
Mao, M., Humphrey, M.: Scaling and scheduling to maximize application performance within budget constraints in cloud workflows. In: 2013 IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), pp. 67–78. IEEE (2013)
Google Scholar
Marx, V.: Biology: The big challenges of big data. Nature 498(7453), 255–260 (2013)
Article Google Scholar
Monge, D.A., Bĕlohradský, J., García Garino, C., Železný, F.: A Performance Prediction Module for Workflow Scheduling. In: 4th Symposium on High-Performance Computing in Latin America (HPCLatAm 2011), vol. 4, pp. 130–144. SADIO, Córdoba (2011)
Google Scholar
Ould-Ahmed-Vall, E., Woodlee, J., Yount, C., Doshi, K., Abraham, S.: Using model trees for computer architecture performance analysis of software applications. In: IEEE International Symposium on Performance Analysis of Systems Software, ISPASS 2007, pp. 116–125. IEEE Computer Society (April 2007)
Google Scholar
Taylor, I., Deelman, E., Gannon, D., Shields, M.: Workflows for e-Science: Scientific Workflows for Grids. 1st edn. Springer, London(December 2007)
Google Scholar
M.L.G. at the University of Waikato, Weka 3: Data mining software in java. (September 2013), http://www.cs.waikato.ac.nz/ml/weka
Wallace, R., Turchenko, V., Sheikhalishahi, M., Turchenko, I., Shults, V., Vazquez-Poletti, J., Grandinetti, L.: Applications of neural-based spot market prediction for cloud computing. In: 2013 IEEE 7th International Conference on Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), vol. 2, pp. 710–716 (September 2013)
Google Scholar
Weicker, R.P.: Dhrystone: a synthetic systems programming benchmark. Communications of the ACM 27(10), 1013–1030 (1984)
Article Google Scholar
Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufman (January 2011)
Google Scholar

Download references

Author information

Authors and Affiliations

ITIC Research Institute, National University of Cuyo (UNCuyo), Argentina
David A. Monge & Carlos García Garino
Faculty of Exact and Natural Sciences, UNCuyo, Argentina
David A. Monge
IDA Research Group, Czech Technical University, Czech Republic
Matĕj Holec & Filip Z̆elezný
Faculty of Engineering, UNCuyo, Argentina
Carlos García Garino

Authors

David A. Monge
View author publications
You can also search for this author in PubMed Google Scholar
Matĕj Holec
View author publications
You can also search for this author in PubMed Google Scholar
Filip Z̆elezný
View author publications
You can also search for this author in PubMed Google Scholar
Carlos García Garino
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universidad Santa Maria, Valparaiso, Chile
Gonzalo Hernández
Ciudad Universitaria, Bucaramanga, Chile
Carlos Jaime Barrios Hernández
Universidad Industrial de Santander, Bucaramanga, Colombia
Gilberto Díaz
Universidad Nacional de Cuyo, Mendoza,, Argentina
Carlos García Garino
Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay
Sergio Nesmachnow
Universidad de Valparaíso, Chile
Tomás Pérez-Acle
CIMEC,, Santa Fe, Argentina
Mario Storti
Barcelona Supercomputing Center, Spain
Mariano Vázquez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Monge, D.A., Holec, M., Z̆elezný, F., García Garino, C. (2014). Ensemble Learning of Run-Time Prediction Models for Data-Intensive Scientific Workflows. In: Hernández, G., et al. High Performance Computing. CARLA 2014. Communications in Computer and Information Science, vol 485. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-45483-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-662-45483-1_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-45482-4
Online ISBN: 978-3-662-45483-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics