Highly Interactive, Steered Scientific Workflows on HPC Systems: Optimizing Design Solutions
- 4.9k Downloads
Scientific workflows are becoming increasingly important in high performance computing (HPC) settings, as the feasibility and appeal of many simultaneous heterogeneous tasks increases with increasing hardware capabilities. Currently no HPC-based workflow platform supports a dynamically adaptable workflow with interactive steering and analysis at run-time. Furthermore, for most workflow programs, compute resources are fixed for a given instance, resulting in a possible waste of expensive allocation resources when tasks are spawned and killed. Here we describe the design and testing of a run-time-interactive, adaptable, steered workflow tool capable of executing thousands of parallel tasks without an MPI programming model, using a database management system to facilitate task management through multiple live connections. We find that on the Oak Ridge Leadership Computing Facility pre-exascale Summit supercomputer it is possible to launch and interactively steer workflows with thousands of simultaneous tasks with negligible latency. For the case of particle simulation and analysis tasks that run for minutes to hours, this paradigm offers the prospect of a robust and efficient means to perform simulation-space exploration with on-the-fly analysis and adaptation.
KeywordsHigh performance computing Scientific workflows External steering Adaptable workflows
An award of computer time was provided by the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR227525. JCS acknowledges ORNL LDRD funds. The authors would like to thank Oscar Hernandez, Frank Noé and group, Cecilia Clementi and group, and Shantenu Jha and group, for valuable insight and discussions.
- 1.Ailamaki, A., Ioannidis, Y.E., Livny, M.: Scientific workflow management by database management. In: Proceedings of the Tenth International Conference on Scientific and Statistical Database Management (Cat. No. 98TB100243), pp. 190–199. IEEE (1998)Google Scholar
- 6.Dorier, M., Wozniak, J.M., Ross, R.: Supporting task-level fault-tolerance in HPC workflows by launching MPI jobs inside MPI jobs. In: Proceedings of the 12th Workshop on Workflows in Support of Large-Scale Science, p. 5. ACM (2017)Google Scholar
- 18.Ossyra, J.R., Sedova, A., Tharrington, A., Noé, F., Clementi, C., Smith, J.C.: Porting adaptive ensemble molecular dynamics workflows to the summit supercomputer. In: Proceedings of ISC 19; IWOPH. SLNCS (2019, in press)Google Scholar
- 25.Souza, R., Silva, V., Oliveira, D., Valduriez, P., Lima, A.A., Mattoso, M.: Parallel execution of workflows driven by a distributed database management system. In: ACM/IEEE Conference on Supercomputing, Poster (2015)Google Scholar
- 26.Weinan, E., Ren, W., Vanden-Eijnden, E.: String method for the study of rare events. Phys. Rev. B 66(5), 052301 (2002)Google Scholar
- 28.Wozniak, J.M., Armstrong, T.G., Wilde, M., Katz, D.S., Lusk, E., Foster, I.T.: Swift/T: large-scale application composition via distributed-memory dataflow processing. In: 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, pp. 95–102. IEEE (2013)Google Scholar