Advertisement

Scientific Workflows in the Cloud

  • Gideon Juve
  • Ewa Deelman
Part of the Computer Communications and Networks book series (CCN)

Abstract

The development of cloud computing has generated significant interest in the scientific computing community. In this chapter we consider the impact of cloud computing on scientific workflow applications. We examine the benefits and drawbacks of cloud computing for workflows, and argue that the primary benefit of cloud computing is not the economic model it promotes, but rather the technologies it employs and how they enable new features for workflow applications. We describe how clouds can be configured to execute workflow tasks and present a case study that examines the performance and cost of three typical workflow applications on Amazon EC2. Finally, we identify several areas in which existing clouds can be improved and discuss the future of workflows in the cloud.

Keywords

Cloud Computing Virtual Machine Resource Type Virtual Cluster Cloud Storage Service 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

We acknowledge the contributions of Karan Vahi, Gaurang Mehta, Phil Maechling, Benjamin P. Berman, and Bruce Berriman. This work was supported by the National Science Foundation under the SciFlow (CCF-0725332) grant. This research made use of Montage, funded by the National Aeronautics and Space Administration’s Earth Science Technology Office, Computation Technologies Project, under Cooperative Agreement Number NCC5-626 between NASA and the California Institute of Technology.

References

  1. 1.
    Amazon.com: Amazon web services (aws). http://aws.amazon.com
  2. 2.
    Amazon.com: Elastic block store (ebs). http://aws.amazon.com/ebs
  3. 3.
    Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., Zaharia, M.: Above the clouds: a Berkeley view of cloud computing. Tech. rep., UC Berkeley (2009) Google Scholar
  4. 4.
    Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., Warfield, A.: Xen and the art of virtualization. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles (2003) Google Scholar
  5. 5.
    Bayucan, A., Henderson, R.L., Lesiak, C., Mann, B., Proett, T., Tweten, D.: Portable batch system: external reference specification. Tech. rep., MRJ Technology Solutions (1999) Google Scholar
  6. 6.
    Berriman, B., Bergou, A., Deelman, E., Good, J., Jacob, J., Katz, D., Kesselman, C., Laity, A., Singh, G., Su, M.H., Williams, R.: Montage: a grid-enabled image mosaic service for the NVO. In: Astronomical Data Analysis Software and Systems (ADASS) XIII (2003) Google Scholar
  7. 7.
    Bharathi, S., Chervenak, A., Deelman, E., Mehta, G., Su, M.H., Vahi, K.: Characterization of scientific workflows. In: Proceedings of the 3rd Workshop on Workflows in Support of Large-Scale Science (WORKS’08) (2008) Google Scholar
  8. 8.
    Bruneman, P., Khanna, S., Tan, W.C.: Why and where: a characterization of data provenance. In: Proceedings of the 8th International Conference on Database Theory (2001) Google Scholar
  9. 9.
    Center, S.C.E.: Community modeling environment. http://www.scec.org/cme/
  10. 10.
    Chase, J.S., Irwin, D.E., Grit, L.E., Moore, J.D., Sprenkle, S.E.: Dynamic virtual clusters in a grid site manager. In: 12th IEEE International Symposium on High Performance Distributed Computing (HPDC’03) (2003) Google Scholar
  11. 11.
  12. 12.
    Dagman (directed acyclic graph manager). http://cs.wisc.edu/condor/dagman
  13. 13.
    Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-Science: an overview of workflow system features and capabilities. Future Gener. Comput. Syst. 25(5), 528–540 (2008) CrossRefGoogle Scholar
  14. 14.
    Deelman, E., Livny, M., Mehta, G., Pavlo, A., Singh, G., Su, M.H., Vahi, K., Wenger, R.K.: Pegasus and DAGMan from Concept to Execution: Mapping Scientific Workflows Onto Today’s Cyberinfrastructure, pp. 56–74. IOS, Amsterdam (2008) Google Scholar
  15. 15.
    Deelman, E., Singh, G., Livny, M., Berriman, B., Good, J.: The cost of doing science on the cloud: the montage example. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (2008) Google Scholar
  16. 16.
    Deelman, E., Singh, G., Su, M.H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci. Program. 13(3), 219–237 (2005) Google Scholar
  17. 17.
    Foster, I., Freeman, T., Keahey, K., Scheftner, D., Sotomayer, B., Zhang, X.: Virtual clusters for grid communities. In: Proceedings of the 6th IEEE International Symposium on Cluster Computing and the Grid (CCGRID’06) (2006) Google Scholar
  18. 18.
    Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the grid: enabling scalable virtual organizations. Int. J. High Perform. Comput. Appl. 15(3), 200–222 (2001) CrossRefGoogle Scholar
  19. 19.
    Frey, J., Tannenbaum, T., Foster, I., Livny, M., Tuecke, S.: Condor-G: a computation management agent for multi-institutional grids. In: 10th International Symposium on High Performance Distributed Computing (2001) Google Scholar
  20. 20.
    Gentzsch, W.: Sun grid engine: towards creating a compute power grid. In: Proceedings of the 1st International Symposium on Cluster Computing and the Grid (2001) Google Scholar
  21. 21.
    Gilbert, L., Tseng, J., Newman, R., Iqbal, S., Pepper, R., Celebioglu, O., Hsieh, J., Cobban, M.: Performance implications of virtualization and hyper-threading on high energy physics applications in a grid environment. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05) (2005) Google Scholar
  22. 22.
  23. 23.
    Groth, P., Deelman, E., Juve, G., Mehta, G., Berriman, B.: Pipeline-centric provenance model. In: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science (WORKS’09) (2009) Google Scholar
  24. 24.
    Hoffa, C., Mehta, G., Freeman, T., Deelman, E., Keahey, K., Berriman, B., Good, J.: On the use of cloud computing for scientific workflows. In: Proceedings of the 3rd International Workshop on Scientific Workflows and Business Workflow Standards in e-Science (SWBES’08) (2008) Google Scholar
  25. 25.
    Inc., G.: Glusterfs. http://www.gluster.org
  26. 26.
    Inc., H.: Cloudstatus. http://www.cloudstatus.com
  27. 27.
    Inc., P.: Panasas. http://www.panasas.com
  28. 28.
    Juve, G., Deelman, E.: Resource provisioning options for large-scale scientific workflows. In: Proceedings of the 3rd International Workshop on Scientific Workflows and Business Workflow Standards in e-Science (SWBES’08) (2008) Google Scholar
  29. 29.
    Juve, G., Deelman, E., Vahi, K., Mehta, G.: Experiences with resource provisioning for scientific workflows using Corral. Sci. Program. 18(2), 77–92 (2010) Google Scholar
  30. 30.
    Juve, G., Deelman, E., Vahi, K., Mehta, G., Berriman, B., Berman, B.P., Maechling, P.: Scientific workflow applications on Amazon EC2. In: Workshop on Cloud-based Services and Applications in Conjunction with 5th IEEE International Conference on e-Science (e-Science’09) (2009) Google Scholar
  31. 31.
    Keahey, K., Freeman, T.: Contextualization: providing one-click virtual clusters. In: Proceedings of the 4th International Conference on eScience (eScience’08) (2008) Google Scholar
  32. 32.
    Kee, Y., Kesselman, C., Nurmi, D., Wolski, R.: Enabling personal clusters on demand for batch resources using commodity software. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08) (2008) Google Scholar
  33. 33.
    Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18(11), 1851–1858 (2008) CrossRefGoogle Scholar
  34. 34.
    Ligon, W.B., Ross, R.B.: Implementation and performance of a parallel file system for high performance distributed applications. In: Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing (1996) Google Scholar
  35. 35.
    Litzkow, M., Livny, M., Mutka, M.: Condor—a hunter of idle workstations. In: Proceedings of the 8th International Conference on Distributed Computing Systems (1988) Google Scholar
  36. 36.
    Microsystems, S.: Lustre. http://www.lustre.org
  37. 37.
    National center for supercomputing applications (ncsa). http://www.ncsa.illinois.edu
  38. 38.
    Open science grid. http://www.opensciencegrid.org
  39. 39.
    Palankar, M.R., Iamnitchi, A., Ripeanu, M., Garfinkel, S.: Amazon S3 for science grids: a viable solution? In: International Workshop on Data-Aware Distributed Computing (2008) Google Scholar
  40. 40.
    Pegasus workflow management system. http://pegasus.isi.edu
  41. 41.
    Raicu, I., Zhao, Y., Dumitrescu, C., Foster, I., Wilde, M.: Falkon: a fast and light-weight task execution framework. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (2007) Google Scholar
  42. 42.
    Sapuntzakis, C., Brumley, D., Chandra, R., Zeldovich, N., Chow, J., Lam, M., Rosenblum, M.: Virtual appliances for deploying and maintaining software. In: Proceedings of the 17th USENIX Conference on System Administration (2003) Google Scholar
  43. 43.
    Schmuck, F., Haskin, R.: GPFS: a shared-disk file system for large computing clusters. In: Proceedings of the 1st USENIX Conference on File and Storage Technologies (2002) Google Scholar
  44. 44.
    San Diego Supercomputing Center (sdsc). http://www.sdsc.edu
  45. 45.
    Singh, G., Kesselman, C., Deelman, E.: Performance impact of resource provisioning on workflows. Tech. rep., University of Southern California, Information Sciences Institute (2005) Google Scholar
  46. 46.
    Singh, G., Kesselman, C., Deelman, E.: A provisioning model and its comparison with best-effort for performance-cost optimization in grids. In: Proceedings of the 16th International Symposium on High Performance Distributed Computing (HPDC’07) (2007) Google Scholar
  47. 47.
    Sotomayor, B., Childers, L.: Globus Toolkit 4 Programming Java Services. Elsevier/Morgan Kaufmann, Amsterdam (2006) Google Scholar
  48. 48.
  49. 49.
    Youseff, L., Seymour, K., You, H., Dongarra, J., Wolski, R.: The impact of paravirtualized memory hierarchy on linear algebra computational kernels and software. In: Proceedings of the 17th International Symposium on High Performance Distributed Computing (2008) Google Scholar
  50. 50.
    Yu, W., Vetter, J.S.: Xen-based HPC: a parallel I/O perspective. In: Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid’08) (2008) Google Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  1. 1.University of Southern CaliforniaMarina del ReyUSA

Personalised recommendations