Skip to main content

Migrating Scientific Workflow Management Systems from the Grid to the Cloud

  • Chapter
  • First Online:
Cloud Computing for Data-Intensive Applications

Abstract

Cloud computing is an emerging computing paradigm that can offer unprecedented scalability and resources on demand, and is gaining significant adoption in the science community. At the same time, scientific workflow management systems provide essential support and functionality to scientific computing, such as management of data and task dependencies, job scheduling and execution, provenance tracking, fault tolerance. Migrating scientific workflow management systems from traditional Grid computing environments into the Cloud would enable a much broader user base to conduct their scientific research with ever increasing data scale and analysis complexity. This paper presents our experience in integrating the Swift scientific workflow management system with the OpenNebula Cloud platform, which supports workflow specification and submission, on-demand virtual cluster provisioning, high-throughput task scheduling and execution, and efficient and scalable resource management in the Cloud. We set up a series of experiments to demonstrate the capability of our integration and use a MODIS image processing workflow as a showcase of the implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.cloud-uestc.cn/projects/serviceframework/index.html.

References

  1. OpenNebula, [Online]. Available: http://www.OpenNebula.org, 2014

  2. Openstack, [Online]. Available: http://www.openstack.org, 2014

  3. GenBank, [Online]. Available: http://www.ncbi.nlm.nih.gov/genbank, 2014

  4. Large Hadron Collider, [Online]. Available: http://lhc.web.cern.ch, 2014

  5. Wang L, Duan R, Li X, et al. An Iterative Optimization Framework for Adaptive Workflow Management in Computational Clouds[C]//Trust, Security and Privacy in Computing and Communications (TrustCom), 2013 12th IEEE International Conference on. IEEE, 2013: 1049–1056.

    Google Scholar 

  6. I. Foster, Y. Zhao, I. Raicu, S. Lu. Cloud Computing and Grid Computing 360-Degree Compared, IEEE Grid Computing Environments (GCE08) 2008, co-located with IEEE/ACM Supercomputing 2008. Austin, TX. pp. 1–10

    Google Scholar 

  7. G. Bell, T. Hey, A. Szalay, Beyond the Data Deluge, Science, Vol. 323, no. 5919, pp. 1297–1298, 2009.

    Article  Google Scholar 

  8. E. Deelman et al. Pegasus: A framework for mapping complex scientific workflows onto distributed systems, Scientific Programming, vol. 13, iss. 3, pp. 219–237. July 2005.

    Google Scholar 

  9. B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A. Lee, J. Tao, Y. Zhao, Scientific workflow management and the Kepler system, Concurrency and Computation: Practice and Experience, Special Issue: Workflow in Grid Systems, vol. 18, iss. 10, pp. 1039–1065, 25 August 2006.

    Google Scholar 

  10. J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E. Scheidegger and H. T. Vo, Managing Rapidly-Evolving Scientific Workflows, Provenance and Annotation of Data, Lecture Notes in Computer Science, 2006, vol. 4145/2006, 10–18, DOI: 10.1007/11890850_2

    Google Scholar 

  11. D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. Pocock, P. Li, and T. Oinn, Taverna: a tool for building and running workflows of services, Nucleic Acids Research, vol. 34, iss. Web Server issue, pp. 729–732, 2006.

    Google Scholar 

  12. Y. Zhao, X. Fei, I. Raicu, S. Lu, Opportunities and Challenges in Running Scientific Workflows on the Cloud, IEEE International Conference on Cyber-enabled distributed computing and knowledge discovery (CyberC), pp. 455–462, 2011.

    Google Scholar 

  13. Woitaszek, M., Dennis, J., Sines, T. Parallel High-resolution Climate Data Analysis using Swift. 4th Workshop on Many-Task Computing on Grids and Supercomputers 2011.

    Google Scholar 

  14. Damkliang K, Tandayya P, Phusantisampan T, et al. Taverna Workflow and Supporting Service for Single Nucleotide Polymorphisms Analysis[C]//Information Management and Engineering, 2009. ICIME’09. International Conference on. IEEE, 2009: 27–31.

    Google Scholar 

  15. Zhang J, Votava P, Lee T J, et al. Bridging VisTrails Scientific Workflow Management System to High Performance Computing[C]//Services (SERVICES), 203 IEEE Ninth World Congress on. IEEE, 2013: 29–36.

    Google Scholar 

  16. Chaisiri S, Bong Z, Lee C, et al. Workflow framework to support data analytics in cloud computing[C]//Cloud Computing Technology and Science (CloudCom), 2012 IEEE 4th International Conference on. IEEE, 2012: 610–613.

    Google Scholar 

  17. M. Palankar, A. Iamnitchi, M. Ripeanu, S. Garfinkel. Amazon S3 for science grids: a viable solution? In Proceedings of the 2008 international workshop on Data-aware distributed computing (DADC ‘08), pp. 55–64, 2008.

    Google Scholar 

  18. E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good. The cost of doing science on the Cloud: the Montage example. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC ’08, pp. 50:1–50:12, Piscataway, NJ, USA, 2008.

    Google Scholar 

  19. C. Vecchiola, S. Pandey, and R. Buyya. High-Performance Cloud Computing: A View of Scientific Applications. In International Symposium onParallel Architectures, Algorithms, and Networks, pp. 4–16, 2009.

    Google Scholar 

  20. Keahey, K., and T. Freeman. Contextualization: Providing One-click Virtual Clusters. in eScience. 2008, pp. 301–308. Indianapolis, IN, 2008.

    Google Scholar 

  21. G. Juve and E. Deelman. Wrangler: Virtual Cluster Provisioning for the Cloud. In HPDC, pp. 277–278, 2011.

    Google Scholar 

  22. Deelman E, Singh G, Livny M, et al. The cost of doing science on the cloud: the montage example[C]//Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 2008: 50.

    Google Scholar 

  23. Hoffa C, Mehta G, Freeman T, et al. On the use of cloud computing for scientific workflows[C]//eScience, 2008. eScience’08. IEEE Fourth International Conference on. IEEE, 2008: 640–645.

    Google Scholar 

  24. C. Lin, S. Lu, Z. Lai, A. Chebotko, X. Fei, J. Hua, F. Fotouhi, Service-Oriented Architecture for VIEW: a Visual Scientific Workflow Management System, In Proc. of the IEEE 2008 International Conference on Services Computing (SCC), pp. 335–342, Honolulu, Hawaii, USA, July 2008.

    Google Scholar 

  25. I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, M. Wilde. Falkon: a Fast and Light-weight tasK executiON framework, IEEE/ACM SuperComputing 2007, pp. 1–12.

    Google Scholar 

  26. Juve G, Deelman E, Vahi K, et al. Scientific workflow applications on Amazon EC2[C]// E-Science Workshops, 2009 5th IEEE International Conference on. IEEE, 2009: 59–66.

    Google Scholar 

  27. M. Wilde, I. Foster, K. Iskra, P. Beckman, Z. Zhang, Allan Espinosa, Mihael Hategan, Ben Clifford, Ioan Raicu, Parallel Scripting for Applications at the Petascale and Beyond, IEEE Computer Nov. 2009 Special Issue on Extreme Scale Computing, vol. 42, iss. 11, pp. 50–60, 2009.

    Google Scholar 

  28. NASA MODIS dataset, [Online]. Available: http://modis.gsfc.nasa.gov/, 2013.

  29. Y. Zhao, J. Dobson, I. Foster, L. Moreau, M. Wilde, A Notation and System for Expressing and Executing Cleanly Typed Workflows on Messy Scientific Data, SIGMOD Record, vol. 34, iss. 3, pp. 37–43, September 2005.

    Google Scholar 

  30. Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. v. Laszewski, I. Raicu, T. Stef-Praun, M. Wilde. Swift: Fast, Reliable, Loosely Coupled Parallel Computation, IEEE Workshop on Scientific Workflows 2007, pp. 199–206.

    Google Scholar 

  31. Hadoop, [Online]. Available: http://hadoop.apache.org/, 2012

  32. D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, D. Zagorodnov. The Eucalyptus Open-Source Cloud-Computing System, 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGRID ‘09, pp. 124–131, 2009.

    Google Scholar 

  33. Oliveira, D. Ocaña, K., Ogasawara, E., Dias, J., Baião, F., Mattoso, M., A Performance Evaluation of X-Ray Crystallography Scientific Workflow Using SciCumulus. IEEE CLOUD 2011, pp. 708–715.

    Google Scholar 

  34. L. Wang, J. Tao, M. Kunze, A. C. Castellanos, D. Kramer, and W. Karl. Scientific Cloud Computing: Early Definition and Experience, in 10th IEEE International Conference on High Performance Computing and Communications, HPCC ‘08., pp. 825–830, 2008.

    Google Scholar 

  35. Zhang J. Ontology-driven composition and validation of scientific grid workflows in Kepler: a case study of hyperspectral image processing[C]//Grid and Cooperative Computing Workshops, 2006. GCCW’06. Fifth International Conference on. IEEE, 2006: 282–289.

    Google Scholar 

  36. R. Moreno-Vozmediano, R.S. Montero, I.M. Llorente. Multi-Cloud Deployment of Computing Clusters for Loosely-Coupled MTC Applications, IEEE Transactions on Parallel and Distributed Systems. 22(6), pp. 924–930, 2011.

    Article  Google Scholar 

  37. R. S. Montero, R. Moreno-Vozmediano, and I. M. Llorente. An Elasticity Model for High Throughput Computing Clusters, J. Parallel and Distributed Computing. 71(6), pp. 750–757, 2011.

    Article  Google Scholar 

  38. OpenNebula Architecture, http://www.opennebula.org/documentation:archives:rel2.2:architecture, 2013.

  39. Juve G, Rynge M, Deelman E, et al. Comparing FutureGrid, Amazon EC2, and Open Science Grid for Scientific Workflows[J]. Computing in Science & Engineering, 2013, 15(4): 20–29.

    Google Scholar 

  40. M. Kozlovszky, K. Karóczkai, I. Márton, A. Balasko, A. C. Marosi, and P. Kacsuk, Enabling Generic Distributed Computing Infrastructure Compatibility for Workflow Management Systems, Computer Science, vol. 13, no. 3, p. 61, 2012.

    Google Scholar 

  41. Juve G, Deelman E, Vahi K, et al. Data sharing options for scientific workflows on amazon ec2[C]//Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 2010: 1–9.

    Google Scholar 

  42. I. Sadooghi, I. Raicu. CloudKon: a Cloud enabled Distributed tasK executiON framework, Illinois Institute of Technology, Department of Computer Science, PhD Oral Qualifier, 2013

    Google Scholar 

  43. Juve G, Deelman E. Scientific workflows in the cloud[M]//Grids, Clouds and Virtualization. Springer London, 2011: 71–91.

    Google Scholar 

  44. Lacroix Z, Aziz M. Resource descriptions, ontology, and resource discovery[J]. International Journal of Metadata, Semantics and Ontologies, 2010, 5(3): 194–207.

    Google Scholar 

  45. Lin C, Lu S, Lai Z, et al. Service-oriented architecture for VIEW: a visual scientific workflow management system[C]//Services Computing, 2008. SCC’08. IEEE International Conference on. IEEE, 2008, 1: 335–342.

    Google Scholar 

  46. Lin C, Lu S. Scheduling scientific workflows elastically for cloud computing[C]//Cloud Computing (CLOUD), 2011 IEEE International Conference on. IEEE, 2011: 746–747.

    Google Scholar 

  47. Mao M, Humphrey M. Auto-scaling to minimize cost and meet application deadlines in cloud workflows[C]//Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2011: 49.

    Google Scholar 

  48. Oliveira D, Ogasawara E, Ocaña K, et al. An adaptive parallel execution strategy for cloud-based scientific workflows[J]. Concurrency and Computation: Practice and Experience, 2012, 24(13): 1531–1550.

    Article  Google Scholar 

  49. Papuzzo G, Spezzano G. Autonomic management of workflows on hybrid grid-cloud infrastructure[C]//Proceedings of the 7th International Conference on Network and Services Management. International Federation for Information Processing, 2011: 230–233.

    Google Scholar 

  50. Reynolds C J, Winter S, Terstyanszky G Z, et al. Scientific workflow makespan reduction through cloud augmented desktop grids[C]//Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on. IEEE, 2011: 18–23.

    Google Scholar 

  51. Vecchiola C, Pandey S, Buyya R. High-performance cloud computing: A view of scientific applications[C]//Pervasive Systems, Algorithms, and Networks (ISPAN), 2009 10th International Symposium on. IEEE, 2009: 4–16.

    Google Scholar 

  52. Yuan D, Yang Y, Liu X, et al. On-demand minimum cost benchmarking for intermediate dataset storage in scientific cloud workflow systems[J]. Journal of Parallel and Distributed Computing, 2011, 71(2): 316–332.

    Article  MATH  MathSciNet  Google Scholar 

  53. Çatalyürek Ü V, Kaya K, Uçar B. Integrated data placement and task assignment for scientific workflows in clouds[C]//Proceedings of the fourth international workshop on Data-intensive distributed computing. ACM, 2011: 45–54.

    Google Scholar 

  54. Wang J, Korambath P, Altintas I. A physical and virtual compute cluster resource load balancing approach to data-parallel scientific workflow scheduling[C]//Services (SERVICES), 2011 IEEE World Congress on. IEEE, 2011: 212–215.

    Google Scholar 

  55. Tolosana-Calasanz R, BañAres J Á N, Pham C, et al. Enforcing QoS in scientific workflow systems enacted over Cloud infrastructures[J]. Journal of Computer and System Sciences, 2012, 78(5): 1300–1315.

    Article  Google Scholar 

  56. Bessai K, Youcef S, Oulamara A, et al. Bi-criteria workflow tasks allocation and scheduling in Cloud computing environments[C]//Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on. IEEE, 2012: 638–645.

    Google Scholar 

  57. Ostermann S, Prodan R. Impact of variable priced cloud resources on scientific workflow scheduling[M]//Euro-Par 2012 Parallel Processing. Springer Berlin Heidelberg, 2012: 350–362.

    Google Scholar 

  58. Ioan Raicu. Many-Task Computing: Bridging the Gap between High Throughput Computing and High Performance Computing, Computer Science Department, University of Chicago, Doctorate Dissertation, March 2009

    Google Scholar 

  59. Ioan Raicu, Ian Foster, Yong Zhao, Alex Szalay, Philip Little, Christopher M. Moretti, Amitabh Chaudhary, Douglas Thain. Towards Data Intensive Many-Task Computing, book chapter in Data Intensive Distributed Computing: Challenges and Solutions for Large-Scale Information Management, IGI Global Publishers, 2009

    Google Scholar 

  60. Michael Wilde, Ioan Raicu, Allan Espinosa, Zhao Zhang, Ben Clifford, Mihael Hategan, Kamil Iskra, Pete Beckman, Ian Foster. Extreme-scale scripting: Opportunities for large task-parallel applications on petascale computers, Scientific Discovery through Advanced Computing Conference (SciDAC09) 2009

    Google Scholar 

  61. Dongfang Zhao, Chen Shou, Tanu Malik, Ioan Raicu. Distributed Data Provenance for Large-Scale Data-Intensive Computing, IEEE Cluster 2013

    Google Scholar 

  62. Ioan Raicu, Pete Beckman, Ian Foster. Making a Case for Distributed File Systems at Exascale, ACM Workshop on Large-scale System and Application Performance (LSAP), 2011

    Google Scholar 

  63. Dharmit Patel, Faraj Khasib, Iman Sadooghi, Ioan Raicu. Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues, 1st International Workshop on Scalable Computing For Real-Time Big Data Applications (SCRAMBL’14) 2014

    Google Scholar 

  64. Tonglin Li, Xiaobing Zhou, Kevin Brandstatter, Dongfang Zhao, Ke Wang, Anupam Rajendran, Zhao Zhang, Ioan Raicu. ZHT: A Light-weight Reliable Persistent Dynamic Scalable Zero-hop Distributed Hash Table, IEEE International Parallel & Distributed Processing Symposium (IPDPS) 2013

    Google Scholar 

  65. Iman Sadooghi, Sandeep Palur, Ajay Anthony, Isha Kapur, Karthik Belagodu, Pankaj Purandare, Kiran Ramamurty, Ke Wang, Ioan Raicu. Achieving Efficient Distributed Scheduling with Message Queues in the Cloud for Many-Task Computing and High-Performance Computing, 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) 2014

    Google Scholar 

Download references

Acknowledgements

This paper is supported by the key project of National Science Foundation of China No. 61034005 and No. 61272528.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ioan Raicu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this chapter

Cite this chapter

Zhao, Y., Li, Y., Raicu, I., Lin, C., Tian, W., Xue, R. (2014). Migrating Scientific Workflow Management Systems from the Grid to the Cloud. In: Li, X., Qiu, J. (eds) Cloud Computing for Data-Intensive Applications. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1905-5_10

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-1905-5_10

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4939-1904-8

  • Online ISBN: 978-1-4939-1905-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics