Skip to main content

A New Algorithm for Intermediate Dataset Storage in a Cloud-Based Dataflow

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9130))

Abstract

Running a dataflow in a cloud environment usually generates many useful intermediate datasets. A strategy for running a dataflow is to decide which datasets should be stored, while the rest of them are regenerated. The intermediate dataset storage (IDS) problem asks to find a strategy for running a dataflow, such that the total cost is minimized. The current best algorithm for linear-structure IDS takes \(O(n^4)\) time, where “linear-structure” means that the structure of the datasets in the dataflow is a pipeline. In this paper, we present a new algorithm for this problem, and improve the time complexity to \(O(n^3)\), where \(n\) is the number of datasets in the pipeline.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Deelman, E., Chervenak, A.: Data management challenges of data-intensive scientific workflows. In: IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), pp. 687–692, Lyon, France (2008)

    Google Scholar 

  2. Yuan, D., Yang, Y., Liu, X., Zhang, G., Chen, J.: On-demand minimum cost benchmarking for intermediate data storage in scientific cloud workflow systems. J. Parallel Distrib. Comput. 71(2), 316–332 (2011)

    Article  MATH  Google Scholar 

  3. Adams, I., Long, D.D.E., Miller, E.L., Pasupathy, S., Storer, M.W.: Maximizing efficiency by trading storage for computation. In: Workshop on Hot Topics in Cloud Computing (HotCloud 2009), pp. 1–5, San Diego, CA (2009)

    Google Scholar 

  4. Yuan, D., Yang, Y., Liu, X., Zhang, G., Chen, J.: A data dependency based strategy for intermediate data storage in scientific cloud workflow systems. Concurr. Comput. Pract. Exp. 24(9), 956–976 (2010)

    Article  Google Scholar 

  5. Zohrevandi, M., Bazzi, R.A.: The bounded data reuse problem in scientific workflows. In: 2013 IEEE 27th International Symposium on Parallel & Distributed Processing, pp. 1051–1062 (2013)

    Google Scholar 

  6. Han, L.X., Xie, Z., Baldock, R.: Automatic data reuse for accelerating data intensive applications in the Cloud. In: The 8th International Conference for Internet Technology and Secured Transactions (ICITST-2013), pp. 596–600 (2013)

    Google Scholar 

Download references

Acknowledgements

This paper is supported by national natural science foundation of China: 61472222, and natural science foundation of Shandong province: ZR2012Z002.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daming Zhu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Cheng, J., Zhu, D., Zhu, B. (2015). A New Algorithm for Intermediate Dataset Storage in a Cloud-Based Dataflow. In: Wang, J., Yap, C. (eds) Frontiers in Algorithmics. FAW 2015. Lecture Notes in Computer Science(), vol 9130. Springer, Cham. https://doi.org/10.1007/978-3-319-19647-3_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-19647-3_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-19646-6

  • Online ISBN: 978-3-319-19647-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics