Skip to main content

Managing Data-Intensive Workloads in a Cloud

  • Chapter
  • First Online:
Grid and Cloud Database Management
  • 1169 Accesses

Abstract

The amount of data available for many areas is increasing faster than our ability to process it. The promise of “infinite” resources given by the cloud computing paradigm has led to recent interest in exploiting clouds for large-scale data intensive computing. Data-intensive computing presents new challenges for systems management in the cloud including new processing frameworks, such as MapReduce, and costs inherent with large data sets in distributed environments. Workload management, an important component of systems management, is the discipline of effectively managing, controlling and monitoring “workflow” across computing systems. This chapter examines the state-of-the-art of workload management for data-intensive computing in clouds. A taxonomy is presented for workload management of data-intensive computing in the cloud and use the taxonomy to classify and evaluate current workload management mechanisms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., Zaharia, M.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010). doi:10.1145/1721654.1721672

    Article  Google Scholar 

  2. Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R.H., Konwinski, A., Lee, G., Patterson, D.A., Rabkin, A., Stoica, I., Zaharia, M.: Above the clouds: A berkeley view of cloud computing. Technical Report No. UCB/EECS-2009–28. University of California at Berkeley (2009)

    Google Scholar 

  3. Amazon Elastic Compute Cloud (amazon ec2). http://aws.amazon.com/ec2/ (2010). Accessed 19 May 2010

  4. Google App engine. http://code.google.com/intl/de-DE/appengine/ (2010). Accessed 19 May 2010

  5. Raicu, I., Foster, I., Szalay, A., Turcu, G.: Astroportal: A science gateway for large-scale astronomy data analysis. In: TeraGrid Conference, 12–15 June 2006

    Google Scholar 

  6. Desprez, F., Vernois, A.: Simultaneous scheduling of replication and computation for data-intensive applications on the grid. J. Grid Comput. 4(1), 19–31 (2006)

    Article  Google Scholar 

  7. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  8. Ahmad, M., Aboulnaga, A., Babu, S., Munagala, K.: Modeling and exploiting query interactions in database systems. Paper presented at the proceeding of the 17th ACM conference on information and knowledge management, Napa Valley, CA, USA (2008)

    Google Scholar 

  9. Niu, B., Martin, P., Powley, W.: Towards autonomic workload management in DBMSs. J. Database Manag. 20(3), 1–17 (2009)

    Article  Google Scholar 

  10. Krompass, S., Kuno, H., Wiene, J.L., Wilkinson, K., Dayal, U., Kemper, A.: Managing long-running queries. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT’09, Saint Petersburg, Russia, 2009. Association for Computing Machinery, pp. 132–143

    Google Scholar 

  11. Dean, J., Sanjay, G.: Mapreduce: Simplified data processing on large clusters. In: Proceedings of the Sixth Symposium on Operating Systems Design and Implementation (OSDI’04), Berkeley, CA, USA, 2004. USENIX Assoc, pp. 137–149

    Google Scholar 

  12. Apache Hadoop. http://hadoop.apache.org/ (2010). Accessed 19 Aug 2010

  13. Gurd, J.R., Kirkham, C.C., Watson, I.: The manchester prototype dataflow computer. Commun. ACM 28(1), 34–52 (1985)

    Article  Google Scholar 

  14. Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: The condor experience. Concurr. Comput-Pract. Exp. 17(2–4), 323–356 (2005)

    Article  Google Scholar 

  15. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed data-parallel programs from sequential building blocks. Paper presented at the Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, Lisbon, Portugal, 2007

    Google Scholar 

  16. DeWitt, D.J., Paulson, E., Robinson, E., Naughton, J., Royalty, J., Shankar, S., Krioukov, A. Clustera: An integrated computation and data management system. Proc. VLDB Endow. 1(1), 28–41 (2008). doi:10.1145/1453856.1453865

    Article  Google Scholar 

  17. Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: Scope: Easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1(2), 1265–1276 (2008). doi:10.1145/1454159.1454166

    Article  Google Scholar 

  18. Dewitt, D., Gray, J.: Parallel database systems. The future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)

    Google Scholar 

  19. GreenPlum. Greenplum database architecture. http://www.greenplum.com/technology/architecture/ (2010). Accessed 19 Aug 2010

  20. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, S.A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)

    Article  Google Scholar 

  21. Gu, Y., Grossman, R.L. Sector and sphere: The design and implementation of a high-performance data cloud. Phil. Trans. Roy. Soc. A: Math. Phys. Eng. Sci. 367(1897), 2429–2445 (2009). doi:10.1098/rsta.2009.0053

    Article  Google Scholar 

  22. Duncan, R.: Survey of parallel computer architectures. Computer 23(2), 5–16 (1990)

    Article  MathSciNet  Google Scholar 

  23. Amazon Cloudwatch. http://aws.amazon.com/cloudwatch/ (2010). Accessed 18 May 2010

  24. Amazon Auto scaling. http://aws.amazon.com/autoscaling/ (2010). Accessed 18 May 2010

  25. Foster, I., Yong, Z., Raicu, I., Lu, S., Cloud computing and grid computing 360-degree compared. In: Grid Computing Environments Workshop, 2008. GCE ’08, 2008, pp. 1–10

    Google Scholar 

  26. Dong, F.: Workflow scheduling algorithms in the grid. PhD, Queen’s University, Kingston (2009)

    Google Scholar 

  27. Venugopal, S., Buyya, R., Ramamohanarao, K. A taxonomy of data grids for distributed data sharing, management, and processing. ACM Comput. Surv. 38(1), 123–175 (2006). doi:http://doi.acm.org/10.1145/1132952.1132955

  28. Yu, J., Buyya, R.: A taxonomy of scientific workflow systems for grid computing. Sigmod. Rec. 34(3), 44–49 (2005)

    Article  Google Scholar 

  29. Hockauf, R., Karl, W., Leberecht, M., Oberhuber, M., Wagner, M.: Exploiting spatial and temporal locality of accesses: A new hardware-based monitoring approach for dsm systems. In: Euro-par’98 parallel processing, pp. 206–215 (1998)

    Google Scholar 

  30. McKinley, K.S., Carr, S., Tseng, C.-W. Improving data locality with loop transformations. ACM Trans. Program Lang. Syst. 18(4), 424–453 (1996). doi:http://doi.acm.org/10.1145/233561.233564

    Article  Google Scholar 

  31. Shatdal, A., Kant, C., Naughton, J.F.: Cache conscious algorithms for relational query processing. In: International Conference Proceedings on Very Large Data Bases, Santiago, Chile, pp. 510–521. Morgan Kaufmann, CA (1994)

    Google Scholar 

  32. Elmore, A., Das, S., Agrawal, D., Abbadi, A.E.: Who’s driving this cloud? Towards efficient migration for elastic and autonomic multitenant databases. Tecnical Report 2010–05. UCSB CS (2010)

    Google Scholar 

  33. Lim, H.C., Babu, S., Chase, J.S. Automated control for elastic storage. Paper presented at the Proceeding of the 7th International Conference on Autonomic Computing, Washington, DC, USA, pp. 1–10 (2010)

    Google Scholar 

  34. Sanjay, G., Howard, G., Shun-Tak, L.: The google file system. SIGOPS Oper. Syst. Rev. 37(5), 29–43 (2003). doi:10.1145/1165389.945450

    Article  Google Scholar 

  35. Apache Hadoop. http://hadoop.apache.org/ (2010). Accessed 3 Jun 2010

  36. Apache Hadoop distribtued file system. http://hadoop.apache.org/common/docs/current/hdfsdesign.html (2010). Accessed 3 Jun 2010

  37. Zaharia, M., Borthakur, D., Sarma, J.S., Elmeleegy, K., Shenker, S., Stoica, I.: Job scheduling for multi-user mapreduce clusters. Technical Report No. UCB/EECS-2009–28. University of California at Berkeley (2009)

    Google Scholar 

  38. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A. Pig latin: A not-so-foreign language for data processing. Paper presented at the Proceedings of the 2008 ACM SIGMOD International Conference on Management of data, Vancouver, Canada (2008)

    Google Scholar 

  39. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R. Hive: A warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)

    Article  Google Scholar 

  40. Ranganathan, K., Foster, I.: Decoupling computation and data scheduling in distributed data-intensive applications. In: Proceedings 11th IEEE International Symposium on High Performance Distributed Computing, Piscataway, NJ, USA, 2002. IEEE Comput. Soc., pp. 352–358

    Google Scholar 

  41. Quiroz, A., Kim, H., Parashar, M., Gnanasambandam, N., Sharma, N.: Towards autonomic workload provisioning for enterprise grids and clouds. In: 2009 10th IEEE/ACM International Conference on Grid Computing (GRID), Banff, AB, Canada, 2009. IEEE Computer Society, pp. 50–57

    Google Scholar 

  42. Chappell, D.: Introducing windows azure. David Chappell & Associates. http://download.microsoft.com/documents/uk/mediumbusiness/products/cloudonlinesoftware/IntroducingWindowsAzure.pdf (2009). Accessed 24 Aug 2010

  43. Voorsluys, W., Broberg, J., Venugopal, S., Buyya, R.: Cost of virtual machine live migration in clouds: A performance evaluation. In: 1st International Conference on Cloud Computing, Beijing, China, 2009. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer, Berlin, pp. 254–265

    Google Scholar 

  44. Prodan, R., Ostermann, S.: A survey and taxonomy of infrastructure as a service and web hosting cloud providers. In: 2009 10th IEEE/ACM International Conference on Grid Computing, 13–15 Oct 2009, pp. 17–25

    Google Scholar 

  45. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 1–26 (2008). doi:10.1145/1365815.1365816

    Article  Google Scholar 

  46. Weissman, C.D., Bobrowski, S. The design of the force.Com multitenant internet application development platform. Paper presented at the proceedings of the 35th SIGMOD international conference on Management of data, Providence, RI, USA (2009)

    Google Scholar 

  47. Zhang, H., Jiang, G., Yoshihira, K., Chen, H., Saxena, A.: Resilient workload manager: Taming bursty workload of scaling internet applications. In: 6th International Conference on Autonomic Computing, ICAC’09, Barcelona, Spain, 2009. Proceedings of the 6th International Conference Industry Session on Autonomic Computing and Communications Industry Session, ICAC-INDST’09. Association for Computing Machinery, pp. 19–28

    Google Scholar 

  48. Moreno-Vozmediano, R., Montero, R.S., Llorente, I.M.: Elastic management of cluster-based services in the cloud. Paper presented at the proceedings of the 1st workshop on Automated control for datacenters and clouds, Barcelona, Spain (2009)

    Google Scholar 

  49. Sotomayor, B., Montero, R.S., Llorente, I.M., Foster, I. Virtual infrastructure management in private and hybrid clouds. IEEE Internet Comput. 13(5), 14–22 (2009)

    Article  Google Scholar 

  50. Raicu, I., Zhao, Y., Dumitrescu, C., Foster, I., Wilde, M.: Falkon: A fast and light-weight task execution framework. Paper presented at the proceedings of the 2007 ACM/IEEE conference on Supercomputing, Reno, Nevada (2007)

    Google Scholar 

  51. Walker, E., Gardner, J.P., Litvin, V., Turner, E.L.: Creating personal adaptive clusters for managing scientific jobs in a distributed computing environment. In: Challenges of Large Applications in Distributed Environments, 2006 IEEE, 2006, pp. 95–103

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. Mian .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Mian, R., Martin, P., Brown, A., Zhang, M. (2011). Managing Data-Intensive Workloads in a Cloud. In: Fiore, S., Aloisio, G. (eds) Grid and Cloud Database Management. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20045-8_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20045-8_12

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20044-1

  • Online ISBN: 978-3-642-20045-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics