Abstract
MapReduce (MR) has become a de facto standard for large-scale data analysis. Moreover, it has also attracted the attention of the HPC community due to its simplicity, efficiency and highly scalable parallel model. However, MR implementations present some issues that may complicate its execution in existing HPC clusters, specially concerning the job submission. While on MR there are no strict parameters required to submit a job, in a typical HPC cluster, users must specify the number of nodes and amount of time required to complete the job execution. This paper presents the MR Job Adaptor, a component to optimize the scheduling of MR jobs along with HPC jobs in an HPC cluster. Experiments performed using real-world HPC and MapReduce workloads have show that MR Job Adaptor can properly transform MR jobs to be scheduled in an HPC Cluster, minimizing the job turnaround time, and exploiting unused resources in the cluster.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Apache Hadoop on Demand (HOD) (2012), http://hadoop.apache.org/common/docs/current/hod_scheduler.html (accessed on February 2012)
Parallel Workloads Archive (2012), http://www.cs.huji.ac.il/labs/parallel/workload/ (accessed on February 2012)
Casanova, H.: Simgrid: A toolkit for the simulation of application scheduling. In: Proceedings of the First IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2001), Brisbane, Australia (May 2001)
Chen, Y., Ganapathi, A., Griffith, R., Katz, R.H.: The case for evaluating mapreduce performance using workload suites. In: MASCOTS, pp. 390–399. IEEE (2011)
De Rose, C.A.F., Ferreto, T., Calheiros, R.N., Cirne, W., Costa, L.B., Fireman, D.: Allocation strategies for utilization of space shared resources in bag of tasks grids. Future Generation Computer Systems 24(5), 331–341 (2008)
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Ekanayake, J., et al.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, pp. 810–818. ACM, New York (2010)
Feitelson, D.G., Mu’alem Weil, A.: Utilization and predictability in scheduling the IBM SP2 with backfilling. In: 12th Intl. Parallel Processing Symp (IPPS), pp. 542–546 (April 1998)
Fox, G., et al.: Parallel data mining from multicore to cloudy grids. In: Proceedings of HPC 2008 (2011)
Gropp, W., Lusk, E., Skjellum, A.: Using MPI Portable Parallel Programming with the Message Passing Interface. The MIT Press (1994)
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: Flexible resource sharing for the cloud. USENIX (August 2011)
Isard, M., et al.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of EuroSys 2007 (January 2007)
Krishnan, S., Tatineni, M.: Myhadoop-hadoop-on-demand on traditional hpc resources. sdsc.edu (2011), http://www.sdsc.edu/~allans/MyHadoop.pdf
Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: Modeling the characteristics of rigid jobs. J. Parallel & Distributed Comput. 63(11), 1105–1122 (2003)
Middleton, A.: Data-intensive technologies for cloud computing. In: Handbook of Cloud Computing (January 2010)
Oracle: Oracle Grid Engine, previously known as Sun Grid Engine (SGE) (2012), http://www.oracle.com/technetwork/oem/grid-engine-166852.html (accessed on February 2012)
Schadt, E., Linderman, M., Sorenson, J.: Computational solutions to large-scale data management and analysis. Nature Reviews (January 2010)
Sehrish, S., et al.: Mrap: a novel mapreduce-based framework to support hpc analytics applications with access patterns. In: Proceedings of HPDC 2010, pp. 107–118 (2010), http://doi.acm.org/10.1145/1851476.1851490
Srirama, S., Jakovits, P.: Adapting scientific computing problems to clouds using mapreduce. Future Generation Computer Systems (January 2011)
Team, A.H.: Apache hadoop web site (2011), http://hadoop.apache.org (accessed on February 2012)
Team, A.H.: Hamster: Hadoop and mpi on the same cluster (2011), https://issues.apache.org/jira/browse/MAPREDUCE-2911 (accessed on February 2012)
Top 500: Top 500 Supercomputers Site (2012), http://www.top500.org (accessed on February 2012)
TORQUE: TORQUE Resource Manager (2012), http://www.clusterresources.com/products/torque-resource-manager.php (accessed on February 2012)
Verma, A., Cherkasova, L., Campbell, R.H.: Aria: automatic resource inference and allocation for mapreduce environments. In: Proceedings of ICAC 2011, pp. 235–244 (2011)
Wang, G., et al.: Towards synthesizing realistic workload traces for studying the hadoop ecosystem. In: MASCOTS. pp. 400–408. IEEE (2011)
Zaharia, M., et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Morin, C., Muller, G. (eds.) EuroSys, pp. 265–278. ACM (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Neves, M.V., Ferreto, T., De Rose, C. (2012). Scheduling MapReduce Jobs in HPC Clusters. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds) Euro-Par 2012 Parallel Processing. Euro-Par 2012. Lecture Notes in Computer Science, vol 7484. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32820-6_19
Download citation
DOI: https://doi.org/10.1007/978-3-642-32820-6_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32819-0
Online ISBN: 978-3-642-32820-6
eBook Packages: Computer ScienceComputer Science (R0)