Peer-to-peer Cooperative Scheduling Architecture for National Grid Infrastructure

Matyska, Ludek; Ruda, Miroslav; Toth, Simon

doi:10.1007/978-1-4419-8014-4_8

Peer-to-peer Cooperative Scheduling Architecture for National Grid Infrastructure

Ludek Matyska³,
Miroslav Ruda³ &
Simon Toth³

Conference paper
First Online: 01 January 2011

745 Accesses
1 Citations

Abstract

For some ten years, the Czech National Grid Infrastructure MetaCentrum uses a single central PBSPro installation to schedule jobs across the country. This centralized approach keeps a full track about all the clusters, providing support for jobs spanning several sites, implementation for the fair-share policy and better overall control of the grid environment. Despite a steady progress in the increased stability and resilience to intermittent very short network failures, growing number of sites and processors makes this architecture, with a single point of failure and scalability limits, obsolete. As a result, a new scheduling architecture is proposed, which relies on higher autonomy of clusters. It is based on a peer to peer network of semi-independent schedulers for each site or even cluster. Each scheduler accepts jobs for the whole infrastructure, cooperating with other schedulers on implementation of global policies like central job accounting, fair-share, or submission of jobs across several sites. The scheduling system is integrated with the Magrathea system to support scheduling of virtual clusters, including the setup of their internal network, again eventually spanning several sites. On the other hand, each scheduler is local to one of several clusters and is able to directly control and submit jobs to them even if the connection of other scheduling peers is lost. In parallel to the change of the overall architecture, the scheduling system itself is being replaced. Instead of PBSPro, chosen originally for its declared support of large scale distributed environment, the new scheduling architecture is based on the open-source Torque system. The implementation and support for the most desired properties in PBSPro and Torque are discussed and the necessary modifications to Torque to support the MetaCentrum scheduling architecture are presented, too.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

M. Ruda, J. Denemark, L. Matyska. Scheduling Virtual Grids: the Magrathea System, Second International Workshop on Virtualization Technology in Distributed Computing, USA, ACM digital library, 2007. p. 1-7. 2007, Reno, USA.
Google Scholar
Moab Workload Manager. http://www.clusterresources.com/products/moab-cluster-suite.php
K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, M. S. Smith, S. Tuecke. A resource management architecture for metacomputing systems. In Proceedings of the IPPS/SPDP Workshop on Job Scheduling Strategies for Parallel Processing. pp. 62–82, 1998.
Google Scholar
Torque Resource Manager. http://www.clusterresources.com/products/torque-resourcemanager.php.
Sun Grid Engine. http://gridengine.sunsource.net/.
A. Yoo, M. Jette, M. Grondona. SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing, volume 2862 of Lecture Notes in Computer Science, pages 44-60, Springer-Verlag, 2003.
Google Scholar
The Portable Batch System. http://www.pbspro.com
S. Zhou. LSF: Load sharing in large-scale heterogeneous distributed systems. In Proceedings of the Workshop on Cluster Computing.
Google Scholar
D. Jackson, Q. Snell, M. Clement. Core Algorithms of the Maui Scheduler. In Proceedings of 7th Workshop on Job Scheduling Strategies for Parallel Processing, 2001.
Google Scholar
LCG Computing Element. https://twiki.cern.ch/twiki/bin/view/EGEE/LcgCE.
P. Andreetto et al., CREAM: A simple, Grid-accessible, Job Management System for local Computational Resources, Proc. XV International Conference on Computing in High Energy and Nuclear Physics (CHEP’06), Feb 13-17, 2006, Mumbay, India, Macmillan, p. 831-835.
Google Scholar
Eduardo Huedo, Ruben S. Montero,Ignacio M. Llorente. The GridWay Framework for Adaptive Scheduling and Execution on Grids. Scalable Computing -Practice and Experience 6 (3): 1-8, 2005.
Google Scholar
P. Andreetto, S. Borgia, A. Dorigo, A. Gianelle, M. Mordacchini, M. Sgaravatto, L. Zangrando, S. Andreozzi, V. Ciaschini, CD Giusto, et al. Practical approaches to grid workload and resource management in the EGEE project. Proceedings of the International Conference on Computing in High Energy Physics (CHEP2004), Interlaken, Switzerland, 2004.
Google Scholar
D. Ch. Nurmi, J. Brevik, R. Wolski. QBETS: queue bounds estimation from time series. SIGMETRICS ’07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 2007.
Google Scholar
W. Smith, V. Taylor, I Foster. Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance. Proceedings of the IPPS/SPDP ’99 Workshop on Job Scheduling Strategies for Parallel Processing, 1999.
Google Scholar
Vladimir V. Korkhova, Jakub T. Moscickib, Valeria V. Krzhizhanovskaya. Dynamic workload balancing of parallel applications with user-level scheduling on the Grid, Future Generation Computer Systems Volume 25, Issue 1, January 2009, Pages 28-34.
Google Scholar
I. Foster, C. Kesselman, C. Lee, R. Lindell, K. Nahrstedt, A. Roy. A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation. Intl Workshop on Quality of Service, 1999.
Google Scholar
K. Czajkowski, I. Foster, C. Kesselman. Resource Co-Allocation in Computational Grids. Proceedings of the Eighth IEEE International Symposium on High Performance Distributed Computing (HPDC-8), pp. 219-228, 1999.
Google Scholar
H.H. Mohamed, D.H.J. Epema. KOALA: A Co-Allocating Grid Scheduler. Concurrency and Computation: Practice and Experience, Vol. 20, 1851-1876, 2008.
Article Google Scholar
M. Litzkow, M. Livny, M. Mutka. Condor—A Hunter of Idle Workstations. In Proceedings of the 8th International Conference of Distributed Computing Systems.
Google Scholar
M. Ruda, A. Krenek, M. Mulac, J. Pospisil, Z. Sustr A uniform job monitoring service in multiple job universes. In GMW ’07: Proceedings of the 2007 workshop on Grid monitoring, ACM 2007. Pages 17–22.
Google Scholar
M. Ruda et al., Job Centric Monitoring on the Grid – 7 years of experience with L&B and JP services, Proc. CESNET Conference 2008.
Google Scholar
M. Ruda, S. Toth. Transition to Inter-Cluster Scheduling Architecture in MetaCentrum. Cesnet technical report 21/2009.
Google Scholar

Download references

Author information

Authors and Affiliations

CESNET, Zikova 4, Praha, Czech Republic
Ludek Matyska, Miroslav Ruda & Simon Toth

Authors

Ludek Matyska
View author publications
You can also search for this author in PubMed Google Scholar
Miroslav Ruda
View author publications
You can also search for this author in PubMed Google Scholar
Simon Toth
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Academia Sinica Grid Computing Centre, Sec.3, Academia Road 128, Nankang, Taipei, 115, Taiwan R.O.C.
Simon C. Lin
Academia Sinica Grid Computing Centre, Sec. 3, Academia Road 128, Nankang, Taipei, 115, Taiwan R.O.C.
Eric Yen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Matyska, L., Ruda, M., Toth, S. (2011). Peer-to-peer Cooperative Scheduling Architecture for National Grid Infrastructure. In: Lin, S., Yen, E. (eds) Data Driven e-Science. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-8014-4_8

Download citation

DOI: https://doi.org/10.1007/978-1-4419-8014-4_8
Published: 14 January 2011
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-8013-7
Online ISBN: 978-1-4419-8014-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics