A resource management architecture for metacomputing systems

Czajkowski, Karl; Foster, Ian; Karonis, Nick; Kesselman, Carl; Martin, Stuart; Smith, Warren; Tuecke, Steven

doi:10.1007/BFb0053981

A resource management architecture for metacomputing systems

Karl Czajkowski¹,
Ian Foster²,
Nick Karonis²,
Carl Kesselman¹,
Stuart Martin²,
Warren Smith² &
…
Steven Tuecke²

Conference paper
First Online: 01 January 2006

582 Accesses
366 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1459))

Abstract

Metacomputing systems are intended to support remote and/or concurrent use of geographically distributed computational resources. Resource management in such systems is complicated by five concerns that do not typically arise in other situations: site autonomy and heterogeneous substrates at the resources, and application requirements for policy extensibility, co-allocation, and online control. We describe a resource management architecture that addresses these concerns. This architecture distributes the resource management problem among distinct local manager, resource broker, and resource co-allocator components and defines an extensible resource specification language to exchange information about requirements. We describe how these techniques have been implemented in the context of the Globus metacomputing toolkit and used to implement a variety of different resource management strategies. We report on our experiences applying our techniques in a large testbed, GUSTO, incorporating 15 sites, 330 computers, and 3600 processors.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

References

Cray Research, 1997. Document Number IN-2153 2/97.
Google Scholar
D. Abramson, R. Sosic, J. Giddy, and B. Hall. Nimrod: A tool for performing parameterised simulations using distributed workstations. In Proc. 4th IEEE Symp. on High Performance Distributed Computing. IEEE Computer Society Press, 1995.
Google Scholar
F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao. Application-level scheduling on distributed heterogeneous networks. In Proceedings of Supercomputing '96. ACM Press, 1996.
Google Scholar
S. Brunett and T. Gottschalk. Scalable ModSAF simulations with more than 50,000 vehicles using multiple scalable parallel processors. In Proceedings of the Simulation Interoperability Workshop, 1997.
Google Scholar
S. Chapin. Distributed scheduling support in the presence of autonomy. In Proc. Heterogeneous Computing Workshop, pages 22–29, 1995.
Google Scholar
Joseph Czyzyk, Michael P. Mesnier, and Jorge J. Moré. The Network-Enabled Optimization System (NEOS) Server. Preprint MCS-P615-0996, Argonne National Laboratory, Argonne, Illinois, 1996.
Google Scholar
A. Downey. Predicting queue times on space-sharing parallel computers. In Proceedings of the 11th International Parallel Processing Symposium, 1997.
Google Scholar
S. Fitzgerald, I. Foster, C. Kesselman, G. von Laszewski, W. Smith, and S. Tuecke. A directory service for configuring high-performance distributed computations. In Proc. 6th IEEE Symp. on High Performance Distributed Computing, pages 365–375. IEEE Computer Society Press, 1997.
Google Scholar
I. Foster, J. Geisler, W. Nickless, W. Smith, and S. Tuecke. Software infrastructure for the I-WAY metacomputing experiment. Concurrency: Practice & Experience, 1998. to appear.
Google Scholar
I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. International Journal of Supercomputer Applications, 11(2):115–128, 1997.
Article Google Scholar
GENIAS Software GmbH. CODINE: Computing in distributed networked environments, 1995. http://www.genias.de/genias/english/codine.html.
Google Scholar
A. Grimshaw, W. Wulf, J. French, A. Weaver, and P. Reynolds, Jr. Legion: The next logical step toward a nationwide virtual computer. Technical Report CS-94-21, Department of Computer Science, University of Virginia, 1994.
Google Scholar
The PSCHED API Working Group. PSCHED: An API for parallel job/resource management version 0.1, 1996. http://parallel.nas.nasa.gov/PSCHED/.
Google Scholar
R. Henderson and D. Tweten. Portable Batch System: External reference specification. Technical report, NASA Ames Research Center, 1996.
Google Scholar
International Business Machines Corporation, Kingston, NY. IBM Load Leveler: User's Guide, September 1993.
Google Scholar
J. Jones and C. Brickell. Second evaluation of job queuing/scheduling software: Phase 1 report. NAS Technical Report NAS-97-013, NASA Ames Research Center, Moffett Field, CA 94035-1000, 1997. http://science.nas.nasa.gov/Pubs/TechReports/NASreports/NAS-97-013/jms.eval.rep2.html.
Google Scholar
David A. Lifka. The ANL/IBM SP scheduling system. In The IPPS'95 Workshop on Job Scheduling Strategies for Parallel Processing, pages 187–191, April 1995.
Google Scholar
M. Litzkow, M. Livny, and M. Mutka. Condor — a hunter of idle workstations. In Proc. 8th Intl Conf. on Distributed Computing Systems, pages 104–111, 1988.
Google Scholar
P. Messina, S. Brunett, D. Davis, T. Gottschalk, D. Curkendall, L. Ekroot, and H. Siegel. Distributed interactive simulation for synthetic forces. In Proceedings of the 11th International Parallel Processing Symposium, 1997.
Google Scholar
K. Moore, G. Fagg, A. Geist, and J. Dongarra. Scalable networked information processing environment (SNIPE). In Proceedings of Supercomputing '91, 1997.
Google Scholar
B. C. Neuman. Prospero: A tool for organizing internet resources. Electronic Networking: Research, Applications, and Policy, 2(1):30–37, Spring 1992.
Article Google Scholar
B. C. Neuman and S. Rao. The Prospero resource manager: A scalable frame-work for processor allocation in distributed systems. Concurrency: Practice & Experience, 6(4):339–355, 1994.
Google Scholar
R. Ramamoorthi, A. Rifkin, B. Dimitrov, and K.M. Chandy. A general resource reservation framework for scientific computing. In Scientific Computing in Object-Oriented Parallel Environments, pages 283–290. Springer-Verlag, 1997.
Google Scholar
W. Smith, I. Foster, and V. Taylor. Predicting application run times using historical information. Lecture Notes on Computer Science, 1998.
Google Scholar
Amin Vahdat, Eshwar Belani, Paul Eastham, Chad Yoshikawa, Thomas Anderson, David Culler, and Michael Dahlin. WebOS: Operating system services for wide area applications. In 7th Symposium on High Performance Distributed Computing, to appear, July 1998.
Google Scholar
J. Weissman. Gallop: The benefits of wide-area computing for parallel processing. Technical report, University of Texas at San Antonio, 1997.
Google Scholar
J. Weissman and A. Grimshaw. A federated model for scheduling in wide-area systems. In Proc. 5th IEEE Symp. on High Performance Distributed Computing, 1996.
Google Scholar
S. Zhou. LSF: Load sharing in large-scale heterogeneous distributed systems. In Proc. Workshop on Cluster Computing, 1992.
Google Scholar

Download references

Author information

Authors and Affiliations

Information Sciences Institute, University of Southern California, Marina del Rey, 90292-6695, CA
Karl Czajkowski & Carl Kesselman
Mathematics and Computer Science Division, Argonne National Laboratory, 60439, Argonne, IL
Ian Foster, Nick Karonis, Stuart Martin, Warren Smith & Steven Tuecke

Authors

Karl Czajkowski
View author publications
You can also search for this author in PubMed Google Scholar
Ian Foster
View author publications
You can also search for this author in PubMed Google Scholar
Nick Karonis
View author publications
You can also search for this author in PubMed Google Scholar
Carl Kesselman
View author publications
You can also search for this author in PubMed Google Scholar
Stuart Martin
View author publications
You can also search for this author in PubMed Google Scholar
Warren Smith
View author publications
You can also search for this author in PubMed Google Scholar
Steven Tuecke
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Dror G. Feitelson Larry Rudolph

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Czajkowski, K. et al. (1998). A resource management architecture for metacomputing systems. In: Feitelson, D.G., Rudolph, L. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 1998. Lecture Notes in Computer Science, vol 1459. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0053981

Download citation

DOI: https://doi.org/10.1007/BFb0053981
Published: 25 May 2006
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64825-3
Online ISBN: 978-3-540-68536-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics