Skip to main content

A resource management architecture for metacomputing systems

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1459))

Abstract

Metacomputing systems are intended to support remote and/or concurrent use of geographically distributed computational resources. Resource management in such systems is complicated by five concerns that do not typically arise in other situations: site autonomy and heterogeneous substrates at the resources, and application requirements for policy extensibility, co-allocation, and online control. We describe a resource management architecture that addresses these concerns. This architecture distributes the resource management problem among distinct local manager, resource broker, and resource co-allocator components and defines an extensible resource specification language to exchange information about requirements. We describe how these techniques have been implemented in the context of the Globus metacomputing toolkit and used to implement a variety of different resource management strategies. We report on our experiences applying our techniques in a large testbed, GUSTO, incorporating 15 sites, 330 computers, and 3600 processors.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cray Research, 1997. Document Number IN-2153 2/97.

    Google Scholar 

  2. D. Abramson, R. Sosic, J. Giddy, and B. Hall. Nimrod: A tool for performing parameterised simulations using distributed workstations. In Proc. 4th IEEE Symp. on High Performance Distributed Computing. IEEE Computer Society Press, 1995.

    Google Scholar 

  3. F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao. Application-level scheduling on distributed heterogeneous networks. In Proceedings of Supercomputing '96. ACM Press, 1996.

    Google Scholar 

  4. S. Brunett and T. Gottschalk. Scalable ModSAF simulations with more than 50,000 vehicles using multiple scalable parallel processors. In Proceedings of the Simulation Interoperability Workshop, 1997.

    Google Scholar 

  5. S. Chapin. Distributed scheduling support in the presence of autonomy. In Proc. Heterogeneous Computing Workshop, pages 22–29, 1995.

    Google Scholar 

  6. Joseph Czyzyk, Michael P. Mesnier, and Jorge J. Moré. The Network-Enabled Optimization System (NEOS) Server. Preprint MCS-P615-0996, Argonne National Laboratory, Argonne, Illinois, 1996.

    Google Scholar 

  7. A. Downey. Predicting queue times on space-sharing parallel computers. In Proceedings of the 11th International Parallel Processing Symposium, 1997.

    Google Scholar 

  8. S. Fitzgerald, I. Foster, C. Kesselman, G. von Laszewski, W. Smith, and S. Tuecke. A directory service for configuring high-performance distributed computations. In Proc. 6th IEEE Symp. on High Performance Distributed Computing, pages 365–375. IEEE Computer Society Press, 1997.

    Google Scholar 

  9. I. Foster, J. Geisler, W. Nickless, W. Smith, and S. Tuecke. Software infrastructure for the I-WAY metacomputing experiment. Concurrency: Practice & Experience, 1998. to appear.

    Google Scholar 

  10. I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. International Journal of Supercomputer Applications, 11(2):115–128, 1997.

    Article  Google Scholar 

  11. GENIAS Software GmbH. CODINE: Computing in distributed networked environments, 1995. http://www.genias.de/genias/english/codine.html.

    Google Scholar 

  12. A. Grimshaw, W. Wulf, J. French, A. Weaver, and P. Reynolds, Jr. Legion: The next logical step toward a nationwide virtual computer. Technical Report CS-94-21, Department of Computer Science, University of Virginia, 1994.

    Google Scholar 

  13. The PSCHED API Working Group. PSCHED: An API for parallel job/resource management version 0.1, 1996. http://parallel.nas.nasa.gov/PSCHED/.

    Google Scholar 

  14. R. Henderson and D. Tweten. Portable Batch System: External reference specification. Technical report, NASA Ames Research Center, 1996.

    Google Scholar 

  15. International Business Machines Corporation, Kingston, NY. IBM Load Leveler: User's Guide, September 1993.

    Google Scholar 

  16. J. Jones and C. Brickell. Second evaluation of job queuing/scheduling software: Phase 1 report. NAS Technical Report NAS-97-013, NASA Ames Research Center, Moffett Field, CA 94035-1000, 1997. http://science.nas.nasa.gov/Pubs/TechReports/NASreports/NAS-97-013/jms.eval.rep2.html.

    Google Scholar 

  17. David A. Lifka. The ANL/IBM SP scheduling system. In The IPPS'95 Workshop on Job Scheduling Strategies for Parallel Processing, pages 187–191, April 1995.

    Google Scholar 

  18. M. Litzkow, M. Livny, and M. Mutka. Condor — a hunter of idle workstations. In Proc. 8th Intl Conf. on Distributed Computing Systems, pages 104–111, 1988.

    Google Scholar 

  19. P. Messina, S. Brunett, D. Davis, T. Gottschalk, D. Curkendall, L. Ekroot, and H. Siegel. Distributed interactive simulation for synthetic forces. In Proceedings of the 11th International Parallel Processing Symposium, 1997.

    Google Scholar 

  20. K. Moore, G. Fagg, A. Geist, and J. Dongarra. Scalable networked information processing environment (SNIPE). In Proceedings of Supercomputing '91, 1997.

    Google Scholar 

  21. B. C. Neuman. Prospero: A tool for organizing internet resources. Electronic Networking: Research, Applications, and Policy, 2(1):30–37, Spring 1992.

    Article  Google Scholar 

  22. B. C. Neuman and S. Rao. The Prospero resource manager: A scalable frame-work for processor allocation in distributed systems. Concurrency: Practice & Experience, 6(4):339–355, 1994.

    Google Scholar 

  23. R. Ramamoorthi, A. Rifkin, B. Dimitrov, and K.M. Chandy. A general resource reservation framework for scientific computing. In Scientific Computing in Object-Oriented Parallel Environments, pages 283–290. Springer-Verlag, 1997.

    Google Scholar 

  24. W. Smith, I. Foster, and V. Taylor. Predicting application run times using historical information. Lecture Notes on Computer Science, 1998.

    Google Scholar 

  25. Amin Vahdat, Eshwar Belani, Paul Eastham, Chad Yoshikawa, Thomas Anderson, David Culler, and Michael Dahlin. WebOS: Operating system services for wide area applications. In 7th Symposium on High Performance Distributed Computing, to appear, July 1998.

    Google Scholar 

  26. J. Weissman. Gallop: The benefits of wide-area computing for parallel processing. Technical report, University of Texas at San Antonio, 1997.

    Google Scholar 

  27. J. Weissman and A. Grimshaw. A federated model for scheduling in wide-area systems. In Proc. 5th IEEE Symp. on High Performance Distributed Computing, 1996.

    Google Scholar 

  28. S. Zhou. LSF: Load sharing in large-scale heterogeneous distributed systems. In Proc. Workshop on Cluster Computing, 1992.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Dror G. Feitelson Larry Rudolph

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Czajkowski, K. et al. (1998). A resource management architecture for metacomputing systems. In: Feitelson, D.G., Rudolph, L. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 1998. Lecture Notes in Computer Science, vol 1459. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0053981

Download citation

  • DOI: https://doi.org/10.1007/BFb0053981

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-64825-3

  • Online ISBN: 978-3-540-68536-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics