A Job Self-scheduling Policy for HPC Infrastructures

  • Francesc Guim
  • Julita Corbalan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4942)


The number of distributed high performance computing architectures has increased exponentially these last years. Thus, systems composed by several computational resources provided by different Research centers and Universities have become very popular. Job scheduling policies have been adapted to these new scenarios in which several independent resources have to be managed. New policies have been designed to take into account issues like multi-cluster environments, heterogeneous systems and the geographical distribution of the resources.

Several centralized scheduling solutions have been proposed in the literature for these environments, such as centralized schedulers, centralized queues and global controllers. These approaches use a unique scheduling entity responsible for scheduling all the jobs that are submitted by the users.

In this paper we propose the usage of self-scheduling techniques for dispatching the jobs that are submitted to a set of distributed computational hosts that are managed by independent schedulers (such as MOAB or LoadLeveler). It is a non-centralized and job-guided scheduling policy whose main goal is to optimize the job wait time. Thus, the scheduling decisions are done independently for each job instead of using a global policy where all the jobs are considered. On top of this, as a part of the proposed solution, we also demonstrate how the usage of job wait time prediction techniques can substantially improve the performance obtained in the described architecture.


Schedule Policy Local Scheduler Global Scheduler Centralize Scheduler Reservation Table 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bansal, N., Harchol-Balter, M.: Analysis of SRPT scheduling: investigating unfairness (2001)Google Scholar
  2. 2.
    Berman, F., Wolski, R.: The apples project: A status report (1997)Google Scholar
  3. 3.
    Berman, F., Wolski, R.: Scheduling from the perspective of the application. pp. 100–111 (1996)Google Scholar
  4. 4.
    Calzarossa, M., Haring, G., Kotsis, G., Merlo, A., Tessera, D.: A hierarchical approach to workload characterization for parallel systems. In: Hertzberger, B., Serazzi, G. (eds.) HPCN-Europe 1995. LNCS, vol. 919, pp. 102–109. Springer, Heidelberg (1995)CrossRefGoogle Scholar
  5. 5.
    Calzarossa, M., Massari, L., Tessera, D.: Workload characterization issues and methodologies. In: Reiser, M., Haring, G., Lindemann, C. (eds.) Performance Evaluation: Origins and Directions. LNCS, vol. 1769, pp. 459–482. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  6. 6.
    Chiang, S.-H., Arpaci-Dusseau, A.C., Vernon, M.K.: The impact of more accurate requested runtimes on production job scheduling performance. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 103–127. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  7. 7.
    Cirne, W., Berman, F.: A comprehensive model of the supercomputer workload. In: 4th Ann. Workshop Workload Characterization (2001)Google Scholar
  8. 8.
    Cirne, W., Berman, F.: A model for moldable supercomputer jobs. In: 15th Intl. Parallel and Distributed Processing Symp. (2001)Google Scholar
  9. 9.
    Downey, A.B.: A parallel workload model and its implications for processor allocation. In: 6th Intl. Symp. High Performance Distributed Comput (August 1997)Google Scholar
  10. 10.
    Downey, A.B.: Using queue time predictions for processor allocation. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1997 and JSSPP 1997. LNCS, vol. 1291, pp. 35–57. Springer, Heidelberg (1997)Google Scholar
  11. 11.
    Ernemann, C., Hamscher, V., Yahyapour, R.: Benefits of global grid computing for job scheduling. In: 5th IEEE/ACM International Workshop on Grid Computing (2004)Google Scholar
  12. 12.
    Feitelson, D.G.: Packing schemes for gang scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1996 and JSSPP 1996. LNCS, vol. 1162, pp. 89–110. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  13. 13.
    Feitelson, D.D.G.: Parallel workload archive (2006)Google Scholar
  14. 14.
    Feitelson, D.G.: Workload modeling for performance evaluation. In: Calzarossa, M.C., Tucci, S. (eds.) Performance 2002. LNCS, vol. 2459, pp. 114–141. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  15. 15.
    Feitelson, D.G., Nitzberg, B.: Job characteristics of a production parallel scientific workload on the nasa ames ipsc/860. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1995 and JSSPP 1995. LNCS, vol. 949, pp. 337–360. Springer, Heidelberg (1995)Google Scholar
  16. 16.
    Feitelson, D.G., Rudolph, L.: Workload evolution on the cornell theory center ibm sp2. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1996 and JSSPP 1996. LNCS, vol. 1162, pp. 27–40. Springer, Heidelberg (1996)Google Scholar
  17. 17.
    Feitelson, D.G., Rudolph, L.: Metrics and benchmarking for parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1998, SPDP-WS 1998, and JSSPP 1998. LNCS, vol. 1459, pp. 1–24. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  18. 18.
    Feitelson, D.G., Rudolph, L., Schwiegelshohn, U.: Parallel job scheduling - a status report. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, p. 9. Springer, Heidelberg (2005)Google Scholar
  19. 19.
    Feitelson, D.G., Weil, A.: Utilization and predictability in scheduling the ibm sp2 with backfilling. In: Proceedings of the 12th International Parallel Processing Symposium, pp. 542–546 (1998)Google Scholar
  20. 20.
    Foster, I., Kesselman, C.: Globus: A metacomputing infrastructure toolkit. J Intl - International Journal of Supercomputer Applications (1997)Google Scholar
  21. 21.
    Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the Grid: Enabling scalable virtual organizations. In: Sakellariou, R., Keane, J.A., Gurd, J.R., Freeman, L. (eds.) Euro-Par 2001. LNCS, vol. 2150, Springer, Heidelberg (2001)CrossRefGoogle Scholar
  22. 22.
    Gerald, S., Rajkumar, K., Arun, R., Ponnuswamy, S.: Scheduling of parallel jobs in a heterogeneous multi-site environment. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, Springer, Heidelberg (2003)Google Scholar
  23. 23.
    Grimshaw, A.S., Wulf, W.A., French, J.C., Weaver, A.C., Reynolds Jr, P.F.: Legion: The next logical step toward a nationwide virtual computer (CS-94-21), 8 (1994)Google Scholar
  24. 24.
    Guim, F., Corbalan, J., Labarta, J.: The internals of the alvio-simulator: Simulator of hpc infraestructures (upc-dac-rr-cap-2007-2). Technical report, Architecture Computer Deparment - Technical University of Catalunya (2005)Google Scholar
  25. 25.
    Guim, F., Corbalan, J., Labarta, J.: Modeling the impact of resource sharing in backfilling policies using the alvio simulator. In: 15th Annual Meeting of the IEEE / ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (submitted, 2007)Google Scholar
  26. 26.
    Harchol-Balter, M., Crovella, M.E., Murta, C.D.: On choosing a task assignment policy for a distributed server system. Journal of Parallel and Distributed Computing 59(2), 204–228 (1999)CrossRefGoogle Scholar
  27. 27.
    Windisch, V.L.K., Moore, R., Feitelson, D., Nitzberg, B.: A comparison of workload traces from two production parallel machines. In: 6th Symp. Frontiers Massively Parallel Comput, pp. 319–326 (1996)Google Scholar
  28. 28.
    Lawson, B.G., Smirni, E.: Multiple-Queue Backfilling Scheduling with Priorities and Reservations for Parallel Systems. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 72–87. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  29. 29.
    Li, H., Chen, J., Tao, Y., Groep, D., Wolters, L.: Improving a local learning technique for queue wait time predictions. Cluster and Grid computing (2006)Google Scholar
  30. 30.
    Pinchak, C., Lu, P., Goldenberg, M.: Practical heterogeneous placeholder scheduling in overlay metacomputers: Early experiences. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 205–228. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  31. 31.
    Schroeder, B., Harchol-Balter, M.: Evaluation of task assignment policies for supercomputing servers: The case for load unbalancing and fairness. Cluster Computing 2004 (2004)Google Scholar
  32. 32.
    Sevcik, K.C.: Application scheduling and processor allocation in multiprogrammed parallel processing systems. Performance Evaluation, 107–140 (1994)Google Scholar
  33. 33.
    Shmueli, E., Feitelson, D.G.: Backfilling with Lookahead to Optimize the Performance of Parallel Job Scheduling. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 228–251. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  34. 34.
    Skovira, J., Chan, W., Zhou, H., Lifka, D.A.: The EASY - LoadLeveler API Project. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1996 and JSSPP 1996. LNCS, vol. 1162, pp. 41–47. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  35. 35.
    Smith, W., Taylor, V.E., Foster, I.T.: Using run-time predictions to estimate queue wait times and improve scheduler performance. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999, IPPS-WS 1999, and SPDP-WS 1999. LNCS, vol. 1659, pp. 202–219. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  36. 36.
    Smith, W., Wong, P.: Resource selection using execution and queue wait time. predictions, p. 7Google Scholar
  37. 37.
    Talby, D., Feitelson, D.: Supporting priorities and improving utilization of the ibm sp scheduler using slack-based backfilling. In: Parallel Processing Symposium, pp. 513–517 (1999)Google Scholar
  38. 38.
    Tsafrir, D., Feitelson, D.G.: Instability in parallel job scheduling simulation: the role of workload flurries. In: 20th Intl. Parallel and Distributed Processing Symp. (2006)Google Scholar
  39. 39.
    Yue, J.: Global Backfilling Scheduling in Multiclusters. In: Manandhar, S., Austin, J., Desai, U., Oyanagi, Y., Talukder, A.K. (eds.) AACC 2004. LNCS, vol. 3285, pp. 232–239. Springer, Heidelberg (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Francesc Guim
    • 1
  • Julita Corbalan
    • 1
  1. 1.Barcelona Supercomputing Center 

Personalised recommendations