QBETS: Queue Bounds Estimation from Time Series

  • Daniel Nurmi
  • John Brevik
  • Rich Wolski
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4942)


Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. Because these machines are typically “space-shared,” each job must wait in a queue until sufficient processor resources become available to service it. In production computing settings, the queuing delay (experienced by users as the time between when the job is submitted and when it begins execution) is highly variable. Users often find this variability a drag on productivity as it makes planning difficult and intellectual continuity hard to maintain.

In this work, we introduce an on-line system for predicting batch-queue delay and show that it generates correct and accurate bounds for queuing delay for batch jobs from 11 machines over a 9-year period. Our system comprises 4 novel and interacting components: a predictor based on nonparametric inference; an automated change-point detector; machine-learned, model-based clustering of jobs having similar characteristics; and an automatic downtime detector to identify systemic failures that affect job queuing delay. We compare the correctness and accuracy of our system against various previously used prediction techniques and show that our new method outperforms them for all machines we have available for study.


Wait Time Queue Delay Binomial Method Site Administrator Queue Wait Time 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    IBM LoadLeveler User’s Guide. Technical report, International Business Machines Corporation (1993)Google Scholar
  2. 2.
    Box, G., Jenkins, G., Reinsel, G.: Time Series Analysis, Forecasting, and Control, 3rd edn. Prentice-Hall, Englewood Cliffs (1994)zbMATHGoogle Scholar
  3. 3.
    Brevik, J., Nurmi, D., Wolski, R.: Quantifying machine availability in networked and desktop grid systems. In: Proceedings of CCGrid 2004 (April 2004)Google Scholar
  4. 4.
    Brevik, J., Nurmi, D., Wolski, R.: Predicting bounds on queuing delay for batch-scheduled parallel machines. In: Proceedings of PPoPP 2006 (March 2006)Google Scholar
  5. 5.
    Brevik, J., Nurmi, D., Wolski, R.: Predicting bounds on queuing delay in space-shared computing environments. In: Proceedings of IEEE International Symposium on Workload Characterization 2006 (October 2006)Google Scholar
  6. 6.
    Chiang, S.-H., Vernon, M.K.: Dynamic vs. static quantum-based processor allocation. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1996 and JSSPP 1996. LNCS, vol. 1162, Springer, Heidelberg (1996)CrossRefGoogle Scholar
  7. 7.
    Clearwater, S., Kleban, S.: Heavy-tailed distributions in supercomputer jobs. Technical Report SAND2002-2378C, Sandia National Labs (2002)Google Scholar
  8. 8.
    Downey, A.: Predicting queue times on space-sharing parallel computers. In: Proceedings of the 11th International Parallel Processing Symposium (April 1997)Google Scholar
  9. 9.
    Downey, A.: Using queue time predictions for processor allocation. In: Proceedings of the 3rd Workshop on Job Scheduling Strategies for Parallel Processing (April 1997)Google Scholar
  10. 10.
    The Dror Feitelson’s Parallel Workload Page,
  11. 11.
    Feitelson, D.G., Nitzberg, B.: Job characteristics of a production parallel scientific workload on the nasa ames ipsc/860. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1996 and JSSPP 1996. LNCS, vol. 1162, Springer, Heidelberg (1996)Google Scholar
  12. 12.
    Feitelson, D.G., Rudolph, L.: Parallel job scheduling: Issues and approaches. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1995 and JSSPP 1995. LNCS, vol. 949, Springer, Heidelberg (1995)Google Scholar
  13. 13.
    Feitelson, D.G., Rudolph, L.: Towards convergence in job schedulers for parallel supercomputers. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1996 and JSSPP 1996. LNCS, vol. 1162, Springer, Heidelberg (1996)Google Scholar
  14. 14.
    Frachtenberg, E., Feitelson, D.G., Fernandez, J., Petrini, F.: Parallel job scheduling under dynamic workloads. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, Springer, Heidelberg (2003)CrossRefGoogle Scholar
  15. 15.
    Granger, C.W.P., Newbold, P.: Forecasting Economic Time Series. Academic Press, London (1986)zbMATHGoogle Scholar
  16. 16.
    Gridengine home page,
  17. 17.
    Harchol-Balter, M.: The effect of heavy-tailed job size distributions on computer system design. In: Proceedings of ASA-IMS Conference on Applications of Heavy Tailed Distributions in Economics, Engineering and Statistics (June 1999)Google Scholar
  18. 18.
    Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Inc, Upper Saddle River, NJ, USA (1988)zbMATHGoogle Scholar
  19. 19.
    Lifka, D.: The anl/ibm sp scheduling system. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1995 and JSSPP 1995. LNCS, vol. 949, pp. 295–303. Springer, Heidelberg (1995)Google Scholar
  20. 20.
    Lifka, D., Henderson, M., Rayl, K.: Users guide to the argonne SP scheduling system. Technical Report TM-201, Argonne National Laboratory, Mathematics and Computer Science Division (May 1995)Google Scholar
  21. 21.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. pp. 281–297 (1967)Google Scholar
  22. 22.
  23. 23.
  24. 24.
    Nurmi, D., Brevik, J., Wolski, R.: Modeling machine availability in enterprise and wide-area distributed computing environments. In: Proceedings of Europar 2005 (August 2005)Google Scholar
  25. 25.
    Nurmi, D., Mandal, A., Brevik, J., Koelbel, C., Wolski, R., Kennedy, K.: Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction. In: Löwe, W., Südholt, M. (eds.) SC 2006. LNCS, vol. 4089, Springer, Heidelberg (2006)Google Scholar
  26. 26.
    Nurmi, D., Wolski, R., Brevik, J.: Model-based checkpoint scheduling for volatile resource environments. In: Proceedings of Cluster 2004 (September 2004)Google Scholar
  27. 27.
    NWS Batch Queue Pprediction web interface,
  28. 28.
  29. 29.
    Plale, B., Gannon, D., Brotzge, J., Droegemeier, K., Kurose, J., Mclaughlin, D., Wilhelmson, R., Graves, S., Ramamurhty, M., Clark, R.D., Yalda, S., Reed, D.A., Joseph, E., Chandraeskar, V.: CASA and LEAD: Adaptive Cyberinfrastructure for Real-Time Multiscale Weather Forecasting. IEEE Computer 39, 56–64 (2006)Google Scholar
  30. 30.
    Posse, C.: Hierarchical model-based clustering for large datasets. Journal of Computational and Graphical Statistics 10(3), 464 (2001)CrossRefMathSciNetGoogle Scholar
  31. 31.
    Schwartz, G.: Estimating the dimension of a model. In: Ann. of Statistics, pp. 461–464 (1979)Google Scholar
  32. 32.
    Smith, W., Taylor, V.E., Foster, I.T.: Using run-time predictions to estimate queue wait times and improve scheduler performance. In: IPPS/SPDP 1999/JSSPP 1999: Proceedings of the Job Scheduling Strategies for Parallel Processing, pp. 202–219. Springer, London, UK (1999)CrossRefGoogle Scholar
  33. 33.
    TeraGrid user portal,
  34. 34.
    The virtual grid application development software (vgrads),
  35. 35.
    Zhong, S.Z.: A unified framework for model-based clustering. Journal of Machine Learning Research 4, 1001–1037 (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Daniel Nurmi
    • 1
  • John Brevik
    • 2
  • Rich Wolski
    • 1
  1. 1.Computer Science DepartmentUniversity of California, Santa Barbara, Santa BarbaraCalifornia
  2. 2.Mathematics and Statistics DepartmentCalifornia State University, Long Beach, Long BeachCalifornia

Personalised recommendations