Job Status Prediction – Catch Them Before They Fail
Jobs in a computer cluster have several exit statuses caused by application properties, user and scheduler behavior. In this paper we analyze importance of job statuses and potential use of their prediction prior to job execution. Method for prediction of failed jobs based on Bayesian classifier is proposed and accuracy of the method is analyzed on several workloads. This method is integrated to the EASY algorithm adapted to prioritize jobs that are likely to fail. System performance for both failed jobs and the entire workload is analyzed.
KeywordsComputer cluster Job Status Prediction Bayesian Classifier
Unable to display preview. Download preview PDF.
- 1.IDC HPC Market Update, http://www.hpcadvisorycouncil.com/events/china_workshop/pdf/6_IDC.pdf
- 2.TOP500 Supercomputing Sites, http://www.top500.org/
- 7.Krishnaswamy, S., Loke, S.W., Zaslavsky, A.: Estimating computation times of data-intensive applications. IEEE Distributed Systems Online 5(4) (2004)Google Scholar
- 8.Kapadia, N.H., Fortes, J.A.B., Brodley, C.E.: Predictive Application-Performance Modeling in a Computational Grid Environment. In: the Proceedings of the The Eighth IEEE International Symposium on High Performance Distributed Computing, pp. 47–54 (1999)Google Scholar
- 11.Tsafrir, D., Feitelson, D.G.: The Dynamics of Backfilling: Solving the Mysteryof Why Increased Inaccuracy May Help. In: Proceedings of 2006 IEEE International Symposium on Workload Characterization, pp. 131–141 (2006)Google Scholar
- 12.Parallel Workloads Archive, http://www.cs.huji.ac.il/labs/parallel/workload/
- 14.Quinlan, J.R.: C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco (1993)Google Scholar