Abstract
There is an increasing demand for computing power in scientific and engineering applications which has motivated the deployment of high performance computing (HPC) systems that deliver tera-scale performance. Current and future HPC systems that are capable of running large-scale parallel applications may span hundreds of thousands of nodes.
In 2006 the highest processor count was 131K nodes according to top500.org [282]. For parallel programs, the failure probability of nodes and computing tasks assigned to the nodes has been shown to increase significantly with the increase in number of nodes. Large-scale computing environments, such as the current grids CERN LCG, NorduGrid, TeraGrid and Grid’5000 gather (tens of) thousands of resources for the use of an ever-growing scientific community. Many of these Grids offer computing resources grouped in clusters, whose owners may share them only for limited periods of time and Grids often have the problems of any large-scale computing environment to which is added that their middleware is still relatively immature, which contributes to making Grids relatively unreliable computing platforms. Long et al. [237] collected a dataset on node failures over 11 months from 1139 workstations on the Internet to determine their uptime intervals. Plank and Elwasif [277] collected a dataset on failure information for a collection of 16 DEC Alpha work-stations at Princeton University; the size of this network is smaller and is a typical local cluster of homogeneous processors; the failure data was collected for 7 months and shows similar characteristics as for the larger clusters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Carlsson, C., Fullér, R. (2011). Risk Assessment in Grid Computing. In: Possibility for Decision. Studies in Fuzziness and Soft Computing, vol 270. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22642-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-22642-7_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22641-0
Online ISBN: 978-3-642-22642-7
eBook Packages: EngineeringEngineering (R0)