Parallel Computing, Failure Recovery, and Extreme Values
A task of random size T is split into M subtasks of lengths T1, …, TM, each of which is sent to one out of M parallel processors. Each processor may fail at a random time before completing its allocated task, and then has to restart it from the beginning. If X1, …,XM are the total task times at the M processors, the overall total task time is then ZM = max1,…,MXi. Limit theorems as M → ∞ are given for ZM, allowing the distribution of T to depend on M. In some cases the limits are classical extreme value distributions, in others they are of a different type.
Key-wordsCramér-Lundberg approximation failure recovery Fréchet distribution geometric sums Gumbel distribution heavy tails logarithmic asymptotics mixture distribution power tail RESTART triangular array
AMS Subject ClassificationPrimary: 60G70 Secondary: 60F05, 68M20
Unable to display preview. Download preview PDF.
- Castillo, X., Siewiorek, D.P., 1980. A performance-reliability model for computing systems. Proc FTCS-10, Silver Spring, MD, IEEE Computer Soc., 187–192.Google Scholar
- Chimento, Jr., P.F., Trivedi, K.S., 1993. The completion time of programs on processors subject to failure and repair. IEEE Trans. on Computers 42(1).Google Scholar
- De Prisco, R., Mayer, A., Yung, M., 1994. Time-optimal message-efficient work performance in the presence of faults. Proc. 13th ACM PODC, 161–172.Google Scholar
- Hoffmann-Jøgensen, J., 1994. Probability With a View Yoward Statistics, Vol. I. Chapman & Hall.Google Scholar
- Jelenkovic, P., Tan, V., 2007. Can retransmission of superexponential documents cause subexponential delays? Proc. IEEE Infocom2007, pp. 892–900, Anchorage, 6–12 May 2007.Google Scholar