Foundations of Dependable Computing pp 125-153 | Cite as

# Fault-Tolerance and Efficiency in Massively Parallel Algorithms

## Abstract

We present an overview of massively parallel deterministic algorithms which combine high fault-tolerance and efficiency. This desirable combination (called *robustness* here) is nontrivial, since increasing efficiency implies removing redundancy whereas increasing fault-tolerance requires adding redundancy to computations. We study a spectrum of algorithmic models for which significant robustness is achievable, from static fault, synchronous computation to dynamic fault, asynchronous computation. In addition to fail-stop processor models, we examine and deal with arbitrarily initialized memory and restricted memory access concurrency. We survey the deterministic upper bounds for the basic *Write-All* primitive, the lower bounds on its efficiency, and we identify some of the key open questions. We also generalize the robust computing of functions to relations; this new approach can model approximate computations. We show how to compute approximate *Write-All* optimally. Finally, we synthesize the state-of-the-art in a complexity classification, which extends with fault-tolerance the traditional classification of efficient parallel algorithms.

## Keywords

Parallel Algorithm Shared Memory Random Access Machine Overhead Ratio Faulty Processor## Preview

Unable to display preview. Download preview PDF.

## Bibliography

- [1]M. Ajtai, J. Aspnes, C. Dwork, O. Waarts, “The Competitive Analysis of Wait-Free Algorithms and its Application to the Cooperative Collect Problem”, manuscript 1993.Google Scholar
- [2]G. B. Adams III, D. P. Agrawal, H. J. Seigel, “A Survey and Comparison of Fault-tolerant Multistage Interconnection Networks”,
*IEEE Computer*, 20,6, pp. 14–29, 1987.Google Scholar - [3]R. Anderson, H. Woll, “Wait-Free Parallel Algorithms for the Union-Find Problem”,
*Proc. of the 23rd ACM Symp. on Theory of Computing*, pp. 370–380, 1991.Google Scholar - [4]Y. Aumann and M.O. Rabin, “Clock Construction in Fully Asynchronous Parallel Systems and PRAM Simulation”, in
*Proc. of the 33rd IEEE Symposium on Foundations of Computer Science*, pp. 147–156, 1992.Google Scholar - [5]Y. Aumann, Z.M. Kedem, K.V. Palem, M.O. Rabin, “Highly Efficient Asynchronous Execution of Large-Grained Parallel Programs”, in
*Proc. of the 34th IEEE Symposium on Foundations of Computer Science*, pp. 271–280, 1993.Google Scholar - [6]P. Beame and J. Hastad, “Optimal bounds for decision problems on the CRCW PRAM,”
*Journal of the ACM*, vol. 36,no. 3, pp. 643–670, 1989.MATHCrossRefMathSciNetGoogle Scholar - [7]P. Beame, M. Kik and M. Kutylowski, “Information broadcasting by Exclusive Read PRAMs”, manuscript 1992.Google Scholar
- [8]J. Buss, P.C. Kanellakis, P. Ragde, A.A. Shvartsman, “Parallel algorithms with processor failures and delays”, Brown Univ. TR CS-91-54, August 1991.Google Scholar
- [9]R. Cole and O. Zajicek, “the APRAM: Incorporating Asynchrony into the PRAM Model,” in
*Proc. of the 1989 ACM Symp. on Parallel Algorithms and Architectures*, pp. 170–178, 1989.Google Scholar - [10]R. Cole and O. Zajicek, “the Expected Advantage of Asynchrony,” in
*Proc. 2nd ACM Symp. on Parallel Algorithms and Architectures*, pp. 85–94, 1990.Google Scholar - [11]R. DePrisco, A. Mayer, M. Young, “Time-Optimal Message-Optimal Work performance in the Presence of Faults” manuscript, 1994.Google Scholar
- [12]C. Dwork, J. Halpern, O. Waarts, “Accomplishing Work in the Presence of Failures” in
*Proc. 11th ACM Symposium on Principles of Distributed Computing*, pp. 91–102, 1992.Google Scholar - [13]D. Eppstein and Z. Galil, “Parallel Techniques for Combinatorial Computation”,
*Annual Computer Science Review*, 3 (1988), pp. 233–83.CrossRefMathSciNetGoogle Scholar - [14]S. Fortune and J. Wyllie, “Parallelism in Random Access Machines”,
*Proc. the 10th ACM Symposium on Theory of Computing*, pp. 114–118, 1978.Google Scholar - [15]P. Gibbons, “A More Practical PRAM Model,” in
*Proc. of the 1989 ACM Symposium on Parallel Algorithms and Architectures*, pp. 158–168, 1989.Google Scholar - [16]P. C. Kanellakis, D. Michailidis, A. A. Shvartsman, “Controlling Memory Access Concurrency in Efficient Fault-Tolerant Parallel Algorithms”,
*7th Int-l Workshop on Distributed Algorithms*, pp. 99–114, 1993.Google Scholar - [17]P. C. Kanellakis and A. A. Shvartsman, “Efficient Parallel Algorithms Can Be Made Robust”,
*Distributed Computing*, vol. 5,no. 4, pp. 201–217, 1992; prelim. vers. in*Proc. of the 8th ACM PODC*, pp. 211–222, 1989.MATHCrossRefGoogle Scholar - [18]P. C. Kanellakis and A. A. Shvartsman, “Efficient Parallel Algorithms On Restartable Fail-Stop Processors”, in
*Proc. of the 10th ACM Symposium on Principles of Distributed Computing*, 1991.Google Scholar - [19]P. C. Kanellakis and A. A. Shvartsman, “Robust Computing with Fail-Stop Processors”, in
*Proc. of the Second Annual Review and Workshop on Ultradependable Multicomputers*, Office of Naval Research, pp. 55–60, 1991.Google Scholar - [20]R. M. Karp and V. Ramachandran, “A Survey of Parallel Algorithms for Shared-Memory Machines”, in
*Handbook of Theoretical Computer Science*(ed. J. van Leeuwen), vol. 1, North-Holland, 1990.Google Scholar - [21]Z. M. Kedem, K. V. Palem, M. O. Rabin, A. Raghunathan, “Efficient Program Transformations for Resilient Parallel Computation via Randomization,” in
*Proc. 24th ACM Symp. on Theory of Comp.*, pp. 306–318, 1992.Google Scholar - [22]Z. M. Kedem, K. V. Palem, A. Raghunathan, and P. Spirakis, “Combining Tentative and Definite Executions for Dependable Parallel Computing,” in
*Proc 23d ACM. Symposium on Theory of Computing*, pp. 381–390, 1991.Google Scholar - [23]Z. M. Kedem, K. V. Palem, and P. Spirakis, “Efficient Robust Parallel Computations,”
*Proc. 22nd ACM Symp. on Theory of Computing*, pp. 138–148, 1990.Google Scholar - [24]C. P. Kruskal, L. Rudolph, M. Snir, “Efficient Synchronization on Multiprocessors with Shared Memory,” in
*ACM Trans. on Programming Languages and Systems*, vol. 10,no. 4, pp. 579–601 1988.MATHCrossRefGoogle Scholar - [25]C. P. Kruskal, L. Rudolph, M. Snir, “A Complexity Theory of Efficient Parallel Algorithms,”
*Theoretical Computer Science*71, pp. 95–132, 1990.MATHCrossRefMathSciNetGoogle Scholar - [26]L. E. Ladner, M. J. Fischer, “Parallel Prefix Computation”,
*Journal of the ACM*, vol. 27,no. 4, pp. 831–838, 1980.MATHCrossRefMathSciNetGoogle Scholar - [27]M. Li and Y. Yesha, “New Lower Bounds for Parallel Computation,”
*Journal of the ACM*, vol. 36,no. 3, pp. 671–680, 1989.MATHCrossRefMathSciNetGoogle Scholar - [28]A. López-Ortiz, “Algorithm
*X*takes work*ω*(*n*log^{2}*n*/log log*n*) in a synchronous fail-stop (no restart) PRAM”, unpublished manuscript, 1992.Google Scholar - [29]C. Martel, personal communication, March, 1991.Google Scholar
- [30]C. Martel, A. Park, and R. Subramonian, “Work-optimal Asynchronous Algorithms for Shared Memory Parallel Computers,”
*SIAM Journal on Computing*, vol. 21, pp. 1070–1099, 1992MATHCrossRefMathSciNetGoogle Scholar - [31]C. Martel and R. Subramonian, “On the Complexity of Certified Write-All Algorithms”, to appear in
*Journal of Algorithms*(a prel. version in the*Proc. of the 12th Conference on Foundations of Software Technology and Theoretical Computer Science*, New Delhi, India, December 1992).Google Scholar - [32]C. Martel, R. Subramonian, and A. Park, “Asynchronous PRAMs are (Almost) as Good as Synchronous PRAMs,” in
*Proc. 32d IEEE Symposium on Foundations of Computer Science*, pp. 590–599, 1990.Google Scholar - [33]J. Naor, R.M. Roth, “Constructions of Permutation Arrays for Ceratin Scheduling Cost Measures”, manuscript, 1993.Google Scholar
- [34]N. Nishimura, “Asynchronous Shared Memory Parallel Computation,” in
*Proc. 3rd ACM Symp. on Parallel Algor. and Architect.*, pp. 76–84, 1990.Google Scholar - [35]N. Pippinger, “On Simultaneous Resource Bounds”, in
*Proc. of 20th IEEE Symposium on Foundations of Computer Science*, pp. 307–311, 1979.Google Scholar - [36]M.O. Rabin, “Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance”,
*J. of ACM*, vol. 36,no. 2, pp. 335–348, 1989.MATHCrossRefMathSciNetGoogle Scholar - [37]D. B. Sarrazin and M. Malek, “Fault-Tolerant Semiconductor Memories”,
*IEEE Computer*, vol. 17,no. 8, pp. 49–56, 1984.Google Scholar - [38]R. D. Schlichting and F. B. Schneider, “Fail-Stop Processors: an Approach to Designing Fault-tolerant Computing Systems”,
*ACM Transactions on Computer Systems*, vol. 1,no. 3, pp. 222–238, 1983.CrossRefGoogle Scholar - [39]J. T. Schwartz, “Ultracomputers”,
*ACM Transactions on Programming Languages and Systems*, vol. 2,no. 4, pp. 484–521, 1980.MATHCrossRefGoogle Scholar - [40]A. A. Shvartsman, “Achieving Optimal CRCW PRAM Fault-Tolerance”,
*Information Processing Letters*, vol. 39,no. 2, pp. 59–66, 1991.MATHCrossRefMathSciNetGoogle Scholar - [41]A. A. Shvartsman,
*Fault-Tolerant and Efficient Parallel Computation*, Ph.D. dissertation, Brown University, Tech. Rep. CS-92-23, 1992.Google Scholar - [42]A. A. Shvartsman, “Efficient Write-All Algorithm for Fail-Stop PRAM Without Initialized Memory”,
*Information Processing Letters*, vol. 44,no. 6, pp. 223–231, 1992.MATHCrossRefGoogle Scholar - [43]R.E. Tarjan, U. Vishkin, “Finding biconnected components and computing tree functions in logarithmic parallel time”, in
*Proc. of the 25th IEEE FOCS*, pp. 12–22, 1984.Google Scholar - [44]J. S. Vitter, R. A. Simmons, “New Classes for Parallel Complexity: A Study of Unification and Other Complete Problems for
*P*,”*IEEE Trans. Comput.*, vol. 35,no. 5, 1986.Google Scholar