Self-Diagnosis, Reconfiguration and Recovery in the Dynamical Reconfigurable Multiprocessor System DAMP

  • Andreas Bauch
  • Erik Maehle
Conference paper
Part of the Informatik-Fachberichte book series (INFORMATIK, volume 283)

Abstract

In this paper the fault tolerance concept for the dynamical reconfigurable multiprocessor system DAMP currently under development at the University of Paderborn is introduced. Its architecture is based on a single type of building block (DAMP-module) consisting of a transputer, memory and a local switching network. These building blocks are interconnected according to a fixed physical topology with restricted neighborhood (octagonal torus). Communication paths between nodes can dynamically be built up and released during runtime in a fully distributed way (circuit-switching). Currently an 8-processor prototype is operational, a redesign for a 64-processor system is under way. Fault-tolerance will be realized by dynamic redundancy in form of standby sparing. The distributed self-diagnosis, reconfiguration and recovery techniques are described in some detail.

Keywords

Pyramid iPSC 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [Ada90]
    Adamo, J.-M.; Bonello, Ch.: TéNor++: A Dynamic Configurer for SuperNode Machines. Proc. CONPAR 90-VAPP IV, Lecture Notes in Computer Science 457, 640–651, Springer-Verlag, Berlin 1990Google Scholar
  2. [And81]
    Anderson, T.; Lee, P.A.: Fault Tolerance - Principles and Practice. Prentice/Hall, Englewood Cliffs 1981Google Scholar
  3. [Ban90]
    Banerjee, P.: Strategies for Reconfiguring Hypercubes under Faults. Proc. 20th Int. Symp. on Fault-Tolerant Computing ‘FTSC-20’, 210–217, Newcastle upon Tyne 1990Google Scholar
  4. [Bau91]
    Bauch, A.; Braam, R.; Maehle, E.: DAMP - A Dynamic Reconfigurable Multiprocessor System With a Distributed Switching Network. Proc. Distributed Memory Computing (ÉDMCC2), Lecture Notes in Computer Science 487,495–504, Springer-Verlag, Berlin 1991Google Scholar
  5. [Bra90]
    Braam, R.; Mockenhaupt, J.; Pollmann, A.: Simulation von Beanspruchung und Verformung biologischer Gelenke auf dem dynamisch adaptierbaren Multiprozessorsystem DAMP. To appear: Proc. TAT’90, Transputer-Anwender-Treffen, Aachen 1990Google Scholar
  6. [Dam90]
    Damsz, J.: Softwaremodul für den dezentralen Verbindungsaufbau im dynamisch adaptierbaren Multiprozessorsystem DAMP. Interner Arbeitsbericht Nr. 35, Fachgebiet Datentechnik, Universität-GH-Paderborn 1990Google Scholar
  7. [Gör89]
    Görke, W.: Fehlertolerante Rechensysteme. Oldenbourg-Verlag, München Wien 1989Google Scholar
  8. [Hay89]
    Hayes, J.P.; Mudge, T.: Hypercube Supercomputers. Proc. of the IEEE, Vol. 77, No. 12, Dec. 1989, 1829–1841.CrossRefGoogle Scholar
  9. [Hän85]
    Händler, W.; Maehle, E.; Wirl, K.: The DIRMU Testbed for High-Performance Multiprocessor Configurations. Proc. First Int. Conf. on Supercomputing Systems, 468–475, St. Petersburg FL, 1985Google Scholar
  10. [INM88a]
    Occam 2 Reference Manual. Prentice Hall, New York London 1988Google Scholar
  11. [INM88b]
    Transputer Reference Manual. Prentice Hall, New York London 1988Google Scholar
  12. INM91] The T9000 Transputer Products Overview Manual, INMOS Limited 1991Google Scholar
  13. [Küb88]
    Kübler, F.D.: A Cluster-Oriented Architecture for the Mapping of Parallel Processor Networks to High Performance Applications. Proc. Int. Conf. on Supercomputing, 179–189, ACM 1988Google Scholar
  14. [Leh87]
    Lehmann, L.; Brehm, J.: Rollback-Recovery in Multiprocessor Ring Configurations. Proc. 3rd Int. Conf. on Fault-Tolerant Computing Systems, Informatik-Fachberichte 147,213–223, Springer-Verlag, Berlin Heidelberg 1987Google Scholar
  15. [Mae86]
    Maehle, E.; Moritzen, K; Wirl, K.: A Graph Model and Its Application to a Fault-Tolerant Multiprocessor System. Proc. 16th Int. Symp. on Fault-Tolerant Computing ‘FTCS-16’, 292–297, Wien 1986Google Scholar
  16. [Mor84]
    Moritzen, K.: System Level Fault-Diagnosis in Distributed Systems. Proc. 2nd Conf. ‘Fault-Tolerant Computing Systems’, Informatik-Fachberichte 84, 301–312, Springer-Verlag, Berlin Heidelberg 1984Google Scholar
  17. [Pee90]
    Peercy. M.; Baneijee, P.: Distributed Algorithms for Shortest-Path Deadlock-Free Routing and Broadcasting in Arbitrarily Faulty Hypercubes. Proc. 20th Int. Symp. on Fault-Tolerant Computing ‘FTSC-20’, 218–225, Newcastle upon Tyne 1990Google Scholar
  18. [Ren 86]
    Rennels, D.A.: On Implementing Fault-Tolerance in Binary Hypercubes. Proc. 16th Int. Symp. on Fault-Tolerant Computing ‘FTCS-16’, 344–349, Wien 1986Google Scholar
  19. [Ros90]
    Rost, J.; Maehle, E.: A Distributed Algorithm for Dynamic Task Scheduling. Proc. CONPAR 90-VAPP IV, Lecture Notes in Computer Science 457, 628–639, Springer-Verlag, Berlin 1990Google Scholar
  20. [Sei88]
    Seidl, W.: Modelle der Fehlertoleranz in Nachrichten-gekoppelten Parallelrechnern. Proc. GI-18. Jahrestagung II, Informatik-Fachberichte 188, 366–378, Springer-Verlag, Berlin Heidelberg 1988Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1991

Authors and Affiliations

  • Andreas Bauch
    • 1
  • Erik Maehle
    • 1
  1. 1.Fachgebiet DatentechnikUniversität-GH-PaderbornPaderbornFed. Rep. of Germany

Personalised recommendations