Abstract
Massively parallel computing systems are being built with thousands of nodes. Because of the high number of components, it is critical to keep these systems running even in the presence of failures. Interconnection networks play a key-role in these systems, and this paper proposes a fault-tolerant routing methodology for use in such networks. The methodology supports any minimal routing function (including fully adaptive routing), does not degrade performance in the absence of faults, does not disable any healthy node, and is easy to implement both in meshes and tori. In order to avoid network failures, the methodology uses a simple mechanism: for some source-destination pairs, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network). The methodology is shown to tolerate a large number of faults (e.g., five/nine faults when using two/three intermediate nodes in a 3D torus). Furthermore, the methodology offers a gracious performance degradation: in an 8 × 8 × 8 torus network with 14 faults the throughput is only decreased by 6.49%.
This work was supported by the Spanish MCYT under Grant TIC2003-08154-C06-01.
Chapter PDF
Similar content being viewed by others
References
ASCI Red Web Site, http://www.sandia.gov/ASCI/Red/
IBM BG/L Team. An Overview of the BlueGene/L Supercomputer. In: ACM Supercomputing Conference (2002)
Bopana, R., Chalasani, S.: A Comparison of Adaptive Wormhole Routing Algorithms. In: Proc. 20th Annual Int. Symp. Comp. Architecture (1993)
Carrion, C., Beivide, R., Gregorio, J.A., Vallejo, F.: A Flow Control Mechanism to Avoid Message Deadlock in K-ary N-Cube Networks. In: 4th International Conference on High Performance Computing, pp. 332–329 (1997)
Casado, R., Bermúdez, A., Duato, J., Quiles, F.J., Sánchez, J.L.: A protocol for deadlock-free dynamic reconfiguration in high speed local area networks. IEEE Transactions on Parallel and Distributed Systems 12(2), 115–132 (2001)
Chien, A.A., Kim, J.H.: Planar-adaptive routing: Low-cost adaptive networks for multiprocessors. In: Proceedings of the 19th International Symposium on Computer Architecture, pp. 268–277 (1992)
Chalasani, S., Boppana, R.V.: Communication in multicomputers with nonconvex faults. IEEE Transactions on Computers 46(5), 616–622 (1997)
Dally, W.J.: Virtual-channel flow control. IEEE Transactions on Parallel and Distributed Systems 3(2), 194–205 (1992)
Dally, W.J., Aoki, H.: Deadlock-free adaptive routing in multicomputer networks using virtual channels. IEEE Transactions on Parallel and Distributed Systems 4(4), 466–475 (1993)
Dally, W.J., et al.: The Reliable Router: A Reliable and High-Performance Communication Substrate for Parallel Computers. In: Proc. Parallel Computer Routing and Communication Workshop (1994)
Duato, J.: A theory of fault-tolerant routing in wormhole networks. In: Proc. International Conference on Parallel and Distributed Systems, pp. 600–607 (1994)
Earth Simulator Center, http://www.es.jamstec.go.jp/esc/eng/index.html
Gómez, M.E., Duato, J., Flich, J., López, P., Robles, A., Nordbotten, N.A., Lysne, O., Skeie, T.: An Efficient Fault-Tolerant Routing Methodology for Meshes and Tori. Computer Architecture Letters 3 (May 2004)
Glass, G.J., Ni, L.M.: Fault-Tolerant Wormhole Routing in Meshes without Virtual Channels. IEEE Transactions Parallel and Distributed Systems 7(6), 620–636 (1996)
Ho, C.T., Stockmeyer, L.: A New Approach to Fault-Tolerant Wormhole Routing for Mesh-Connected Parallel Computers. In: Proc. 16th International Parallel and Distributed Processing Symposium (2002)
InfiniBandTM Trade Association, InfiniBandTM architecture. Specification vol. 1.Release 1.0.a, Available at http://www.infinibandta.com
Linder, D.H., Harden, J.C.: An Adaptive and fault tolerant wormhole routing strategy for k-ary n-cubes. IEEE Transactions on Computers C-40(1), 2–12 (1991)
Puente, V., Gregorio, J.A., Prellezo, J.M., Beivide, R., Duato, J., Izu, C.: Adaptive Bubble Router: A Design to Balance Latency and Throughput in Networks for Parallel Computers. In: 22nd International Conference on Parallel Processing (1999)
Suh, Y.J., Dao, B.V., Duato, J., Yalamanchili, S.: Software-based rerouting for fault-tolerant pipelined communication. IEEE Transactions on Parallel and Distributed Systems 11(3), 193–211 (2000)
Scott, S.L., Thorson, G.M.: The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus. In: Symposium on High Performance Interconnects (1996)
Valiant, L.G.: A Scheme for Fast Parallel Communication. SIAM J. Comput. 11, 350–361 (1982)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 IFIP International Federation for Information Processing
About this paper
Cite this paper
Nordbotten, N.A. et al. (2004). A Fully Adaptive Fault-Tolerant Routing Methodology Based on Intermediate Nodes. In: Jin, H., Gao, G.R., Xu, Z., Chen, H. (eds) Network and Parallel Computing. NPC 2004. Lecture Notes in Computer Science, vol 3222. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30141-7_49
Download citation
DOI: https://doi.org/10.1007/978-3-540-30141-7_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23388-6
Online ISBN: 978-3-540-30141-7
eBook Packages: Springer Book Archive