Skip to main content

Maintaining Quality of Service with Dynamic Fault Tolerance in Fat-Trees

  • Conference paper
  • 686 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5374))

Abstract

A very important ingredient in the computing landscape is Utility Computing Data Centres (UCDCs), large-scale computing systems that offer computational services to concurrently running jobs through virtual servers. As UCDC systems increase in size and the mean time between failure decreases, it is becoming an increasingly important challenge to expediently tolerate failures (dynamically), while distributing the effects of the failure amongst the virtual servers according to their service level agreements. We propose and evaluate a strategy for offering predictable service in fat-trees experiencing faults, by reprioritising packets. The strategy is able to distribute the effect of network faults in order to satisfy a number of quality-of-service demands. Which demands to favour depends on the computer system and the characteristics of the jobs it is running, and in the presence of a moderate number of faults it is to some degree possible to meet the demands.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alfaro, F.J., Sanchez, J.L., Duato, J., Das, C.R.: A strategy to compute the infiniband arbitration tables. In: Proceedings of International Parallel and Distributed Processing Symposium (April 2002)

    Google Scholar 

  2. Alfaro, F.J., Sanchez, J.L., Duato, J.: A strategy to manage time sensitive traffic in infiniband. In: Proceedings of Workshop on Communication Architechture for Clusters (CAC) (April 2002)

    Google Scholar 

  3. Beecroft, J., Addison, D., Hewson, D., McLaren, M., Roweth, D., Petrini, F., Nieplocha, J.: Qsnetii: Defining high-performance network design. IEEE Micro. 25(4), 34–47 (2005)

    Article  Google Scholar 

  4. Chalasani, S., Raghavendra, C.S., Varma, A.: Fault-tolerant routing in MIN based supercomputers. In: Supercomputing 1990: Proceedings of the 1990 conference on Supercomputing, pp. 244–253. IEEE Computer Society Press, Los Alamitos (1990)

    Google Scholar 

  5. Myrinet Inc. Myrinet overview (2007), http://www.myri.com/myrinet/overview/

  6. J-sim (May 2006), http://www.j-sim.org/

  7. Lee, T.-H., Chou, J.-J.: Some directed graph theorems for testing the dynamic full access property of multistage interconnection networks. In: IEEE TENCON (1993)

    Google Scholar 

  8. Leiserson, C.E.: Fat-trees: Universal networks for hardware-efficient supercomputing. IEEE Transactions on Computers C-34(10), 892–901 (1985)

    Article  Google Scholar 

  9. Martinez, R., Alfaro, F.J., Sanchez, J.L.: Decoupling the bandwidth and latency bounding for table-based schedulers. In: Proceedings of the 2006 International Conference on Parallel Processing, pp. 155–163 (2006)

    Google Scholar 

  10. Petrini, F., Vanneschi, M.: K-ary N-trees: High performance networks for massively parallel architectures. Technical Report TR-95-18, 15 (1995)

    Google Scholar 

  11. Sem-Jacobsen, F.O., Lysne, O., Skeie, T.: Combining source routing and dynamic fault tolerance. In: De Souza, A.F., Buyya, R., Meira Jr., W. (eds.) Proceedings of the 18th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Washington, DC, USA, pp. 151–158. IEEE Computer Society, Los Alamitos (2006)

    Google Scholar 

  12. Sem-Jacobsen, F.O., Skeie, T., Lysne, O.: A dynamic fault-torlerant routing algorithm for fat-trees. In: International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Nevada, USA, June 27- 30. CSREA Press (2005)

    Google Scholar 

  13. Sem-Jacobsen, F.O., Skeie, T., Lysne, O., Duato, J.: Dynamic fault tolerance with misrouting in fat trees. In: Feng, W.c. (ed.) Proceedings of the International Conference on Parallel Processing (ICPP), pp. 33–45. IEEE Computer Society, Los Alamitos (2006)

    Google Scholar 

  14. Sem-Jacobsen, F.O., Skeie, T., Lysne, O.: Dynamic fault tolerance in multistage interconnection networks (2008), Research note, Simula, http://simula.no/research/networks/publications/simula.nd.121

  15. Sem-Jacobsen, F.O., Skeie, T., Lysne, O., Tørudbakken, O., Rongved, E., Johnsen, B.: Siamese-twin: A dynamically fault tolerant fat tree. In: Proceedings of the 19th IPDPS (2005)

    Google Scholar 

  16. Sengupta, J., Bansal, P.K.: Fault-tolerant routing in irregular MINs. In: TENCON 1998. 1998 IEEE Region 10 International Conference on Global Connectivity in Energy, Computer, Communication and Control, vol. 2, pp. 638–641 (1998)

    Google Scholar 

  17. Sengupta, J., Bansal, P.K.: High speed dynamic fault-tolerance. In: Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology, 2001. TENCON, vol. 2, pp. 669–675 (2001)

    Google Scholar 

  18. Sharma, N.K.: Fault-tolerance of a MIN using hybrid redundancy. In: Proceedings of the 27th Annual Simulation Symposium, pp. 142–149 (April 1994)

    Google Scholar 

  19. Skeie, T.: A fault-tolerant method for wormhole multistage networks. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 1998), pp. 637–644 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sem-Jacobsen, F.O., Skeie, T. (2008). Maintaining Quality of Service with Dynamic Fault Tolerance in Fat-Trees. In: Sadayappan, P., Parashar, M., Badrinath, R., Prasanna, V.K. (eds) High Performance Computing - HiPC 2008. HiPC 2008. Lecture Notes in Computer Science, vol 5374. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89894-8_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-89894-8_40

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-89893-1

  • Online ISBN: 978-3-540-89894-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics