Advertisement

The Implementation and Evaluation of High-Speed Link Monitoring Tool for Supercomputer

  • Jiaqing Xu
  • Jie He
  • Xiaotao Hu
  • Jijun Cao
  • Lei Zhang
  • Chongfeng Wang
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 994)

Abstract

With the increase of system scale and link speed, the link failure has become the most important type of interconnect fault in supercomputers, which has brought great challenges to the maintenance of high-performance interconnect networks. In order to meet the needs of operation and maintenance personnel to monitor the status and performance of all high-speed links of supercomputer in real-time, this paper designs a high-speed link monitoring tool based on in-band network, which has good scalability and robustness for real-time monitoring of high-speed link status and performance information. The tool has been practically utilized in the operation and maintenance of domestic supercomputers to speed up the process of locating and troubleshooting link failures, effectively reducing the downtime of supercomputers.

Keywords

Supercomputer Interconnection networks High-speed link Monitoring tool 

Notes

Acknowledgements

This work is mainly supported by the National Key Research and Development Program of China (2016YFB0200203), the National Natural Science Foundation of China (61572509).

References

  1. 1.
    https://www.top500.org/. Last Accessed 30 May 2018
  2. 2.
    The Opportunities and Challenges of Exascale Computing. Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee, Office of Science, DOE (2010)Google Scholar
  3. 3.
    Duato, J., Yalamanchili, S., Ni, L.: Interconnection Networks: An Engineering Approach. Morgan Kaufmann Publishers, California (2003)Google Scholar
  4. 4.
    Dally, W.J., Towles, B.: Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers, California (2004)Google Scholar
  5. 5.
    Domke, J., Hoefler, T., Matsuoka, S.: Fail-in-place network design: interaction between topology, routing algorithm and failures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2014), pp. 597–608. IEEE Press, New Orleans (2014)Google Scholar
  6. 6.
    Cao, J.J., Xiao, L.Q., Wang, K.F.: The implementation and evaluation of in-band network management in supercomputing system. Chin. J. Comput. 39(9), 1717–1732 (2016)MathSciNetGoogle Scholar
  7. 7.
    Introduction to InfiniBand-White Paper. https://www.mellanox.com/pdf/whitepapers/IB_Intro_WP_190.pdf. Last Accessed 3 June 2018
  8. 8.
    Wang, H.R., Xu, M.W.: Survey on SNMP network management. Mini Micro Syst. 25(3), 358–366 (2004)Google Scholar
  9. 9.
    Guo, C.X., Yuan, L.H., Xiang, D., et al.: Pingmesh: a large-scale system for data center network latency measurement and analysis. In: Proceeding of ACM SIGCOMM 2015, pp. 139–152. ACM Press, London (2015)CrossRefGoogle Scholar
  10. 10.
    Peng, Y., Yang, J., Wu, C., et al.: deTector: a topology-aware monitoring system for data center networks. In: 2017 USENIX Annual Technical Conference (USENIX ATC 2017), USENIX Association, pp. 55–68. USENIX Association, California (2017)Google Scholar
  11. 11.
    Wang, J.X., Qi, H.: Real-time link fault detection as a service for datacenter netwrok. J. Comput. Res. Dev. 55(4), 704–716 (2018)Google Scholar
  12. 12.
    CloudBrain for Automatic Troubleshooting for the Cloud. https://www.microsoft.com/en-us/research/project/cloudbrain/. Last Accessed 5 June 2018
  13. 13.
    Birrittella, M.S., Debbage, M., et al.: Intel omni-path architecture: enabling scalable, high performance fabrics. In: Proceeding of 23rd IEEE Annual Symposium on High-Performance Interconnects, pp. 1–9. IEEE Press, California (2015)Google Scholar
  14. 14.
    Wen, J.W.: Infiniband subnet management technology. Master thesis, National University of Defense Technology (2009)Google Scholar
  15. 15.
    Unified Fabric Manager for InfiniBand User Manual. http://pleiades.ucsc.edu/doc/mellanox/UFM_5.2_User_Manual_IB_DOC-00600.pdf. Last Accessed 28 May 2018
  16. 16.
  17. 17.
    Huang, P., Guo, C.X., Zhou, L.D, et al.: Gray failure: the Achilles’ heel of cloud-scale systems. In: Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS 2017), pp. 150–155. ACM Press, Whistler (2017)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Jiaqing Xu
    • 1
  • Jie He
    • 1
  • Xiaotao Hu
    • 1
  • Jijun Cao
    • 1
  • Lei Zhang
    • 1
  • Chongfeng Wang
    • 1
  1. 1.School of ComputerNational University of Defense TechnologyChangshaChina

Personalised recommendations