Skip to main content

A Transfer Entropy Based Visual Analytics System for Identifying Causality of Critical Hardware Failures Case Study: CPU Failures in the K Computer

  • Conference paper
  • First Online:
Book cover Methods and Applications for Modeling and Simulation of Complex Systems (AsiaSim 2018)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 946))

Included in the following conference series:

  • 1477 Accesses

Abstract

Large-scale scientific computing facilities usually operate expensive HPC (High Performance Computing) systems, which have their computational and storage resources shared with the authorized users. On such shared resource systems, a continuous and stable operation is fundamental for providing the necessary hardware resources for the different user needs, including large-scale numerical simulations, which are the main targets of such large-scale facilities. For instance, the K computer installed at the R-CCS (RIKEN Center for Computational Science), in Kobe, Japan, enables the users to continuously run large jobs with tens of thousands of nodes (a maximum of 36,864 computational nodes) for up to 24 h, and a huge job by using the entire K computer system (82,944 computational nodes) for up to 8 h. Critical hardware failures can directly impact the affected job, and may also indirectly impact the scheduled subsequent jobs. To monitor the health condition of the K computer and its supporting facility, a large number of sensors has been providing a vast amount of measured data. Since it is almost impossible to analyze the entire data in real-time, these information has been stored as log data files for post-hoc analysis. In this work, we propose a visual analytics system which uses these big log data files to identify the possible causes of the critical hardware failures. We focused on the transfer entropy technique for quantifying the “causality” between the possible cause and the critical hardware failure. As a case study, we focused on the critical CPU failures, which required subsequent substitution, and utilized the log files corresponding to the measured temperatures of the cooling system such as air and water. We evaluated the usability of our proposed system, by conducting practical evaluations via a group of experts who directly works on the K computer system operation. The positive and negative feedbacks obtained from this evaluation will be considered for the future enhancements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Shoji, F., et al.: Long term failure analysis of 10 petascale supercomputer. In: HPC in Asia Session at ISC (2015)

    Google Scholar 

  2. Schulz, C., Rodrigues, N., Damarla, K., Henicke, A., Weiskopf, D.: Visual Exploration of mainframe workloads. In: SIGGRAPH Asia 2017 Symposium on Visualization, pp. 4:1–4:7 (2017)

    Google Scholar 

  3. El-Sayed, N., Schroeder, B.: Reading between the lines of failure logs: understanding how HPC systems fail. In: 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 1–12 (2013)

    Google Scholar 

  4. Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2010)

    Article  Google Scholar 

  5. Gupta, S., Tiwari, D., Jantzi, C., Rogers, J., Maxwell, D.: Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems. In: 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 37–44 (2015)

    Google Scholar 

  6. Schreiber, T.: Measuring information transfer. Phys. Rev. Lett. 85(2), 461 (2000)

    Article  MathSciNet  Google Scholar 

  7. Sakamoto, N., Koyamada, K.: KVS: a simple and effective framework for scientific visualization. J. Adv. Simul. Sci. Eng. 2(1), 76–95 (2015)

    Article  Google Scholar 

Download references

Acknowledgements

Some of the results were obtained by using the K computer operational environment at the RIKEN CCS (Center for Computational Science) in Kobe, Japan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kazuki Koiso .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Koiso, K., Sakamoto, N., Nonaka, J., Shoji, F. (2018). A Transfer Entropy Based Visual Analytics System for Identifying Causality of Critical Hardware Failures Case Study: CPU Failures in the K Computer. In: Li, L., Hasegawa, K., Tanaka, S. (eds) Methods and Applications for Modeling and Simulation of Complex Systems. AsiaSim 2018. Communications in Computer and Information Science, vol 946. Springer, Singapore. https://doi.org/10.1007/978-981-13-2853-4_44

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-2853-4_44

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-2852-7

  • Online ISBN: 978-981-13-2853-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics