A Transfer Entropy Based Visual Analytics System for Identifying Causality of Critical Hardware Failures Case Study: CPU Failures in the K Computer

Koiso, Kazuki; Sakamoto, Naohisa; Nonaka, Jorji; Shoji, Fumiyoshi

doi:10.1007/978-981-13-2853-4_44

Kazuki Koiso¹²,
Naohisa Sakamoto¹²,
Jorji Nonaka¹³ &
…
Fumiyoshi Shoji¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 946))

Included in the following conference series:

Asian Simulation Conference

1477 Accesses

Abstract

Large-scale scientific computing facilities usually operate expensive HPC (High Performance Computing) systems, which have their computational and storage resources shared with the authorized users. On such shared resource systems, a continuous and stable operation is fundamental for providing the necessary hardware resources for the different user needs, including large-scale numerical simulations, which are the main targets of such large-scale facilities. For instance, the K computer installed at the R-CCS (RIKEN Center for Computational Science), in Kobe, Japan, enables the users to continuously run large jobs with tens of thousands of nodes (a maximum of 36,864 computational nodes) for up to 24 h, and a huge job by using the entire K computer system (82,944 computational nodes) for up to 8 h. Critical hardware failures can directly impact the affected job, and may also indirectly impact the scheduled subsequent jobs. To monitor the health condition of the K computer and its supporting facility, a large number of sensors has been providing a vast amount of measured data. Since it is almost impossible to analyze the entire data in real-time, these information has been stored as log data files for post-hoc analysis. In this work, we propose a visual analytics system which uses these big log data files to identify the possible causes of the critical hardware failures. We focused on the transfer entropy technique for quantifying the “causality” between the possible cause and the critical hardware failure. As a case study, we focused on the critical CPU failures, which required subsequent substitution, and utilized the log files corresponding to the measured temperatures of the cooling system such as air and water. We evaluated the usability of our proposed system, by conducting practical evaluations via a group of experts who directly works on the K computer system operation. The positive and negative feedbacks obtained from this evaluation will be considered for the future enhancements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Shoji, F., et al.: Long term failure analysis of 10 petascale supercomputer. In: HPC in Asia Session at ISC (2015)
Google Scholar
Schulz, C., Rodrigues, N., Damarla, K., Henicke, A., Weiskopf, D.: Visual Exploration of mainframe workloads. In: SIGGRAPH Asia 2017 Symposium on Visualization, pp. 4:1–4:7 (2017)
Google Scholar
El-Sayed, N., Schroeder, B.: Reading between the lines of failure logs: understanding how HPC systems fail. In: 43^rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 1–12 (2013)
Google Scholar
Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2010)
Article Google Scholar
Gupta, S., Tiwari, D., Jantzi, C., Rogers, J., Maxwell, D.: Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems. In: 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 37–44 (2015)
Google Scholar
Schreiber, T.: Measuring information transfer. Phys. Rev. Lett. 85(2), 461 (2000)
Article MathSciNet Google Scholar
Sakamoto, N., Koyamada, K.: KVS: a simple and effective framework for scientific visualization. J. Adv. Simul. Sci. Eng. 2(1), 76–95 (2015)
Article Google Scholar

Download references

Acknowledgements

Some of the results were obtained by using the K computer operational environment at the RIKEN CCS (Center for Computational Science) in Kobe, Japan.

Author information

Authors and Affiliations

Graduate School of System Informatics, Kobe University, Kobe, Japan
Kazuki Koiso & Naohisa Sakamoto
Center for Computational Science, RIKEN, Kobe, Japan
Jorji Nonaka & Fumiyoshi Shoji

Authors

Kazuki Koiso
View author publications
You can also search for this author in PubMed Google Scholar
Naohisa Sakamoto
View author publications
You can also search for this author in PubMed Google Scholar
Jorji Nonaka
View author publications
You can also search for this author in PubMed Google Scholar
Fumiyoshi Shoji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kazuki Koiso .

Editor information

Editors and Affiliations

Ritsumeikan University, Kusatsu, Shiga, Japan
Liang Li
Ritsumeikan University, Kusatsu, Shiga, Japan
Kyoko Hasegawa
Ritsumeikan University, Kusatsu, Shiga, Japan
Satoshi Tanaka

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Koiso, K., Sakamoto, N., Nonaka, J., Shoji, F. (2018). A Transfer Entropy Based Visual Analytics System for Identifying Causality of Critical Hardware Failures Case Study: CPU Failures in the K Computer. In: Li, L., Hasegawa, K., Tanaka, S. (eds) Methods and Applications for Modeling and Simulation of Complex Systems. AsiaSim 2018. Communications in Computer and Information Science, vol 946. Springer, Singapore. https://doi.org/10.1007/978-981-13-2853-4_44

Download citation

DOI: https://doi.org/10.1007/978-981-13-2853-4_44
Published: 18 October 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2852-7
Online ISBN: 978-981-13-2853-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics