An Approach to Failure Prediction in Cluster by Self-updating Cause-and-Effect Graph

  • Yan Yu
  • Haopeng ChenEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11513)


Cluster systems have been widely used in cloud computing, high-performance computing, and other fields, and the usage and scale of cluster systems have shown a sharp upward trend. Unfortunately, the larger cluster systems are more prone to failures, and the difficulty and cost of repairing failures are unusually huge. Therefore, the importance and necessity of failure prediction in cluster systems are obvious. In order to solve this severe challenge, we propose an approach to failure prediction in cluster systems by Self-Updating Cause-and-Effect Graph. Different from the previous approaches, the most novel point of our approach is that it can automatically mine the causality among log events from cluster systems, and set up and update Cause-and-Effect Graph for failure prediction throughout their life cycle. In addition, we use the real logs from Blue Gene/L system to verify the effectiveness of our approach and compare our approach to other approaches using the same logs. The result shows that our approach outperforms other approaches with the best precision and recall rate reaching 89% and 85%, respectively.



This paper is supported by Project 213.


  1. 1.
    Adam, O., Stearley, J.: What supercomputers say: a study of five system logs. In: Proceedings of the DSN (2007)Google Scholar
  2. 2.
    Bianca, S., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secure Comput. 7(4), 337–350 (2010)CrossRefGoogle Scholar
  3. 3.
    Zheng, Z., Lan, Z.: A practical failure prediction with location and lead time for Blue Gene/P. In: International Conference on Dependable Systems and Networks Workshops. IEEE Computer Society (2010)Google Scholar
  4. 4.
    Gainaru, A., Cappello, F.: Adaptive event prediction strategy with dynamic time window for large-scale HPC systems. In: SLAML. ACM (2011)Google Scholar
  5. 5.
    Chuah, E., Kuo, S.: Diagnosing the root-causes of failures from cluster log files. In: IEEE International Conference on High Performance Computing (2011)Google Scholar
  6. 6.
    Fu, X., Ren, R.: Digging deeper into cluster system logs for failure prediction and root cause diagnosis. In: IEEE International Conference on Cluster Computing (2014)Google Scholar
  7. 7.
    Chuah, E., Jhumka, A.: Linking resource usage anomalies with system failures from cluster log data. In: IEEE International Symposium on Reliable Distributed Systems (2013)Google Scholar
  8. 8.
    Gu, J., Zheng, Z.: Dynamic meta-learning for failure prediction in large-scale systems: a case study. In: International Conference on Parallel Processing (2008)Google Scholar
  9. 9.
    Jiang, Z.M., Hassan, A.E.: An automated approach for abstracting execution logs to execution events. J. Software Maintenance Evol. Res. Pract. 20(4), 249–267 (2008)CrossRefGoogle Scholar
  10. 10.
    He, P., Zhu, J.: Drain: an online log parsing approach with fixed depth tree. In: IEEE International Conference on Web Services (2017)Google Scholar
  11. 11.
    Makanju, A., Zincir-Heywood, A.N.: A lightweight algorithm for message type extraction in system application logs. IEEE Trans. Knowl. Data Eng. 24(11), 1921–1936 (2012)CrossRefGoogle Scholar
  12. 12.
    Makanju, A., Zincir-Heywood, A.N.: A search-based approach for accurate identification of log message formats. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2009)Google Scholar
  13. 13.
    Deb, K., Pratap, A.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)CrossRefGoogle Scholar
  14. 14.
    Fu, X., Ren, R.: LogMaster: mining event correlations in logs of large-scale cluster systems. In: IEEE Symposium on Reliable Distributed Systems (2012)Google Scholar
  15. 15.
    Agarwal, M.K., Madduri, V.R.: Correlating failures with asynchronous changes for root cause analysis in enterprise environments. In: IEEE International Conference on Dependable Systems & Networks (2010)Google Scholar
  16. 16.
    He, P., Zhu, J.: An evaluation study on log parsing and its use in log mining. In: IEEE International Conference on Dependable Systems and Networks (2016)Google Scholar
  17. 17.
    Zheng, Z., Lan, Z.: System log pre-processing to improve failure prediction. In: IEEE International Conference on Dependable Systems & Networks (2009)Google Scholar
  18. 18.
    Fu, Q., Lou, J.: Execution anomaly detection in distributed systems through unstructured log analysis. In: IEEE International Conference on Data Mining (2009)Google Scholar
  19. 19.
    Kobayashi, S., Otomo, K.: Mining causality of network events in log data. IEEE Trans. Netw. Serv. Manag. 15(1), 53–67 (2018)CrossRefGoogle Scholar
  20. 20.
    Kobayashi, S., Fukuda, K.: Mining causes of network events in log data with causal inference. In: IEEE Symposium on Integrated Network and Service Management (2017)Google Scholar
  21. 21.
    Jieming, Z., Shilin, M.: Tools and benchmarks for automated log parsing. In: International Conference on Software Engineering (ICSE 2019) (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Shanghai Jiaotong UniversityShanghaiChina

Personalised recommendations