Abstract
It has been long recognized that failure events are correlated, not independent. Previous research efforts have shown the correlation analysis of system logs is helpful to resource allocation, job scheduling and proactive management. However, previous log analysis methods analyze the history logs offline. They fail to capture the dynamic change of system errors and failures. In this paper, we purpose an online log analysis approach to mine event correlations in system logs of large-scale cluster systems. Our contributions are three-fold: first, we analyze the event correlations of system logs of a 260-nodes production Hadoop cluster system, and the result shows that the correlation rules of logs change dramatically in different periods; Second, we present a online log analysis algorithm Apriori-SO; third, based on the online event correlations mining, we present an online event prediction method that can predict diversities of failure events with the great detail. The experiment result of a 260-nodes production Hadoop cluster system shows that our online log analysis algorithm can analyze the log streams to obtain event correlation rules in soft real time, and our online event prediction method can achieve higher precision rate and recall rate than the offline log analysis approach.
Chapter PDF
Similar content being viewed by others
References
Sahoo, R.K., Sivasubramaniam, A., Squillante, M.S.: Failure data analysis of a large-scale heterogeneous server environment. In: Proc. of DSN 2004 (2004)
Tierney, B., Johnston, W.: The NetLogger methodology for high performance distributed systems performance analysis. In: Proc. of HPDC (1998)
Sahoo, R.K., Oliner, A.J.: Critical Event Prediction for Proactive Management in Large scale Computer Clusters. In: Proc. of SIGKDD (2003)
Fu, S., Xu, C.: Exploring Event Correlation for Event prediction in Coalitions of Clusters. In: Proc. of ICS (2007)
Fu, S., Xu, C.: Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management. In: Proc. of SRDS (2007)
Gujrati, P., Li, Y., Lan, Z.: A Meta-Learning Failure Predictor for Blue Gene/L Systems. In: Proc. of ICPP (2007)
Knight, J.C.: An Introduction To Computing System Dependability. In: Proc. of ICSE (2004)
Tang, D., Iyer, R.K.: Analysis and Modeling of Correlated Failures in Multicomputer Systems. IEEE Trans. on Comput. 41(5), 567–577 (1992)
Koskinen, E., Jannotti, J.: BorderPatrol: Isolating Events for Precise Black-box Tracing. In: Proc. of Eurosys (2008)
Liang, Y., Zhang, Y.: BlueGene/L Failure Analysis and Prediction Models. In: Proc. of DSN (2006)
Hacker, T.J., Romero, F., Carothers, C.D.: An analysis of clustered failures on large supercomputing systems. Journal of Parallel and Distributed Computing 69(7), 652–665 (2009)
Oliner, A.J., Aiken, A., Stearley, J.: Alert Detection in Logs. In: Proc. of ICDM (2008)
Zhou, W., Zhan, J., Meng, D., Xu, D., Zhang, Z.: LogMaster: Mining Event Correlations in Logs of Large-scale Cluster Systems. In: CoRR abs/1003.0951 (2010)
Jiang, N., Gruenwald, L.: Research Issues in Data Stream Association Rule Mining. ACM SIGMOD Record 35(1) (March 2006)
Salfner, F., Tschirpke, S.: Error Log Processing for Accurate Event prediction. In: USENIX Workshop on The Analysis of System Logs, WASL (2008)
Lou, J.G., Fu, Q., Wang, Y., Li, J.: Mining Dependency in Distributed Systems through Unstructured Logs Analysis. In: USENIX Workshop on WASL (2009)
Zhang, R., Cope, E., Heusler, L., Cheng, F.: A Bayesian Network Approach to Modeling IT Service Availability using System Logs. In: USENIX Workshop on WASL 2009 (2009)
Tang, D., Iyer, R.K.: Analysis and Modeling of Correlated Failures in Multicomputer Systems. IEEE Trans. on Comput. 41(5), 567–577 (1992)
Oliner, A., Stearley, J.: What Supercomputers Say: A Study of Five System Logs. In: Proc. of DSN (2005)
Rouillard, J.P.: Real-time log file analysis using the Simple Event Correlator (SEC). In: Proc. of LISA (2004)
Zhang, Z., Zhan, J.: Precise request tracing and performance debugging of multi-tier services of black boxes. In: Proc. of DSN 2009 (2009)
Zhou, W., Zhan, J.: Multidimensional Analysis of System Logs in Large-scale Cluster Systems. In: Proc. of DSN 2008, Fast Abstract (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 IFIP International Federation for Information Processing
About this paper
Cite this paper
Zhou, W., Zhan, J., Meng, D., Zhang, Z. (2010). Online Event Correlations Analysis in System Logs of Large-Scale Cluster Systems. In: Ding, C., Shao, Z., Zheng, R. (eds) Network and Parallel Computing. NPC 2010. Lecture Notes in Computer Science, vol 6289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15672-4_23
Download citation
DOI: https://doi.org/10.1007/978-3-642-15672-4_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15671-7
Online ISBN: 978-3-642-15672-4
eBook Packages: Computer ScienceComputer Science (R0)