Abstract
Every large multi-site infrastructure such as Grids and Clouds must implement fault-tolerance mechanisms and smart schedulers to enable continuous operation even when resource failures occur. Evaluating the efficiency of such mechanisms and schedulers requires representative failure models that are able to capture realistic properties of real world failure data. This paper shows that failures in multi-site infrastructures are far from being randomly distributed. We propose a failure model that captures features observed in real failure traces.
Chapter PDF
References
Beran, J.: Statistics for Long-Memory Processes. Chapman & Hall (1994)
Chu, J., Labonte, K., Levine, B.N.: Availability and Locality Measurements of Peer-to-Peer File Systems. In: ITCom (2002)
Feitelson, D.G.: Workload Modeling for Computer Systems Performance Evaluation., Book Draft, Version 0.32 (2011)
Feller, W.: An Introduction to Probability Theory and Its Applications (1950)
Gainaru, A., Cappello, F., Snir, M., Kramer, W.: Fault Prediction under the Microscope: a Closer Look into HPC Systems. In: SC (2012)
Gallet, M., Yigitbasi, N., Javadi, B., Kondo, D., Iosup, A., Epema, D.: A Model for Space-Correlated Failures in Large-Scale Distributed Systems. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010, Part I. LNCS, vol. 6271, pp. 88–100. Springer, Heidelberg (2010)
Hurst, H.E.: Long Term Storage Capacity of Reservoirs., Trans. ASCE (1951)
Iosup, A., et al.: On the Dynamic Resource Availability in Grids. In: GRID (2007)
Karagiannis, T., et al.: A User-Friendly Self-Similarity Analysis Tool (2003)
Kondo, D., et al.: The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems. In: CCGRID (2010)
Lillo, F., Farmer, J.: The Long Memory of the Efficient Market (2004)
Myung, J.: Tutorial on Maximum Likelihood Estimation. J. Math Psy. (2003)
Nurmi, D., Brevik, J., Wolski, R.: Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 432–441. Springer, Heidelberg (2005)
Oppenheimer, D., Ganapathi, A., Patterson, D.A.: Why do Internet Services Fail, and What Can Be Done about It? In: USITS (2003)
Pecchia, A., Cotroneo, D., Kalbarczyk, Z., Iyer, R.K.: Improving Log-Based Field Failure Data Analysis of Multi-Node Computing Systems. In: DSN (2011)
Sahoo, R.K., Squillante, M.S., Sivasubramaniam, A., Zhang, Y.: Failure Data Analysis of a Large-Scale Heterogeneous Server Environment. In: DSN (2004)
Schroeder, B., Gibson, G.A.: A Large-Scale Study of Failures in High-Performance-Computing Systems. In: DSN (2006)
Yigitbasi, N., Gallet, M., Kondo, D., Iosup, A., Epema, D.: Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems. In: GRID (2010)
Zheng, Z., et al.: 3-Dimensional Root Cause Diagnosis via Co-Analysis (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 IFIP International Federation for Information Processing
About this paper
Cite this paper
Minh, T.N., Pierre, G. (2013). Failure Analysis and Modeling in Large Multi-site Infrastructures. In: Dowling, J., Taïani, F. (eds) Distributed Applications and Interoperable Systems. DAIS 2013. Lecture Notes in Computer Science, vol 7891. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38541-4_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-38541-4_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38540-7
Online ISBN: 978-3-642-38541-4
eBook Packages: Computer ScienceComputer Science (R0)