Failure Analysis and Modeling in Large Multi-site Infrastructures

Minh, Tran Ngoc; Pierre, Guillaume

doi:10.1007/978-3-642-38541-4_10

Failure Analysis and Modeling in Large Multi-site Infrastructures

Tran Ngoc Minh¹⁸ &
Guillaume Pierre¹⁸

Conference paper

1104 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 7891))

Abstract

Every large multi-site infrastructure such as Grids and Clouds must implement fault-tolerance mechanisms and smart schedulers to enable continuous operation even when resource failures occur. Evaluating the efficiency of such mechanisms and schedulers requires representative failure models that are able to capture realistic properties of real world failure data. This paper shows that failures in multi-site infrastructures are far from being randomly distributed. We propose a failure model that captures features observed in real failure traces.

Download to read the full chapter text

Chapter PDF

References

Beran, J.: Statistics for Long-Memory Processes. Chapman & Hall (1994)
Google Scholar
Chu, J., Labonte, K., Levine, B.N.: Availability and Locality Measurements of Peer-to-Peer File Systems. In: ITCom (2002)
Google Scholar
Feitelson, D.G.: Workload Modeling for Computer Systems Performance Evaluation., Book Draft, Version 0.32 (2011)
Google Scholar
Feller, W.: An Introduction to Probability Theory and Its Applications (1950)
Google Scholar
Gainaru, A., Cappello, F., Snir, M., Kramer, W.: Fault Prediction under the Microscope: a Closer Look into HPC Systems. In: SC (2012)
Google Scholar
Gallet, M., Yigitbasi, N., Javadi, B., Kondo, D., Iosup, A., Epema, D.: A Model for Space-Correlated Failures in Large-Scale Distributed Systems. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010, Part I. LNCS, vol. 6271, pp. 88–100. Springer, Heidelberg (2010)
Chapter Google Scholar
Hurst, H.E.: Long Term Storage Capacity of Reservoirs., Trans. ASCE (1951)
Google Scholar
Iosup, A., et al.: On the Dynamic Resource Availability in Grids. In: GRID (2007)
Google Scholar
Karagiannis, T., et al.: A User-Friendly Self-Similarity Analysis Tool (2003)
Google Scholar
Kondo, D., et al.: The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems. In: CCGRID (2010)
Google Scholar
Lillo, F., Farmer, J.: The Long Memory of the Efficient Market (2004)
Google Scholar
Myung, J.: Tutorial on Maximum Likelihood Estimation. J. Math Psy. (2003)
Google Scholar
Nurmi, D., Brevik, J., Wolski, R.: Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 432–441. Springer, Heidelberg (2005)
Chapter Google Scholar
Oppenheimer, D., Ganapathi, A., Patterson, D.A.: Why do Internet Services Fail, and What Can Be Done about It? In: USITS (2003)
Google Scholar
Pecchia, A., Cotroneo, D., Kalbarczyk, Z., Iyer, R.K.: Improving Log-Based Field Failure Data Analysis of Multi-Node Computing Systems. In: DSN (2011)
Google Scholar
Sahoo, R.K., Squillante, M.S., Sivasubramaniam, A., Zhang, Y.: Failure Data Analysis of a Large-Scale Heterogeneous Server Environment. In: DSN (2004)
Google Scholar
Schroeder, B., Gibson, G.A.: A Large-Scale Study of Failures in High-Performance-Computing Systems. In: DSN (2006)
Google Scholar
Yigitbasi, N., Gallet, M., Kondo, D., Iosup, A., Epema, D.: Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems. In: GRID (2010)
Google Scholar
Zheng, Z., et al.: 3-Dimensional Root Cause Diagnosis via Co-Analysis (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

IRISA / University of Rennes 1, France
Tran Ngoc Minh & Guillaume Pierre

Authors

Tran Ngoc Minh
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Pierre
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Royal Isntitute of Technology (KTH), Isafjordsgatan 39, 16440, Kista, Sweden
Jim Dowling
IRISA, Université de Rennes 1, 263 Avenue du Général Leclerc, Bât. 12, 35042, Rennes, France
François Taïani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Minh, T.N., Pierre, G. (2013). Failure Analysis and Modeling in Large Multi-site Infrastructures. In: Dowling, J., Taïani, F. (eds) Distributed Applications and Interoperable Systems. DAIS 2013. Lecture Notes in Computer Science, vol 7891. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38541-4_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-38541-4_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38540-7
Online ISBN: 978-3-642-38541-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics