Using Simulation to Evaluate the Performance of Resilience Strategies at Scale

Levy, Scott; Topp, Bryan; Ferreira, Kurt B.; Arnold, Dorian; Hoefler, Torsten; Widener, Patrick

doi:10.1007/978-3-319-10214-6_5

Scott Levy¹⁶,
Bryan Topp¹⁶,
Kurt B. Ferreira¹⁷,
Dorian Arnold¹⁶,
Torsten Hoefler¹⁸ &
…
Patrick Widener¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8551))

Included in the following conference series:

International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems

856 Accesses
7 Citations
3 Altmetric

Abstract

Fault-tolerance has been identified as a major challenge for future extreme-scale systems. Current predictions suggest that, as systems grow in size, failures will occur more frequently. Because increases in failure frequency reduce the performance and scalability of these systems, significant effort has been devoted to developing and refining resilience mechanisms to mitigate the impact of failures. However, effective evaluation of these mechanisms has been challenging. Current systems are smaller and have significantly different architectural features (e.g., interconnect, persistent storage) than we expect to see in next-generation systems. To overcome these challenges, we propose the use of simulation. Simulation has been shown to be an effective tool for investigating performance characteristics of applications on future systems. In this work, we: identify the set of system characteristics that are necessary for accurate performance prediction of resilience mechanisms for HPC systems and applications; demonstrate how these system characteristics can be incorporated into an existing large-scale simulator; and evaluate the predictive performance of our modified simulator. We also describe how we were able to optimize the simulator for large temporal and spatial scales—allowing the simulator to run 4x faster and use over 100x less memory.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bergman, K., et al.: Exascale computing study: Technology challenges in achieving exascale systems (September 2008), http://www.science.energy.gov/ascr/Research/CS/DARPAexascale-hardware(2008).pdf
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22(3), 303–312 (2006)
Article Google Scholar
Bouguerra, M.-S., Gautier, T., Trystram, D., Vincent, J.-M.: A flexible checkpoint/restart model in distributed systems. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part I. LNCS, vol. 6067, pp. 206–215. Springer, Heidelberg (2010)
Chapter Google Scholar
Guermouche, A., Ropars, T., Brunet, E., Snir, M., Cappello, F.: Uncoordinated checkpointing without domino effect for send-deterministic MPI applications. In: International Parallel Distributed Processing Symposium (IPDPS), pp. 989–1000 (May 2011)
Google Scholar
Alvisi, L., Elnozahy, E., Rao, S., Husain, S., de Mel, A.: An analysis of communication induced checkpointing. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, Digest of Papers, pp. 242–249 (1999)
Google Scholar
Monnet, S., Morin, C., Badrinath, R.: A hierarchical checkpointing protocol for parallel applications in cluster federations. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, p. 211. IEEE (2004)
Google Scholar
Elnozahy, E.N.M., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Article Google Scholar
Oldfield, R.A., Arunagiri, S., Teller, P.J., Seelam, S., Varela, M.R., Riesen, R., Roth, P.C.: Modeling the impact of checkpoints on next-generation systems. In: 24th IEEE Conference on Mass Storage Systems and Technologies, pp. 30–46 (September 2007)
Google Scholar
Ferreira, K., Riesen, R., Bridges, P., Arnold, D., Stearley, J., Laros III, J.H., Oldfield, R., Pedretti, K., Brightwell, R.: Evaluating the viability of process replication reliability for exascale systems. In: Lathrop, S., Costa, J., Kramer, W. (eds.) SC. ACM (November 2011)
Google Scholar
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: International Conference on Dependable Systems and Networks (DSN) (June 2006)
Google Scholar
Kannan, S., Gavrilovska, A., Schwan, K., Milojicic, D.: Optimizing checkpoints using NVM as virtual memory. In: Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS 2013. ACM, New York (2013)
Google Scholar
Dong, X., Muralimanohar, N., Jouppi, N., Kaufmann, R., Xie, Y.: Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 57:1–57:12. ACM, New York (2009)
Google Scholar
Bronevetsky, G., Marques, D., Pingali, K., McKee, S., Rugina, R.: Compiler-enhanced incremental checkpointing for openmp applications. In: IEEE International Symposium on Parallel & Distributed Processing, pp. 1–12 (2009)
Google Scholar
Ferreira, Kurt B., Riesen, Rolf, Brighwell, Ron, Bridges, Patrick, Arnold, Dorian: libhashckpt: hash-based incremental checkpointing using GPU’s. In: Cotronis, Yiannis, Danalis, Anthony, Nikolopoulos, Dimitrios S., Dongarra, Jack (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 272–281. Springer, Heidelberg (2011)
Chapter Google Scholar
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010), pp. 1–11 (2010), http://dx.doi.org/10.1109/SC.2010.18
Ibtesham, D., Arnold, D., Bridges, P.G., Ferreira, K.B., Brightwell, R.: On the viability of compression for reducing the overheads of checkpoint/restart-based fault tolerance. In: 2012 41st International Conference on Parallel Processing, pp. 148–157 (2012)
Google Scholar
Guermouche, A., Ropars, T., Snir, M., Cappello, F.: HydEE: Failure containment without event logging for large scale send-deterministic mpi applications. In: IPDPS, pp. 1216–1227. IEEE Computer Society (2012)
Google Scholar
Mubarak, M., Carothers, C.D., Ross, R., Carns, P.: Modeling a million-node dragonfly network using massively parallel discrete-event simulation. In: 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC), pp. 366–376. IEEE (2012)
Google Scholar
Zheng, G., Wilmarth, T., Jagadishprasad, P., Kalé, L.V.: Simulation-based performance prediction for large parallel machines. International Journal of Parallel Programming 33(2–3), 183–207 (2005)
Article Google Scholar
Hoefler, T., Schneider, T., Lumsdaine, A.: LogGOPSim - Simulating Large-Scale Applications in the LogGOPS Model. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 597–604. ACM (June 2010)
Google Scholar
Ferreira, K.B., Bridges, P., Brightwell, R.: Characterizing application sensitivity to os interference using kernel-level noise injection. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, p. 19. IEEE Press (2008)
Google Scholar
Hoefler, T., Schneider, T., Lumsdaine, A.: Characterizing the Influence of System Noise on Large-Scale Applications by Simulation. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010) (November 2010)
Google Scholar
Simon, Horst D.: Barriers to exascale computing. In: Daydé, Michel, Marques, Osni, Nakajima, Kengo (eds.) VECPAR. LNCS, vol. 7851, pp. 1–3. Springer, Heidelberg (2013)
Chapter Google Scholar
Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)
Article Google Scholar
Plank, J.S., Li, K., Puening, M.A.: Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems 9(10), 972–986 (1998)
Article Google Scholar
Plank, J.S., Kim, Y.B., Dongarra, J.J.: Algorithm-based diskless checkpointing for fault tolerant matrix operations. In: Twenty-Fifth International Symposium on Fault-Tolerant Computing, Digest of Papers, Pasadena, CA, USA, pp. 351–360. IEEE Comput. Soc. Press, Los Alamitos (1995)
Google Scholar
Silva, L.M., Silva, J.G.: An experimental study about diskless checkpointing. In: 24th EUROMICRO Conference, Vasteras, Sweden, pp. 395–402. IEEE Computer Society Press (August 1998)
Google Scholar
Monnet, S., Morin, C., Badrinath, R.: Hybrid checkpointing for parallel applications in cluster federations. In: IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004, pp. 773–782. IEEE (2004)
Google Scholar
Alvisi, L., Elnozahy, E., Rao, S., Husain, S.A., De Mel, A.: An analysis of communication induced checkpointing. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing. Digest of Papers, pp. 242–249. IEEE (1999)
Google Scholar
Gioiosa, R., Sancho, J.C., Jiang, S., Petrini, F., Davis, K.: Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, p. 9. IEEE Computer Society (2005)
Google Scholar
Widener, P., Ferreira, K., Levy, S., Bridges, P.G., Arnold, D., Brightwell, R.: Asking the right questions: benchmarking fault-tolerant extreme-scale systems. In:Proc. 6th Workshop on Resiliency in High Performance Computing, Aachen,Germany (August 2013), in conjunction with Euro-Par 2013
Google Scholar
Riesen, R., Ferreira, K., Stearley, J., Oldfield, R., Laros III, J.H., Pedretti, K., Brightwell, R., et al.: Redundant computing for exascale systems. Technical report SAND2010-8709. Sandia National Laboratories (2010)
Google Scholar
Hoefler, T.: LogGOPSim - A LogGOPS (LogP, LogGP, LogGPS) Simulator and Simulation Framework (April 10, 2013), http://www.unixer.de/research/LogGOPSim/
Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: LogP: towards a realistic model of parallel computation. SIGPLAN Not. 28(7), 1–12 (1993)
Article Google Scholar
Hoefler, T., Siebert, C., Lumsdaine, A.: Group Operation Assembly Language - a flexible way to express collective communication. In: ICPP-2009 - The 38th International Conference on Parallel Processing. IEEE (September 2009)
Google Scholar
Tikotekar, A., Vallée, G., Naughton, T., Scott, S.L., Leangsuksun, C.: Evaluation of fault-tolerant policies using simulation. In: 2007 IEEE International Conference on Cluster Computing, pp. 303–311. IEEE (2007)
Google Scholar
Bohm, S., Engelmann, C.: xSim: The extreme-scale simulator. In: 2011 International Conference on High Performance Computing and Simulation (HPCS), pp. 280–286. IEEE (2011)
Google Scholar
Boteanu, A., Dobre, C., Pop, F., Cristea, V.: Simulator for fault tolerance in large scale distributed systems. In: 2010 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 443–450. IEEE (2010)
Google Scholar
Janssen, C.L., Adalsteinsson, H., Cranford, S., Kenny, J.P., Pinar, A., Evensky, D.A., Mayo, J.: A simulator for large-scale parallel computer architectures. International Journal of Distributed Systems and Technologies (IJDST) 1(2), 57–73 (2010)
Article Google Scholar
Sst: The structural simulation toolkit (2011), http://sst.sandia.gov/about_sstmacro.html
Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J.: An evaluation of user-level failure mitigation support in MPI. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) EuroMPI 2012. LNCS, vol. 7490, pp. 193–203. Springer, Heidelberg (2012)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of New Mexico, Albuquerque, USA
Scott Levy, Bryan Topp & Dorian Arnold
Scalable System Software, Sandia National Laboratories, Albuquerque, USA
Kurt B. Ferreira & Patrick Widener
Computer Science Department, ETH Zürich, Zürich, Switzerland
Torsten Hoefler

Authors

Scott Levy
View author publications
You can also search for this author in PubMed Google Scholar
Bryan Topp
View author publications
You can also search for this author in PubMed Google Scholar
Kurt B. Ferreira
View author publications
You can also search for this author in PubMed Google Scholar
Dorian Arnold
View author publications
You can also search for this author in PubMed Google Scholar
Torsten Hoefler
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Widener
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Scott Levy .

Editor information

Editors and Affiliations

University of Warwick Coventry, West Midlands, United Kingdom
Stephen A. Jarvis
University of Warwick Coventry, West Midlands, United Kingdom
Steven A. Wright
Sandia National Laboratories CSRI, Albuquerque, New Mexico, USA
Simon D. Hammond

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Levy, S., Topp, B., Ferreira, K.B., Arnold, D., Hoefler, T., Widener, P. (2014). Using Simulation to Evaluate the Performance of Resilience Strategies at Scale. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation. PMBS 2013. Lecture Notes in Computer Science(), vol 8551. Springer, Cham. https://doi.org/10.1007/978-3-319-10214-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-10214-6_5
Published: 01 October 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10213-9
Online ISBN: 978-3-319-10214-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics