Abstract
Fault-tolerance has been identified as a major challenge for future extreme-scale systems. Current predictions suggest that, as systems grow in size, failures will occur more frequently. Because increases in failure frequency reduce the performance and scalability of these systems, significant effort has been devoted to developing and refining resilience mechanisms to mitigate the impact of failures. However, effective evaluation of these mechanisms has been challenging. Current systems are smaller and have significantly different architectural features (e.g., interconnect, persistent storage) than we expect to see in next-generation systems. To overcome these challenges, we propose the use of simulation. Simulation has been shown to be an effective tool for investigating performance characteristics of applications on future systems. In this work, we: identify the set of system characteristics that are necessary for accurate performance prediction of resilience mechanisms for HPC systems and applications; demonstrate how these system characteristics can be incorporated into an existing large-scale simulator; and evaluate the predictive performance of our modified simulator. We also describe how we were able to optimize the simulator for large temporal and spatial scales—allowing the simulator to run 4x faster and use over 100x less memory.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bergman, K., et al.: Exascale computing study: Technology challenges in achieving exascale systems (September 2008), http://www.science.energy.gov/ascr/Research/CS/DARPAexascale-hardware(2008).pdf
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22(3), 303–312 (2006)
Bouguerra, M.-S., Gautier, T., Trystram, D., Vincent, J.-M.: A flexible checkpoint/restart model in distributed systems. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part I. LNCS, vol. 6067, pp. 206–215. Springer, Heidelberg (2010)
Guermouche, A., Ropars, T., Brunet, E., Snir, M., Cappello, F.: Uncoordinated checkpointing without domino effect for send-deterministic MPI applications. In: International Parallel Distributed Processing Symposium (IPDPS), pp. 989–1000 (May 2011)
Alvisi, L., Elnozahy, E., Rao, S., Husain, S., de Mel, A.: An analysis of communication induced checkpointing. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, Digest of Papers, pp. 242–249 (1999)
Monnet, S., Morin, C., Badrinath, R.: A hierarchical checkpointing protocol for parallel applications in cluster federations. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, p. 211. IEEE (2004)
Elnozahy, E.N.M., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Oldfield, R.A., Arunagiri, S., Teller, P.J., Seelam, S., Varela, M.R., Riesen, R., Roth, P.C.: Modeling the impact of checkpoints on next-generation systems. In: 24th IEEE Conference on Mass Storage Systems and Technologies, pp. 30–46 (September 2007)
Ferreira, K., Riesen, R., Bridges, P., Arnold, D., Stearley, J., Laros III, J.H., Oldfield, R., Pedretti, K., Brightwell, R.: Evaluating the viability of process replication reliability for exascale systems. In: Lathrop, S., Costa, J., Kramer, W. (eds.) SC. ACM (November 2011)
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: International Conference on Dependable Systems and Networks (DSN) (June 2006)
Kannan, S., Gavrilovska, A., Schwan, K., Milojicic, D.: Optimizing checkpoints using NVM as virtual memory. In: Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS 2013. ACM, New York (2013)
Dong, X., Muralimanohar, N., Jouppi, N., Kaufmann, R., Xie, Y.: Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 57:1–57:12. ACM, New York (2009)
Bronevetsky, G., Marques, D., Pingali, K., McKee, S., Rugina, R.: Compiler-enhanced incremental checkpointing for openmp applications. In: IEEE International Symposium on Parallel & Distributed Processing, pp. 1–12 (2009)
Ferreira, Kurt B., Riesen, Rolf, Brighwell, Ron, Bridges, Patrick, Arnold, Dorian: libhashckpt: hash-based incremental checkpointing using GPU’s. In: Cotronis, Yiannis, Danalis, Anthony, Nikolopoulos, Dimitrios S., Dongarra, Jack (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 272–281. Springer, Heidelberg (2011)
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010), pp. 1–11 (2010), http://dx.doi.org/10.1109/SC.2010.18
Ibtesham, D., Arnold, D., Bridges, P.G., Ferreira, K.B., Brightwell, R.: On the viability of compression for reducing the overheads of checkpoint/restart-based fault tolerance. In: 2012 41st International Conference on Parallel Processing, pp. 148–157 (2012)
Guermouche, A., Ropars, T., Snir, M., Cappello, F.: HydEE: Failure containment without event logging for large scale send-deterministic mpi applications. In: IPDPS, pp. 1216–1227. IEEE Computer Society (2012)
Mubarak, M., Carothers, C.D., Ross, R., Carns, P.: Modeling a million-node dragonfly network using massively parallel discrete-event simulation. In: 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC), pp. 366–376. IEEE (2012)
Zheng, G., Wilmarth, T., Jagadishprasad, P., Kalé, L.V.: Simulation-based performance prediction for large parallel machines. International Journal of Parallel Programming 33(2–3), 183–207 (2005)
Hoefler, T., Schneider, T., Lumsdaine, A.: LogGOPSim - Simulating Large-Scale Applications in the LogGOPS Model. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 597–604. ACM (June 2010)
Ferreira, K.B., Bridges, P., Brightwell, R.: Characterizing application sensitivity to os interference using kernel-level noise injection. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, p. 19. IEEE Press (2008)
Hoefler, T., Schneider, T., Lumsdaine, A.: Characterizing the Influence of System Noise on Large-Scale Applications by Simulation. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010) (November 2010)
Simon, Horst D.: Barriers to exascale computing. In: Daydé, Michel, Marques, Osni, Nakajima, Kengo (eds.) VECPAR. LNCS, vol. 7851, pp. 1–3. Springer, Heidelberg (2013)
Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)
Plank, J.S., Li, K., Puening, M.A.: Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems 9(10), 972–986 (1998)
Plank, J.S., Kim, Y.B., Dongarra, J.J.: Algorithm-based diskless checkpointing for fault tolerant matrix operations. In: Twenty-Fifth International Symposium on Fault-Tolerant Computing, Digest of Papers, Pasadena, CA, USA, pp. 351–360. IEEE Comput. Soc. Press, Los Alamitos (1995)
Silva, L.M., Silva, J.G.: An experimental study about diskless checkpointing. In: 24th EUROMICRO Conference, Vasteras, Sweden, pp. 395–402. IEEE Computer Society Press (August 1998)
Monnet, S., Morin, C., Badrinath, R.: Hybrid checkpointing for parallel applications in cluster federations. In: IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004, pp. 773–782. IEEE (2004)
Alvisi, L., Elnozahy, E., Rao, S., Husain, S.A., De Mel, A.: An analysis of communication induced checkpointing. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing. Digest of Papers, pp. 242–249. IEEE (1999)
Gioiosa, R., Sancho, J.C., Jiang, S., Petrini, F., Davis, K.: Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, p. 9. IEEE Computer Society (2005)
Widener, P., Ferreira, K., Levy, S., Bridges, P.G., Arnold, D., Brightwell, R.: Asking the right questions: benchmarking fault-tolerant extreme-scale systems. In:Proc. 6th Workshop on Resiliency in High Performance Computing, Aachen,Germany (August 2013), in conjunction with Euro-Par 2013
Riesen, R., Ferreira, K., Stearley, J., Oldfield, R., Laros III, J.H., Pedretti, K., Brightwell, R., et al.: Redundant computing for exascale systems. Technical report SAND2010-8709. Sandia National Laboratories (2010)
Hoefler, T.: LogGOPSim - A LogGOPS (LogP, LogGP, LogGPS) Simulator and Simulation Framework (April 10, 2013), http://www.unixer.de/research/LogGOPSim/
Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: LogP: towards a realistic model of parallel computation. SIGPLAN Not. 28(7), 1–12 (1993)
Hoefler, T., Siebert, C., Lumsdaine, A.: Group Operation Assembly Language - a flexible way to express collective communication. In: ICPP-2009 - The 38th International Conference on Parallel Processing. IEEE (September 2009)
Tikotekar, A., Vallée, G., Naughton, T., Scott, S.L., Leangsuksun, C.: Evaluation of fault-tolerant policies using simulation. In: 2007 IEEE International Conference on Cluster Computing, pp. 303–311. IEEE (2007)
Bohm, S., Engelmann, C.: xSim: The extreme-scale simulator. In: 2011 International Conference on High Performance Computing and Simulation (HPCS), pp. 280–286. IEEE (2011)
Boteanu, A., Dobre, C., Pop, F., Cristea, V.: Simulator for fault tolerance in large scale distributed systems. In: 2010 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 443–450. IEEE (2010)
Janssen, C.L., Adalsteinsson, H., Cranford, S., Kenny, J.P., Pinar, A., Evensky, D.A., Mayo, J.: A simulator for large-scale parallel computer architectures. International Journal of Distributed Systems and Technologies (IJDST) 1(2), 57–73 (2010)
Sst: The structural simulation toolkit (2011), http://sst.sandia.gov/about_sstmacro.html
Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J.: An evaluation of user-level failure mitigation support in MPI. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) EuroMPI 2012. LNCS, vol. 7490, pp. 193–203. Springer, Heidelberg (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Levy, S., Topp, B., Ferreira, K.B., Arnold, D., Hoefler, T., Widener, P. (2014). Using Simulation to Evaluate the Performance of Resilience Strategies at Scale. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation. PMBS 2013. Lecture Notes in Computer Science(), vol 8551. Springer, Cham. https://doi.org/10.1007/978-3-319-10214-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-10214-6_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10213-9
Online ISBN: 978-3-319-10214-6
eBook Packages: Computer ScienceComputer Science (R0)