Skip to main content

Advertisement

Log in

System-wide trade-off modeling of performance, power, and resilience on petascale systems

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

While performance remains a major objective in the field of high-performance computing (HPC), future systems will have to deliver desired performance under both reliability and energy constraints. Although a number of resilience methods and power management techniques have been presented to address the reliability and energy concerns, the trade-offs among performance, power, and resilience are not well understood, especially in HPC systems with unprecedented scale and complexity. In this work, we present a co-modeling mechanism named TOPPER (system-wide Trade-Off modeling for Performance, PowEr, and Resilience). TOPPER is build with colored Petri nets which allow us to capture the dynamic, complicated interactions and dependencies among different factors such as workload characteristics, hardware reliability, runtime system operation, on a petascale machine. Using system traces collected from a production supercomputer, we conducted a series of experiments to analyze various resilience methods, power capping techniques, and job characteristics in terms of system-wide performance and energy consumption. Our results provide interesting insights regarding performance–power–resilience trade-offs on HPC systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  1. Balbo G (2007) Introduction to generalized stochastic petri nets. In: Proceedings of SFM

  2. Bautista-Gomez L, Komatitsch D, Maruyama N, Tsuboi S, Cappello F, Matsuoka S (2011) FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of SC

  3. Bircher W, John L (2008) Analysis of dynamic power management on multi-core processors. In: Proceedings of ICS

  4. Bodas D, Song J, Rajappa M, Hoffman A (2014) Simple power-aware scheduler to limit power consumption by HPC system within a budget. In: Proceedings of E2SC

  5. Chen X, Xu C, Dick R, Mao Z (2010) Performance and power modeling in a multi-programmed multi-core environment. In: Proceedings of DAC

  6. Chiesi M, Vanzolini L, Mucci C, Scarselli E, Guerrieri R (2015) Power-aware job scheduling on heterogeneous multicore architectures. IEEE Trans Parallel Distrib Syst 26:868–877

    Article  Google Scholar 

  7. Cobalt Resource Manager http://trac.mcs.anl.gov/projects/cobalt

  8. Crovella M, Bianchini R, Leblanc T, Markatos E, Wisniewski R (1992) Using communication-to-computation ratio in parallel program design and performance prediction. In: Proceedings of IPDPS

  9. CPN Tools (2015) http://cpntools.org/

  10. Curtis-Maury M, Dzierwa J, Antonopoulos C, Nikolopoulos D (2006) Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In: Proceedings of ICS

  11. Daly J (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22:303–312

    Article  Google Scholar 

  12. Di S, Bouguerra M-S, Bautista-Gomez LA, Cappello F (2014) Optimization of multi-level checkpoint model for large scale HPC applications. In: Proceedings of IPDPS

  13. Elliott J, Kharbas K, Fiala D, Mueller F, Ferreira K, Engelmann C (2012) Combining partial redundancy and checkpointing for HPC. In: Proceedings of ICDCS

  14. ExSpecT (2015) http://www.exspect.com/

  15. Fan X, Weber W-D, Barroso L (2007) Power provisioning for a warehouse-sized computer. In: Proceedings of ISCA

  16. Feitelson D, Rudolph L, Schwiegelshohn U, Sevcik K, Wong P (1997) Theory and practice in parallel job scheduling. In: Proceedings of JSSPP

  17. Feng X, Ge R, Cameron K (2005) Power and energy profiling of scientific applications on distributed systems. In: Proceedings of IPDPS

  18. Ferreira K, Stearley J, Laros III J, Oldfield R et al (2011) Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of SC

  19. Gandhi A, Harchol-Balter M, Adan I (2010) Server farms with setup costs. Perform Eval 67:1123–1138

    Article  Google Scholar 

  20. Ge R, Feng X, Cameron K (2005) Performance-constrained distributed DVS scheduling for scientific applications on power-aware clusters. In: Proceedings of SC

  21. Ge R, Feng X, Song S, Chang H-C, Li D, Cameron K (2010) PowerPack: energy profiling and analysis of high-performance systems and applications. IEEE Trans Parallel Distrib Syst 21:658–671

    Article  Google Scholar 

  22. Gniady C, Butt A, Hu Y, Lu Y-H (2006) Program counter-based prediction techniques for dynamic power management. IEEE Trans Comput 55:641–658

    Article  Google Scholar 

  23. Goiri I, Kien L, Haque M, Beauchea R, Nguyen T, Guitart J, Torres J, Bianchini R (2011) GreenSlot: scheduling energy consumption in green datacenters. In: Proceedings of SC

  24. Guenter B, Jain N, Williams C (2011) Managing cost, performance, and reliability tradeoffs for energy-aware server provisioning. In: Proceedings of INFOCOM

  25. Jensen K (1981) Colored petri nets and the invariant-method. Theoret Comput Sci 14:317–336

    Article  MathSciNet  MATH  Google Scholar 

  26. Kanev S, Hazelwood KM, Wei G-Y, Brooks DM (2014) Tradeoffs between power management and tail latency in warehouse-scale applications. In: Proceedings of IISWC

  27. LeBlanc T, Anand R, Gabriel E, Subhlok J (2009) Volpexmpi: an MPI Library for execution of parallel applications on volatile nodes. In: European PVM/MPI users’ group meeting

  28. Lefurgy C, Wang X, Ware M (2007) Server-level power control. In: Proceedings of ICAC

  29. LLview (2013) Graphical monitoring of loadleveler controlled cluster. http://www.fz-juelich.de/jsc/llview/

  30. Martin T, Siewiorek D (2001) Non-ideal battery and main memory effects on CPU speed-setting for low power. IEEE Trans VLSI System 9:29–34

    Article  Google Scholar 

  31. Marwan W, Rohr C, Heiner M (2012) Petri nets in snoopy: a unifying framework for the graphical display, computational modelling, and simulation of bacterial regulatory networks. Humana Press, New York

    Google Scholar 

  32. Mira (2012) Next-generation supercomputer. https://www.alcf.anl.gov/mira

  33. Moody A, Bronevetsky G, Mohror K, Supinski B (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of SC

  34. NSF Cyberinfrastructure Framework for \(21^{st}\) Century Science and Engineering Vision. http://www.nsf.gov/pubs/2010/nsf10015/nsf10015.jsp

  35. Patki T, Lowenthal D, Rountree B, Schulz M, de Supinski B (2013) Exploring hardware overprovisioning in power-constrained, high performance computing. In: Proceedings of ICS

  36. Qiu Q, Pedram M (1999) Dynamic power management based on continuous-time Markov decision processes. In: Proceedings of DAC

  37. Reed D, Lu C, Mendes C (2003) Big systems and big reliability challenges. In: Proceedings of ParCo

  38. ReNeW (2015) http://www.renew.de/

  39. Riesen R, Ferreira K, Silva D, Lemarinier P, Arnold D, Bridges P (2012) Alleviating scalability issues of checkpointing protocols. In: Proceedings of SC

  40. Rong P, Pedram M (2006) Battery-aware power management based on Markovian decision processes. In: Proceedings of ICCAD

  41. Sancho J, Petrini F, Davis K, Gioiosa R, Jiang S (2005) Current practice and a direction forward in checkpoint/restart implementations for fault tolerance. In: Proceedings of IPDPS

  42. Srinivasan J, Adve S, Bose P, Rivers J (2004) The impact of technology scaling on lifetime reliability. In: Proceedings of DSN

  43. Tang W, Desai N, Buettner D, Lan Z (2010) Analyzing and adjusting user runtime estimates to improve job scheduling on blue gene/P. In: Proceedings of IPDPS

  44. The Standard Workload Format (2007) http://www.cs.huji.ac.il/labs/parallel/workload/swf.html

  45. Tian Y, Lin C, Yao M (2012) Modeling and analyzing power management policies in server farms using stochastic petri nets. In: Proceedings of e-Energy

  46. Tiwari A, Laurenzano M, Carrington L, Snavely A (2012) Modeling power and energy usage of HPC Kernels. In: Proceedings of IPDPSW

  47. TOPPER (2015) http://bluesky.cs.iit.edu/topper/

  48. Wallace S, Vishwanath V, Coghlan S, Lan Z, Papka M (2013) Application profilling benchmarks on IBM blue gene/Q. In: Proceedings of cluster

  49. Wingstrom J (2009) Overcoming the difficulties created by the volatile nature of desktop grids through understanding. Technical report, Ph.D. thesis, University of Hawai’i, Manoa

  50. Yang X, Zhou Z, Wallace S, Lan Z, Tang W, Coghlan S, Papka M (2013) Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems. In: Proceedings of SC

  51. Yu L, Zhou Z, Wallace S, Papka M, Lan Z (2015) Quantitative modeling of power-performance tradeoffs on extreme scale systems. J Parallel Distrib Comput Comput 84:1–14

    Article  Google Scholar 

  52. Zhou Z, Lan Z, Tang W, Desai N (2013) Reducing energy costs for IBM blue gene/P via power-aware job scheduling. In: Proceedings of JSSPP

Download references

Acknowledgements

This work is supported in part by US National Science Foundation Grant CCF-1618776 and CCF-1422009. It used data of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Yu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, L., Zhou, Z., Fan, Y. et al. System-wide trade-off modeling of performance, power, and resilience on petascale systems. J Supercomput 74, 3168–3192 (2018). https://doi.org/10.1007/s11227-018-2368-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2368-8

Keywords

Navigation