Advertisement

Exploring Application-Level Message-Logging in Scalable HPC Programs

  • Esteban MenesesEmail author
Conference paper
  • 531 Downloads
Part of the Communications in Computer and Information Science book series (CCIS, volume 796)

Abstract

The next generation of supercomputers will require HPC applications to handle failures. This paper presents, through an example application, the benefits of logging messages at the application level. The proposed method will do both, provide resilience to failures and improve performance.

Keywords

Resilience Fault tolerance Message logging 

Notes

Acknowledgments

This work was partially supported by a machine allocation on Argonne Leadership Computing Facility awarded by the U.S. Department of Energy under contract DE-AC02-06CH11357.

References

  1. 1.
    Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)CrossRefGoogle Scholar
  2. 2.
    Gioachin, F., Sharma, A., Chakravorty, S., Mendes, C., Kalé, L.V., Quinn, T.: Scalable cosmological simulations on parallel machines. In: Daydé, M., Palma, J.M.L.M., Coutinho, Á.L.G.A., Pacitti, E., Lopes, J.C. (eds.) VECPAR 2006. LNCS, vol. 4395, pp. 476–489. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-71351-7_37 CrossRefGoogle Scholar
  3. 3.
    Huang, K.-H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33(6), 518–528 (1984)CrossRefzbMATHGoogle Scholar
  4. 4.
    Kalé, L., Krishnan, S.: CHARM++: a portable concurrent object oriented system based on C++. In: Paepcke, A. (ed.) Proceedings of OOPSLA 1993, pp. 91–108. ACM Press, September 1993Google Scholar
  5. 5.
    Meneses, E., Sarood, O., Kale, L.V.: Energy profile of rollback-recovery strategies in high performance computing. Parallel Comput. 40(9), 536–547 (2014)CrossRefGoogle Scholar
  6. 6.
    Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., Chien, A.A., Coteus, P., DeBardeleben, N., Diniz, P.C., Engelmann, C., Erez, M., Fazzari, S., Geist, A., Gupta, R., Johnson, F., Krishnamoorthy, S., Leyffer, S., Liberty, D., Mitra, S., Munson, T., Schreiber, R., Stearley, J., Hensbergen, E.V.: Addressing failures in exascale computing. IJHPCA 28(2), 129–173 (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.National Advanced Computing Collaboratory, National High Technology Center and School of ComputingCosta Rica Institute of TechnologyCartagoCosta Rica

Personalised recommendations