Advertisement

Resilient N-Body Tree Computations with Algorithm-Based Focused Recovery: Model and Performance Analysis

  • Aurélien CavelanEmail author
  • Aiman Fang
  • Andrew A. Chien
  • Yves Robert
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10724)

Abstract

This paper presents a model and performance study for Algorithm-Based Focused Recovery (ABFR) applied to N-body computations, subject to latent errors. We make a detailed comparison with the classical Checkpoint/Restart (CR) approach. While the model applies to general frameworks, the performance study is limited to perfect binary trees, due to the inherent difficulty of the analysis. With ABFR, the crucial parameter is the detection interval, which bounds the error latency. We show that the detection interval has a dramatic impact on the overhead, and that optimally choosing its value leads to significant gains over the CR approach.

References

  1. 1.
    Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014)CrossRefGoogle Scholar
  2. 2.
    Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 100(6), 518–528 (1984)CrossRefzbMATHGoogle Scholar
  3. 3.
    Chen, Z.: Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: PPoPP, pp. 167–176 (2013)Google Scholar
  4. 4.
    Fang, A., Cavelan, A., Robert, Y., Chien, A.A.: Resilience for stencil computations with latent errors. In: The 46th International Conference on Parallel Processing (ICPP 2017). IEEE Computer Society Press (2017)Google Scholar
  5. 5.
    Dun, N., et al.: Data decomposition in monte carlo neutron transport simulations using global view arrays. Int. J. High Perform. Comput. Appl. 29, 348–365 (2015)CrossRefGoogle Scholar
  6. 6.
    Fang, A., Chien, A.A.: Applying GVR to molecular dynamics: enabling resilience for scientific computations. Technical report TR-2014-04, University of Chicago (2014)Google Scholar
  7. 7.
    Chien, A., et al.: Versioned distributed arrays for resilience in scientific applications: global view resilience. Procedia Comput. Sci. 51, 29–38 (2015)CrossRefGoogle Scholar
  8. 8.
    Chien, A., et al.: Exploring versioned distributed arrays for resilience in scientific applications: global view resilience. Int. J. High Perform. Comput. Appl. (2016)Google Scholar
  9. 9.
  10. 10.
  11. 11.
    Dun, N., Pleiter, D., Fang, A., Vandenbergen, N., Chien, A.A.: Multi-versioning performance opportunities in BGAS system for resilience. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 486–504. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-41321-1_25 Google Scholar
  12. 12.
    Blelloch, G., Narlikar, G.: A practical comparison of \(n\)-body algorithms. In: Parallel Algorithms. Series in Discrete Mathematics and Theoretical Computer Science. American Mathematical Society (1997)Google Scholar
  13. 13.
    Eastwood, J., Hockney, R.: Computer Simulation Using Particles. McGrawHill, New York (1981)zbMATHGoogle Scholar
  14. 14.
    Van Albada, G., Van Leer, B., Roberts Jr., W.: A comparative study of computational methods in cosmic gas dynamics. Astron. Astrophys. 108, 76–84 (1982)zbMATHGoogle Scholar
  15. 15.
    Appel, A.W.: An efficient program for many-body simulation. SIAM J. Sci. Statist. Comput. 6(1), 85–103 (1985)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Greengard, L., Rokhlin, V.: A fast algorithm for particle simulations. J. Comput. Phys. 73(2), 325–348 (1987)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Barnes, J., Hut, P.: A hierarchical o (n log n) force-calculation algorithm. Nature 324(6096), 446–449 (1986)CrossRefGoogle Scholar
  18. 18.
    Hernquist, L.: Performance characteristics of tree codes. Astrophys. J. Suppl. Ser. 64, 715–734 (1987)CrossRefGoogle Scholar
  19. 19.
    McMillan, S.L., Aarseth, S.J.: An o (n log n) integration scheme for collisional stellar systems. Astrophys. J. 414, 200–212 (1993)CrossRefGoogle Scholar
  20. 20.
    Springel, V., Yoshida, N., White, S.D.: Gadget: a code for collisionless and gasdynamical cosmological simulations. New Astronomy 6(2), 79–117 (2001)CrossRefGoogle Scholar
  21. 21.
    O’Gorman, T.: The effect of cosmic rays on the soft error rate of a DRAM at ground level. IEEE Trans. Electron. Devices 41(4), 553–557 (1994)CrossRefGoogle Scholar
  22. 22.
    Ziegler, J.F., Curtis, H.W., Muhlfeld, H.P., Montrose, C.J., Chin, B.: IBM experiments in soft fails in computer electronics. IBM J. Res. Dev. 40(1), 3–18 (1996)CrossRefGoogle Scholar
  23. 23.
    Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.d.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: SC. ACM (2010)Google Scholar
  24. 24.
    Ferreira, K., Stearley, J., Laros, J.H.I., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P.G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: SC 2011. ACM (2011)Google Scholar
  25. 25.
    Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: SC. ACM (2012)Google Scholar
  26. 26.
    Casanova, H., Bougeret, M., Robert, Y., Vivien, F., Zaidouni, D.: Using group replication for resilience on exascale systems. Int. J. High Perform. Comput. Appl. 28(2), 210–224 (2014)CrossRefGoogle Scholar
  27. 27.
    Lyons, R.E., Vanderkulk, W.: The use of triple-modular redundancy to improve computer reliability. IBM J. Res. Dev. 6(2), 200–209 (1962)CrossRefzbMATHGoogle Scholar
  28. 28.
    Avizienis, A., Laprie, J., Randell, B., Landwehr, C.E.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Sec. Comput. 1(1), 11–33 (2004)CrossRefGoogle Scholar
  29. 29.
    Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33(6), 518–528 (1984)CrossRefzbMATHGoogle Scholar
  30. 30.
    Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69(4), 410–416 (2009)CrossRefGoogle Scholar
  31. 31.
    Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: ICS. ACM (2012)Google Scholar
  32. 32.
    Benson, A.R., Schmit, S., Schreiber, R.: Silent error detection in numerical time-stepping schemes. Int. J. High Perform. Comput. Appl. 29, 403–421 (2014)CrossRefGoogle Scholar
  33. 33.
    Sao, P., Vuduc, R.: Self-stabilizing iterative solvers. In: ScalA 2013 (2013)Google Scholar
  34. 34.
    Heroux, M., Hoemmen, M.: Fault-tolerant iterative methods via selective reliability. Research report SAND2011-3915 C, Sandia National Laboratories (2011)Google Scholar
  35. 35.
    Elliott, J., Hoemmen, M., Mueller, F.: Evaluating the impact of SDC on the GMRES iterative solver. In: IPDPS. IEEE (2014)Google Scholar
  36. 36.
    Bronevetsky, G., de Supinski, B.: Soft error vulnerability of iterative linear algebra methods. In: ICS. ACM (2008)Google Scholar
  37. 37.
    Berrocal, E., Bautista-Gomez, L., Di, S., Lan, Z., Cappello, F.: Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In: HPDC. ACM (2015)Google Scholar
  38. 38.
    Bautista Gomez, L., Cappello, F.: Detecting silent data corruption through data dynamic monitoring for scientific applications. In: PPoPP. ACM (2014)Google Scholar
  39. 39.
    Bautista Gomez, L., Cappello, F.: Detecting and correcting data corruption in stencil applications through multivariate interpolation. In: FTS. IEEE (2015)Google Scholar
  40. 40.
    Bautista Gomez, L., Cappello, F.: Exploiting spatial smoothness in HPC applications to detect silent data corruption. In: HPCC. IEEE (2015)Google Scholar
  41. 41.
    Ciocca, E., Koren, I., Koren, Z., Krishna, C.M., Katz, D.S.: Application-level fault tolerance in the orbital thermal imaging spectrometer. In: PRDC. IEEE (2004)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Aurélien Cavelan
    • 1
    • 2
    Email author
  • Aiman Fang
    • 3
  • Andrew A. Chien
    • 3
    • 4
  • Yves Robert
    • 2
    • 5
  1. 1.University of BaselBaselSwitzerland
  2. 2.Laboratoire LIP, ENS Lyon and InriaLyonFrance
  3. 3.University of ChicagoChicagoUSA
  4. 4.Argonne National LaboratoryLemontUSA
  5. 5.University of Tennessee KnoxvilleKnoxvilleUSA

Personalised recommendations