Advertisement

Scalable Algorithmic Detection of Silent Data Corruption for High-Dimensional PDEs

  • Alfredo Parra Hinojosa
  • Hans-Joachim Bungartz
  • Dirk PflügerEmail author
Conference paper
  • 371 Downloads
Part of the Lecture Notes in Computational Science and Engineering book series (LNCSE, volume 123)

Abstract

In this paper we show how to benefit from the numerical properties of a well-established extrapolation method—the combination technique—to make it tolerant to silent data corruption (SDC). The term SDC refers to errors in data not detected by the system. We use the hierarchical structure of the combination technique to detect if parts of the floating point data are corrupted. The method we present is based on robust regression and other well-known outlier detection techniques. It is a lossy approach, meaning we sacrifice some accuracy but we benefit from the small computational overhead. We test our algorithms on a d-dimensional advection-diffusion equation and inject SDC of different orders of magnitude. We show that our method has a very good detection rate: large errors are always detected, and the small errors that go undetected do not noticeably damage the solution. We also carry out scalability tests for a 5D scenario. We finally discuss how to deal with false positives and how to extend these ideas to more general quantities of interest.

Keywords

Silent Data Corruption (SDC) High-Dimensional PDEs Robust Regression Full Grid Solution Sparse Grid 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    L. Bautista-Gomez, F. Cappello, Detecting silent data corruption for extreme-scale MPI applications, in Proceedings of the 22nd European MPI Users’ Group Meeting (ACM, New York, 2015), p. 12Google Scholar
  2. 2.
    E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan, F. Cappello, Lightweight silent data corruption detection based on runtime data analysis for HPC applications, in Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’15 (ACM, New York, 2015), pp. 275–278Google Scholar
  3. 3.
    M. Blatt, A. Burchardt, A. Dedner, C. Engwer, J. Fahlke, B. Flemisch, C. Gersbacher, C. Gräser, F. Gruber, C. Grüninger et al., The distributed and unified numerics environment, version 2.4. Archive Numer. Softw. 4(100), 13–29 (2016)Google Scholar
  4. 4.
    M.A. Branch, T.F. Coleman, Y. Li, A subspace, interior, and conjugate gradient method for large-scalebound-constrained minimization problems. Tech. Rep., Cornell University, 1995Google Scholar
  5. 5.
    H.J. Bungartz, M. Griebel, Sparse grids. Acta Numer. 13, 147–269 (2004).MathSciNetCrossRefGoogle Scholar
  6. 6.
    H.J. Bungartz, M. Griebel, D. Röschke, C. Zenger, Pointwise convergence of the combination technique for Laplace’s equation. Technische Universität München. Institut für Informatik (1993)Google Scholar
  7. 7.
    F. Cappello et al., Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1(1), 4–27 (2014)Google Scholar
  8. 8.
    Z. Chen, Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods, in ACM SIGPLAN Notices, vol. 48 (ACM, New York, 2013), pp. 167–176CrossRefGoogle Scholar
  9. 9.
    C. Constantinescu, I. Parulkar, R. Harper, S. Michalak, Silent data corruption–myth or reality? in IEEE International Conference on Dependable Systems and Networks With FTCS and DCC, 2008. DSN 2008 (IEEE, New York, 2008), pp. 108–109Google Scholar
  10. 10.
    J. Elliott, M. Hoemmen, F. Mueller, Evaluating the impact of SDC on the GMRES iterative solver, in 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IEEE, New York, 2014), pp. 1193–1202Google Scholar
  11. 11.
    J. Elliott, M. Hoemmen, F. Mueller, Resilience in numerical methods: a position on fault models and methodologies (2014). arXiv preprint arXiv:1401.3013Google Scholar
  12. 12.
    J. Elliott, M. Hoemmen, F. Mueller, A numerical soft fault model for iterative linear solvers, in Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (ACM, New York, 2015), pp. 271–274Google Scholar
  13. 13.
    D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, R. Brightwell, Detection and correction of silent data corruption for large-scale High-Performance Computing, in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (IEEE Computer Society Press, Washington, 2012), p. 78Google Scholar
  14. 14.
    M. Galassi, J. Davies, J. Theiler, B. Gough, G. Jungman, P. Alken, M. Booth, F. Rossi, R. Ulerich, GNU scientific library reference manual (2015). Library available online at http://www.gnu.org/software/gsl
  15. 15.
    J. Garcke, Sparse grids in a nutshell, in Sparse Grids and Applications (Springer, Berlin, 2013), pp. 57–80Google Scholar
  16. 16.
    M. Griebel, M. Schneider, C. Zenger, A combination technique for the solution of sparse grid problems, in Iterative Methods in Linear Algebra (1992), pp. 263–281Google Scholar
  17. 17.
    B. Harding, Fault tolerant computation of hyperbolic partial differential equations with the sparse grid combination technique. Ph.D. thesis, 2016Google Scholar
  18. 18.
    B. Harding et al.: Fault tolerant computation with the sparse grid combination technique. SIAM J. Sci. Comput. 37(3), C331–C353 (2015)MathSciNetCrossRefGoogle Scholar
  19. 19.
    M. Heene, D. Pflüger, Scalable algorithms for the solution of higher-dimensional PDEs, in Software for Exascale Computing-SPPEXA 2013–2015 (Springer, Berlin, 2016), pp. 165–186Google Scholar
  20. 20.
    M. Heene, A.P. Hinojosa, H.J. Bungartz, D. Pflüger, A massively-parallel, fault-tolerant solver for high-dimensional PDEs, in Euro-Par 2016: Parallel Processing Workshops (2016)Google Scholar
  21. 21.
    A.P. Hinojosa et al., Handling silent data corruption with the sparse grid combination technique, in Proceedings of the SPPEXA Workshop. Lecture Notes in Computational Science and Engineering (Springer, Berlin, 2016)Google Scholar
  22. 22.
    P.W. Holland, R.E. Welsch, Robust regression using iteratively reweighted least-squares. Commun. Stat. Theory Methods 6(9), 813–827 (1977)CrossRefGoogle Scholar
  23. 23.
    F. Jenko et al., Electron temperature gradient driven turbulence. Phys. Plasmas 7(5), 1904–1910 (2000). http://www.genecode.org/ CrossRefGoogle Scholar
  24. 24.
    C. Kowitz, D. Pflüger, F. Jenko, M. Hegland, The combination technique for the initial value problem in linear gyrokinetics, in Sparse Grids and Applications. Lecture Notes in Computational Science and Engineering, vol. 88 (Springer, Heidelberg, 2012), pp. 205–222Google Scholar
  25. 25.
    B. Mohr, W. Frings, Jülich Blue Gene/P extreme scaling workshop 2009. Tech. Rep., Technical report FZJ-JSC-IB-2010-02 (2010). http://juser.fz-juelich.de/record/8924/files/ib-2010-02.ps.gz
  26. 26.
    A. Pan, J.W. Tschanz, S. Kundu, A low cost scheme for reducing silent data corruption in large arithmetic circuits, in IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems, 2008. DFTVS’08 (IEEE, New York, 2008), pp. 343–351Google Scholar
  27. 27.
    C. Reisinger, Analysis of linear difference schemes in the sparse grid combination technique. IMA J. Numer. Anal. 33(2), 544–581 (2012)MathSciNetCrossRefGoogle Scholar
  28. 28.
    P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, vol. 589 (Wiley, New York, 2005)zbMATHGoogle Scholar
  29. 29.
    M. Snir, R.W. Wisniewski, J.A. Abraham, S.V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, et al. Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28, 129–173 (2014)CrossRefGoogle Scholar
  30. 30.
    H.J. van Dam, A. Vishnu, W.A. De Jong, A case for soft error detection and correction in computational chemistry. J. Chem. Theory Comput. 9(9), 3995–4005 (2013)CrossRefGoogle Scholar
  31. 31.
    M. Wakefield, Bounds on quantities of physical interest. Ph.D. thesis, University of Reading, 2003Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Alfredo Parra Hinojosa
    • 1
  • Hans-Joachim Bungartz
    • 1
  • Dirk Pflüger
    • 2
    Email author
  1. 1.Chair of Scientific ComputingTechnische Universität MünchenMünchenGermany
  2. 2.Institute for Parallel and Distributed Systems, University of StuttgartStuttgartGermany

Personalised recommendations