Advertisement

Understanding and Improving the Trust in Results of Numerical Simulations and Scientific Data Analytics

  • Franck Cappello
  • Rinku Gupta
  • Sheng Di
  • Emil Constantinescu
  • Thomas Peterka
  • Stefan M. Wild
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10659)

Abstract

With ever-increasing execution scale of parallel scientific simulations, potential unnoticed corruptions to scientific data during simulation make users more suspicious about the correctness of floating-point calculations than ever before. In this paper, we analyze the issue of the trust in results of numerical simulations and scientific data analytics. We first classify the corruptions into two categories, nonsystematic corruption and systematic corruption, and also discuss their origins. Then, we provide a formal definition of the trust in simulation and analytical results across multiple areas. We also discuss what kind of result accuracy would be expected from user’s perspective and how to build trust by existing techniques. We finally identify the current gap and discuss two potential research directions based on existing techniques. We believe that this paper will be interesting to the researchers who are working on the detection of potential unnoticed corruptions of scientific simulation and data analytics, in that not only does it provide a clear definition and classification of corruption as well as an in-depth survey on corruption sources, but we also discuss potential research directions/topics based on existing detection techniques.

Keywords

Trust Numerical simulation Data analytics 

Notes

Acknowledgments.

This material was based upon work supported by the U.S. Department of Energy,Office of Science, Advanced Scientific Computing Research Program, under Contract DE-AC02-06CH11357.

References

  1. 1.
  2. 2.
    Disasters in bad numerical computing. http://www.iro.umontreal.ca/~mignotte/IFT2425/Disasters.html. Accessed 08 May 2017
  3. 3.
    Opteron bugs. https://access.redhat.com/solutions/918043. Accessed 08 May 2017
  4. 4.
  5. 5.
    Trust Computing Group. http://www.trustedcomputinggroup.org/. Accessed 08 May 2017
  6. 6.
    Trust in Social Sciences. http://en.wikipedia.org/wiki/Trust_(social_sciences). Accessed 08 May 2017
  7. 7.
    Trust Metrics. http://en.wikipedia.org/wiki/Trust_metric. Accessed 08 May 2017
  8. 8.
    Trusted Platform Module (TPM) Specification. http://www.trustedcomputinggroup.org/resources/tpm_main_specification. Accessed 08 May 2017
  9. 9.
    TWC: Small: behavior-based zero-day intrusion detection for real-time cyber-physical systems. https://www.collectiveip.com/grants/NSF:1423334. Accessed 08 May 2017
  10. 10.
    ASCR Cybersecurity for Scientific Computing Integrity, February 2015. http://www.osti.gov/scitech/servlets/purl/1223021
  11. 11.
    Avizienis, A.: The N-version approach to fault-tolerant software. IEEE Trans. Softw. Eng. 11(12), 1491–1501 (1985)CrossRefGoogle Scholar
  12. 12.
    Avižienis, A., Laprie, J.-C., Randell, B.: Dependability and its threats: a taxonomy. In: Jacquart, R. (ed.) Building the Information Society. IIFIP, vol. 156, pp. 91–120. Springer, Boston (2004).  https://doi.org/10.1007/978-1-4020-8157-6_13 CrossRefGoogle Scholar
  13. 13.
    Becker, S., Hasselbring, W., Paul, A., Boskovic, M., Koziolek, H., Ploski, J., Dhama, A., Lipskoch, H., Rohr, M., Winteler, D., Giesecke, S., Meyer, R., Swaminathan, M., Happe, J., Muhle, M., Warns, T.: Trustworthy software systems: a discussion of basic concepts and terminology. SIGSOFT Softw. Eng. Notes 31(6), 1–18 (2006)CrossRefGoogle Scholar
  14. 14.
    Benson, A.R., Schmit, S., Schreiber, R.: Silent error detection in numerical time-stepping schemes. Int. J. High Perform. Comput. Appl. 29(4), 403–421 (2015)CrossRefGoogle Scholar
  15. 15.
    Berrocal, E., Bautista-Gomez, L., Di, S., Lan, Z., Cappello, F.: Exploring partial replication to improve lightweight silent data corruption detection for HPC applications. In: Dutot, P.-F., Trystram, D. (eds.) Euro-Par 2016. LNCS, vol. 9833, pp. 419–430. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-43659-3_31 Google Scholar
  16. 16.
    National Research Council: Assessing the Reliability of Complex Models: Mathematical and Statistical Foundations of Verification, Validation, and Uncertainty Quantification. The National Academies Press, Washington, D.C. (2012). https://www.nap.edu/catalog/13395/assessing-the-reliability-of-complex-models-mathematical-and-statistical-foundations Google Scholar
  17. 17.
    Di, S., Berrocal, E., Cappello, F.: An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 271–280, May 2015Google Scholar
  18. 18.
    Di, S., Cappello, F.: Adaptive impact-driven detection of silent data corruption for HPC applications. IEEE Trans. Parallel Distrib. Syst. 27(10), 2809–2823 (2016).  https://doi.org/10.1109/TPDS.2016.2517639 CrossRefGoogle Scholar
  19. 19.
    Knight, J.C., Leveson, N.G.: An experimental evaluation of the assumption of independence in multiversion programming. IEEE Trans. Softw. Eng. 12(1), 96–109 (1986)CrossRefGoogle Scholar
  20. 20.
    Levien, R., Aiken, A.: Attack-resistant trust metrics for public key certification. In: Proceedings of the 7th Conference on USENIX Security Symposium, SSYM 1998, vol. 7, pp. 18–18. USENIX Association, Berkeley (1998)Google Scholar
  21. 21.
    Randell, B., Xu, J.: The evolution of the recovery block concept. In: Software Fault Tolerance, pp. 1–22. Wiley (1994)Google Scholar
  22. 22.
    Sha, L.: Using simplicity to control complexity. IEEE Softw. 18(4), 20–28 (2001)CrossRefGoogle Scholar
  23. 23.
    Sparks, E.R.: A security assessment of trusted platform modules. Technical report TR2007-597, Dartmouth College, Computer Science, Hanover, NH, June 2007Google Scholar
  24. 24.
    Subasi, O., Di, S., Bautista-Gomez, L., Balaprakash, P., Unsal, O., Labarta, J., Cristal, A., Cappello, F.: Spatial support vector regression to detect silent errors in the exascale era. In: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 413–424 May 2016Google Scholar
  25. 25.
    Theodorakopoulos, G., Baras, J.S.: On trust models and trust evaluation metrics for ad hoc networks. IEEE J. Sel. A. Commun. 24(2), 318–328 (2006)CrossRefGoogle Scholar
  26. 26.
    Türpe, S., Poller, A., Steffan, J., Stotz, J.-P., Trukenmüller, J.: Attacking the BitLocker boot process. In: Chen, L., Mitchell, C.J., Martin, A. (eds.) Trust 2009. LNCS, vol. 5471, pp. 183–196. Springer, Heidelberg (2009).  https://doi.org/10.1007/978-3-642-00587-9_12 CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Franck Cappello
    • 1
  • Rinku Gupta
    • 1
  • Sheng Di
    • 1
  • Emil Constantinescu
    • 1
  • Thomas Peterka
    • 1
  • Stefan M. Wild
    • 1
  1. 1.Mathematics and Computer Science DivisionArgonne National LaboratoryArgonneUSA

Personalised recommendations