Advertisement

Data Pallets: Containerizing Storage for Reproducibility and Traceability

  • Jay LofsteadEmail author
  • Joshua Baker
  • Andrew Younge
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11887)

Abstract

Trusting simulation output is crucial for Sandia’s mission objectives. We rely on these simulations to perform our high-consequence mission tasks given national treaty obligations. Other science and modeling applications, while they may have high-consequence results, still require the strongest levels of trust to enable using the result as the foundation for both practical applications and future research. To this end, the computing community has developed workflow and provenance systems to aid in both automating simulation and modeling execution as well as determining exactly how was some output was created so that conclusions can be drawn from the data.

Current approaches for workflows and provenance systems are all at the user level and have little to no system level support making them fragile, difficult to use, and incomplete solutions. The introduction of container technology is a first step towards encapsulating and tracking artifacts used in creating data and resulting insights, but their current implementation is focused solely on making it easy to deploy an application in an isolated “sandbox” and maintaining a strictly read-only mode to avoid any potential changes to the application. All storage activities are still using the system-level shared storage.

This project explores extending the container concept to include storage as a new container type we call data pallets. Data Pallets are potentially writeable, auto generated by the system based on IO activities, and usable as a way to link the contained data back to the application and input deck used to create it.

Notes

Acknowledgements

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. This work is funded through the LDRD program and ASC CSSE.

References

  1. 1.
    Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the kepler scientific workflow system. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 118–132. Springer, Heidelberg (2006).  https://doi.org/10.1007/11890850_14CrossRefGoogle Scholar
  2. 2.
    Biederman, E.W., Networx, L.: Multiple instances of the global linux namespaces. In: Proceedings of the Linux Symposium, vol. 1, pp. 101–112. Citeseer (2006)Google Scholar
  3. 3.
    Deelman, E., et al.: Pegasus: a workflow management system for science automation. Future Gener. Comput. Syst. 46, 17–35 (2015). Funding Acknowledgements: NSF ACI SDCI 0722019, NSF ACI SI2-SSI 1148515 and NSF OCI-1053575CrossRefGoogle Scholar
  4. 4.
    Koziol, Q., Matzke, R.: HDF5-a new generation of HDF: reference manual and user guide. National Center for Supercomputing Applications, Champaign, Illinois, USA 1998. http://hdf.ncsa.uiuc.edu/nra/HDF5
  5. 5.
  6. 6.
    Kurtzer, G.M., Sochat, V., Bauer, M.W.: Singularity: scientific containers for mobility of compute. PLoS ONE 12(5), e0177459 (2017)CrossRefGoogle Scholar
  7. 7.
    Sandia National Labs: Sandia Analysis Workbench Next Generation Workflows (2018). https://gitlab.com/iwf/ngw
  8. 8.
    Lofstead, J.F., Baker, J., Younge, A.: Data pallets: containerizing storage for reproducibility and traceability. CoRR, abs/1811.04740 (2018)Google Scholar
  9. 9.
    Lofstead, J.F., Klasky, S., Schwan, K., Podhorszki, N., Jin, C.: Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS), In: Proceedings of the 6th International Workshop on Challenges of Large Applications in Distributed Environments CLADE 2008, pp. 15–24. ACM, New York (2008)Google Scholar
  10. 10.
    Ludäscher, B., et al.: Scientific workflow management and the Kepler system: research articles. Concurr. Comput.: Pract. Exper. 18(10), 1039–1065 (2006)CrossRefGoogle Scholar
  11. 11.
    Malewicz, G., Foster, I., Rosenberg, A.L., Wilde, M.: A tool for prioritizing DAGMan jobs and its evaluation. In: 2006 15th IEEE International Symposium on High Performance Distributed Computing, pp. 156–168 (2006)Google Scholar
  12. 12.
    Rew, R., Davis, G.: NetCDF: an interface for scientific data access. IEEE Comput. Graph. Appl. 10(4), 76–82 (1990)CrossRefGoogle Scholar
  13. 13.
    Szeredi, M.: Fuse: filesystem in userspace 2005 (2005). http://fuse.sourceforge.net
  14. 14.
    Wilde, M., Hategan, M., Wozniak, J.M., Clifford, B., Katz, D.S., Foster, I.: Swift: a language for distributed parallel scripting. Parallel Comput. 37(9), 633–652 (2011)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Sandia National LaboratoriesAlbuquerqueUSA
  2. 2.Georgia Institute of TechnologyAtlantaUSA

Personalised recommendations