Like a rainbow in the dark: metadata annotation for HPC applications in the age of dark data


The deluge of dark data is about to happen. Lacking data management capabilities, especially in the field of supercomputing, and missing data documentation (i.e., missing metadata annotation) constitute a major source of dark data. The present work contributes to addressing this challenge by presenting ExtractIng, a generic automated metadata extraction toolkit. Existing metadata information of simulation output files scattered through the file system, can be aggregated, parsed and converted to the EngMeta metadata model. Use cases from computational engineering are considered to demonstrate the viability of ExtractIng. The evaluation results show that the metadata extraction is simulation-code independent in the sense that it can handle data outputs from various fields of science, is easy to integrate into simulation workflows and compatible with a multitude of computational environments.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  1. 1., last Access 25.11.2020.

  2. 2., last Access 25.11.2020.

  3. 3., last access Feb 14th 2020.

  4. 4., last access Feb 14th 2020.

  5. 5., last access 25.11.2020.

  6. 6., last access Feb 26th 2020.

  7. 7., last accessed Feb 26th 2020.

  8. 8., last accessed Feb 26th 2020.

  9. 9., last accessed Feb 26th 2020.

  10. 10., last access March 2 2020.

  11. 11.

    Interestingly, the authors speak of a “data swamp” in terms of dark data contrasting this with a “data lake” of well-annotated data.


  1. 1.

    Hey AJ, Trefethen AE (2003) The data deluge: an e-science perspective, pp 809–824.

  2. 2.

    Hey T, Tansley S, Tolle K (2009) The fourth paradigm: data-intensive scientific discovery (Microsoft Research).

  3. 3.

    Heidorn PB (2008) Shedding light on the dark data in the long tail of science. Library Trends 57(2):280.

    Article  Google Scholar 

  4. 4.

    Heidorn PB, Stahlman GR, Steffen J (2018) Astrolabe: curating, linking, and computing astronomy’s dark data. Astrophys J Suppl Ser 236(1):3.

    Article  Google Scholar 

  5. 5.

    Schembera B, Durán JM (2019) Dark data as the new challenge for big data science and the introduction of the scientific data officer. Philos Technol.

    Article  Google Scholar 

  6. 6.

    IBM. Digging up dark data. (2015). Accessed 14 Feb 2020

  7. 7.

    Goetz T (2007) Freeing the dark data of failed scientific experiment. Wired Mag 15(10):7

    Google Scholar 

  8. 8.

    Cafarella M, Ilyas IF, Kornacker M, Kraska T, Ré C (2016) Dark data: Are we solving the right problems?. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp 1444–1445.

  9. 9.

    Lin D, Wang Q (2017) A game theory based energy efficient clustering routing protocol for WSNs. Wirel Netw 23(4):1101

    Article  Google Scholar 

  10. 10.

    Lin D, Min W, Xu J (2020) An energy-saving routing integrated economic theory with compressive sensing to extend the lifespan of WSNs. IEEE Internet of Things J

  11. 11.

    Lin D, Wang Q, Min W, Xu J, Zhang Z (2020) A survey on energy-efficient strategies in static wireless sensor networks. ACM Trans Sens Netw (TOSN) 17(1):1

    Google Scholar 

  12. 12.

    Wilkinson MD, Dumontier M, Aalbersberg J, Appleton G, Axton M et al (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3:160018

    Article  Google Scholar 

  13. 13.

    Schembera B, Bönisch T (2017) Challenges of research data management for high performance computing. In: Kamps J, Tsakonas G, Manolopoulos Y, Iliadis L, Karydis I (eds) Research and advanced technology for digital libraries. Springer International Publishing, Cham, pp 140–151

    Google Scholar 

  14. 14.

    Hick J (2010) HPSS in the Extreme Scale Era: Report to DOE Office of Science on HPSS in 2018-2022. Tech. Rep. LBNL-3877E.

  15. 15.

    Arora R (2015) Data management: state-of-the-practice at open-science data centers. Springer, New York, pp 1095–1108.

    Google Scholar 

  16. 16.

    Jones SN, Strong CR, Parker-Wood A, Holloway A, Long DDE (2011) Easing the Burdens of HPC File Management. In: Proceedings of the Sixth Workshop on Parallel Data Storage (ACM), PDSW ’11, pp 25–30.

  17. 17.

    Parker-Wood A, Long DDE, Madden BA, Adams IF, McThrow M, Wildani A (2013) Examining Extended and Scientific Metadata for Scalable Index Designs. In: Proceedings of the 6th International Systems and Storage Conference (ACM, New York, NY, USA), SYSTOR ’13, pp 4:1–4:6.

  18. 18.

    Mattmann CA (2013) Computing: a vision for data science. Nature 493(7433):473.

    Article  Google Scholar 

  19. 19.

    Michener WK, Brunt JW, Helly JJ, Kirchner TB (1997) Stafford SG nongeospatial metadata for the ecological sciences. Ecol Appl 7(1):330.

    Article  Google Scholar 

  20. 20.

    Schembera B (2019) Forschungsdatenmanagement im Kontext dunkler Daten in den Simulationswissenschaften. Dissertation, Universität Stuttgart.

  21. 21.

    Petersen AM, Fortunato S, Pan RK, Kaski K, Penner O, Rungi A, Riccaboni M, Stanley HE, Pammolli F (2014) Reputation and impact in academic careers. Proc Natl Acad Sci 111(43):15316.

    Article  Google Scholar 

  22. 22.

    Schembera B, Iglezakis D (2020) EngMeta–metadata for computational engineering. Preprint arXiv:2005.01637

  23. 23.

    Edwards PN, Mayernik MS, Batcheller AL, Bowker GC, Borgman CL (2011) Science friction: data, metadata, and collaboration. Soc Stud Sci 41(5):667. PMID: 22164720

    Article  Google Scholar 

  24. 24.

    Schembera B, Iglezakis D (2019) The genesis of engmeta: a metadata model for research data in computational engineering. In: Garoufallou E, Sartori F, Siatri R, Zervas M (eds) Metadata and semantic research. Springer International Publishing, Cham, pp 127–132

    Google Scholar 

  25. 25.

    Caplan P (2009) Understanding PREMIS. Accessed 25 Nov 2020

  26. 26.

    Ammann N, Nielsen LH, Peters CS, de Smaele TM (2011) Datacite metadata schema for the publication and citation of research data. Zugegriffen: 27.4.2019

  27. 27.

    Riley J (2017) Understanding metadata: What is metadata, and what is it for?: A primer. Tech. rep, NISO

  28. 28.

    Hess B, van der Spoel D, Lindahl E, Smith JC, Shirts MR, Bjelkmar P, Larsson P, Kasson PM, Schulz R, Apostolov R, Pronk S, Páll S (2013) GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit. Bioinformatics 29(7):845.

    Article  Google Scholar 

  29. 29.

    Greenberg J (2004) Metadata extraction and harvesting: a comparison of two automatic metadata generation applications. J Internet Catal 6(4):59

    Article  Google Scholar 

  30. 30.

    Giuffrida G, Shek EC, Yang J (2000) Knowledge-based metadata extraction from PostScript files. In: Proceedings of the Fifth ACM Conference on Digital Libraries, pp 77–84

  31. 31.

    Spinosa P, Giardiello G, Cherubini M, Marchi S, Venturi G, Montemagni S (2009) NLP-based metadata extraction for legal text consolidation. In: Proceedings of the 12th International Conference on Artificial Intelligence and Law, pp 40–49

  32. 32.

    Liu R, Gao L, An D, Jiang Z, Tang Z (2017) Automatic document metadata extraction based on deep networks. In: National CCF Conference on Natural Language Processing and Chinese Computing (Springer, 2017), pp 305–317

  33. 33.

    Paul AK, Wang B, Rutman N, Spitz C, Butt AR (2020) Efficient Metadata Indexing for HPC Storage Systems. In: 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) (IEEE, 2020), pp 162–171

  34. 34.

    Paul AK (2020) An application-attuned framework for optimizing hpc storage systems. Ph.D. thesis, Virginia Tech

  35. 35.

    Khan A, Kim T, Byun H, Kim Y (2019) SciSpace: a scientific collaboration workspace for geo-distributed HPC data centers. Fut Gen Comput Syst 101:398

    Article  Google Scholar 

  36. 36.

    Liang S, Holmes V, Antoniou G, Higgins J (2015) iCurate: a research data management system. Springer International Publishing, Cham, pp 39–47.

    Google Scholar 

  37. 37.

    Grunzke R, Breuers S, Gesing S, Herres-Pawlis S, Kruse M, Blunk D, de la Garza L, Packschies L, Schäfer P, Schärfe C, Schlemmer T, Steinke T, Schuller B, Müller-Pfefferkorn R, Jäkel R, Nagel WE, Atkinson M, Krüger J (2014) Standards-based metadata management for molecular simulations. Concurr Comput Pract Exp 26(10):1744.

    Article  Google Scholar 

  38. 38.

    Grunzke R (2016) Generic metadata handling in scientific data life cycles. Ph.D. thesis, Technische Universität Dresden

  39. 39.

    Grunzke R, Hartmann V, Jejkal T, Kollai H, Prabhune A, Herold H, Deicke A, Dressler C, Dolhoff J, Stanek J, Hoffmann A, Müller-Pfefferkorn R, Schrade T, Meinel G, Herres-Pawlis S, Nagel WE (2019) Future Generation Computer Systems 94:879.,

  40. 40.

    Adorf CS, Dodd PM, Ramasubramani V, Glotzer SC (2018) Simple data and workflow management with the signac framework. Comput Mater Sci 146:220.

    Article  Google Scholar 

  41. 41.

    Skluzacek TJ (2019) Dredging a data lake: decentralized metadata extraction. In: Proceedings of the 20th International Middleware Conference Doctoral Symposium, pp 51–53

  42. 42.

    Skluzacek TJ, Chard R, Wong R, Li Z, Babuji YN, Ward L, Blaiszik B, Chard K, Foster I (2019) Serverless workflows for indexing large scientific data. In: Proceedings of the 5th International Workshop on Serverless Computing, pp 43–48

  43. 43.

    Skluzacek TJ, Kumar R, Chard R, Harrison G, Beckman P, Chard K, Foster I (2018) Skluma: an extensible metadata extraction pipeline for disorganized data. In: 2018 IEEE 14th International Conference on e-Science (e-Science) (IEEE, 2018), pp 256–266

  44. 44.

    Padhy S, Jansen G, Alameda J, Black E, Diesendruck L, Dietze M, Kumar P, Kooper R, Lee J, Liu R, et al (2015) Brown Dog: leveraging everything towards autocuration. In: 2015 IEEE International Conference on Big Data (Big Data) (IEEE, 2015), pp 493–500

  45. 45.

    Satheesan SP, Alameda J, Bradley S, Dietze M, Galewsky B, Jansen G, Kooper R, Kumar P, Lee J, Marciano R et al (2018) Brown dog: making the digital world a better place, a few files at a time. In: Proceedings of the Practice and Experience on Advanced Research Computing, pp 1–8

  46. 46.

    Rodrigo GP, Henderson M, Weber GH, Ophus C, Antypas K, Ramakrishnan L (2018) ScienceSearch: enabling search through automatic metadata generation. In: 2018 IEEE 14th International Conference on e-Science (e-Science) (IEEE, 2018), pp 93–104

Download references


The author likes to thank the Federal Ministry of Education and Research for funding the Dipling project under Grant No. FDM-008. The author also likes to thank Dr. Martin Thomas Horsch for comments on the script and proofreading.

Author information



Corresponding author

Correspondence to Björn Schembera.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Schembera, B. Like a rainbow in the dark: metadata annotation for HPC applications in the age of dark data. J Supercomput (2021).

Download citation


  • Dark data
  • Metadata
  • Research data management
  • Metadata extraction
  • High performance computing
  • Computational engineering