A data ecosystem to support machine learning in materials science


Facilitating the application of machine learning (ML) to materials science problems requires enhancing the data ecosystem to enable discovery and collection of data from many sources, automated dissemination of new data across the ecosystem, and the connecting of data with materials-specific ML models. Here, we present two projects, the Materials Data Facility (MDF) and the Data and Learning Hub for Science (DLHub), that address these needs. We use examples to show how MDF and DLHub capabilities can be leveraged to link data with ML models and how users can access those capabilities through web and programmatic interfaces.

This is a preview of subscription content, access via your institution.

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6


  1. 1.

    A. White: The materials genome initiative: one year on. MRS Bull. 37, 71–716 (2012).

    Article  Google Scholar 

  2. 2.

    B. Blaiszik, K. Chard, J. Pruyne, R. Ananthakrishnan, S. Tuecke, and I. Foster: The materials data facility: data services to advance materials science research. JOM 68, 204–2052 (2016).

    Article  Google Scholar 

  3. 3.

    R. Chard, Z. Li, K. Chard, L. Ward, Y. Babuji, A. Woodard, S. Tuecke, B. Blaiszik, M.J. Franklin, and I. Foster: DLHub: Model and Data Serving for Science, 2018. http://arxiv.org/abs/1811.11213 (accessed March 8, 2019).

    Google Scholar 

  4. 4.

    P. Nguyen, S. Konstanty, T. Nicholson, T. OBrien, A. Schwartz-Duval, T. Spila, K. Nahrstedt, R.H. Campbell, I. Gupta, M. Chan, K. Mchenry, and N. Paquin: 4CeeD: real-time data acquisition and analysis framework for material-related cyber-physical environments. In 2017 17th IEEE/ ACM Int. Symp. Clust. Cloud Grid Comput., IEEE, 2017; pp. 11–20. doi:10.1109/CCGRID.2017.51.

    Google Scholar 

  5. 5.

    J. O’Mara, B. Meredig, and K. Michel: Materials data infrastructure: a case study of the citrination platform to examine data import, storage, and access. JOM 68, 2031–2034 (2016).

    Article  Google Scholar 

  6. 6.

    A. Dima, S. Bhaskarla, C. Becker, M. Brady, C. Campbell, P. Dessauw, R. Hanisch, U. Kattner, K. Kroenlein, M. Newrock, A. Peskin, R. Plante, S.-Y. Li, P.-F. Rigodiat, G.S. Amaral, Z. Trautt, X. Schmitt, J. Warren, and S. Youssef: Informatics infrastructure for the materials genome initiative. JOM 68, 2053–2064 (2016).

    Article  Google Scholar 

  7. 7.

    S. Kirklin, J.E. Saal, B. Meredig, A. Thompson, J.W. Doak, M. Aykol, S. Rühl, and C. Wolverton: The open quantum materials database (OQMD): assessing the accuracy of DFT formation energies. npj Comput. Mater 1, 15010 (2015).

    CAS  Article  Google Scholar 

  8. 8.

    A. Jain, S.P. Ong, G. Hautier, W. Chen, W.D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, and K.A. Persson: Commentary: the materials project: a materials genome approach to accelerating materials innovation. APL Mater. 1, 011002 (2013).

    Article  Google Scholar 

  9. 9.

    C. Draxl and M. Scheffler: NOMAD: the FAIR concept for big data-driven materials science. MRS Bull. 43, 676–682 (2018).

    Article  Google Scholar 

  10. 10.

    J. Carrete, W. Li, N. Mingo, S. Wang, and S. Curtarolo: Finding unprece-dentedly low-thermal-conductivity half-Heusler semiconductors via high-throughput materials modeling. Phys. Rev. X 4, 011019 (2014).

    Google Scholar 

  11. 11.

    S. Curtarolo, W. Setyawan, S. Wang, J. Xue, K. Yang, R.H. Taylor, L.J. Nelson, G.L.W. Hart, S. Sanvito, M. Buongiorno-Nardelli, N. Mingo, and O. Levy: AFLOWLIB.ORG: a distributed materials properties repository from high-throughput ab initio calculations. Comput. Mater. Sci. 58, 227–235 (2012).

    CAS  Article  Google Scholar 

  12. 12.

    A. Mannodi-Kanakkithodi, A. Chandrasekaran, C. Kim, T.D. Huan, G. Pilania, V. Botu, and R. Ramprasad: Scoping the polymer genome: a roadmap for rational polymer dielectrics design and beyond. Mater. Today (2017). doi:10.1016/j.mattod.2017.11.021.

    Google Scholar 

  13. 13.

    R.B. Tchoua, K. Chard, D.J. Audus, L.T. Ward, J. Lequieu, J.J. De Pablo, and I.T. Foster: Towards a hybrid human-computer scientific information extraction pipeline. In 2017 IEEE 13th Int. Conf. e-Science, IEEE, 2017; pp. 109–118. doi:10.1109/eScience.2017.23.

    Google Scholar 

  14. 14.

    B. Puchala, G. Tarcea, E.A. Marquis, M. Hedstrom, H.V. Jagadish, and J.E. Allison: The materials commons: a collaboration platform and information repository for the global materials community. JOM 68, 203–2044 (2016).

    Article  Google Scholar 

  15. 15.

    Materials Simulation Toolkit for Machine Learning (MAST-ML), (n.d.): https://github.com/uw-cmg/MAST-ML (accessed June 27, 2019).

  16. 16.

    D. Wheeler, D. Brough, T. Fast, S. Kalidindi, and A. Reid: PyMKS: materials knowledge system in python (2014).

    Google Scholar 

  17. 17.

    L. Ward, A. Dunn, A. Faghaninia, N.E.R. Zimmermann, S. Bajaj, Q. Wang, J. Montoya, J. Chen, K. Bystrom, M. Dylla, K. Chard, M. Asta, K.A. Persson, G.J. Snyder, I. Foster, and A. Jain: Matminer: an open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60–69 (2018).

    Article  Google Scholar 

  18. 18.

    S.P. Ong, W.D. Richards, A. Jain, G. Hautier, M. Kocher, S. Cholia, D. Gunter, V.L. Chevrier, K.A. Persson, and G. Ceder: Python materials genomics (pymatgen): a robust, open-source python library for materials analysis. Comput. Mater. Sci. 68, 314–319 (2013).

    CAS  Article  Google Scholar 

  19. 19.

    J. Schneider and J. Hamaekers: The atomic simulation environment - a Python library for working with atoms: related content ATK-forceField: a new generation molecular dynamics software package. J. Phys. Condens. Matter Top. Rev (2017). doi:10.1088/1361-648X/aa680e.

    Google Scholar 

  20. 20.

    Materials Data Facility Schema Repository, (n.d.): https://github.com/materials-data-facility/data-schemas (accessed June 27, 2019).

  21. 21.

    I. Foster, K. Chard, and S. Tuecke: The discovery cloud: accelerating and democratizing research on a global scale. In 2016 IEEE Int. Conf. Cloud Eng., IEEE, 2016; pp. 68–77. doi:10.1109/IC2E.2016.46.

    Google Scholar 

  22. 22.

    R. Ananthakrishnan, B. Blaiszik, K. Chard, R. Chard, B. McCollam, J. Pruyne, S. Rosen, S. Tuecke, and I. Foster: Globus platform services for data publication. In Proc. Pract. Exp. Adv. Res. Comput. - PEARC’ 18; ACM Press, New York, NY, USA, 2018; pp. 1–7. doi:10.1145/ 3219104.3219127.

    Google Scholar 

  23. 23.

    Z. Avsec, R. Kreuzhuber, J. Israeli, N. Xu, J. Cheng, A. Shrikumar, A. Banerjee, D.S. Kim, L. Urban, A. Kundaje, O. Stegle, and J. Gagneur: Kipoi: accelerating the community exchange and reuse of predictive models for genomics. BioRxiv, 375345 (2018). doi:10.1101/375345.

    Google Scholar 

  24. 24.

    DataCite Schema, (n.d.): https://schema.datacite.org/ (accessed March 8, 2019).

  25. 25.

    Y. Babuji, A. Brizius, K. Chard, I. Foster, D.S. Katz, M. Wilde, and J. Wozniak: Introducing parsl: a python parallel scripting library (2017). doi:10.5281/ZENODO.891533.

    Google Scholar 

  26. 26.

    H.S. Stein, D. Guevarra, P.F. Newhouse, E. Soedarmadji, and J.M. Gregoire: Machine learning of optical properties of materials–predicting spectra from images and images from spectra. Chem. Sci. 10, 47–55 (2019).

    CAS  Article  Google Scholar 

  27. 27.

    S. Mitrovic, E. Soedarmadji, P.F. Newhouse, S.K. Suram, J.A. Haber, J. Jin, and J.M. Gregoire: Colorimetric screening for high-throughput discovery of light absorbers. ACS Comb. Sci. 17, 176–181 (2015).

    CAS  Article  Google Scholar 

  28. 28.

    M. Schwarting, S. Siol, K. Talley, A. Zakutayev, and C. Phillips: Automated algorithms for band gap analysis from optical absorption spectra. Mater. Discov. 10, 43–52 (2017).

    Article  Google Scholar 

  29. 29.

    L. van der Maaten and G. Hinton: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  30. 30.

    M.J. Cherukara, Y.S.G. Nashed, and R.J. Harder: Real-time coherent diffraction inversion using deep generative networks. Sci. Rep. 8, 16520 (2018).

    Article  Google Scholar 

  31. 31.

    L.A. Curtiss, P.C. Redfern, and K. Raghavachari: Gaussian-4 theory using reduced order perturbation theory. J. Chem. Phys. 127, 124105 (2007).

    Article  Google Scholar 

  32. 32.

    L. Ward, B. Blaiszik, I. Foster, R.S. Assary, B. Narayanan, and L. Curtiss: Machine learning prediction of accurate atomization energies of organic molecules from low-fidelity quantum chemical calculations. MRS Commun 9(3), 891–899 (2019). doi:10.1557/mrc.2019.107.

    CAS  Article  Google Scholar 

  33. 33.

    K.T. Schütt, H.E. Sauceda, P.-J. Kindermans, A. Tkatchenko, and K.-R. Müller: SchNet–a deep learning architecture for molecules and materials. J. Chem. Phys. 148, 241722 (2018).

    Article  Google Scholar 

  34. 34.

    R. Ramakrishnan, P.O. Dral, M. Rupp, and O.A. von Lilienfeld: Big data meets quantum chemistry approximations: the Δ-machine learning approach. J. Chem. Theory Comput. 11, 2087–2096 (2015).

    CAS  Article  Google Scholar 

Download references


MDF: This work was performed under financial assistance award 70NANB14H012 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Material Design (CHiMaD). This work was performed under the following financial assistance award 70NANB19H005 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD). This work was also supported by the National Science Foundation as part of the Midwest Big Data Hub under NSF Award Number: 1636950 “BD Spokes: SPOKE: MIDWEST: Collaborative: Integrative Materials Design (IMaD): Leverage, Innovate, and Disseminate.” DLHub: This work was supported in part by Laboratory Directed Research and Development funding from Argonne National Laboratory under U.S. Department of Energy under Contract DE-AC02-06CH11357. We also thank the Argonne Leadership Computing Facility for access to the PetrelKube Kubernetes cluster and Amazon Web Services for providing research credits to enable rapid service prototyping. This research used resources of the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. The authors would also like to acknowledge and thank the researchers who made their datasets and/or models and codes openly available.[26,30,32]

Author information



Corresponding author

Correspondence to Ben Blaiszik.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Blaiszik, B., Ward, L., Schwarting, M. et al. A data ecosystem to support machine learning in materials science. MRS Communications 9, 1125–1133 (2019). https://doi.org/10.1557/mrc.2019.118

Download citation