Abstract
The continuous increase in the data produced by simulations, experiments and edge components in the last few years has forced a shift in the scientific research process, leading to the definition of a fourth paradigm in Science, concerning data-intensive computing. This data deluge, in fact, introduces various challenges related to big data volumes, formats heterogeneity and the speed in the data production and gathering that must be handled to effectively support scientific discovery. To this end, High Performance Computing (HPC) and data analytics are both considered as fundamental and complementary aspects of the scientific process and together contribute to a new paradigm encompassing the efforts from the two fields called High Performance Data Analytics (HPDA). In this context, the Ophidia project provides a HPDA framework which joins the HPC paradigm with scientific data analytics. This contribution presents some aspects regarding the Ophidia HPDA framework, such as the multidimensional storage model, its distributed and hierarchical implementation along with a benchmark of a parallel in-memory time series reduction operator.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
OPH_REDUCE2 documentation http://ophidia.cmcc.it/documentation/users/operators/OPH_REDUCE2.html.
- 2.
GlusterFS documentation https://docs.gluster.org/en/latest/.
- 3.
ICCLIM (Indice Calculation CLIMate) https://icclim.readthedocs.io/en/latest/ intro.html.
- 4.
NCAR command language https://www.ncl.ucar.edu/.
- 5.
PyOphidia - Conda Forge https://anaconda.org/conda-forge/pyophidia.
- 6.
Dask, library for dynamic task scheduling https://dask.org.
- 7.
Pangeo. A community platform for big data geoscience. https://pangeo.io/.
- 8.
The ESiWACE Center of Excellence on Weather and Climate Simulations in Europe project https://www.esiwace.eu/.
- 9.
ESiWACE Earth System Data Middleware https://github.com/ESiWACE/esdm.
References
Aloisio, G., Fiore, S.: Towards exascale distributed data management. Int. J. High Perform. Comput. Appl. 23(4), 398–400 (2009). https://doi.org/10.1177/1094342009347702
Aloisio, G., Fiore, S., Foster, I., Williams, D.: Scientific big data analytics challenges at large scale. Proceedings of Big Data and Extreme-scale Computing (BDEC) (2013)
Asch, M., et al.: Big data and extreme-scale computing: pathways to convergence-toward a shaping strategy for a future software and data ecosystem for scientific inquiry. Int. J. High Perform. Comput. Appl. 32(4), 435–479 (2018). https://doi.org/10.1177/1094342018778123
Baumann, P., Dehmel, A., Furtado, P., Ritsch, R., Widmann, N.: The multidimensional database system RasDaMan. SIGMOD Rec. 27(2), 575–577 (1998). https://doi.org/10.1145/276305.276386
Baumann, P., Dehmel, A., Furtado, P., Ritsch, R., Widmann, N.: Spatio-temporal retrieval with RasDaMan. In: Proceedings of the 25th International Conference on Very Large Data Bases, VLDB 1999 pp. 746–749. Morgan Kaufmann Publishers Inc., San Francisco (1999). http://dl.acm.org/citation.cfm?id=645925.671513
Baumann, P., Furtado, P., Ritsch, R., Widmann, N.: The RasDaMan approach to multidimensional database management. In: Proceedings of the 1997 ACM Symposium on Applied Computing, SAC 1997, pp. 166–173. ACM, New York (1997). https://doi.org/10.1145/331697.331732
Bell, G., Hey, T., Szalay, A.: Beyond the data deluge. Science 323(5919), 1297–1298 (2009). https://doi.org/10.1126/science.1170411
Brown, P.G.: Overview of sciDB: large scale array storage, processing and analysis. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 963–968. ACM, New York (2010). https://doi.org/10.1145/1807167.1807271
D’Anca, A., et al.: On the use of in-memory analytics workflows to computer science indicators from large climate datasets. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 1035–1043, May 2017. https://doi.org/10.1109/CCGRID.2017.132
Dongarra, J., et al.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25(1), 3–60 (2011). https://doi.org/10.1177/1094342010391989
Elia, D., et al.: An in-memory based framework for scientific data analytics. In: Proceedings of the ACM International Conference on Computing Frontiers, CF 2016, pp. 424–429. ACM, New York (2016). https://doi.org/10.1145/2903150.2911719
Fiore, S., et al.: Ophidia: a full software stack for scientific data analytics. In: 2014 International Conference on High Performance Computing Simulation (HPCS), pp. 343–350, July 2014. https://doi.org/10.1109/HPCSim.2014.6903706
Fiore, S., et al.: Distributed and cloud-based multi-model analytics experiments on large volumes of climate change data in the earth system grid federation eco-system. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 2911–2918, December 2016. https://doi.org/10.1109/BigData.2016.7840941
Fiore, S., D’Anca, A., Palazzo, C., Foster, I.T., Williams, D.N., Aloisio, G.: Ophidia: toward big data analytics for escience. In: Proceedings of the International Conference on Computational Science, ICCS 2013, Barcelona, Spain, 5–7 June 2013, pp. 2376–2385 (2013). https://doi.org/10.1016/j.procs.2013.05.409
Fiore, S., et al.: Big data analytics on large-scale scientific datasets in the INDIGO-DataCloud project. In: Proceedings of the Computing Frontiers Conference, CF 2017, pp. 343–348. ACM, New York (2017). https://doi.org/10.1145/3075564.3078884
Folk, M., Heber, G., Koziol, Q., Pourmal, E., Robinson, D.: An overview of the HDF5 technology suite and its applications. In: Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases. AD 2011, pp. 36–47. ACM, New York (2011). https://doi.org/10.1145/1966895.1966900
Golfarelli, M., Rizzi, S.: Data Warehouse Design: Modern Principles and Methodologies, 1st edn. McGraw-Hill Inc., New York (2009)
Gray, J., Liu, D.T., Nieto-Santisteban, M., Szalay, A., DeWitt, D.J., Heber, G.: Scientific data management in the coming decade. SIGMOD Rec. 34(4), 34–41 (2005). https://doi.org/10.1145/1107499.1107503
Hu, F., et al.: ClimateSpark: an in-memory distributed computing framework for big climate data analytics. Comput. Geosci. 115, 154–166 (2018). https://doi.org/10.1016/j.cageo.2018.03.011
Palamuttam, R., et al.: SciSpark: applying in-memory distributed computing to weather event detection and tracking. In: 2015 IEEE International Conference on Big Data (Big Data), pp. 2020–2026, October 2015. https://doi.org/10.1109/BigData.2015.7363983
Reed, D.A., Dongarra, J.: Exascale computing and big data. Commun. ACM 58(7), 56–68 (2015). https://doi.org/10.1145/2699414
Schulzweida, U.: CDO user guide - version 1.9.6 (2019). https://code.mpimet.mpg.de/projects/cdo/embedded/cdo.pdf
Stonebraker, M., Brown, P., Becla, J., Zhang, D.: SciDB: a database management system for applications with complex analytics. Comput. Sci. Eng. 15(3), 54–62 (2013). https://doi.org/10.1109/MCSE.2013.19
Stonebraker, M., Brown, P., Poliakov, A., Raman, S.: The Architecture of SciDB. In: Bayard Cushing, J., French, J., Bowers, S. (eds.) SSDBM 2011. LNCS, vol. 6809, pp. 1–16. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22351-8_1
Wilson, B., et al.: SciSpark: highlyinteractive in-memory science data analytics. In: 2016 IEEE InternationalConference on Big Data (Big Data), pp. 2964–2973, December 2016. https://doi.org/10.1109/BigData.2016.7840948
Zender, C.S.: Analysis of self-describing gridded geoscience data with netCDF Operators (NCO). Environ. Model. Softw. 23(10), 1338–1342 (2008). https://doi.org/10.1016/j.envsoft.2008.03.004
Acknowledgments
This work was supported in part by the EU H2020 Excellence in SImulation of Weather and Climate in Europe (ESiWACE) project (Grant Agreement 675191). Moreover, the authors would like to acknowledge Antonio Aloisio for his editing and proofreading work on this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Fiore, S. et al. (2019). Towards High Performance Data Analytics for Climate Change. In: Weiland, M., Juckeland, G., Alam, S., Jagode, H. (eds) High Performance Computing. ISC High Performance 2019. Lecture Notes in Computer Science(), vol 11887. Springer, Cham. https://doi.org/10.1007/978-3-030-34356-9_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-34356-9_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34355-2
Online ISBN: 978-3-030-34356-9
eBook Packages: Computer ScienceComputer Science (R0)