Skip to main content

On the Processing of Extreme Scale Datasets in the Geosciences

  • Chapter
  • First Online:
  • 1555 Accesses

Abstract

Observational measurements and model output data acquired or generated by the various research areas within the realm of Geosciences (also known as Earth Science) encompass a spatial scale of tens of thousands of kilometers and temporal scales of seconds to millions of years. Here geosciences refers to the study of atmosphere, hydrosphere, oceans, and biosphere as well as the earth’s core. Rapid advances in sensor deployments, computational capacity, and data storage density have been resulted in dramatic increases in the volume and complexity of data in geosciences. Geoscientists now see the data-intensive computing approach as part of their knowledge discovery process alongside traditional theoretical, experimental, and computational archetype [1]. Data-intensive computing poses unique challenges to the geoscience community that is exacerbated by the sheer size of the datasets involved.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. T. Hey, et al., The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond, Washington: Microsoft Corporation, 2009.

    Google Scholar 

  2. F. M. Hoffman, et al., “Multivariate Spatio-Temporal Clustering (MSTC) as a data mining tool for environmental applications,” in the iEMSs Fourth Biennial Meeting: International Congress on Environmental Modelling and Software Society (iEMSs 2008), 2008, pp. 1774–1781.

    Google Scholar 

  3. F. M. Hoffman, et al., “Data Mining in Earth System,” in the International Conference on Computational Science (ICCS), 2011, pp. 1450–1455.

    Google Scholar 

  4. O. J. Reichman, et al. (2011) Challenges and opportunities of open data in ecology. Science. 703–705.

    Google Scholar 

  5. M. Keller, et al., “A continental strategy for the National Ecological Observatory Network,” Front. Ecol. Environ Special Issue on Continental-Scale Ecology, vol. 5, pp. 282–284, 2008.

    Google Scholar 

  6. D. Schimel, et al., “NEON: A hierarchically designed national ecological network,” Front. Ecol. Environ, vol. 2, 2007.

    Google Scholar 

  7. June, 17, 2011). The Open Geospatial Consortium (OGC) Available: http://www.opengeo http://spatial.org

  8. G. Percivall and C. Reed, “OGC Sensor Web Enabliment Standards,” Sensors and Transducers Journal, vol. 71, pp. 698–706, 2006.

    Google Scholar 

  9. MTPE EOS Reference Handbook the EOS Project Science Office, code 900, NASA Goddard Space Flight Center, 1995.

    Google Scholar 

  10. The Global Telecommunication System. Available: http://www.wmo.int/pages/prog/www/TEM/GTS/index_en.html

  11. National Center for Environmental Prediction (NCEP). Available: http://www.ncep.noaa.gov/

  12. Panasas: Parallel File System for HPC Storage. Available: http://www.panasas.com/

  13. M. M. Kuhn, et al., “Dynamic file system semantics to enable metadata optimizations in PVFS,” Concurrency and Computation: Practice and Experience, vol. 21, 2009.

    Google Scholar 

  14. P. J. Braam, “Lustre: a scalable high-performance file system,” 2002.

    Google Scholar 

  15. F. B. Schmuck and R. L. Haskin, “GPFS: A Shared-Disk File System for Large Computing Clusters,” in the Conference on File and Storage Technologies, 2002, pp. 231–244.

    Google Scholar 

  16. J. Lofstead, et al., “Managing Variability in the IO Performance of Petascale Storage Systems,” presented at the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2010.

    Google Scholar 

  17. M. P. I. Forum, “MPI-2: Extensions to the Message-Passing Interface,” 1997.

    Google Scholar 

  18. S. Ghemawat, et al., “The Google File System,” ACM SIGOPS Operating Systems Review, vol. 37, 2003.

    Google Scholar 

  19. HDF-5. Available: http://hdf.ncsa.uiuc.edu/products/hdf5/

  20. NetCDF4. Available: http://www.hdfgroup.org/projects/netcdf-4/.

  21. J. Li, et al., “Parallel netCDF: A high-performance scientific I/O interface,” in ACM Supercomputing (SC03), 2003.

    Google Scholar 

  22. H. Abbasi, et al., “DataStager: scalable data staging services for petascale applications,” in ACM international Symposium on High Performance Distributed Computing, 2009.

    Google Scholar 

  23. J. Craig Upson, et al., “The Application Visualization System: A computational environment for scientific visualization,” IEEE Computer Graphics and Applications, pp. 30–42, 1989.

    Google Scholar 

  24. VisIt Visualization Tool. Available: https://wci.llnl.gov/codes/visit/home.html

  25. R. Daley, Atmospheric Data Analysis: Cambridge atmospheric and space science series, 1993.

    Google Scholar 

  26. O. Wildi, Data Analysis in Vegetation Ecology Willey, 2010.

    Google Scholar 

  27. P. Rigaux, et al., Spatial Databases with Application to GIS: Morgan Kaufmann, 2002.

    Google Scholar 

  28. S. Shekhar and S. Chawla, Spatial Database: A Tour: Prentice Hall, 2002.

    Google Scholar 

  29. P. Longley, et al., Geographic Information Systems and Science, 3 ed.: John Wiley & Sons, 2011.

    Google Scholar 

  30. R. Rew and G. Davis, “NetCDF: an interface for scientific data access,” IEEE Computer Graphics and Applications, vol. 10, pp. 76–82, 1990.

    Article  Google Scholar 

  31. Common Data Model. Available: http://www.unidata.ucar.edu/software/netcdf-java/CDM/

  32. P. Cudre-Mauroux, et al., “A Demonstration of SciDB: A Science-Oriented DBMS,” in the 2009 VLDB Endowment 2009.

    Google Scholar 

  33. J. Buck, et al., “SciHadoop: Array-based Query Processing in Hadoop,” UCSC2011.

    Google Scholar 

  34. (2010, The HDF Group. Hierarchical data format version 5. http://www.hdfgroup.org/HDF5.

  35. (2011, FITS Support Office. http://fits.gsfc.nasa.gov/.

  36. D. C. Wells, et al., “FITS: A Flexible Image Transport System,” Astronomy & Astrophysics, vol. 44, pp. 363–370, 1981.

    Google Scholar 

  37. P. Cornillon, et al., “OPeNDAP: Accessing data in a distributed, heterogeneous environment,” Data Science Journal, vol. 2, pp. 164–174, 2003.

    Article  Google Scholar 

  38. D. M. Karl, et al., “Building the long-term picture: U.S. JGOFS Time-series Programs,” Oceanography, pp. 6–17, 2001.

    Google Scholar 

  39. P. Ramsey, “PostGIS Manual,” ed: Refractions Research.

    Google Scholar 

  40. A. Guttman, “R-trees: a dynamic index structure for spatial searching,” in Proceedings of the 1984 ACM SIGMOD international conference on Management of data, ed. Boston, Massachusetts: ACM, 1984, pp. 47–57.

    Google Scholar 

  41. S. Tilak, et al., “The Ring Buffer Network Bus (RBNB) DataTurbine Streaming Data Middleware for Environmental Observing Systems,” in IEEE e-Science, 2007, pp. 125–133.

    Google Scholar 

  42. D. N. Williams, et al., “The Earth System Grid: Enabling Access to Multi-Model Climate Simulation Data,” Bulletin of the American Meteorological Society, vol. 90, pp. 195–205, 2009.

    Article  Google Scholar 

  43. B. Domenico, et al., “Thematic Real-time Environmental Distributed Data Services (THREDDS): Incorporating Interactive Analysis Tools into NSDL,” Journal of Interactivity in Digital Libraries, vol. 2, 2002.

    Google Scholar 

  44. A. Shoshani, et al., “Storage Resource Managers (SRM) in the Earth System Grid,” Earth System Grid2009.

    Google Scholar 

  45. G. Khanna, et al., “A Dynamic Scheduling Approach for Coordinated Wide-Area Data Transfers using GridFTP,” in the 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008), 2008.

    Google Scholar 

  46. Globus Online \(\vert\) Reliable File Transfer. No IT Required. Available: https://www.globuson http://line.org/

  47. P. G. Brown, “Overview of sciDB: large scale array storage, processing and analysis,” in Proceedings of the 2010 international conference on Management of data, ed. Indianapolis, Indiana, USA: ACM, 2010, pp. 963–968.

    Google Scholar 

  48. M. S. Mit, et al. (2009, Requirements for Science Data Bases and SciDB.

    Google Scholar 

  49. J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Communications of the ACM, vol. 51, pp. 107–113, 2008.

    Article  Google Scholar 

  50. A. Akdogan, et al., “Voronoi-Based Geospatial Query Processing with MapReduce,” in Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, ed, 2010, pp. 9–16.

    Google Scholar 

  51. Y. Wang and S. Wang, “Research and implementation on spatial data storage and operation based on Hadoop platform,” in Geoscience and Remote Sensing (IITA-GRS), 2010 Second IITA International Conference on vol. 2, ed, 2010, pp. 275–278.

    Google Scholar 

  52. Apache Hadoop. Available: http://hadoop.apache.org/

  53. Hadoop Distributed File System. Available: http://hadoop.apache.org/hdfs/

  54. J. Wang, et al., “Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems,” in Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, ed. Portland, Oregon: ACM, 2009, pp. 12:1–12:8.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sangmi Lee Pallickara .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Lee Pallickara, S., Malensek, M., Pallickara, S. (2011). On the Processing of Extreme Scale Datasets in the Geosciences. In: Furht, B., Escalante, A. (eds) Handbook of Data Intensive Computing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1415-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-1415-5_20

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-1414-8

  • Online ISBN: 978-1-4614-1415-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics