1 Introduction

The Information System and Data Center (ISDC) portal of the Helmholtz Centre Potsdam GFZ German Research Centre for Geosciences (http://isdc.gfz-potsdam.de) is the online service access point for all manner of geoscientific geodata, its corresponding metadata, scientific documentation and software tools. The majority of the data and information, the portal currently offers to the public, are global geomonitoring products such as satellite orbits and Earth gravity field data as well as geomagnetic and atmospheric data for the exploration. These products for Earths changing system are provided via state-of-the art retrieval techniques. The portal’s design and the operation is a project of the ISDC team within the GFZ’s Data Center. Before the start of the first release of the portal in March 2006, there have been different project driven and independent operating ISDCs, such as the GGP ISDC for the handling of local gravity and appropriate meteorological data of the international Global Geodetic Project (GGP) or the CHAMP ISDC, the GRACE ISDC and the GNSS ISDC for the management of geodetic, geophysical and atmospheric and ionospheric geomonitoring data and information derived from the CHAMP, GRACE and GPS satellites and GPS ground stations. Because of the existence of unique and independent ISDCs, users, who were interested in e.g. orbit products from different satellite missions, had to go into the appropriate ISDC, such as CHAMP ISDC, GRACE ISDC or GNSS ISDC in order to find required orbit data and information. To overcome the just described complicated situation, for the improvement of the Graphical User Interface (GUI) of the ISDC and for the reduction of double work and costs related to the operation and maintenance of different ISDC, the idea of the integration of the ISDC systems under one portal roof was born. In conclusion, the requirements and constraints for the development of an ISDC portal were:

  • the integration of new product types related to new collaboration projects, such as e.g. the GNSS monitoring project which deals with Global Navigation Satellite System data and the Galileo Geodetic Service Provider (GGSP) project,

  • the management of a constant increasing number of users and user groups,

  • the improvement of the system usability and the request for a single sign on,

  • the realization of a multi-domain geoscience information and data retrieval,

  • the optimization of system and service operation and maintenance.

Figure 1 illustrates that after the launch of the first release of the GFZ ISDC portal in March 2006, the number of users increased from around 800 to almost 2,000 in February 2009. Especially within the first year after the start there was an exponential increase of users, which also demonstrates the great user acceptance and successful development of the new portal system.

Fig. 1
figure 1

User development graph (2009-02-11)

The grown international importance of the geosciences data and information (Klump et al., 2008), provided by the portal is shown in Fig. 2. Now, more than four fifth of the registered portal users are from foreign countries, such as from China and the USA, both with almost 300 users, followed by India, Japan, Canada, UK, France, Italy and others. The daily data input/output rate has reached a value of about 5,000 data files. By now, the registered and authorized users have access to more than 20 million geosciences data products, always consisting of data and metadata files structured in almost 300 different product types related to main geosciences domains, such as:

  • geodesy, e.g. GPS data, satellite orbits, local gravity data, Earth gravity models, and Earth rotation parameter,

  • geophysics, e.g. Earth magnetic field data, both vector and scalar data,

  • atmosphere and ionosphere, e.g. tropospheric temperature profiles and ionospheric electron density profiles.

Fig. 2
figure 2

User country statistics (2009-02-11)

The objectives of the data lifecycle management, the ISDC metadata classification model and used metadata standards, the portal design and the data retrieval and data access interfaces as well as the description of the backend functionality are subjects of the next chapters.

2 Data Lifecycle Management

The challenge of the exponential growing number and volume of and the increasing danger of data waste and data loss data (Gantz et al., 2008) only can be solved by the introduction of a framework which guides the process of data management from the birth of data to the transformation process into knowledge or the death of the data.

In a framework of a complete data life cycle (Lyopn, 2007), as shown in (Fig. 3), the portal system is responsible for the geoscience data and information handling from the ingestion of geoscience data products, provided by scientists, until the provision of geoscience knowledge in form of e.g. publications or model visualizations, which are based on the ISDC data. Even in the project elaboration phase the ISDC expertise attends the process for the definition, description and classification of data products. In the data product generation phase the portal is both, data drain and data source. Half-formed or low-level processed data products are ingested and stored at the ISDC as well as later provided to user or user groups for further processing. Finalized data products are stored in a sustainable way at long-term and online archive systems. Data sets are imported, if the appropriate metadata documents are complete, consistent and valid. Thus at least a minimum of a formal data file validation process is realized. Because of almost 300 different product types, there is no real check of the content of the data files at the ISDC ingestion process possible. But the standardized data product and product type metadata documents are used for the creation of a complete and consistent ISDC data product catalog. Complete data sustainability is realized, if the disclosure, the discovery and the reuse of data is guaranteed for everybody, for a long time. Publication and citation of data are important activities which support the sustainable data management idea. The disclosure of new data products in the portal is realized by special features of the portal, such as e.g. the publication of newsletters or the provision of RSS feeds. The ISDC data product catalog system enables a detailed search for data, which are accessible, downloadable and finally reusable. Knowledge generation starts with adding value to the data, such as data integration, annotation, visualization or simulation. Both, for the data integration and the annotation of data products, the portal provides the appropriate features. Knowledge extraction processes are data mining, modeling, analysis and synthesis. Another process which is important, but not part of the “Research Life Cycle view of Data Curation” in Fig. 3, is the science driven data review process, which should be done on a cyclical basis. This review process contains such activities, as the harmonization and aggregation of data, the tailoring of data and the removal of data. Reviewing is necessary for the enhancement of data interoperability, the re-usage of data at other scientific domains and finally for the maintenance of the operational status of the ISDC portal.

Fig. 3
figure 3

e-Research life cycle, data curation and related processes* Lyopn (2007) (*edited by Ritschel, B.)

3 Metadata Model

The ISDC portal backend software manages almost 300 geoscience product types from different projects. In order to handle such a big variety of product types, a special ISDC product philosophy and metadata handling mechanism has been developed and introduced (Ritschel et al., 2006). The key for the solution of this challenge is the compulsory usage of a standardized metadata format for the description of the product types and the appropriate data products.

The relation of project-related product types at ISDC is shown in Fig. 4. Each product type consists of a set of products. A product is composed of a data file(s) and metadata that is created by using DIF XML.

Fig. 4
figure 4

Project – product type – data product schema, which especially illustrates the relations between product types and data products and appropriate XML schemata

As explained in detail in Mende et al. (2008) and Ritschel et al. (2007b), each product type that results from a geoscience project consists of a set of data products. A data product is composed of a data file or a data set and a standardized metadata document. In order to describe and manage the data products, the ISDC system uses NASA’s Directory Interchange Format (http://gcmd.gsfc.nasa.gov/User/difguide/difman.html) metadata standard DIF. Currently, the ISDC backend accepts both, ASCII DIF version 6, e.g. for CHAMP satellite data products, and an enhanced XML DIF version 9.x, e.g. for TerraSAR-X satellite data products. First, the DIF standard was developed for the Global Change Master Directory (http://gcmd.nasa.gov/Aboutus) and is used for the semantical description of all kinds of Earth science data sets, which are categorized in domain specific product types. The metadata standard uses general metadata attributes, which are defined as required attributes, such as e.g. Entry_ID (unique identifier), Entry_Title (title of the product type), Parameters (science keywords that are representative of the product type being described), Summary (brief description of the product type that allows users to determine if the data set is useful for their requirements) and others. In addition to the required elements there is a set of metadata attributes, which describe the product type in a much more detailed way. Such attributes are e.g. Start_Date and Stop_Date describing the temporal coverage of the data collection, or Latitude, Longitude and Attitude or Depth, which determine the spatial coverage of the data. The DIF metadata standard has the potential to provide the right structure for the description of all kinds of geosciences data sets. Counting all GCMD DIF files, almost 40,000 different data sets or product types from A as agriculture to T as terrestrial hydrosphere are semantically described by DIF compliant metadata documents. Even more, DIF metadata is transferable to the Federal Geographic Data Committee (FGDC) standard (http://www.fgdc.gov), and there are XSL transformation specifications, as shown in Fig. 5 for the creation of ISO 19115 (http://www.iso.org/iso/search.htm?qt=ISO+19115&published=on) compliant metadata documents. The listed features of the DIF standard proof the right choice of the DIF standard for the management of ISDC product types (Ritschel et al., 2006; Ritschel et al., 2007a). The ISDC base schema of the product type DIF XML documents is defined in the “base-dif.xsd” file (Mende et al., 2008). The ISDC XML Schema Definition (XSD) has been defined on the basis of the GCMD XSD and is available at http://isdc.gfz-potsdam.de/xsd/base_dif.xsd. Because the ISDC portal manages both – product types and data products – it was necessary to extend the DIF standard. For the management of data products, the ISDC deals with a combination of product type and data product DIF documents. The metadata of product types is stored in associated data product type DIF files according to the “base-dif.xsd” schema. The data file specific metadata is documented in data product DIF XML files. The combination of a data file or a set of data files (currently max. 3 data files) and the appropriate metadata file define the ISDC data product, as seen in Fig. 4. Each product type has its own schema for the data product DIF XML files. Data product DIF documents are necessary for the description of the data file specific properties. The complex XML type <Data_Parameters> in the data product DIF XML document provides the specific extension of the product type DIF XML structures, which are used for semantic information of the data file, such as e.g. temporal and spatial information about the data in the data file and technical information such as e.g. data file name or data file size and other information. The connection between the data product DIF XML files and the product type DIF XML document is given by the equality of the main parts of the <Entry_ID> element in both the product type and the related product metadata documents. Additionally, the content of the <Parent_DIF> element in the data product DIF XML document refers to the appropriate product-type DIF document. Figure 4 illustrates the relation between the XML schemata for the definition of product types and the definition of data products. The addition of mandatory elements to the schemata of data products keeps the usefulness of data product metadata DIF documents without the appropriate product type DIF documents.

Fig. 5
figure 5

Mapping of metadata standardsFootnote

Dreftymac (2007) diagram of the basic elements and processing flow of XSL Transformations retrieved February 2009 from http://en.wikipedia.org/ (edited by Ritschel, B.).

The ISDC ontology class model based on the semantic Web approach (Daconta et al., 2003) contains the metadata classes project, platform, instrument, product type and institution. Keywords from controlled and free vocabularies are used for the description of the different metadata classes. The new ISDC metadata concept is an extension to the ISDC product type and metadata philosophy (Ritschel et al., 2008) and is based on the extended metadata classification model of the GCMD. Figure 6 illustrates the new metadata classes and its relations as well as the use of controlled and free vocabularies. The ISDC metadata class model defines the appropriate classes, relations and the input of different vocabularies. The relation between project and instrument (dashed line) is an implicit one only, realized via the project – platform – instrument relation. The science domain, used for semantic description of the product type is defined by the project objectives and extended by the physical features of the instrument.

Fig. 6
figure 6

The ISDC metadata class model

The introduction of the new metadata classes project, platform and instrument are a result of the necessity both to describe the semantics of the appropriate classes in a deeper, more detailed and standardized manner (Pfeiffer, 2008). The concept model of the ISDC metadata classes also contains the independent class institution, because data and information about institutions, organisations and persons are always part of the other classes in the model.

The relations between ISDC metadata classes and attributes are shown in Table 1. Because of historical reasons, the class product type always contains real attributes, whereas the other classes have attributes for the description of own properties and use references for the crosslink to the appropriate classes. Detailed information about the ISDC metadata concept model is available at (Ritschel et al., 2008a).

Table 1 ISDC metadata classes and semantical relations

As done for the semantic description of product type metadata, keywords from controlled and free vocabulary is used for the metadata content of documents related to projects, platforms and instruments. The implementation of the new ISDC concept model will provide advanced and new ISDC portal retrieval features. A classified keyword search over the complete stock of metadata documents offers a total new view about the relations between and in projects, platforms, instruments and product types.

4 Portal Architecture

The current solution uses PostNukeFootnote 2 as a portal framework. The decision for using this platform was based on the detailed analysis of different portal systems in 2004. The main selection criteria were costs and simplicity. Because of PostNuke’s open architecture and the large community around this open-source software, there are many free components that became part of the current portal implementation. One main component for backend services is a Sybase database (http://www.sybase.com), where data flow information, rights management and user statistics is stored and periodically updated. This big challenge that had to be solved: managing metadata and fine-granular rights for tens of millions of data files. As mentioned already, there are currently around 20 million products, and each product has its own set of access rights.

4.1 Application Framework

The PHP Application Framework PostNuke, stands is a PHP-based portal and Content Management System (CMS). The implementation details are not specific for this platform. The independence from a specific framework was a central point for the software development because PostNuke was meant to be a solution to consolidate the existing systems and prepare the way towards the planed GFZ-wide portal system solution. The PostNuke framework runs on a typical LAMPFootnote 3-Environment. We use Solaris (http://www.sun.com/software/solaris/) as operating system, an Apache Webserver with a MySQL database and PHP as web scripting language. This configuration provides flexibility and basic functionality for specific functions, such as e.g. user management. Another standard component is the CMS in PostNuke for managing news postings or article editing and the storage and handling of such accessible documents as e.g. descriptions of product types, project related information or documentations and publications of the portal. The PostNuke Application Program Interface (API) provides the possibility for the implementation of encapsulating functionality into ISDC-specific plug-ins. There are four big areas:

  • Data product retrieval

  • User account management

  • User collaboration

  • System monitoring

The use of an application-wide theme for a standardized layout for all components makes it easy to separate the data model from the portal GUI.

4.2 Data Flow

As already mentioned, there is a main database interface between user portal and backend services. This design decision was done because of evolutional-based development reasons, which result in both, positive and negative aspects. The main system limit is the time delay between the ingestion and the provision of data. Every backend process runs as operating system cronjob in time intervals between 1 and 15 min. Some jobs (e.g. aggregation of statistics) even only run on a daily or weekly basis. The advantage of that asynchronous process handling approach is the work independence or loose coupling of system modules. If a new file is imported, delivered or archived, not all systems have to work synchronal or on-request, which avoids mutual process blocking within and between system modules and components. For example, data providers often upload their files just from time to time into appropriate FTP-directories, but the services for the import of data run independently and periodically and process these files sequentially. Therefore performance issues can be controlled efficiently and transactions are safely. The FTP based services are a central aspect of data ingestion, data storage and data delivery. Users do not have direct access to the data, instead of the system sends a qualified user requests to the archive system in order to transfer the requested files to the appropriate user FTP-directories, where the files are cached for the download. This approach is used because of security reasons and the necessity for the system to operate with different archive systems at the same time. Background processes collect the data files from these different services and transfer them to the user directories. The user is notified when the transfer process was completed and the files are ready for download.

4.3 Interfaces

Internally all inter-system communication is controlled by database transactions. All functional needs are covered by this approach. As an example, here is the workflow of the user registration process:

  • new user registers at the portal website

  • user data is saved to the database

  • a cronjob checks newly registered users

  • new users are added to the FTP user accounts and appropriate user home directories are created

  • system clears specific portal area for input of required data for system usage

  • users now choose projects and interests and define favorite product types

  • administrator checks users data in order to apply grants for certain internal product types (grants for public product types are assigned automatically).

Only a part of the presented work flow is human-centric. Lots of activities are done in an automatic mode. Only the triggering of some process and the approval tasks are realized be user interaction. The Graphical User Interface (GUI) of the ISDC homepage is presented in Fig. 7.

Fig. 7
figure 7

Graphical user interface of the GFZ ISDC portal (ISDC portal homepage)

The portal frame of the GUI contains navigation and monitoring elements on the right and left side of the portal. The central area is used for the report of news, and after clicking a navigation link, for the display of appropriate content, such as information about projects and product types or tools for the retrieval and access to data products. In order to provide a project-centric access to data and information every project has its own homepage within the portal framework, such as e.g. the CHAMP ISDC homepage (http://isdc.gfz-potsdam.de/champ) or the GRACE ISDC homepage (http://isdc.gfz-potsdam.de/grace).

The Data Product Browser, which can be seen as virtual FTP directory browser, shown in Fig. 8, is an excellent example for the realization of a user request driven interface for an easy access to the different data product files. The files are categorized into projects, such as e.g. CHAMP and GRACE, processing levels (1–4), the product types and temporal units (year, month, day). Also the visibility of product types and data products is granted according to the user access rights only, which was an essential data policy constraint. The calendar function of the Data Product Browser provides data products of a specific product type on a daily basis which are put into MY PRODUCT CART on user request. The user can repeat this action for different product types until the user-dependent limit (standard: 1,000 files, 1 giga byte per day) is reached. After clicking the “Request product card”-button, the required files are transferred to the user FTP directory.

Fig. 8
figure 8

Data product browser

5 Backend for Operational Services

As explained in Chap. 4 already, there are several cronjobs, which run on the different backend servers and zones (Fig. 9). There are data transfer processes, such as the data import or export, metadata extraction processes, and actions, which are necessary for data mining and aggregation purposes. Most of appropriate jobs are driven by database transactions. An example for a background aggregation process is the computation of user grant dependent time-line information for the creation of an occurrence chart, which displays the availability of data products in a certain period of time. This process is too complex for an on-the-fly computation.

Fig. 9
figure 9

Deployment diagram

5.1 Component Deployment

Whereas the former ISDC system components were installed on SUN Servers, Fibre Channel RAID Systems and the Solaris 9 Operating System in 2003, as described and illustrated at (Ritschel et al., 2006), the new ISDC portal is based on a new system architecture using innovate features of Solaris 10 operating system, such as a new virtualization technology (http://www.sun.com/software/solaris/virtualization.jsp) called Solaris Container and the Zettabyte File System (ZFS) (http://www.sun.com/software/solaris/data_management.jsp). Now it was possible to decrease the number of required workstation based servers, to reduce system administration tasks and to fulfill increasing requirements related to higher performance and system availability. A Solaris Container is the combination of resource control and partition technology provided by zones. Zones are lightweight Virtual Machines with an isolated process tree. That means active processes in a zone cannot affect processes outside in another zone. Each Zone is an isolated virtual server with an own node name, virtual network interface, and storage assigned to it. Zones do not require a dedicated CPU, storage or physical network interface. Any of these resources can be assigned specifically to one zone. (http://www.softpanorama.org/Solaris/Virtualization/zones.shtml) Solaris ZFS is a 128-bit file system and therefore memory management should not be a problem in the foreseeable future. The ZFS protects data from corruption, with integrated error detection and correction, and provides virtual unlimited scalability by using virtual storage pools. Another advantage is the snapshot feature for the preservation of the current state of the file system. File systems can be expanded by simply adding more drives. Building up virtual storage pools with integrated mirroring, Solaris ZFS RAIDZ or Solaris ZFS RAIDZ2 mechanisms increase redundancy and availability. Solaris Container technology provides the advantage of storage sharing via different Zones. Because of security constrains, network connections between hosts in the demilitarized zone (DMZ) are not allowed at the Helmholtz Centre Potsdam German Research Centre for Geosciences (GFZ). Using Zones, former four separate operating workstations could be integrated on one new workstation (PRODUCTION), each deployed into a single zone, as shown in Fig. 9. One of these zones, the external FTP-server has an own network interface into the DMZ, the other three zones using network interfaces into the LAN. Now the disc storage based on ZSF can be shared with different zones. A part of the storage is used for the internal and external FTP-server. The same storage is used simultaneously at the processing zone and the Online Product Archive (OPA). Another part of the storage is mapped only into the OPA, invisibly by all other zones. This prevents FTP users for the compromise of the data archive. A second machine, called BACKUP is setup with the same configuration as the PRODUCTIVE machine and is used as production host in case of hardware failures.

6 Outlook

The step-by-step implementation of the ISDC ontology metadata concept is important for the realization of a semantic-driven information system (Daconta et al., 2003) for the multi-domain retrieval of geosciences data, information and knowledge. The planned integration of Web 2.0 technologies and the implementation of appropriate user interfaces are necessary for user feedback and communication processes in order to ingest the often uninvested knowledge of the user community. A new release of the ISDC portal will be based on JAVA technology and will provide new user interfaces as well as standardized interfaces and services to other information systems, such as e.g. GCMD and GEOSS.