Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Motivation and goals

Recently an increasing number of open data catalogs appear on the Web [1]. These catalogs contain data that represents real world entities and their attributes. Data can be imported from several catalogs to build web services; hence there is a need to trace the source of each entity and attribute value in a way that handles also the possible conflicts between attribute values coming from overlapping sources [2]. For open data, source tracing requires capturing both the provenance [3] of the attribute values and the identity links [4] between entities. Moreover, resolving the conflicts manually becomes harder with the increasing size of data.

We propose a source tracing module that extends any existing import process by making it tracing-aware. The source tracing module contains three tools: authority, provenance and evidence. Authority provides rules for overriding attribute values, provenance specifies the source of an attribute value and evidence provides identity links between entities.

2 Problem

The problem of tracing sources is studied with respect to an import process that takes an open data catalog and extracts entities and their attribute values from its contents. The extracted entities and attribute values are imported into a database called entity base.

A common category of the open data repositories is the DCAT catalog. DCATFootnote 1 (Data Catalog Vocabulary) is an RDF vocabulary for describing datasets in a data catalog. A DCAT catalog can have one or more datasets, a dataset can have one or more distributions. DCAT catalogs exist within a Web-based system called CKAN. CKANFootnote 2 (Comprehensive Knowledge Archive Network) is a dataset distribution system. Datasets are distributed as packages. Each package has one or more resource groups, and each resource group has one or more resources.

Open data catalogs contain data that represents objects from the real world. We refer to real world objects that are of enough importance to be given a name as entities. An example for entities is Italy. There are different entity types, such as Locations. Italy is an entity of type Location. The type of entity gives the list of attribute definitions that can be assigned to an entity of this type. Location entities may have the attribute Area which holds the value of the total area of the location. The values of the attribute definitions for a specific entity are called attribute values.

The entity base is populated with entities through an import process which can be, for instance, a generic work flow for importing any dataset or a custom procedure for importing a specific dataset. We consider any import process that has the following three aspects:

  1. 1.

    Partiality: The import process may take a partial input.

  2. 2.

    Overlap: Imported data may be disjoint or overlapped with existing entities and attribute values in the entity base.

  3. 3.

    Multiple Imports: The import process may run multiple times on the same catalog.

3 Our Approach

Fig. 1.
figure 1

Extending an import process with the source tracing module

We propose a source tracing module that extends any existing import process by making it tracing-aware (see Fig. 1). The source tracing module contains three tools: authority, provenance and evidence.

3.1 Authority

Authority is a meta-attribute of an element (entity type, an attribute definition, an entity or an attribute value) that provides a connection between the element and the resource which has the authority to create or update it. Authority is specified through a set of authority rules. An authority rule is a relation between a resource and one or more elements which is called the scope, with a ranking value that is called the priority.

The scope specifies the set of elements that are affected by an authority rule. We support four ordered levels of authority scope: (1) entity type, (2) a set of entities, (3) attribute definition and (4) attribute value. The three aspects of the import process (partiality, overlap and multiple imports) can happen at any scope. The priority is a ranking value that is assigned to order if multiple sources are given authority for the same scope. This ranking is a total order. Authority should be defined for each element. Its purpose is to help in finding a winning resource if there is a conflict between two resources in an attribute value.

3.2 Provenance and Evidence

An import process runs on an external resource and extracts entities and their attribute values from it. Before creating or updating the entities and their attribute values in the entity base, a tracing-aware import process creates a graph of elements between the external source and the entity base. This graph is shown in Fig. 2. The ultimate goal of this graph is to trace the sources of each element in the entity base. The graph is connected to the entity base through provenance and evidence. Provenance is a meta-attribute that specifies the source of an attribute value; while evidence is an attribute that links an entity with another external entity which represents the same real world object.

Fig. 2.
figure 2

Provenance graph for the entity base