Resource type: :

Set of ontologies

Permanent URL: :

https://github.com/aligned-h2020/ALIGNED_Ontologies

1 Introduction

Recent years have seen a significant increase in the demand for data-intensive applications based on large-scale sources of data. However our engineering techniques for building data-intensive systems are both immature and often partitioned into software engineering and data engineering processes, tasks or teams. There is a need for integrated engineering approaches. The data itself must also be high-quality, which entails a curatorial process to improve and manage data over time. The expressivity of semantic models makes them useful for both addressing data quality [5] and applying model-driven approaches [3] to software engineering. Semantic data, in the form of enterprise linked data is also useful for describing, fusing and managing the combined data and software engineering lifecycles to increase productivity, agility and system quality.

In this paper, we present a suite of ontologies developed within the ALIGNEDFootnote 1 project, that aim to align the divergent processes encapsulating data and software engineering. The key aim of the ALIGNED ontology suite is to support the generation of combined software and data engineering processes and tools for improved productivity, agility and quality. The suite contains linked data ontologies/vocabularies designed to: (1) support semantics-based model driven software engineering, by documenting additional system context and constraints for RDF-based data or knowledge models in the form of design intents, software lifecycle specifications and data lifecycle specifications; (2) support data quality engineering techniques, by documenting data curation tasks, roles, datasets, workflows and data quality reports at each data lifecycle stage in a data intensive system; and (3) support the development of tools for unified views of software and data engineering processes and software/data test case interlinking, by providing the basis for enterprise linked data describing software and data engineering activities (tasks), agents (actors) and entities (artefacts) based on the W3C provenance ontologyFootnote 2.

This ontology suite has been deployed for validation and incremental improvement in the ALIGNED project on four, large-scale data-intensive systems engineering use cases: the Seshat Global History Databank [10], which is compiling linked data time series relating to all human societies over the past 12,000 years; JURIONFootnote 3, a legal information platform developed by Wolters Kluwer Germany; PoolPartyFootnote 4, a semantic technology middleware developed by the Semantic Web Company; and the DBpedia+Footnote 5 data quality and release processes.

The paper is structured as follows: Sect. 2 presents an overview of the ALIGNED suite. It provides a brief description of the core ontologies in the suite. Section 3 shows how the vocabularies have been applied to a complex collaborative software and data engineering scenario from the legal information system domain. Section 4 presents an evaluation of the ontologies in the suite. Section 5 briefly discusses related work. Finally, Sect. 6 presents conclusions.

2 Overview of the ALIGNED Suite

Figure 1 illustrates the ALIGNED suite of ontologies split into the provenance, generic, and domain-specific layers. As can be seen from the figure, a high emphasis has been placed on reusing existing, well known and standardised specifications where available. At the top layer, the W3C provenance standard forms the baseline for all our specifications and all our models extend it in some way. The split of the ALIGNED ontology suite between a generic layer and a domain specific extensions layer allows rapid evolution of domain-specific extensions for the ALIGNED use cases/trial environments (JURION, Seshat, DBpedia, PoolParty) based on a stable set of core concepts modelled in the generic layer. As the project progresses these extensions will be evaluated and incorporated into the generic layer if they prove valuable or more widely applicable than a single domain. Within the project the suite of ontologies is known as the “ALIGNED metamodel” due to the links with software engineering practices.

Fig. 1.
figure 1

The ALIGNED suite of ontologies

We briefly present here the core ontologies from the suite. Further details of the ontologies including the axiomatisations, graphical representation, serialisations in multiple formats via content negotiation, examples illustrating the usage of the ontologies, typical SPARQL queries that can be formulated using the ontologies as the data model and HTML documentation are available from the individual deployments at their persistent URIs. Due to space constraints we deliberately do not include these in this paper. The ontologies are grouped as follows:

  • Design intent: This model is used to document the design decisions about data intensive system artefacts such as requirements, designs or datasets. It is based on the design intent ontology (DIO)Footnote 6, which allows users to express the design intent or design rationale while undertaking the design of an artefact. DIO [9] is a generic ontology that provides the conceptualisation needed to capture the knowledge generated during various phases of the overall design lifecycle. DIO provides definitions for design artefacts such as requirements, designs, design issues, solutions, justifications and evidence, and relationships between them.

  • Software engineering: This model defines the major agents (e.g. project roles), activities (e.g. lifecycle stages), and entities (design artefacts) involved in a software engineering project and their relations with a special focus on capturing the engineering lifecycle. Two ontologies make up this model: the software process ontology (SPO)Footnote 7 and the software implementation processes ontology (SIP)Footnote 8.

  • Data engineering: As software engineering above but with a focus on data engineering and data lifecycles. Two ontologies are used: the data lifecycle ontology (DLO)Footnote 9 defined within ALIGNED and the DataIDFootnote 10 ontology, defined by ALIGNED for the DBpedia association, for describing datasets. DLO provides a set of conceptual entities, agents, activities, and roles to represent the general data engineering process. Furthermore, it is the basis for deriving specific domain ontologies which represent lifecycles of concrete data engineering projects such as DBpedia or Seshat. DataID is a multi-layered meta-data system, which, in its core, describes datasets and their different manifestations, as well as relations to agents like persons or organisations, in regard to their rights and responsibilities. Depending on context, type of data and use case, this core ontology can be augmented by multiple existing extensions (e.g. Linked Data, repository descriptions etc.).

  • Unified quality reports: Defines a unified reporting representation for data quality metrics, ontology reasoning errors, test cases, and test case results based on the W3C SHACL reporting vocabulary. It is based on four ontologies/vocabularies, three of which are externally developed: W3C SHACLFootnote 11, W3C Data QualityFootnote 12, and University of Leipzigs test-driven RDF validation ontology [5] (RUT); and one ontology developed within ALIGNED: the reasoning violation ontology (RVO)Footnote 13. RUT is designed to capture the lifecycle of RDF validation with the test driven validation methodology. It is implemented by the RDFUnit tool. RVO describes both ABox and TBox reasoning errors for the integration of reasoners into data lifecycle tool-chains. The ontology covers violations of the OWL 2 direct semantics and syntax detected on both the schema and instance level over the full range of OWL 2 and RDFS language constructs. An overview of RVO and its design, implementation and use cases has been published in [1].

  • Domain data model: This describes the domain of the data-intensive application being developed and is specific to that application, e.g. the Seshat ontology for historical time-series describing human societies. The lower layer includes the domain-specific extensions to the metamodels. ALIGNED has developed four domain-specific metamodels based on each of our use cases, with a focus on model elements needed for the ALIGNED phase 2 trials.

  • Enterprise information processing: extensions and models for the JURION use case.

  • E-research in the Social Sciences and Humanities: extensions and models for the Seshat use case.

  • Crowd-sourced public datasets: extensions and models for the DBpedia use case.

  • Enterprise software development: extensions and models for the PoolParty use case.

Fig. 2.
figure 2

Usage of the ALIGNED suite of ontologies in the JURION semantics-based legal information system

3 Example Deployment: The ALIGNED Suite in Wolters Kluwer’s JURION

JURION is an innovative legal information platform developed by Wolters Kluwer Germany that merges and interlinks over 1 million documents of content and data from diverse sources such as national and European legislation and court judgements, extensive internally authored content and local customer data, as well as social media and semantic web data (e.g. from DBpedia). This data is then presented to users (such as law offices) in the form of highly customised applications for semantic search, annotation, case management and legal information retrieval.

Currently, the software development process and data life cycle are highly independent from each other. Figure 2 illustrates where ontologies from the ALIGNED suite contribute towards facilitating interoperability between the software and data engineering processes and tools used to build and maintain JURION. The two main uses are tool integration and unified governance. Tool integration includes both cases within a single domain (data or software engineering) and cross-domain tool-chain integration. Unified governance uses ALIGNED provenance records, data extraction and uplift from enterprise engineering tools and data fusion to provide end to end and cross-domain views of the JURION platform engineering processes. We elaborate on the deployment of ALIGNED ontologies for these use cases below.

RUT has been used in JURION for validating & verifying the extraction of metadata [6]. In particular, RDFUnit is used as a data validation tool integrated in JURION’s continuous integration (CI) platform (Jenkins). RVO, the reasoning violations ontology, has been used to integrate advanced OWL reasoning-based data quality checks with RDFUnit’s triple-query oriented tests to expand the scope of testing possible. DataID descriptors of all the JURION datasets are under evaluation and it is planned to use this to provide consistent meta-data which will be available to all tools thus facilitating further integration. The EIP, enterprise information processing, ontology has been used to describe the JURION environment, systems, artifacts and engineering processes in terms of the ALIGNED software and data lifecycle models.

Table 1. Evaluating the ALIGNED suite of ontologies

An upcoming feature in JURION is the integration of search requirements with design issues/software bugs arising during their implementation. The goal is to express integrated requirements and issues as linked data, which is semantically annotated using the DIO and DIO-PP ontologies from the ALIGNED suite. This would further enable the development of customised Confluence interfaces which can be used to provide enhanced query features over the integrated data and produce bespoke reports using visual and statistical analytics.

4 Evaluation

Table 1 presents the evaluation of the ALIGNED suite in accordance to the desired criteriaFootnote 14.

5 Related Work

SEONFootnote 15 is a family of ontologies that describe concepts in the context of software engineering, software evolution and software maintenance. SWOFootnote 16 is a resource for describing software tools, their types, tasks, versions and provenance. While they cover some general aspects of software engineering, they do not address the description of design intents and software lifecycles. Representing design intents or design rationales as ontologies have been captured for various specialised domains such as software engineering [2] however there is no generic, domain-independent design intent capture model available as a design pattern. OOPS! [8] is a tool with a catalogue for validating ontologies by spotting common pitfalls, however it detects design flaws rather than logical errors and does not use an ontology for error reporting. The dcat vocabulary includes the special class Distribution for the representation of the available materialisations of a dataset. These distributions cannot be described further within dcat. The Asset Description Metadata SchemaFootnote 17 (adms) is a profile of dcat, which only describes a specialised class of datasets: so-called Semantic Assets.

6 Conclusions

Combining data and software engineering processes to increase productivity and agility, is a challenge being faced by several organisations aiming to exploit the benefits of big data. Ontologies and vocabularies developed in accordance to competency questions, objective criteria and ontology engineering principles can provide useful support to data scientists and software engineers undertaking the challenge. In this paper we have proposed the ALIGNED suite of ontologies that provide semantic models of design intents, domain specific datasets, software engineering processes, quality heuristics and error handling mechanisms. The suite contributes immensely towards enabling interoperability and alleviating some of the complexities involved. We have exemplified the usage of the suite on a real-world use case from the legal domain and evaluated it against the desired criteria. As ontologies from the suite are now in various stages of adoption by the ALIGNED use cases, the next steps would incorporate their empirical evaluation.