Enabling Combined Software and Data Engineering at Web-Scale: The ALIGNED Suite of Ontologies

Solanki, Monika; Božić, Bojan; Freudenberg, Markus; Kontokostas, Dimitris; Dirschl, Christian; Brennan, Rob

doi:10.1007/978-3-319-46547-0_21

Enabling Combined Software and Data Engineering at Web-Scale: The ALIGNED Suite of Ontologies

Monika Solanki²¹,
Bojan Božić²²,
Markus Freudenberg²³,
Dimitris Kontokostas²³,
Christian Dirschl²⁴ &
…
Rob Brennan²²

Conference paper
First Online: 23 September 2016

2588 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9982))

Abstract

Effective, collaborative integration of software and big data engineering for Web-scale systems, is now a crucial technical and economic challenge. This requires new combined data and software engineering processes and tools. Semantic metadata standards and linked data principles, provide a technical grounding for such integrated systems given an appropriate model of the domain. In this paper we introduce the ALIGNED suite of ontologies specifically designed to model the information exchange needs of combined software and data engineering. These ontologies are deployed in web-scale, data-intensive, system development environments in both the commercial and academic domains. We exemplify the usage of the suite on a complex collaborative software and data engineering scenario from the legal information system domain.

You have full access to this open access chapter, Download conference paper PDF

Resource type: :: Set of ontologies
Permanent URL: :: https://github.com/aligned-h2020/ALIGNED_Ontologies

1 Introduction

Recent years have seen a significant increase in the demand for data-intensive applications based on large-scale sources of data. However our engineering techniques for building data-intensive systems are both immature and often partitioned into software engineering and data engineering processes, tasks or teams. There is a need for integrated engineering approaches. The data itself must also be high-quality, which entails a curatorial process to improve and manage data over time. The expressivity of semantic models makes them useful for both addressing data quality [5] and applying model-driven approaches [3] to software engineering. Semantic data, in the form of enterprise linked data is also useful for describing, fusing and managing the combined data and software engineering lifecycles to increase productivity, agility and system quality.

In this paper, we present a suite of ontologies developed within the ALIGNED^{Footnote 1} project, that aim to align the divergent processes encapsulating data and software engineering. The key aim of the ALIGNED ontology suite is to support the generation of combined software and data engineering processes and tools for improved productivity, agility and quality. The suite contains linked data ontologies/vocabularies designed to: (1) support semantics-based model driven software engineering, by documenting additional system context and constraints for RDF-based data or knowledge models in the form of design intents, software lifecycle specifications and data lifecycle specifications; (2) support data quality engineering techniques, by documenting data curation tasks, roles, datasets, workflows and data quality reports at each data lifecycle stage in a data intensive system; and (3) support the development of tools for unified views of software and data engineering processes and software/data test case interlinking, by providing the basis for enterprise linked data describing software and data engineering activities (tasks), agents (actors) and entities (artefacts) based on the W3C provenance ontology^{Footnote 2}.

This ontology suite has been deployed for validation and incremental improvement in the ALIGNED project on four, large-scale data-intensive systems engineering use cases: the Seshat Global History Databank [10], which is compiling linked data time series relating to all human societies over the past 12,000 years; JURION^{Footnote 3}, a legal information platform developed by Wolters Kluwer Germany; PoolParty^{Footnote 4}, a semantic technology middleware developed by the Semantic Web Company; and the DBpedia+^{Footnote 5} data quality and release processes.

The paper is structured as follows: Sect. 2 presents an overview of the ALIGNED suite. It provides a brief description of the core ontologies in the suite. Section 3 shows how the vocabularies have been applied to a complex collaborative software and data engineering scenario from the legal information system domain. Section 4 presents an evaluation of the ontologies in the suite. Section 5 briefly discusses related work. Finally, Sect. 6 presents conclusions.

2 Overview of the ALIGNED Suite

Figure 1 illustrates the ALIGNED suite of ontologies split into the provenance, generic, and domain-specific layers. As can be seen from the figure, a high emphasis has been placed on reusing existing, well known and standardised specifications where available. At the top layer, the W3C provenance standard forms the baseline for all our specifications and all our models extend it in some way. The split of the ALIGNED ontology suite between a generic layer and a domain specific extensions layer allows rapid evolution of domain-specific extensions for the ALIGNED use cases/trial environments (JURION, Seshat, DBpedia, PoolParty) based on a stable set of core concepts modelled in the generic layer. As the project progresses these extensions will be evaluated and incorporated into the generic layer if they prove valuable or more widely applicable than a single domain. Within the project the suite of ontologies is known as the “ALIGNED metamodel” due to the links with software engineering practices.

We briefly present here the core ontologies from the suite. Further details of the ontologies including the axiomatisations, graphical representation, serialisations in multiple formats via content negotiation, examples illustrating the usage of the ontologies, typical SPARQL queries that can be formulated using the ontologies as the data model and HTML documentation are available from the individual deployments at their persistent URIs. Due to space constraints we deliberately do not include these in this paper. The ontologies are grouped as follows:

Design intent: This model is used to document the design decisions about data intensive system artefacts such as requirements, designs or datasets. It is based on the design intent ontology (DIO)^{Footnote 6}, which allows users to express the design intent or design rationale while undertaking the design of an artefact. DIO [9] is a generic ontology that provides the conceptualisation needed to capture the knowledge generated during various phases of the overall design lifecycle. DIO provides definitions for design artefacts such as requirements, designs, design issues, solutions, justifications and evidence, and relationships between them.
Software engineering: This model defines the major agents (e.g. project roles), activities (e.g. lifecycle stages), and entities (design artefacts) involved in a software engineering project and their relations with a special focus on capturing the engineering lifecycle. Two ontologies make up this model: the software process ontology (SPO)^{Footnote 7} and the software implementation processes ontology (SIP)^{Footnote 8}.
Data engineering: As software engineering above but with a focus on data engineering and data lifecycles. Two ontologies are used: the data lifecycle ontology (DLO)^{Footnote 9} defined within ALIGNED and the DataID^{Footnote 10} ontology, defined by ALIGNED for the DBpedia association, for describing datasets. DLO provides a set of conceptual entities, agents, activities, and roles to represent the general data engineering process. Furthermore, it is the basis for deriving specific domain ontologies which represent lifecycles of concrete data engineering projects such as DBpedia or Seshat. DataID is a multi-layered meta-data system, which, in its core, describes datasets and their different manifestations, as well as relations to agents like persons or organisations, in regard to their rights and responsibilities. Depending on context, type of data and use case, this core ontology can be augmented by multiple existing extensions (e.g. Linked Data, repository descriptions etc.).
Unified quality reports: Defines a unified reporting representation for data quality metrics, ontology reasoning errors, test cases, and test case results based on the W3C SHACL reporting vocabulary. It is based on four ontologies/vocabularies, three of which are externally developed: W3C SHACL^{Footnote 11}, W3C Data Quality^{Footnote 12}, and University of Leipzigs test-driven RDF validation ontology [5] (RUT); and one ontology developed within ALIGNED: the reasoning violation ontology (RVO)^{Footnote 13}. RUT is designed to capture the lifecycle of RDF validation with the test driven validation methodology. It is implemented by the RDFUnit tool. RVO describes both ABox and TBox reasoning errors for the integration of reasoners into data lifecycle tool-chains. The ontology covers violations of the OWL 2 direct semantics and syntax detected on both the schema and instance level over the full range of OWL 2 and RDFS language constructs. An overview of RVO and its design, implementation and use cases has been published in [1].
Domain data model: This describes the domain of the data-intensive application being developed and is specific to that application, e.g. the Seshat ontology for historical time-series describing human societies. The lower layer includes the domain-specific extensions to the metamodels. ALIGNED has developed four domain-specific metamodels based on each of our use cases, with a focus on model elements needed for the ALIGNED phase 2 trials.
Enterprise information processing: extensions and models for the JURION use case.
E-research in the Social Sciences and Humanities: extensions and models for the Seshat use case.
Crowd-sourced public datasets: extensions and models for the DBpedia use case.
Enterprise software development: extensions and models for the PoolParty use case.

3 Example Deployment: The ALIGNED Suite in Wolters Kluwer’s JURION

JURION is an innovative legal information platform developed by Wolters Kluwer Germany that merges and interlinks over 1 million documents of content and data from diverse sources such as national and European legislation and court judgements, extensive internally authored content and local customer data, as well as social media and semantic web data (e.g. from DBpedia). This data is then presented to users (such as law offices) in the form of highly customised applications for semantic search, annotation, case management and legal information retrieval.

Currently, the software development process and data life cycle are highly independent from each other. Figure 2 illustrates where ontologies from the ALIGNED suite contribute towards facilitating interoperability between the software and data engineering processes and tools used to build and maintain JURION. The two main uses are tool integration and unified governance. Tool integration includes both cases within a single domain (data or software engineering) and cross-domain tool-chain integration. Unified governance uses ALIGNED provenance records, data extraction and uplift from enterprise engineering tools and data fusion to provide end to end and cross-domain views of the JURION platform engineering processes. We elaborate on the deployment of ALIGNED ontologies for these use cases below.

RUT has been used in JURION for validating & verifying the extraction of metadata [6]. In particular, RDFUnit is used as a data validation tool integrated in JURION’s continuous integration (CI) platform (Jenkins). RVO, the reasoning violations ontology, has been used to integrate advanced OWL reasoning-based data quality checks with RDFUnit’s triple-query oriented tests to expand the scope of testing possible. DataID descriptors of all the JURION datasets are under evaluation and it is planned to use this to provide consistent meta-data which will be available to all tools thus facilitating further integration. The EIP, enterprise information processing, ontology has been used to describe the JURION environment, systems, artifacts and engineering processes in terms of the ALIGNED software and data lifecycle models.

Table 1. Evaluating the ALIGNED suite of ontologies

Full size table

An upcoming feature in JURION is the integration of search requirements with design issues/software bugs arising during their implementation. The goal is to express integrated requirements and issues as linked data, which is semantically annotated using the DIO and DIO-PP ontologies from the ALIGNED suite. This would further enable the development of customised Confluence interfaces which can be used to provide enhanced query features over the integrated data and produce bespoke reports using visual and statistical analytics.

4 Evaluation

Table 1 presents the evaluation of the ALIGNED suite in accordance to the desired criteria^{Footnote 14}.

5 Related Work

SEON^{Footnote 15} is a family of ontologies that describe concepts in the context of software engineering, software evolution and software maintenance. SWO^{Footnote 16} is a resource for describing software tools, their types, tasks, versions and provenance. While they cover some general aspects of software engineering, they do not address the description of design intents and software lifecycles. Representing design intents or design rationales as ontologies have been captured for various specialised domains such as software engineering [2] however there is no generic, domain-independent design intent capture model available as a design pattern. OOPS! [8] is a tool with a catalogue for validating ontologies by spotting common pitfalls, however it detects design flaws rather than logical errors and does not use an ontology for error reporting. The dcat vocabulary includes the special class Distribution for the representation of the available materialisations of a dataset. These distributions cannot be described further within dcat. The Asset Description Metadata Schema^{Footnote 17} (adms) is a profile of dcat, which only describes a specialised class of datasets: so-called Semantic Assets.

6 Conclusions

Combining data and software engineering processes to increase productivity and agility, is a challenge being faced by several organisations aiming to exploit the benefits of big data. Ontologies and vocabularies developed in accordance to competency questions, objective criteria and ontology engineering principles can provide useful support to data scientists and software engineers undertaking the challenge. In this paper we have proposed the ALIGNED suite of ontologies that provide semantic models of design intents, domain specific datasets, software engineering processes, quality heuristics and error handling mechanisms. The suite contributes immensely towards enabling interoperability and alleviating some of the complexities involved. We have exemplified the usage of the suite on a real-world use case from the legal domain and evaluated it against the desired criteria. As ontologies from the suite are now in various stages of adoption by the ALIGNED use cases, the next steps would incorporate their empirical evaluation.

Notes

1.
http://aligned-project.eu.
2.
http://www.w3.org/ns/prov-o.
3.
https://www.jurion.de/.
4.
https://www.poolparty.biz/.
5.
http://wiki.dbpedia.org/.
6.
https://w3id.org/dio.
7.
https://w3id.org/slo.
8.
https://w3id.org/sip.
9.
https://w3id.org/dlo.
10.
http://dataid.dbpedia.org/ns/core#.
11.
https://www.w3.org/TR/shacl/.
12.
https://www.w3.org/TR/vocab-dqv/.
13.
https://w3id.org/rvo.
14.
https://figshare.com/articles/ISWC2016_Resources_Track_Review_Instructions/2016852.
15.
http://se-on.org/#publications.
16.
http://theswo.sourceforge.net/.
17.
https://www.w3.org/TR/vocab-adms/.

References

Bozic, B., Brennan, R., Feeney, K., Mendel-Gleason, G.: Describing reasoning results with RVO, the reasoning violations ontology. In: ESWC 2016 (2016, to appear)
Google Scholar
de Medeiros, A.P., Schwabe, D., Fejjó, B.: Kuaba ontology: design rationale representation and reuse in model-based designs. In: Delcambre, L., Kop, L., Mayr, H.C., Mylopoulos, J., Pastor, O. (eds.) Conceptual Modeling - ER 2005. LNCS, vol. 3716, pp. 241–255. Springer, Heidelberg (2005)
Chapter Google Scholar
Gasevic, D., Djuric, D., Devedzic, V.: Model Driven Engineering and Ontology Development, 2nd edn. Springer, Heidelberg (2009)
Google Scholar
Gruber, T.R.: Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum. Comput. Stud. 43(5–6), 907–928 (1995)
Article Google Scholar
Kontokostas, D., Brüummer, M., Hellmann, S., Lehmann, J., Ioannidis, L.: NLP data cleansing based on linguistic ontology constraints. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 224–239. Springer, Heidelberg (2014). doi:10.1007/978-3-319-07443-6_16
Chapter Google Scholar
Kontokostas, D., Mader, C., Dirschl, C., Eck, K., Leuthold, M., Lehmann, J., Hellmann, S.: Semantically enhanced quality assurance in the jurion business use case. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 661–676. Springer, Heidelberg (2016). doi:10.1007/978-3-319-34129-3_40
Chapter Google Scholar
Noy, N.F., Mcguinness, D.L.: Ontology development 101: a guide to creating your first ontology. Technical report, Stanford Center for Biomedical Informatics Research (BMIR) (2001)
Google Scholar
Poveda-Villalón, M., Suárez-Figueroa, M.C., Gómez-Pérez, A.: Validating ontologies with OOPS!. In: Teije, A., et al. (eds.) EKAW 2012. LNCS (LNAI), vol. 7603, pp. 267–281. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33876-2_24
Chapter Google Scholar
Solanki, M.: A pattern for capturing the intents underlying designs. In: Proceedings of the 6th Workshop on Ontology and Semantic Web Patterns (WOP 2015), vol. 1461. CEUR-WS.org (2015)
Google Scholar
Turchin, P., Brennan, R., Currie, T., Feeney, K., Francois, P., Hoyer, D., Manning, J., Marciniak, A., Mullins, D., Palmisano, A., Peregrine, P., Turner, E.A., Whitehouse, H.: Seshat: the global history databank. Cliodynamics 6(1), 77–107 (2015)
Google Scholar

Download references

Acknowledgment

This research has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 644055, the ALIGNED project (www.aligned-project.eu).

Author information

Authors and Affiliations

Department of Computer Science, University of Oxford, Oxford, UK
Monika Solanki
KDEG, School of Computer Science and Statistics, Trinity College, Dublin, Dublin, Ireland
Bojan Božić & Rob Brennan
AKSW/KILT, University of Leipzig, Leipzig, Germany
Markus Freudenberg & Dimitris Kontokostas
Wolters Kluwer, Munich, Germany
Christian Dirschl

Authors

Monika Solanki
View author publications
You can also search for this author in PubMed Google Scholar
Bojan Božić
View author publications
You can also search for this author in PubMed Google Scholar
Markus Freudenberg
View author publications
You can also search for this author in PubMed Google Scholar
Dimitris Kontokostas
View author publications
You can also search for this author in PubMed Google Scholar
Christian Dirschl
View author publications
You can also search for this author in PubMed Google Scholar
Rob Brennan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Monika Solanki .

Editor information

Editors and Affiliations

Elsevier Labs. , Amsterdam, The Netherlands
Paul Groth
University of Southampton , Southampton, United Kingdom
Elena Simperl
Heriot-Watt University , Edinburgh, United Kingdom
Alasdair Gray
Vienna University of Technology , Vienna, Austria
Marta Sabou
Technische Universität Dresden , Dresden, Germany
Markus Krötzsch
IBM Research Ireland , Dublin 4, Ireland
Freddy Lecue
for the Social Sciences, GESIS-Leibniz Institute for the Social Sciences, Köln, Germany
Fabian Flöck
University of Southern California , Marina del Rey, California, USA
Yolanda Gil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Solanki, M., Božić, B., Freudenberg, M., Kontokostas, D., Dirschl, C., Brennan, R. (2016). Enabling Combined Software and Data Engineering at Web-Scale: The ALIGNED Suite of Ontologies. In: Groth, P., et al. The Semantic Web – ISWC 2016. ISWC 2016. Lecture Notes in Computer Science(), vol 9982. Springer, Cham. https://doi.org/10.1007/978-3-319-46547-0_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-46547-0_21
Published: 23 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46546-3
Online ISBN: 978-3-319-46547-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics