Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

 

Resource type: :

Dataset

Permanent URL: :

https://datahub.io/dataset/lodstats

 

1 Introduction

Over the past years, the size of the Data Web has increased significantly, which makes obtaining general insights into its growth and structure both more challenging and more desirable. The expansion of the Data Web can be to a large extent attributed to the efforts in the Semantic Web and Open Government communities. Both communities have a common goal: to provide 5-starFootnote 1 RDF datasets to end-users. To achieve this goal, the Semantic Web community introduced a number of requirements for datasets, which should be fulfilled to be included into the LOD Cloud Footnote 2. The Semantic Web community has a main dataset registry hub: the datahubFootnote 3 data catalog, while Open Government initiatives usually distribute RDF datasets through their own data catalogs (e.g. data.gov, publicdata.eu and open.canada.ca).

All of the mentioned data catalogs utilize CKAN, an open-source data portal platform, which is a de-facto standard for Open Data. CKAN provides a solid framework to organize datasets and to expose metadata about them in various formats, including RDF. However, CKAN does not provide analytics over the registered datasets and highly depends on the user input. Moreover, no single aggregation point exists. These factors limit the possibility to obtain general insights into the Data Web. The lack of such insights hinders important data management tasks such as quality, privacy and coverage analysis.

For this reason, attempts to analyze the Data Web were made previously. SPARQL Endpoint Status Footnote 4 (SPARQLES) [3] addresses the problem of the availability of SPARQL endpoints over time. SPARQLES aggregates 553 SPARQL endpoints and exposes information on the availability and their features (e.g. support for SPARQL 1.0/1.1, availability of VoID/Service descriptions). Linked Open Vocabularies Footnote 5 (LOV) [6] is a project for building an RDF vocabulary ecosystem, which can support reuse of vocabulary terms. LOV aggregates the vocabularies from various publishers and establish relationships between them using the VOAF vocabulary. The project collected 548 vocabularies (e.g. DCMI Metadata Terms, Friend of a Friend and others) and enabled vocabulary search by utilizing metrics derived from the analysis of the vocabularies and their relationships. The vocab.cc project attempted to fill the gap of vocabulary usage statistics. Being based on the Billion Triples Challenge (BTC) in 2012, vocab.cc introduced four metrics to evaluate the BTC dataset. However, the project has a limited scope (i.e. being restricted to the BTC dataset) and was a one-shot evaluation, and therefore does not provide sustainable statistics over time.

In this paper, we address the above-described gap in the Data Web analysis. We present the LODStats dataset, which provides a comprehensive picture of the current state of a significant part of the Data Web. At the time of writing, LODStats aggregates 9960 RDF datasets from the data.gov, publicdata.eu and datahub.io data catalogs. For each RDF dataset, LODStats collects comprehensive statistics adhering to the RDF data model. This analysis has been regularly published and enhanced over the past five years at http://lodstats.aksw.org. We extend our previous work [4, 5] as follows: (i) we include data.gov and publicdata.eu data catalogs, which account for 45 % of the RDF datasets (ii) we publish the LDSO vocabulary, describing the LODStats data schema and (iii) we enrich the dataset with CKAN metadata. Overall, our contributions are as follows:

  • We provide a 5-star RDF dataset containing statistical facts about the Data Web, which is interlinked with CKAN metadata.

  • We showcase the usage of the dataset via five use case descriptions.

  • We describe insights in the Data Web gained from the analysis of LODStats dataset.

  • We maintain LODStats over the past five years, delivering sustainable solution to the Semantic Web community.

The rest of the paper is structured as follows: in Sect. 2 we introduce the LODStats web application, Sect. 3 outlines the design of the LODStats dataset, in Sect. 4 we describe use cases supported by the dataset, Sect. 5 exhibits the interfaces to access the dataset, we discuss the insights of the Data Web analysis in Sect. 6, and finally conclude and outline future work in Sect. 7.

2 LODStats: Web Scale RDF Data Analytics

In this section, we briefly outline the inner workings of the LODStats application and show the evolution of the technical solution.

The general overview of the LODStats architecture is depicted in Fig. 1. The LODStats Statistics Evaluation (LSE) module performs the execution of the statistical metrics on a dataset and is described in more detail in previous work [4, 5].Footnote 6 In this paper, we introduce the following new modules. To aggregate the datasets from the data catalogs we implemented the CKAN Aggregator Footnote 7. The Messaging Broker Footnote 8 allows to schedule processing and scale it horizontally (i.e. to distribute datasets processing between LSE modules running in parallel).

We provide interfaces both for human users and machine agents. The RDB2RDF Footnote 9 module provides virtual RDF views accessible through the LODStats SPARQL Endpoint for the consumption of machine agents. For human users, a web front-end is available at http://lodstats.aksw.org.

Fig. 1.
figure 1

LODStats architecture overview.

Moreover, we provide Docker image of the whole system publicly.Footnote 10 With LODStats Docker image, the application can be deployed on any Docker-enabled host with one command, namely docker-compose up -d.

3 Dataset Modelling

In this section, we describe the LODStats DataSet vOcabulary (LDSO)Footnote 11, depicted in Fig. 2. We designed LDSO as an extension of the Data Catalog Vocabulary (DCAT) [7] and Vocabulary of Interlinked Datasets (VoID) [1] according to the best practices of the vocabulary design, preservation and governance described in [2, 6]. In the following, we describe the structure of the vocabulary.

The ldso:Dataset class is a representation of a dataset from a CKAN data catalog. Thus, to model ldso:Dataset we extend dcat:Dataset by adding the ldso:active property and reusing general metadata properties such as dc:identitfier and dc:modified. ldso:active is a boolean property, which separates up-to-date (i.e. existing in the CKAN data catalog) and out-dated datasets. ldso:Dataset connects to the data.gov, publicdata.eu and datahub.io data portals (ldso:CkanCatalog) via the dc:isPartOf property. Also, we interlink instances of ldso:Dataset to the corresponding RDF representations in the data portals using owl:sameAs. To process a ldso:Dataset, the LODStats application utilizes the value of the dcat:downloadURL property to retrieve dumps. Subsequently, a ldso:Dataset is linked directly to the last evaluation result via ldso:currentStats. The modelling of ldso:Dataset instances, for example, supports the following queries: (i) How many RDF datasets are in a particular CKAN data catalog?, (ii) What is the ratio between out-dated and up-to-date datasets?, (iii) Who is the dataset maintainer and what is her email address?

A ldso:StatResult represents a single evaluation result for a ldso:Dataset. ldso:StatResult extends void:Dataset by adding set of statistical metrics in the LDSO namespace such as ldso:literals, ldso:blanks, ldso:subclasses. We connect ldso:StatResult to ldso:Dataset using the foaf:primaryTopic property. The VoID vocabulary introduces the concept of property and class partitions, which represent the subsets of a dataset utilizing particular properties/classes. We extend this design pattern by introducing new partitions, based on datatypes, vocabularies and languages. We interlink ldso:StatResult instances to the VoID description of the datasets, generated automatically on dataset evaluation. The modelling of ldso:StatResult allows, for example, the following queries: (i) How many triples (literals, blanks, subclasses) are contained in the dataset?, (ii) How many triples in the dataset are adhering to the particular vocabulary (language, datatype)?, (iii) What is the size of the dataset dump (in bytes)?

4 Relevance of the Dataset

Obtaining comprehensive statistical analysis about datasets made available on the Web of Data facilitates a number of important use cases (UC) and provides crucial benefits. These include:

Vocabulary Reuse (UC1). One of the advantages of semantic technologies is to simplify data integration via common vocabularies. However, it is often difficult to identify relevant vocabulary elements. The LODStats web interface stores the usage frequency of vocabulary elements (e.g. property usage count in [4]) and provides search functionality. This allows knowledge engineers to find the most frequent schema elements, which can be used to model the task at hand. Having this functionality encourages reuse of schema elements and, therefore, simplifies data integration, which is one of the central advantages of semantic technologies. LODStats also provides a webservice for this functionality, such that third party tools can easily integrate search for similar classes and properties. For instance, Linked Open Vocabularies (LOV)Footnote 12 utilizes vocabulary usage frequency as an indicator showing the users popularity of specific vocabulary inside the Linked Open Vocabularies catalogue.

Fig. 2.
figure 2

LODStats vocabulary schema.

Quality analysis (UC2). A major problem when using Web Data is quality. However, the quality of the datasets itself is not so much a problem as assessing and evaluating the expected quality and deciding whether it is sufficient for a certain application. Also, on the traditional Web we have very varying quality, but means were established (e.g. page rank) to assess the quality of information on the document web. In order to establish similar measures on the Web of Data it is crucial to assess datasets with regard to incoming and outgoing links, but also regarding the used vocabularies, properties, adherence to property range restrictions, their values etc. The links can be directly used for data quality (e.g. more links – better). The other metrics, for example, can be compared over the time between the datasets. Hence, a statistical analysis of datasets can provide important insights with regard to the expectable quality.

Coverage analysis (UC3). Similarly important as quality is the coverage a certain dataset provides. LODStats can be used to compute several coverage dimensions. For instance, the most frequent properties for a particular dataset can be computed and allow to get an overview over instance data, e.g. whether it contains address information (i.e. vcard:adr usage count > 0). Furthermore, the frequency of namespaces may also be an indicator for the domain of a dataset. The ranges of properties can give insights on whether spatial or temporal information is present in the dataset. In the case of spatial data, for example, we would like to know the region the dataset covers, which can be easily derived from minimum, maximum and average of longitude and latitude properties.

Privacy analysis (UC4). For quickly deciding whether a dataset potentially containing personal information can be published on the Data Web, we need to get a swift overview on the information contained in the dataset without looking at every individual data record (e.g. dataset uses vcard vocabulary). An analysis and summary of all the properties and classes used in a dataset can quickly reveal the type of information and thus prevent the violation of privacy rules.

Link target identification (UC5). Establishing links between datasets is a fundamental requirement for many Linked Data applications (e.g. data integration and fusion). However, as we learned the Web of Linked Data currently still lacks coherence (with less than 10 % of the entities actually being linked). Meanwhile, there are a number of tools available which support the automatic generation of links (e.g. [8, 9]). An obstacle for the broad use of these tools is, however, the difficulty to identify suitable link targets on the Data Web. By attaching proper statistics about the internal structure of a dataset (in particular about the used vocabularies, properties etc.) it will be dramatically simplified to quickly identify suitable target datasets for linking. For example, the use of longitude and latitude properties in a dataset indicates that this dataset might be a good candidate for linking spatial objects. If we additionally know the minimum, maximum and average values for these properties, we can even identify datasets which are suitable link targets for a certain region.

5 Availability, Interfaces and Sustainability

In this section, we describe the interfaces to access the dataset as well as how we support sustainability. We publish our dataset on datahub.io data catalogFootnote 13. The datahub.io entry for LODStats includes:

  • VoID description. Machine readable description of the dataset.

  • LDSO vocabulary. LODStats Dataset Vocabulary.

  • LODStats SPARQL endpoint. SPARQL endpoint for the application.

  • LODStats RDF dump. The RDF dump of LODStats dataset (April 2016).

  • VoID descriptions RDF dump. Automatically generated VoID descriptions from the LODStats application (April 2016).

  • Data.gov, PublicData.eu, Datahub.io RDF dumps. RDF dumps of the crawled data catalogs (April 2015).

The SPARQL endpoint serves the last output of RDB2RDF module and exposes up-to-date data. We announce the LODStats dataset using public Semantic Web lists and create a Web forumFootnote 14 to support community feedback. The sustainability of LODStats is demonstrated through: (i) the LODStats project being running for over the last five years, (ii) a state of the Data Web evaluation being performed every 2 months or at least once per half a year during this period, (iii) the last evaluation was performed just recently.

6 Data Web Statistics Summary

In this section we provide brief overview of the insights into the Data Web, based on the statistics collected over the past five years for the RDF dumps. The general current statistics such as number of triples, entities, literals etc. are available on the LODStats web portalFootnote 15.

Over the past five years, the number of the datasets has increased from 422 in 2011 to 9644 in 2015. The burst of the datasets number has occurred in 2015, when we included data catalogs from the Open Governments in LODStats. However, only a small part of the overall amount of triples: 1 % for the PublicData.eu and 3 % for the Data.gov portals, can be attributed to the governmental data catalogs. It can be explained by the fact, that Open Governments publish short documents such as monthly energy consumption or salary rates for the governmental facilities. The connectedness of the Data Web has increased to 40 % since 2011, when only 3 % of the overall amount of triples were links between different datasets.

The further Web Data statistics can be accessed from the LODStats SPARQL endpointFootnote 16. For instance, the datasets in 2011 can be requested as follows:

figure a

7 Conclusion and Future Work

We presented LODStats – The Data Web Census Dataset, which exposes statistics about the Data Web over the last five years. We exposed the dataset using SPARQL endpoint and as an RDF dump, providing the one point of access at the DataHub.io data catalog. We created a mailing list to collect the feedback from the community and announced the dataset on the major mailing lists.

In the future we will be processing very large datasets with more than hundreds of millions triples, which are expensive to process on a single machine. Additionally, we plan to include metrics for data streams (standing queries and observing their change over time) as well as extending the metrics to compute complex graph properties, and properties related to inference. The timestamps of all individual measurements are available as RDF data via SPARQL endpoint, which we plan to use for providing the timeline views for the different statistics available via LODStats.