LODStats: The Data Web Census Dataset

Ermilov, Ivan; Lehmann, Jens; Martin, Michael; Auer, Sören

doi:10.1007/978-3-319-46547-0_5

Ivan Ermilov²¹,
Jens Lehmann²²,
Michael Martin²¹ &
…
Sören Auer²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9982))

Included in the following conference series:

International Semantic Web Conference

3016 Accesses
24 Citations

Abstract

Over the past years, the size of the Data Web has increased significantly, which makes obtaining general insights into its growth and structure both more challenging and more desirable. The lack of such insights hinders important data management tasks such as quality, privacy and coverage analysis. In this paper, we present the LODStats dataset, which provides a comprehensive picture of the current state of a significant part of the Data Web. LODStats is based on RDF datasets from data.gov, publicdata.eu and datahub.io data catalogs and at the time of writing lists over 9000 RDF datasets. For each RDF dataset, LODStats collects comprehensive statistics and makes these available in adhering to the LDSO vocabulary. This analysis has been regularly published and enhanced over the past five years at the public platform lodstats.aksw.org. We give a comprehensive overview over the resulting dataset.

You have full access to this open access chapter, Download conference paper PDF

A Researcher’s View on (Big) Data Analytics in Austria Results from an Online Survey

Open Statistics: The Rise of a New Era for Open Data?

Dataset search: a survey

Article Open access 24 August 2019

Adriane Chapman, Elena Simperl, … Paul Groth

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Resource type: :: Dataset
Permanent URL: :: https://datahub.io/dataset/lodstats

1 Introduction

Over the past years, the size of the Data Web has increased significantly, which makes obtaining general insights into its growth and structure both more challenging and more desirable. The expansion of the Data Web can be to a large extent attributed to the efforts in the Semantic Web and Open Government communities. Both communities have a common goal: to provide 5-star^{Footnote 1} RDF datasets to end-users. To achieve this goal, the Semantic Web community introduced a number of requirements for datasets, which should be fulfilled to be included into the LOD Cloud ^{Footnote 2}. The Semantic Web community has a main dataset registry hub: the datahub^{Footnote 3} data catalog, while Open Government initiatives usually distribute RDF datasets through their own data catalogs (e.g. data.gov, publicdata.eu and open.canada.ca).

All of the mentioned data catalogs utilize CKAN, an open-source data portal platform, which is a de-facto standard for Open Data. CKAN provides a solid framework to organize datasets and to expose metadata about them in various formats, including RDF. However, CKAN does not provide analytics over the registered datasets and highly depends on the user input. Moreover, no single aggregation point exists. These factors limit the possibility to obtain general insights into the Data Web. The lack of such insights hinders important data management tasks such as quality, privacy and coverage analysis.

For this reason, attempts to analyze the Data Web were made previously. SPARQL Endpoint Status ^{Footnote 4} (SPARQLES) [3] addresses the problem of the availability of SPARQL endpoints over time. SPARQLES aggregates 553 SPARQL endpoints and exposes information on the availability and their features (e.g. support for SPARQL 1.0/1.1, availability of VoID/Service descriptions). Linked Open Vocabularies ^{Footnote 5} (LOV) [6] is a project for building an RDF vocabulary ecosystem, which can support reuse of vocabulary terms. LOV aggregates the vocabularies from various publishers and establish relationships between them using the VOAF vocabulary. The project collected 548 vocabularies (e.g. DCMI Metadata Terms, Friend of a Friend and others) and enabled vocabulary search by utilizing metrics derived from the analysis of the vocabularies and their relationships. The vocab.cc project attempted to fill the gap of vocabulary usage statistics. Being based on the Billion Triples Challenge (BTC) in 2012, vocab.cc introduced four metrics to evaluate the BTC dataset. However, the project has a limited scope (i.e. being restricted to the BTC dataset) and was a one-shot evaluation, and therefore does not provide sustainable statistics over time.

In this paper, we address the above-described gap in the Data Web analysis. We present the LODStats dataset, which provides a comprehensive picture of the current state of a significant part of the Data Web. At the time of writing, LODStats aggregates 9960 RDF datasets from the data.gov, publicdata.eu and datahub.io data catalogs. For each RDF dataset, LODStats collects comprehensive statistics adhering to the RDF data model. This analysis has been regularly published and enhanced over the past five years at http://lodstats.aksw.org. We extend our previous work [4, 5] as follows: (i) we include data.gov and publicdata.eu data catalogs, which account for 45 % of the RDF datasets (ii) we publish the LDSO vocabulary, describing the LODStats data schema and (iii) we enrich the dataset with CKAN metadata. Overall, our contributions are as follows:

We provide a 5-star RDF dataset containing statistical facts about the Data Web, which is interlinked with CKAN metadata.
We showcase the usage of the dataset via five use case descriptions.
We describe insights in the Data Web gained from the analysis of LODStats dataset.
We maintain LODStats over the past five years, delivering sustainable solution to the Semantic Web community.

The rest of the paper is structured as follows: in Sect. 2 we introduce the LODStats web application, Sect. 3 outlines the design of the LODStats dataset, in Sect. 4 we describe use cases supported by the dataset, Sect. 5 exhibits the interfaces to access the dataset, we discuss the insights of the Data Web analysis in Sect. 6, and finally conclude and outline future work in Sect. 7.

2 LODStats: Web Scale RDF Data Analytics

In this section, we briefly outline the inner workings of the LODStats application and show the evolution of the technical solution.

The general overview of the LODStats architecture is depicted in Fig. 1. The LODStats Statistics Evaluation (LSE) module performs the execution of the statistical metrics on a dataset and is described in more detail in previous work [4, 5].^{Footnote 6} In this paper, we introduce the following new modules. To aggregate the datasets from the data catalogs we implemented the CKAN Aggregator ^{Footnote 7}. The Messaging Broker ^{Footnote 8} allows to schedule processing and scale it horizontally (i.e. to distribute datasets processing between LSE modules running in parallel).

We provide interfaces both for human users and machine agents. The RDB2RDF ^{Footnote 9} module provides virtual RDF views accessible through the LODStats SPARQL Endpoint for the consumption of machine agents. For human users, a web front-end is available at http://lodstats.aksw.org.

Moreover, we provide Docker image of the whole system publicly.^{Footnote 10} With LODStats Docker image, the application can be deployed on any Docker-enabled host with one command, namely docker-compose up -d.

3 Dataset Modelling

In this section, we describe the LODStats DataSet vOcabulary (LDSO)^{Footnote 11}, depicted in Fig. 2. We designed LDSO as an extension of the Data Catalog Vocabulary (DCAT) [7] and Vocabulary of Interlinked Datasets (VoID) [1] according to the best practices of the vocabulary design, preservation and governance described in [2, 6]. In the following, we describe the structure of the vocabulary.

The ldso:Dataset class is a representation of a dataset from a CKAN data catalog. Thus, to model ldso:Dataset we extend dcat:Dataset by adding the ldso:active property and reusing general metadata properties such as dc:identitfier and dc:modified. ldso:active is a boolean property, which separates up-to-date (i.e. existing in the CKAN data catalog) and out-dated datasets. ldso:Dataset connects to the data.gov, publicdata.eu and datahub.io data portals (ldso:CkanCatalog) via the dc:isPartOf property. Also, we interlink instances of ldso:Dataset to the corresponding RDF representations in the data portals using owl:sameAs. To process a ldso:Dataset, the LODStats application utilizes the value of the dcat:downloadURL property to retrieve dumps. Subsequently, a ldso:Dataset is linked directly to the last evaluation result via ldso:currentStats. The modelling of ldso:Dataset instances, for example, supports the following queries: (i) How many RDF datasets are in a particular CKAN data catalog?, (ii) What is the ratio between out-dated and up-to-date datasets?, (iii) Who is the dataset maintainer and what is her email address?

A ldso:StatResult represents a single evaluation result for a ldso:Dataset. ldso:StatResult extends void:Dataset by adding set of statistical metrics in the LDSO namespace such as ldso:literals, ldso:blanks, ldso:subclasses. We connect ldso:StatResult to ldso:Dataset using the foaf:primaryTopic property. The VoID vocabulary introduces the concept of property and class partitions, which represent the subsets of a dataset utilizing particular properties/classes. We extend this design pattern by introducing new partitions, based on datatypes, vocabularies and languages. We interlink ldso:StatResult instances to the VoID description of the datasets, generated automatically on dataset evaluation. The modelling of ldso:StatResult allows, for example, the following queries: (i) How many triples (literals, blanks, subclasses) are contained in the dataset?, (ii) How many triples in the dataset are adhering to the particular vocabulary (language, datatype)?, (iii) What is the size of the dataset dump (in bytes)?

4 Relevance of the Dataset

Obtaining comprehensive statistical analysis about datasets made available on the Web of Data facilitates a number of important use cases (UC) and provides crucial benefits. These include:

Vocabulary Reuse (UC1). One of the advantages of semantic technologies is to simplify data integration via common vocabularies. However, it is often difficult to identify relevant vocabulary elements. The LODStats web interface stores the usage frequency of vocabulary elements (e.g. property usage count in [4]) and provides search functionality. This allows knowledge engineers to find the most frequent schema elements, which can be used to model the task at hand. Having this functionality encourages reuse of schema elements and, therefore, simplifies data integration, which is one of the central advantages of semantic technologies. LODStats also provides a webservice for this functionality, such that third party tools can easily integrate search for similar classes and properties. For instance, Linked Open Vocabularies (LOV)^{Footnote 12} utilizes vocabulary usage frequency as an indicator showing the users popularity of specific vocabulary inside the Linked Open Vocabularies catalogue.

Quality analysis (UC2). A major problem when using Web Data is quality. However, the quality of the datasets itself is not so much a problem as assessing and evaluating the expected quality and deciding whether it is sufficient for a certain application. Also, on the traditional Web we have very varying quality, but means were established (e.g. page rank) to assess the quality of information on the document web. In order to establish similar measures on the Web of Data it is crucial to assess datasets with regard to incoming and outgoing links, but also regarding the used vocabularies, properties, adherence to property range restrictions, their values etc. The links can be directly used for data quality (e.g. more links – better). The other metrics, for example, can be compared over the time between the datasets. Hence, a statistical analysis of datasets can provide important insights with regard to the expectable quality.

Coverage analysis (UC3). Similarly important as quality is the coverage a certain dataset provides. LODStats can be used to compute several coverage dimensions. For instance, the most frequent properties for a particular dataset can be computed and allow to get an overview over instance data, e.g. whether it contains address information (i.e. vcard:adr usage count > 0). Furthermore, the frequency of namespaces may also be an indicator for the domain of a dataset. The ranges of properties can give insights on whether spatial or temporal information is present in the dataset. In the case of spatial data, for example, we would like to know the region the dataset covers, which can be easily derived from minimum, maximum and average of longitude and latitude properties.

Privacy analysis (UC4). For quickly deciding whether a dataset potentially containing personal information can be published on the Data Web, we need to get a swift overview on the information contained in the dataset without looking at every individual data record (e.g. dataset uses vcard vocabulary). An analysis and summary of all the properties and classes used in a dataset can quickly reveal the type of information and thus prevent the violation of privacy rules.

Link target identification (UC5). Establishing links between datasets is a fundamental requirement for many Linked Data applications (e.g. data integration and fusion). However, as we learned the Web of Linked Data currently still lacks coherence (with less than 10 % of the entities actually being linked). Meanwhile, there are a number of tools available which support the automatic generation of links (e.g. [8, 9]). An obstacle for the broad use of these tools is, however, the difficulty to identify suitable link targets on the Data Web. By attaching proper statistics about the internal structure of a dataset (in particular about the used vocabularies, properties etc.) it will be dramatically simplified to quickly identify suitable target datasets for linking. For example, the use of longitude and latitude properties in a dataset indicates that this dataset might be a good candidate for linking spatial objects. If we additionally know the minimum, maximum and average values for these properties, we can even identify datasets which are suitable link targets for a certain region.

5 Availability, Interfaces and Sustainability

In this section, we describe the interfaces to access the dataset as well as how we support sustainability. We publish our dataset on datahub.io data catalog^{Footnote 13}. The datahub.io entry for LODStats includes:

VoID description. Machine readable description of the dataset.
LDSO vocabulary. LODStats Dataset Vocabulary.
LODStats SPARQL endpoint. SPARQL endpoint for the application.
LODStats RDF dump. The RDF dump of LODStats dataset (April 2016).
VoID descriptions RDF dump. Automatically generated VoID descriptions from the LODStats application (April 2016).
Data.gov, PublicData.eu, Datahub.io RDF dumps. RDF dumps of the crawled data catalogs (April 2015).

The SPARQL endpoint serves the last output of RDB2RDF module and exposes up-to-date data. We announce the LODStats dataset using public Semantic Web lists and create a Web forum^{Footnote 14} to support community feedback. The sustainability of LODStats is demonstrated through: (i) the LODStats project being running for over the last five years, (ii) a state of the Data Web evaluation being performed every 2 months or at least once per half a year during this period, (iii) the last evaluation was performed just recently.

6 Data Web Statistics Summary

In this section we provide brief overview of the insights into the Data Web, based on the statistics collected over the past five years for the RDF dumps. The general current statistics such as number of triples, entities, literals etc. are available on the LODStats web portal^{Footnote 15}.

Over the past five years, the number of the datasets has increased from 422 in 2011 to 9644 in 2015. The burst of the datasets number has occurred in 2015, when we included data catalogs from the Open Governments in LODStats. However, only a small part of the overall amount of triples: 1 % for the PublicData.eu and 3 % for the Data.gov portals, can be attributed to the governmental data catalogs. It can be explained by the fact, that Open Governments publish short documents such as monthly energy consumption or salary rates for the governmental facilities. The connectedness of the Data Web has increased to 40 % since 2011, when only 3 % of the overall amount of triples were links between different datasets.

The further Web Data statistics can be accessed from the LODStats SPARQL endpoint^{Footnote 16}. For instance, the datasets in 2011 can be requested as follows:

7 Conclusion and Future Work

We presented LODStats – The Data Web Census Dataset, which exposes statistics about the Data Web over the last five years. We exposed the dataset using SPARQL endpoint and as an RDF dump, providing the one point of access at the DataHub.io data catalog. We created a mailing list to collect the feedback from the community and announced the dataset on the major mailing lists.

In the future we will be processing very large datasets with more than hundreds of millions triples, which are expensive to process on a single machine. Additionally, we plan to include metrics for data streams (standing queries and observing their change over time) as well as extending the metrics to compute complex graph properties, and properties related to inference. The timestamps of all individual measurements are available as RDF data via SPARQL endpoint, which we plan to use for providing the timeline views for the different statistics available via LODStats.

Notes

1.
According to 5-star data model available at http://5stardata.info.
2.
http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/.
3.
http://datahub.io/.
4.
http://sparqles.ai.wu.ac.at/.
5.
http://lov.okfn.org/.
6.
For SPARQL endpoints LSE can only infer number of triples.
7.
https://github.com/aksw/ckan-aggregator-py.
8.
We use rabbitmq as a messaging broker https://www.rabbitmq.com/.
9.
For RDB2RDF transformation we utilize Sparqlify http://sparqlify.org/.
10.
https://github.com/aksw/lodstats.docker.
11.
LDSO is published at http://lodstats.aksw.org/ontology/ldso.owl.
12.
http://lov.okfn.org.
13.
Available at http://datahub.io/dataset/lodstats.
14.
https://groups.google.com/forum/#!forum/lodstats.
15.
Statistics can be accessed at http://lodstats.aksw.org/stats.
16.
http://lodstats.aksw.org/sparql.

References

Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets. In: LDOW (2009)
Google Scholar
Allemang, D., Hendler, J.: Semantic Web for the Working Ontologist: Effective Modeling in RDFS and OWL. Morgan Kaufmann Publishers Inc., San Francisco (2011). ISBN: 9780123859662
Google Scholar
Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.-Y.: SPARQL web-querying infrastructure: ready for action? In: Alani, H., et al. (eds.) ISWC 2013. LNCS, vol. 8219, pp. 277–293. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41338-4_18
Chapter Google Scholar
Auer, S., Demter, J., Martin, M., Lehmann, J.: LODStats – an extensible framework for high-performance dataset analytics. In: Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) EKAW 2012. LNCS (LNAI), vol. 7603, pp. 353–362. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33876-2_31
Chapter Google Scholar
Ermilov, I., Martin, M., Lehmann, J., Auer, S.: Linked open data statistics: collection and exploitation. In: Klinov, P., Mouromtsev, D. (eds.) KESW 2013. CCIS, vol. 394, pp. 242–249. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41360-5_19
Chapter Google Scholar
Greenberg, E.M.R., Bueno, J.G., de la Fuente, T., Baker, P.-Y.V., Vatant, B.: Requirements for vocabulary preservation and governance. Libr. Hi Tech 31(4), 657–668 (2013)
Article Google Scholar
Maali, F., Erickson, J., Archer, P.: Data catalog vocabulary (dcat). In: W3C Recommendation (2014)
Google Scholar
Ngonga Ngomo, A.-C., Auer, S.: Limes - a time-efficient approach for large-scale link discovery on the web of data. In: Proceedings of IJCAI (2011)
Google Scholar
Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Discovering and maintaining links on the web of data. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 650–665. Springer, Heidelberg (2009). doi:10.1007/978-3-642-04930-9_41
Chapter Google Scholar

Download references

Acknowledgments

This work was partly supported by the German Federal Ministry of Education and Research (BMBF) for the LEDS Project (GA no. 03WKCG11C) and by grant from the European Union’s Horizon 2020 research Europe flag and innovation programme for the project Big Data Europe (GA no. 644564).

Author information

Authors and Affiliations

AKSW, Institute of Computer Science, University of Leipzig, Leipzig, Germany
Ivan Ermilov & Michael Martin
University of Bonn and Fraunhofer IAIS, Bonn, Germany
Jens Lehmann & Sören Auer

Authors

Ivan Ermilov
View author publications
You can also search for this author in PubMed Google Scholar
Jens Lehmann
View author publications
You can also search for this author in PubMed Google Scholar
Michael Martin
View author publications
You can also search for this author in PubMed Google Scholar
Sören Auer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ivan Ermilov .

Editor information

Editors and Affiliations

Elsevier Labs. , Amsterdam, The Netherlands
Paul Groth
University of Southampton , Southampton, United Kingdom
Elena Simperl
Heriot-Watt University , Edinburgh, United Kingdom
Alasdair Gray
Vienna University of Technology , Vienna, Austria
Marta Sabou
Technische Universität Dresden , Dresden, Germany
Markus Krötzsch
IBM Research Ireland , Dublin 4, Ireland
Freddy Lecue
for the Social Sciences, GESIS-Leibniz Institute for the Social Sciences, Köln, Germany
Fabian Flöck
University of Southern California , Marina del Rey, California, USA
Yolanda Gil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ermilov, I., Lehmann, J., Martin, M., Auer, S. (2016). LODStats: The Data Web Census Dataset. In: Groth, P., et al. The Semantic Web – ISWC 2016. ISWC 2016. Lecture Notes in Computer Science(), vol 9982. Springer, Cham. https://doi.org/10.1007/978-3-319-46547-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-46547-0_5
Published: 23 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46546-3
Online ISBN: 978-3-319-46547-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

LODStats: The Data Web Census Dataset

Abstract

Similar content being viewed by others

A Researcher’s View on (Big) Data Analytics in Austria Results from an Online Survey