A Collection of Benchmark Datasets for Systematic Evaluations of Machine Learning on the Semantic Web

Ristoski, Petar; de Vries, Gerben Klaas Dirk; Paulheim, Heiko

doi:10.1007/978-3-319-46547-0_20

Petar Ristoski²¹,
Gerben Klaas Dirk de Vries²² &
Heiko Paulheim²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9982))

Included in the following conference series:

International Semantic Web Conference

3978 Accesses
37 Citations

Abstract

In the recent years, several approaches for machine learning on the Semantic Web have been proposed. However, no extensive comparisons between those approaches have been undertaken, in particular due to a lack of publicly available, acknowledged benchmark datasets. In this paper, we present a collection of 22 benchmark datasets of different sizes. Such a collection of datasets can be used to conduct quantitative performance testing and systematic comparisons of approaches.

You have full access to this open access chapter, Download conference paper PDF

LOD Lab: Experiments at LOD Scale

Bridging Semantic Web and Machine Learning: First Results of a Systematic Mapping Study

Overview of the S3 Contest: Performance Evaluation of Semantic Service Matchmakers

Keywords

Resource type: :: Datasets
Permanent URL: :: http://w3id.org/sw4ml-datasets

1 Introduction

In the recent years, applying machine learning to Semantic Web data has drawn a lot of attention. Many approaches have been proposed for different tasks at hand, ranging from reformulating machine learning problems on the Semantic Web as traditional, propositional machine learning tasks to developing entirely novel algorithms. However, systematic comparative evaluations of different approaches are scarce; approaches are rather evaluated on a handful of often project-specific datasets, and compared to a baseline and/or one or two other systems.

In contrast, evaluations in the machine learning area are often more rigorous. Approaches are usually compared using a larger number of standard datasets, most often from the UCI repository^{Footnote 1}. With a larger set of datasets used in the evaluation, statements about statistical significance are possible as well [3].

At the same time, collections of benchmark datasets have become quite well accepted in other areas of Semantic Web research. Notable examples include the Ontology Alignment Evaluation Initiative (OAEI) for ontology matching^{Footnote 2}, the Berlin SPARQL Benchmark ^{Footnote 3} for triple store performance, the Lehigh University Benchmark (LUBM)^{Footnote 4} for reasoning, or the Question Answering over Linked Data (QALD) dataset^{Footnote 5} for natural language query systems.

In this paper, we introduce a collection of datasets for benchmarking machine learning approaches for the Semantic Web. Those datasets are either existing RDF datasets, or external classification or regression problems, for which the instances have been enriched with links to the Linked Open Data cloud [14]. Furthermore, by varying the number of instances for a dataset, scalability evaluations are also made possible.

2 Related Work

Recent surveys on the use of Semantic Web for machine learning organize the proposed approaches in several categories, i.e., approaches that use Semantic Web data for machine learning [16], approaches that perform machine learning on the Semantic Web [11], and approaches that use machine learning techniques to create and improve Semantic Web data [8, 16]. Furthermore, there are some challenges, like the Linked Data Mining Challenge ^{Footnote 6} or the Semantic-Web enabled Recommender Systems Challenge ^{Footnote 7}, which usually focus on only a few datasets and a very specific problem setting.

3 Datasets

Our dataset collection has three categories: (i) existing datasets that are commonly used in machine learning experiments, (ii) datasets that were generated from official observations, and (iii) datasets generated from existing RDF datasets. Each of the datasets in the first two categories are initially linked to DBpedia^{Footnote 8}. This has two main reasons, (1) DBpedia being a cross-domain knowledge base usable in datasets from very different topical domains, and (2) tools like DBpedia Lookup and DBpedia Spotlight making it easy to link external datasets to DBpedia. However, DBpedia can be seen as an entry point to the Web of Linked Data, with many datasets linking to and from DBpedia. In fact, we use the RapidMiner Linked Open Data extension [9], to retrieve external links for each entity to YAGO^{Footnote 9} and Wikidata^{Footnote 10}. Such links could be exploited for systematic evaluation of the relevance of the data of different LOD dataset in different learning tasks.

In the dataset collection, there are four datasets that are commonly used for machine learning. For these datasets, we first enrich the instances with links to LOD datasets, and reuse the already defined target variable to perform machine learning experiments:

The Auto MPG dataset^{Footnote 11} captures different characteristics of cars, and the target is to predict the fuel consumption (MPG) as a regression task.
The AAUP (American Association of University Professors) dataset contains a list of universities, including eight target variables describing the salary of different staff at the universities^{Footnote 12}. We use the average salary as a target variable both for regression and classification, discretizing the target variable into “high”, “medium” and “low”, using equal frequency binning.
The Auto 93 dataset^{Footnote 13} captures different characteristics of cars, and the target is to predict the price of the vehicles as a regression task.
The Zoo dataset captures different characteristics of animals, and the target is to predict the type of the animals as a classification task.

For those datasets, cars, universities, and animals are linked to DBpedia based on their name.

The second category of datasets contains a list of datasets where the target variable is an observation from different real-world domains, as captured by official sources. Again, the instances were enriched with links to LOD datasets. There are thirteen datasets in this category:

The Forbes dataset contains a list of companies including several features of the companies, which was generated from the Forbes list of leading companies 2015^{Footnote 14}. The target is to predict the company’s market value as a classification and regression task. To use it for the task of classification we discretize the target variable into “high”, “medium”, and “low”, using equal frequency binning.
The Cities dataset contains a list of cities and their quality of living, as captured by Mercer [7]. We use the dataset both for regression and classification.
The Endangered Species dataset classifies animals into endangered species^{Footnote 15}.
The Facebook Movies dataset contains a list of movies and the number of Facebook likes for each movie^{Footnote 16}. We first selected 10, 000 movies from DBpedia, which were then linked to the corresponding Facebook page, based on the movie’s name and the director. The final dataset contains 1, 600 movies, which was created by first ordering the list of movies based on the number of Facebook likes, and then selecting the top 800 movies and the bottom 800 movies. We use the dataset for regression and classification.
Similarly, the Facebook Books dataset contains a list of books and the number of Facebook likes. Each book was linked to the corresponding Facebook page using the book’s title and the book’s author. Again, we selected the top 800 books and the bottom 800 books, based on the number of Facebook likes.
The Metacritic Movies dataset is retrieved from Metacritic.com^{Footnote 17}, which contains an average rating of all time reviews for a list of movies [12]. The initial dataset contained around 10, 000 movies, from which we selected 1, 000 movies from the top of the list, and 1, 000 movies from the bottom of the list. We use the dataset both for regression and classification.
Similarly, the Metacritic Albums dataset is retrieved from Metacritic.com^{Footnote 18}, which contains an average rating of all time reviews for a list of albums [13].
The HIV Deaths Country dataset contains a list of countries with the number of deaths caused by HIV, as captured by the World Health Organization^{Footnote 19}. We use the dataset both for regression and classification.
Similarly, the Traffic Accidents Deaths Country dataset contains a list of countries with the number of deaths caused by traffic accidents^{Footnote 20}.
The Energy Savings Country dataset contains a list of countries with the total amount of energy savings of primary energy in 2010^{Footnote 21}, which was downloaded from WorldBank^{Footnote 22}. We use the dataset both for regression and classification.
Similarly, the Inflation Country dataset contains a list of countries with the inflation rate for 2011^{Footnote 23}.
The Scientific Journals Country dataset contains a list of countries with a number of scientific and technical journal articles published in 2011^{Footnote 24}.
The Unemployment French Region dataset contains a list of regions in France with the unemployment rate, used in the SemStats 2013 challenge [10].

Again, for those datasets, the instances (cities, countries, etc.) are linked to DBpedia. For datasets which are used for classification and regression, the regression target was discretized using equal frequency binning, usually into a high and a low class.

The third, and final, category contains datasets that were generated from existing RDF datasets, where the value of a certain property is used as a classification target. There are five datasets in this category:

The Drug-Food Interaction dataset contains a list of drug-recipe pairs and their interaction, i.e., “negative” and “neutral” [6]. The dataset was retrieved from FinkiLOD^{Footnote 25}. Furthermore, each drug is linked to DrugBank^{Footnote 26}. We drew a stratified random sample of 2, 000 instances from the complete dataset. When generating the features, we ignore the foodInteraction property in DrugBank, since it highly correlates with the target variable.
The AIFB dataset describes the AIFB research institute in terms of its staff, research group, and publications. In [1] the dataset was first used to predict the affiliation (i.e., research group) for people in the dataset. The dataset contains 178 members of a research group, however the smallest group contains only 4 people, which is removed from the dataset, leaving 4 classes. Also, we remove the employs relation, which is the inverse of the affiliation relation.
The AM dataset contains information about artifacts in the Amsterdam Museum [2]. Each artifact in the dataset is linked to other artifacts and details about its production, material, and content. It also has an artifact category, which serves as a prediction target. We have drawn a stratified random sample of 1, 000 instances from the complete dataset. We also removed the material relation, since it highly correlates with the artifact category.
The MUTAG dataset is distributed as an example dataset for the DL-Learner toolkit^{Footnote 27}. It contains information about complex molecules that are potentially carcinogenic, which is given by the isMutagenic property.
The BGS dataset was created by the British Geological Survey and describes geological measurements in Great Britain^{Footnote 28}. It was used in [17] to predict the lithogenesis property of named rock units. The dataset contains 146 named rock units with a lithogenesis, from which we use the two largest classes.

An overview of the datasets is given in Tables 1, 2, and 3. For each dataset, we depict the number of instances, the machine learning tasks in which the dataset is used (C stands for classification and R stands for regression), the source of the dataset, and the LOD datasets to which the dataset is linked. For each dataset, we depict basic statistics of the properties of the LOD datasets, i.e., average, median, maximum and minimum number of types, categories, outgoing relations (rel out), incoming relations (rel in), outgoing relations including values (rel-vals out) and incoming relations including values (rel-vals in). The datasets, as well as a detailed description, a link quality evaluation, and licensing information, can be found online^{Footnote 29}.

From the given statistics, we can infer the following observations: (i) DBpedia contains significantly less owl:sameAs links to YAGO, compared to Wikidata; (ii) DBpedia provides the highest number of types and categories on average per entity; (iii) Wikidata contains the highest number of outgoing and incoming relations for most of the datasets; (iv) YAGO contains the highest number of outgoing and incoming relations values for most of the datasets.

Table 1. Datasets statistics

Full size table

Table 2. Datasets statistics

Full size table

Table 3. Datasets statistics

Full size table

4 Conclusion and Outlook

In this paper, we have introduced a collection of 22 benchmark datasets for machine learning on the Semantic Web. So far, we have concentrated on classification and regression tasks. There are methods to derive clustering and outlier detection benchmarks from classification and regression datasets [4, 5], so that extending the dataset collection for such unsupervised tasks is possible as well. Furthermore, as many datasets on the Semantic Web use extensive hierarchies in the form of ontologies, building benchmark datasets for tasks like hierarchical multi-label classification [15] would also be an interesting extension.

Notes

References

Bloehdorn, S., Sure, Y.: Kernel methods for mining instance data in ontologies. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 58–71. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76298-0_5
Chapter Google Scholar
Boer, V., Wielemaker, J., Gent, J., Hildebrand, M., Isaac, A., Ossenbruggen, J., Schreiber, G.: Supporting linked data production for cultural heritage institutes: the Amsterdam museum case study. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 733–747. Springer, Heidelberg (2012). doi:10.1007/978-3-642-30284-8_56
Chapter Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Emmott, A.F., Das, S., Dietterich, T., Fern, A., Wong, W.K.: Systematic construction of anomaly detection benchmarks from real data. In: Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, pp. 16–21. ACM (2013)
Google Scholar
Färber, I., Günnemann, S., Kriegel, H.P., Kröger, P., Müller, E., Schubert, E., Seidl, T., Zimek, A.: On using class-labels in evaluation of clusterings. In: MultiClust: Workshop on Discovering, Summarizing and Using Multiple Clusterings (2010)
Google Scholar
Jovanovik, M., Bogojeska, A., Trajanov, D., Kocarev, L.: Inferring cuisine-drug interactions using the linked data approach. Scientific reports 5 (2015)
Google Scholar
Paulheim, H.: Generating possible interpretations for statistics from linked open data. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 560–574. Springer, Heidelberg (2012). doi:10.1007/978-3-642-30284-8_44
Chapter Google Scholar
Rettinger, A., Lösch, U., Tresp, V., d’Amato, C., Fanizzi, N.: Mining the semantic web. Data Min. Knowl. Disc. 24(3), 613–662 (2012)
Article MATH Google Scholar
Ristoski, P., Bizer, C., Paulheim, H.: Mining the web of linked data with rapidminer. Web Semant. Sci. Serv. Agents WWW 35, 142–151 (2015)
Article Google Scholar
Ristoski, P., Paulheim, H.: Analyzing statistics with background knowledge from linked open data. In: Workshop on Semantic Statistics (2013)
Google Scholar
Ristoski, P., Paulheim, H.: Semantic web in data mining and knowledge discovery: a comprehensive survey. Web Semant. 36, 1–22 (2016)
Article Google Scholar
Ristoski, P., Paulheim, H., Svátek, V., Zeman, V.: The linked data mining challenge 2015. In: KNOW@ LOD (2015)
Google Scholar
Ristoski, P., Paulheim, H., Svátek, V., Zeman, V.: The linked data mining challenge 2016. In: KNOW@LOD (2016)
Google Scholar
Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 245–260. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11964-9_16
Google Scholar
Silla Jr., C.N., Freitas, A.A.: A survey of hierarchical classification across different application domains. Data Min. Knowl. Disc. 22(1), 31–72 (2011)
Article MathSciNet MATH Google Scholar
Tresp, V., Bundschus, M., Rettinger, A., Huang, Y.: Towards machine learning on the semantic web. In: da Costa, P.C.G., et al. (eds.) URSW 2005-2007. LNCS, vol. 5327, pp. 282–314. Springer, Heidelberg (2008)
Chapter Google Scholar
Vries, G.K.D.: A fast approximation of the Weisfeiler-Lehman graph kernel for RDF data. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013. LNCS, vol. 8188, pp. 606–621. Springer, Heidelberg (2013)
Chapter Google Scholar

Download references

Acknowledgments

The work presented in this paper has been partly funded by the German Research Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD), and the Dutch national program COMMIT.

Author information

Authors and Affiliations

Research Group Data and Web Science, University of Mannheim, Mannheim, Germany
Petar Ristoski & Heiko Paulheim
WizeNoze, Amsterdam, The Netherlands
Gerben Klaas Dirk de Vries

Authors

Petar Ristoski
View author publications
You can also search for this author in PubMed Google Scholar
Gerben Klaas Dirk de Vries
View author publications
You can also search for this author in PubMed Google Scholar
Heiko Paulheim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Petar Ristoski .

Editor information

Editors and Affiliations

Elsevier Labs. , Amsterdam, The Netherlands
Paul Groth
University of Southampton , Southampton, United Kingdom
Elena Simperl
Heriot-Watt University , Edinburgh, United Kingdom
Alasdair Gray
Vienna University of Technology , Vienna, Austria
Marta Sabou
Technische Universität Dresden , Dresden, Germany
Markus Krötzsch
IBM Research Ireland , Dublin 4, Ireland
Freddy Lecue
for the Social Sciences, GESIS-Leibniz Institute for the Social Sciences, Köln, Germany
Fabian Flöck
University of Southern California , Marina del Rey, California, USA
Yolanda Gil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ristoski, P., de Vries, G.K.D., Paulheim, H. (2016). A Collection of Benchmark Datasets for Systematic Evaluations of Machine Learning on the Semantic Web. In: Groth, P., et al. The Semantic Web – ISWC 2016. ISWC 2016. Lecture Notes in Computer Science(), vol 9982. Springer, Cham. https://doi.org/10.1007/978-3-319-46547-0_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-46547-0_20
Published: 23 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46546-3
Online ISBN: 978-3-319-46547-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Collection of Benchmark Datasets for Systematic Evaluations of Machine Learning on the Semantic Web

Abstract

Similar content being viewed by others

LOD Lab: Experiments at LOD Scale

Bridging Semantic Web and Machine Learning: First Results of a Systematic Mapping Study

Overview of the S3 Contest: Performance Evaluation of Semantic Service Matchmakers

Keywords

1 Introduction

2 Related Work

3 Datasets

4 Conclusion and Outlook

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Collection of Benchmark Datasets for Systematic Evaluations of Machine Learning on the Semantic Web

Abstract

Similar content being viewed by others

LOD Lab: Experiments at LOD Scale

Bridging Semantic Web and Machine Learning: First Results of a Systematic Mapping Study

Overview of the S3 Contest: Performance Evaluation of Semantic Service Matchmakers

Keywords

1 Introduction

2 Related Work

3 Datasets

4 Conclusion and Outlook

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation