Keywords

 

Resource type: :

Datasets

Permanent URL: :

http://w3id.org/sw4ml-datasets

 

1 Introduction

In the recent years, applying machine learning to Semantic Web data has drawn a lot of attention. Many approaches have been proposed for different tasks at hand, ranging from reformulating machine learning problems on the Semantic Web as traditional, propositional machine learning tasks to developing entirely novel algorithms. However, systematic comparative evaluations of different approaches are scarce; approaches are rather evaluated on a handful of often project-specific datasets, and compared to a baseline and/or one or two other systems.

In contrast, evaluations in the machine learning area are often more rigorous. Approaches are usually compared using a larger number of standard datasets, most often from the UCI repositoryFootnote 1. With a larger set of datasets used in the evaluation, statements about statistical significance are possible as well [3].

At the same time, collections of benchmark datasets have become quite well accepted in other areas of Semantic Web research. Notable examples include the Ontology Alignment Evaluation Initiative (OAEI) for ontology matchingFootnote 2, the Berlin SPARQL Benchmark Footnote 3 for triple store performance, the Lehigh University Benchmark (LUBM)Footnote 4 for reasoning, or the Question Answering over Linked Data (QALD) datasetFootnote 5 for natural language query systems.

In this paper, we introduce a collection of datasets for benchmarking machine learning approaches for the Semantic Web. Those datasets are either existing RDF datasets, or external classification or regression problems, for which the instances have been enriched with links to the Linked Open Data cloud [14]. Furthermore, by varying the number of instances for a dataset, scalability evaluations are also made possible.

2 Related Work

Recent surveys on the use of Semantic Web for machine learning organize the proposed approaches in several categories, i.e., approaches that use Semantic Web data for machine learning [16], approaches that perform machine learning on the Semantic Web [11], and approaches that use machine learning techniques to create and improve Semantic Web data [8, 16]. Furthermore, there are some challenges, like the Linked Data Mining Challenge Footnote 6 or the Semantic-Web enabled Recommender Systems Challenge Footnote 7, which usually focus on only a few datasets and a very specific problem setting.

3 Datasets

Our dataset collection has three categories: (i) existing datasets that are commonly used in machine learning experiments, (ii) datasets that were generated from official observations, and (iii) datasets generated from existing RDF datasets. Each of the datasets in the first two categories are initially linked to DBpediaFootnote 8. This has two main reasons, (1) DBpedia being a cross-domain knowledge base usable in datasets from very different topical domains, and (2) tools like DBpedia Lookup and DBpedia Spotlight making it easy to link external datasets to DBpedia. However, DBpedia can be seen as an entry point to the Web of Linked Data, with many datasets linking to and from DBpedia. In fact, we use the RapidMiner Linked Open Data extension [9], to retrieve external links for each entity to YAGOFootnote 9 and WikidataFootnote 10. Such links could be exploited for systematic evaluation of the relevance of the data of different LOD dataset in different learning tasks.

In the dataset collection, there are four datasets that are commonly used for machine learning. For these datasets, we first enrich the instances with links to LOD datasets, and reuse the already defined target variable to perform machine learning experiments:

  • The Auto MPG datasetFootnote 11 captures different characteristics of cars, and the target is to predict the fuel consumption (MPG) as a regression task.

  • The AAUP (American Association of University Professors) dataset contains a list of universities, including eight target variables describing the salary of different staff at the universitiesFootnote 12. We use the average salary as a target variable both for regression and classification, discretizing the target variable into “high”, “medium” and “low”, using equal frequency binning.

  • The Auto 93 datasetFootnote 13 captures different characteristics of cars, and the target is to predict the price of the vehicles as a regression task.

  • The Zoo dataset captures different characteristics of animals, and the target is to predict the type of the animals as a classification task.

For those datasets, cars, universities, and animals are linked to DBpedia based on their name.

The second category of datasets contains a list of datasets where the target variable is an observation from different real-world domains, as captured by official sources. Again, the instances were enriched with links to LOD datasets. There are thirteen datasets in this category:

  • The Forbes dataset contains a list of companies including several features of the companies, which was generated from the Forbes list of leading companies 2015Footnote 14. The target is to predict the company’s market value as a classification and regression task. To use it for the task of classification we discretize the target variable into “high”, “medium”, and “low”, using equal frequency binning.

  • The Cities dataset contains a list of cities and their quality of living, as captured by Mercer [7]. We use the dataset both for regression and classification.

  • The Endangered Species dataset classifies animals into endangered speciesFootnote 15.

  • The Facebook Movies dataset contains a list of movies and the number of Facebook likes for each movieFootnote 16. We first selected 10, 000 movies from DBpedia, which were then linked to the corresponding Facebook page, based on the movie’s name and the director. The final dataset contains 1, 600 movies, which was created by first ordering the list of movies based on the number of Facebook likes, and then selecting the top 800 movies and the bottom 800 movies. We use the dataset for regression and classification.

  • Similarly, the Facebook Books dataset contains a list of books and the number of Facebook likes. Each book was linked to the corresponding Facebook page using the book’s title and the book’s author. Again, we selected the top 800 books and the bottom 800 books, based on the number of Facebook likes.

  • The Metacritic Movies dataset is retrieved from Metacritic.comFootnote 17, which contains an average rating of all time reviews for a list of movies [12]. The initial dataset contained around 10, 000 movies, from which we selected 1, 000 movies from the top of the list, and 1, 000 movies from the bottom of the list. We use the dataset both for regression and classification.

  • Similarly, the Metacritic Albums dataset is retrieved from Metacritic.comFootnote 18, which contains an average rating of all time reviews for a list of albums [13].

  • The HIV Deaths Country dataset contains a list of countries with the number of deaths caused by HIV, as captured by the World Health OrganizationFootnote 19. We use the dataset both for regression and classification.

  • Similarly, the Traffic Accidents Deaths Country dataset contains a list of countries with the number of deaths caused by traffic accidentsFootnote 20.

  • The Energy Savings Country dataset contains a list of countries with the total amount of energy savings of primary energy in 2010Footnote 21, which was downloaded from WorldBankFootnote 22. We use the dataset both for regression and classification.

  • Similarly, the Inflation Country dataset contains a list of countries with the inflation rate for 2011Footnote 23.

  • The Scientific Journals Country dataset contains a list of countries with a number of scientific and technical journal articles published in 2011Footnote 24.

  • The Unemployment French Region dataset contains a list of regions in France with the unemployment rate, used in the SemStats 2013 challenge [10].

Again, for those datasets, the instances (cities, countries, etc.) are linked to DBpedia. For datasets which are used for classification and regression, the regression target was discretized using equal frequency binning, usually into a high and a low class.

The third, and final, category contains datasets that were generated from existing RDF datasets, where the value of a certain property is used as a classification target. There are five datasets in this category:

  • The Drug-Food Interaction dataset contains a list of drug-recipe pairs and their interaction, i.e., “negative” and “neutral” [6]. The dataset was retrieved from FinkiLODFootnote 25. Furthermore, each drug is linked to DrugBankFootnote 26. We drew a stratified random sample of 2, 000 instances from the complete dataset. When generating the features, we ignore the foodInteraction property in DrugBank, since it highly correlates with the target variable.

  • The AIFB dataset describes the AIFB research institute in terms of its staff, research group, and publications. In [1] the dataset was first used to predict the affiliation (i.e., research group) for people in the dataset. The dataset contains 178 members of a research group, however the smallest group contains only 4 people, which is removed from the dataset, leaving 4 classes. Also, we remove the employs relation, which is the inverse of the affiliation relation.

  • The AM dataset contains information about artifacts in the Amsterdam Museum [2]. Each artifact in the dataset is linked to other artifacts and details about its production, material, and content. It also has an artifact category, which serves as a prediction target. We have drawn a stratified random sample of 1, 000 instances from the complete dataset. We also removed the material relation, since it highly correlates with the artifact category.

  • The MUTAG dataset is distributed as an example dataset for the DL-Learner toolkitFootnote 27. It contains information about complex molecules that are potentially carcinogenic, which is given by the isMutagenic property.

  • The BGS dataset was created by the British Geological Survey and describes geological measurements in Great BritainFootnote 28. It was used in [17] to predict the lithogenesis property of named rock units. The dataset contains 146 named rock units with a lithogenesis, from which we use the two largest classes.

An overview of the datasets is given in Tables 12, and  3. For each dataset, we depict the number of instances, the machine learning tasks in which the dataset is used (C stands for classification and R stands for regression), the source of the dataset, and the LOD datasets to which the dataset is linked. For each dataset, we depict basic statistics of the properties of the LOD datasets, i.e., average, median, maximum and minimum number of types, categories, outgoing relations (rel out), incoming relations (rel in), outgoing relations including values (rel-vals out) and incoming relations including values (rel-vals in). The datasets, as well as a detailed description, a link quality evaluation, and licensing information, can be found onlineFootnote 29.

From the given statistics, we can infer the following observations: (i) DBpedia contains significantly less owl:sameAs links to YAGO, compared to Wikidata; (ii) DBpedia provides the highest number of types and categories on average per entity; (iii) Wikidata contains the highest number of outgoing and incoming relations for most of the datasets; (iv) YAGO contains the highest number of outgoing and incoming relations values for most of the datasets.

Table 1. Datasets statistics
Table 2. Datasets statistics
Table 3. Datasets statistics

4 Conclusion and Outlook

In this paper, we have introduced a collection of 22 benchmark datasets for machine learning on the Semantic Web. So far, we have concentrated on classification and regression tasks. There are methods to derive clustering and outlier detection benchmarks from classification and regression datasets [4, 5], so that extending the dataset collection for such unsupervised tasks is possible as well. Furthermore, as many datasets on the Semantic Web use extensive hierarchies in the form of ontologies, building benchmark datasets for tasks like hierarchical multi-label classification [15] would also be an interesting extension.