1 Introduction

sparql is the standard query language for retrieving and manipulating data represented in the Resource Description Framework (rdf) [11]. sparql constitutes one key technology of the semantic web and has become very popular since it became an official w3c recommendation [1].

The construction of efficient sparql query evaluators faces several challenges. First, rdf datasets are increasingly large, with some already containing more than a billion triples. To handle efficiently this growing amount of data, we need systems to be distributed and to scale. Furthermore, semantic data often have the characteristic of being dynamic (frequently updated). Thus being able to answer quickly after a change in the input data constitutes a very desirable property for a sparql evaluator. In this context, we propose sparqlgx: an engine designed to evaluate sparql queries based on Apache Spark [21]: it relies on a compiler of sparql conjunctive queries which generates Scala code that is executed by the Spark infrastructure. The source code of our system is available online from the following url: https://github.com/tyrex-team/sparqlgx.

The paper is organized as follows: we first introduce the technologies that we consider in Sect. 2. Then, in Sect. 3, we describe sparqlgx and present additional available tools in Sect. 4. Section 5 reports on our experimental validation to compare our implementation with other open source hdfs-based rdf systems. Finally, we review related works in Sect. 6 and conclude in Sect. 7.

2 Background

The Resource Description Framework (rdf) is a language standardized by w3c to express structured information on the Web as graphs [11]. It models knowledge about arbitrary resources using Unique Resource Identifiers (uris), Blank Nodes and Literals. rdf data is structured in sentences –or triples– written (s p o), each one having a subject s, a predicate p and an object o.

The sparql standard language has been studied under various forms and fragments. We focus on the problem of evaluating the Basic Graph Pattern (bgp) fragment over a dataset of rdf triples. The bgp fragment is composed of conjunctions of triple patterns (tps). A candidate solution is a mapping from the variables of the query towards values, a candidate solution satisfies a tp when the replacement of the variables of the tp with their value corresponds to a triple that appears in the rdf data. A query solution is a candidate solution that satisfies all the tps of the query.

Apache HadoopFootnote 1 is a framework for distributed systems based on the Map-Reduce paradigm. The Hadoop Distributed File System (hdfs) is a popular distributed file system handling the distribution of data across a cluster and its replication [19].

Apache Spark [21] is a MapReduce-like data-parallel framework designed for large-scale data processing running on top of the jvm. Spark can be set up to use hdfs.

3 SPARQLGX: General Architecture

In this Section, we explain how we translate queries from our sparql fragment into lower-level Scala code [14] which is directly executable with the Spark API. To this end, after presenting the chosen data storage model, we give a translation into a sequence of Spark-compliant Scala-commands for each operator of the considered fragment.

3.1 Data Storage Model

In order to process rdf datasets with Apache Spark, we first have to adopt a convenient storage model on the hdfs. From a “raw” storage (e.g. a file in the N-Triple standard which is a simple list of all triples) to complex schemes (e.g. involving indexes or B-trees), there are many ways to store rdf data. Any storage choice is a compromise between (1) the time required for converting origin data into the target format, (2) the total disk-space needed, (3) the possible response time improvement induced.

rdf triples have very specific semantics. In a rdf triple (s p o), the predicate p represents the “semantic relationship” between the subject s and the object o. Thus, there are often relatively few distinct predicates compared to the number of distinct subjects or objects. The vertically partitioned architecture introduced by Abadi et al. in [2] takes advantage of this observation by storing the triple (s p o) in a file named p whose contents keeps only s and o entries.

  1. (1)

    Converting rdf data into a vertically partitioned dataset does not involve complex computation: each triple is read once and the pair (subject, object) is appended to the predicate file.

  2. (2)

    For large datasets with only a few predicates, two uris are stored instead of three which reduce the memory footprint compared with the input dataset.

  3. (3)

    Having vertically partitioned data reduces evaluation time of triple patterns whose predicate is a constant (i.e. not a variable): searches are limited to the relevant files. In practice, one can observe that most sparql queries have triple patterns with a constant predicate. [7] showed that graph patterns where all predicates are constant represent 77.81 % of the queries asked to dbpedia and 98.08 % of the ones asked to swdf.

We believe that vertical partitionning is very well suited for rdf: it implies a pass over the data but with only simple computation, reduces the size of the dataset and provides an indexation.

3.2 SPARQL Fragment Translation

We compute the solution of a conjunction of tps recursively. Given a conjunction of n tps we recursively compute the set of solution for the \(n-1\) first tps and then we combine this set with the solutions of the last tp by joining them on their common variables.

To compute the solutions for a unique tp: when the predicate is a constant, we open the relevant hdfs file using textFile; otherwise, we have to open all predicate files. Then, using the constants of the tp, we use a filter to keep only the matching elements. Finally, we use the variables names appearing in the tp for variables. For instance, the following tp {?s age 21 .} matching people that are 21 years old is translated into:

figure a

In order to translate a conjunction of tps (i.e. a bgp), the tps are joined. Two set of partial solutions are joined using their common variables as a key: keyBy in Spark. Joining tps is then realized with join in Spark. For example the following tps {?s age 21 . ?s gender ?g .} are translated into:

figure b

A join with no common variables corresponds to a cross product (therefore a cartesian in Spark). For a conjunction of n tps we perform \((n-1)\) joins.

The obtained translation (the Scala-code) thus depends on the initial order of tps since the joins will be perfomed in the same order. This allows us to develop optimizations based on join commutativity such as the ones presented in Sect. 4.1.

3.3 SPARQL Fragment Extension

Once the tps are translated, we use a map to retain only the desired fields (i.e. the distinguished variables) of the query. At that stage, we can also modify results according to the sparql solution modifiers [1] (e.g. removing duplicates with distinct, sorting with sortByKey, returning only few lines with take, etc.)

Furthermore, we also easily translate two additional sparql keywords: union and optional, provided they are located at top-level in the where clauses. Indeed, Spark allows to aggregate sets having similar structures with union and is also able to add data if possible with leftOuterJoin. Thus sparqlgx natively supports a slight extension (unions and optionals at top level) of the extensively studied sparql fragment made of conjunctions of triple patterns.

4 Additional Features

4.1 Optimized Join Order with Statistics

The evaluation process (using Spark) first evaluates tps and then joins these subsets according to their common variables; thus, minimizing the intermediate set sizes involved in the join process reduces evaluation time (since communication between workers is then faster). Thereby, statistics on data and information on intermediate results sizes provide useful information that we exploit for optimisation purposes.

Given an rdf dataset \(\mathcal {D}\) having T triples, and given a place in an rdf sentence \(k~\in ~\{subj,pred,obj\}\), we define the selectivity in \(\mathcal {D}\) of an element e located at k as: (1) the occurrence number of e as k in \(\mathcal {D}\) if e is a constant; (2) T if e is a variable. We note it \(sel_\mathcal {D}^k(e)\). Similarly, we define the selectivity of a tp (a b c .) over an rdf dataset \(\mathcal {D}\) as: \(SEL_\mathcal {D}(a,b,c)\,=\, \min (sel_\mathcal {D}^{subj}(a)\,,\,sel_\mathcal {D}^{pred}(b)\,,\,sel_\mathcal {D}^{obj}(c))\).

Thereby, to rank each tp, we compute statistics on datasets counting all the distinct subjects, predicates and objects. This is implemented in a compile-time module that sorts tps in ascending order of their selectivities before they are translated.

Finally, we also want to avoid cartesian products. Given an ordered list l of tps we compute a new list \(l'\) by repeating the following procedure: remove from l and append to \(l'\) the first tp that shares a variable with a tp of \(l'\). If no such tp exists, we take the first.

4.2 SDE: SPARQLGX as a Direct Evaluator

Our tool evaluates sparql queries using Apache Spark after preprocessing rdf data. However, in certain situations, data might be dynamic (e.g. subject to updates) and/or users might only need to evaluate a single query (e.g. when evaluation is integrated into a pipeline of transformations). In such cases, it is interesting to limit as much as possible both the preprocessing time and the query evaluation time.

To take the original triple file as source, we only have to modify in our translation process the way we treat tps to change our storage model. Instead of searching in predicate files, we directly use the initial file; and the rest of the translation process remains the same. We call this variant of our evaluator the “direct evaluator” or sde.

5 Experimental Results

In this Section, we present an excerpt of our empirical comparison of our approach with other open source hdfs-based rdf systems. RYA [16] relies on key-value tables using Apache AccumuloFootnote 2. CliqueSquare [8] converts queries in a Hadoop list of instructions. S2RDF [18] is a recent tool that allow to load rdf data according to a novel scheme called ExtVP and then to query the relational tables using Apache SparkSQL [4]. Finally, PigSPARQL [17] just translates sparql queries into an executable PigLatin [15] instruction sequence; and RDFHiveFootnote 3 is a straightforward tool we made to evaluate sparql conjunctive queries directly on Apache Hive [20] after a naive translation of sparql into Hive-QL.

Table 1. General information about used datasets.

All experiments are performed on a cluster of 10 Virtual Machines (vm) distributed on two physical machines (each one running 5 of them). The operating system is CentOS-X64 6.6-final. Each vm has 17 GB of memory and 4 cores at 2.1 GHz. We kept the default setting with which hdfs is resilient to the loss of two nodes and we do not consider the data import on the hdfs as part of the preprocessing phase.

We compare the presented systems using two popular benchmarks: LUBM [9] and Watdiv [3]. Table 1 presents characteristics of the considered datasets. We rely on three metrics to discuss results (Table 2): query execution times, preprocessing times (for systems that need to preprocess data), and disk footprints. For space reasons, Table 2 presents three Lubm queries: Q1 because it bears large input and high selectivity, Q2 since it has large intermediate results while involving a triangular pattern and Q14 for its simplicity. Moreover, we aggregate Watdiv queries by the categories proposed in the Watdiv paper [3]: 3 complex (QC), 5 snowflake-shaped (QF), 5 linear (QL) and 7 star queries (QS). In Table 2 we indicate “timeout” whenever the process did not complete within a certain amount of timeFootnote 4. We indicate “fail” whenever the system crashed before this timeout delay. This regroups several kinds of failure such as unability of evaluating queries and also unability of preprocessing the datasets. We indicate “n/a” whenever the task could not be accomplished because of a failure during the preprocessing phase.

Table 2. Compared system performance.

Table 2 shows that sparqlgx always answer all tested queries on all tested datasets whereas this is not the case with other conventional rdf datastores which either timeout or fail at some point.

In addition, sparqlgx outperforms several implementations in many cases (also as shown on Table 2), while implementing a simple architecture exclusively built on top of open source and publicly available technologies. Furthermore, the sde variant of our implementation, which does not require any preprocessing phase offers performances similar to the ones obtained with state-of-the-art implementations that require preprocessing.

6 Related Work

In recent years, many rdf systems capable of evaluating sparql queries have been developed [12]. These stores can be divided in two categories: centralized systems (e.g. RDF-3X [13] or Virtuoso [5]) and distributed ones, that we further review. Distributed rdf stores can in turn be divided into three categories. (1) The ad-hoc systems that are specially designed for rdf data and that distribute and store data across the nodes according to custom ad-hoc methods (e.g. 4store [10]). (2) Other systems use a communication layer between centralized systems deployed across the cluster and then evaluate sub-queries on each node such as Partout with RDF-3X [6]. (3) Lastly, some rdf systems [8, 1618] are built on top of distributed Cloud platforms such as Apache Hadoop. One major interest of such platforms relies on their common file systems (e.g., hdfs): indeed various applications can access data at the same time and the distribution/replication issues are transparent. These systems [8, 1618], then evaluate sparql conjunctive queries using various tools as presented in Sect. 5 (e.g. Accumulo, Hive, Spark, etc.). To set up appropriate tools for pipeline applications, we choose to distribute data with a Cloud platform (hdfs) and evaluate queries using Spark. We compared the performances of sparqlgx with the most closely related implementations in Sect. 5.

Finally, it is worthwhile to notice that sparql is a very expressive language which offers a rich set of features and operators. Most evaluators based on Cloud platforms focus on the restricted sparql fragment composed of conjunctive queries. sparqlgx also natively supports a slight extension of this fragment with union and optional operators at top level.

7 Conclusion

We proposed sparqlgx: a tool for the efficient evaluation of sparql queries on distributed rdf datasets. sparql queries are translated into Spark executable code, that attempts to leverage the advantages of the Spark platform in the specific setting of rdf data. sparqlgx also comes with a direct evaluator based on the same sparql translation process and called sde, for situations where preprocessing time matters as much as query evaluation time. We report on practical experiments with our systems that outperform several state-of-the-art Hadoop-reliant systems, while implementing a simple architecture that is easily deployable across a cluster.