Keywords

1 Introduction

Reachability query is one of most important research in graph processing, especially in large-scale graph processing, which is widely used in Semantic Web, such as Lightweight Service [1], semantic query [19] and Semantic Mining [2], knowledge ontology, biological network and social network. We know that a directed graph can always be transformed into a directed acyclic graph (DAG) by coalescing strongly connected components into vertices, and the reachability query of the original graph can be answered on DAG [3]. Let G = (V, E) be the DAG with n vertices (n = |V|) and m edges (m = |E|), and a reachability query (u → v? u, v \( \in \) V) is to answer if there exists a path (u, v) = (v 1 , v 2 , …, v p ) in G where (v i , vi+1) is an edge in E, for 1 ≤ i < p, u = v 1 , and v = v p . However, the recent dramatically increasing graph data poses new challenges for reachability computing. For example, the Linked Open Data (LOD) project [4] has contained 2973 open datasets and more than 149.4 billion triplets up to August 2017. Therefore, some graph indexing approaches were proposed to improve the efficiency of reachability query.

2 Related Work

There are two kinds of graph indexing approaches according to the size of graph data:

  1. (1)

    the approaches for small-scaled and medium-scaled graph with below 1 million vertices, including Chain-Cover [5, 6], Tree-Cover [7,8,9,10], 2-Hop and 3-Hop [11,12,13].

  2. (2)

    the approaches for large-scaled graph with above 1 million vertices, including Refined Online Search [14,15,16], and Bloom Filter Labeling [17]. All these approaches are trade-off between online processing cost and the offline processing cost, while online processing cost is reachability query time, and offline processing cost contains index construction time and index size.

The above approaches present different reachability query algorithms based on different indexing approaches. However, there are several problems:

  1. (1)

    The balance between reachability query time, index construction time and index size of graph: those algorithms are trying to speed up the query answering time while reducing the index construction time with a reasonable index size.

  2. (2)

    The scalability bottleneck for handling massive graphs: some reachability algorithms cannot scale to very large real-world graphs.

  3. (3)

    The limitation of platform: most of the algorithms are implemented in C++ based on the Standard Template Library (STL), and they have not been extended to Cloud Platform [18], which have significant advantages for large-scale data processing.

In this paper, we propose Min-Forest by using forest structure index to prune the search space of original graph, so as to speed up query time. In addition, Min-Forest algorithms are implemented on the Cloud platform of Spark to increase scalability for large-scale graph. The main idea of Min-Forest is as follows:

  1. (1)

    (Min-Forest) The original DAG is divided into a Forest structure with minimal number of trees (Min-Forest) by cutting some edges from the original graph, and then each tree in Min-Forest carries the major reachability information of the whole tree vertices, and the deleted edges called Non-Forest Edge carry the relation information between trees. Each Non-Forest Edge in the original DAG is the deleted incoming edge of ending vertex in the Min-Forest, so as to ensure the in-degree of ending vertex is at most 1 in the Min-Forest.

  2. (2)

    (Interval Labeling) Each vertex in Min-Forest is assigned an interval labels (X, Y), where X is the tree id in the forest, and Y is the vertex position on the tree. Therefore, the positive reachability query between the two vertices in the Min-Forest can be immediately answered by the interval labels.

  3. (3)

    (Start Vertex Set of Non-Forest Edge) As for each Non-Forest Edge carrying the relation information between trees, we record all the starting vertices of it in order to achieve the reachability query, and name them Start Vertex Set of Non-Forest Edge.

Therefore, the positive reachability query between any two vertices in the original DAG can further be answered by Start Vertex Set of Non-Forest Edge and Nearest Ancestor Vertex of Non-Forest Edge, because they connect trees in Min-Forest.

The rest of the paper is organized as follows. In Sect. 3, we introduce how to construct a Min-Forest from original DAG. In Sect. 4, we assign interval labels of each vertex of Min-Forest, which label the tree and the branch each vertex belongs to. In Sect. 5, we assign connectivity label for each vertex of Min-Forest to ensure the connectivity of original DAG. In Sect. 6, we present reachability query approach of Min-Forest, and describe the corresponding query algorithm and its optimized query algorithm for special graph with redundant data. We analyze experiment results of four kinds of graphs from query time, index size and construction time, including small sparse graphs, large sparse graphs, small dense graphs and large dense graphs. We also analyze scalability of Min-Forest.

3 Construction of Min-Forest

Our study is motivated by a list of tree-based approaches, and we propose Min-Forest consisted by tree-shape subgraphs to cover a DAG G.

Let T 1 = (V 1 , E 1 ) and T 2  = (V 2 , E 2 ) are two trees in G = (V, E), we use T 1 \( \cap \) T 2 to denote the intersection of Tree T 1 and Tree T 2 with vertices and edges, and T 1 \( \cup \) T 2 to denote the union of T 1 and T 2 with vertices and edges. We use E T1 to denote the edge set of T 1 , E T1 \( \cup \) T2 to denote the union of edges of T 1 and T 2 , and ET1  T2 to denote the complement of edges in T 1 and T 2 . We define Non-Forest Edge based on the above terminology.

Definition 1 (Forest).

Given a DAG G = (V, E), and (T 1 , T 2 , …, T n ) are divided multiple trees by deleting some edges from G, where T i \( \cap \) T j  = \( \varnothing \)(i \( \ne \) j, i, j \( \in \)(1, n)), and FG = T 1 \( \cup \) T2 \( \cup \)\( \cup \) T n  = (V*, E*) (V* = V, E* \( \in \) EE* = E T1 \( \cup \) T2 \( \cup \)\( \cup \) Tn ) is called the Forest of G, and S = EE* = {(u, v)|(u, v) \( \in \) E && (u, v)\( \notin \) E*} is called Non-Forest Edge Set.

As an example, Fig. 2 represents a decomposed Forest FG from DAG G in Figs. 1, and 2(a) and (b) show two trees of T 1 and T 2 in FG. Non-Forest Edge Set is generated in the decomposition process of G, S = EE T1 \( \cup \) T2 = {(4, 6), (4, 7), (5, 9), (6, 9), (7, 10), (9, 8), (9, 15), (10, 12), (10, 13), (11, 10), (12, 13), (12, 16), (13, 16), (14, 13), (15, 16), (16, 17)}.

Fig. 1.
figure 1

DAG G.

Fig. 2.
figure 2

Decomposed Forest F G from G.

There may be several possible forests as the results when converting a DAG to Forest, and different forests may contain different number of trees. We know the worst result is that each vertex in the original DAG is converted into a tree in the forest, so the number of trees of the forest is the number of vertices in the original DAG. Therefore, we propose Min-Forest to define as the least number of trees in Forest as possible while converting the original DAG to Forest.

Lemma 1 (Min-Forest Criterion).

Given DAG G, if the number of vertices with in-degree 0 in G is N, the minimal number of trees in Min-Forest F is N.

Proof:

The vertices with in-degree 0 can only be the root vertices in the tree. Suppose there are N vertices with in-degree 0, then there are at least N root vertices when converting G to Min-Forest. That is, the number of trees in Min-Forest is N.

From Lemma 1, the converting process from G to Min-Forest is as following: (1) Traverse G to find out N vertices with in-degree 0, (2) and then delete incoming edges from the vertices with in-degree more than 1 and just keep one incoming edge, to ensure in-degree of each vertex is no more than 1, (3) finally, we get Min-Forest FG with N trees and E* edges, and the set of deleted edges S, where E = E* + S.

According the converting process, we design the algorithm of Min-Forest Construction, and we concern the scalability and space-saving of the algorithm.

figure a

Using Algorithm 1, we get the converted Min-Forest with Tree1 and Tree2 in Fig. 2 from the original DAG G in Fig. 1. We can also get Non-Forest Edge set during this converting process. Two sets of edges are generated by executing Map operation twice, including E-the edge set of G and E*-the edge set of the converted Min-Forest, and then Non-Forest Edge set is the complement of E and E*.

Algorithm 1 of constructing Min-Forest has two advantages. The first is that the forest structure based on tree structure helps to increase the reachability query. The second is that the integrated functions of Spark can filter out all the isolated vertices automatically, which reduce the difficultly of dealing with large-scale graph data.

4 Interval Labeling of Min-Forest

In Sect. 3, we get Min-Forest with trees and Non-Forest Edge set when converting the original G to Min-Forest. We will introduce how to label each vertex with a 3-tuple to cover these two kinds of information by Min-Forest in this section.

The beginning two elements of 3-tuple cover the position of vertex in Min-Forest, which help to answer the reachability query among trees in Min-Forest, and the last element cover the connection information between trees in Min-Forest. These three elements of 3-tuple can compress the full transitive closure of G to answer the reachability query of the original G.

4.1 Interval Label of Vertex in Min-Forest

In this section, we present how to assign the vertex with the interval label of the beginning two elements. Similar to Path-Tree approach [8], we also perform a Depth-First Search (DFS) to create an X label for each vertex, which denotes the tree ID in the order of the Min-Forest by DFS, and create a Y label, which denotes the branch ID in the whole Min-Forest. By utilizing the interval label (X, Y), we can easily answer the reachability query among the Min-Forest.

  • A. The DFS order of Min-Forest

We assign X label of the interval label (X, Y) for each vertex by DFS algorithm. The procedure of the algorithm is as following: (1) Find the vertices with in-degree 0 in Min-Forest, which are root vertices of the trees in Min-Forest. For example, the vertices in the set of {1, 2} are root vertices with in-degree 0, (2) and then perform DFS traversal from the root vertex set sequentially, until all vertices in Min-Forest are visited, (3) finally, order all vertices in Min-Forest by DFS traversal order, and then label the order as their X. As for the isolated vertices deleted during the process of generating Min-Forest, we label 0 as their X (Fig. 3).

Fig. 3.
figure 3

Interval label of Min-Forest.

Lemma 2.

For any two vertices u, v in Min-Forest, if u can reach v, then u.X  < v.X.

Proof:

Clearly, if u can reach v, DFS traversal will visit u earlier than v, and it turns to u only after visiting all v’s neighbors. So, if u can reach v, u.X  < v.X based on DFS, but not vice versa.

  • B. The Branch Order of Min-Forest

Using X of interval label (X, Y), it can answer the reachability query between root vertex and its child vertices, but cannot answer the query between the child vertices below the same root vertex. Therefore, we assign Y of interval label (X, Y) to the vertex as the branch order of Min-Forest to solve this kind of reachability queries. The branches are just like the tree branches from the root, and these help to label different branches.

Definition 2 (Branch of Min-Forest).

A branch is a subdivision of Min-Forest F G that starts at the root vertex and explores vertices as far as possible along each edge until the leaf vertex with out-degree 0, and this subdivision path formed is a branch of F G .

As shown in Fig. 4, the branches of a, b, c, d, e, f, g are seven branches of Min-Forest F G . We observe that different branches never join back up together, each root vertex or father vertex may belong to different branches, and each leaf vertex can only belongs to one branch. For example, root vertex 6 belongs to branch b and branch c, but leaf vertex 16 only belongs to branch b. Therefore, we design a post-order traversal algorithm for fast assigning the branch order of Y to each vertex in Min-Forest.

Fig. 4.
figure 4

Branches of Min-Forest.

The procedure of the algorithm is as following: (1) Initial the branch order of Y for each vertex. We initial the DFS order as the branch order for the leaf vertex with out-degreee 0, and initial 0 as the branch order for non-leaf vertex with out-degree more than 0, (2) and then post-order traverse vertices in Min-Forest, that is, it always first visit the child vertices from left to right, and then visit the father vertices of them, (3) finally, label the branch order of Y for each vertex during post-order traversal. If the branch order of father vertex is less than that of its child vertices, then it is updated with the branch order of its child.

For example, vertex 6 belongs to branch b and branch c, and it is the father of vertex 10 and vertex 13, with the branch order 7 and 9 respectively, so vertex 6 is assigned with the branch order of 9, which is the maximal branch order of vertex 10 and vertex 3. From Fig. 4, we also note that the vertices below the same father vertex belong to different branches, and their interval label (X, Y) are mutually exclusive. The branch order of root vertex is always labeled with the maximal branch order among all its child vertices, and its interval label (X, Y) is larger than that of its child vertices. The Pregel iterative algorithm in Cloud Platform of Spark can easily label branch order of vertices of Min-Forest.

Algorithm 2 of BranchVisit shows the algorithm for assigning Y label of interval label (X, Y) for each vertex in Min-Forest by post-order traversal. Figure 4 shows the Y label based on the post-order traversal algorithm and the branch order of isolated vertices is labeled with 0.

figure b

Lemma 3.

For any two vertices u, v in Min-Forest, if u can reach v, then u.Y  ≥ v.Y.

Proof:

(Case 1:) if u only belongs to the same branch of v belongs to, then u.Y  = v.Y. (Case 2:) if u also belongs to different branches besides the branch of v belongs to, u is the root vertex with the maximal branch order of all its child vertices based on the post-order traversal in Algorithm 2, then u.Y ≥ v.Y. Combining both Case 1 and Case 2, we prove our result.

4.2 Connectivity Between Vertices in Min-Forest

We map the connectivity between vertices in Min-Forest to a two-dimensional space, according to the interval label (X, Y) of vertices. As shown in Fig. 6, X-axis represents vertex’s DFS order, and Y-axis represents the vertex’s branch order in Min-Forest. As we know, for any two vertices u, v in Min-Forest, if u can reach v, then u.X  < v.X && u.Y  ≥ v.Y, based on Lemmas 2 and 3, that is, v is located at the lower right of u corresponding to the two-dimensional space. Therefore, we use two-dimensional map of Min-Forest to express the interval label (X, Y) of each vertex, and it also reflects the reachability between vertices. This approach is similar to the labeling approach of Path-Tree [8].

Lemma 4.

For any two vertices u, v in Min-Forest, u can reach v if and only if u.X  < v.X ∧ u.Y  ≥ v.Y.

Proof:

First, we easily prove u → v ⇒ u.X  < v.X ∧ u.Y  ≥ v.Y based on Lemmas 2 and 3. Second, we prove u.X < v.Xu.Y  ≥ v.Y ⇒ u → v. (Case 1:) if u.Y = v.Y, then u and v are on the same branch of Min-Forest, and we will visit u before v in DFS traversal because u is the ancestor of v (u.X < v.X), so we have u → v. (Case 2:) This can be proved by contradiction. Let us assume u cannot reach v. Then if u.Y > v.Y, u is the root vertex in Min-Forest or u and v belongs to two different branches but under the same father vertex. However, if u is the root vertex in Min-Forest, then u can reach v obviously, this contradicts the assumption. If u and v are under the same father vertex, and u.Y > v.Y means that it first visits v and then u according to post-order traversal in Algorithm.2, then we get v.X < u.X, a contradiction. Combining both cases 1 and 2, we prove our result (Fig. 5).

Fig. 5.
figure 5

Branches of Min-Forest.

5 Connectivity Labeling for Vertices in Trees of Min-Forest

In Sect. 4, we introduce how to judge the reachability of vertices among trees in Min-Forest. However, we cannot answer the reachability query of any two vertices in the original DAG, because we delete some edges as Non-Forest edges from the original DAG when constructing Min-Forest, which weakens the connectivity of the original DAG. We will build the connections that Non-Forest edges break in this section.

From the definitions of Forest, Min-Forest and Non-Forest Edge Set, we know E = E F  + S, where Non-Forest Edge Set of S break the connectivity of the original DAG G. We define two concepts to study the connectivity besides Min-Forest: Start-Vertex Set of Non-Forest Edge (SVS) and Nearest Ancestor Vertex of Non-Forest Edge (NAV).

Definition 3 (SVS).

Start-Vertex Set for Non-Forest Edge (SVS) is the staring vertex set for each ending vertex existing Non-Forest edge to connect them in the original DAG. As for the ending vertex v, SVS v  = {U| u i , u i \( \in \) V* && (u i , v) \( \in \) S}.

We can get SVS of each vertex from Non-Forest Edge Set. Figure 6 shows Non-Forest edges of the original G by dotted lines. For instance, SVS 8  = {9}, and SVS9 = {5, 6}. In particular, if there is no starting vertex for a Non-Forest edge v, we record SVS v = Null.

Fig. 6.
figure 6

Deleted edges of Min-Forest (Color figure online)

Definition 4 (NAV).

Nearest Ancestor Vertex of Non-Forest Edge (NAV) is the nearest ancestor vertex for each ending vertex existing Non-Forest edge to connect them in the tree of Min-Forest FG. Suppose the vertex set {u 1 , …, u i , …, u n } → v, where u i \( \in \) V* && SVS ui \( \ne \) ∅, i \( \in \)[1, n], n = |V*| − 1, then

$$ NAV_{v} = \left\{ {\begin{array}{*{20}l} {MAX\left\{ {u_{1} \text{,} \ldots \text{,}u_{i} \text{,} \ldots \text{,}u_{n} } \right\} ,i \in [ 1 ,n ]} \hfill \\ { 0 ,\;i = 0} \hfill \\ \end{array} } \right. $$
(1)

For instance, NAV 15  = 8, and NAV 11  = 7. For any vertex v without NAV, we record NAVv = Null.

We can find NAV of each vertex by DFS traversal or BFS traversal with the same O(n’ + m’) time, where n’ and m’ are the number of vertices and edges in Min-Forest. However, the practical results from later experiments show that DFS traversal is better than BFS traversal, especially to the large-scale data, because BFS traversal may result in pop operation and push operation consciously for some vertices. Therefore, we design an algorithm to find out NAV based on DFS traversal. The procedure of the algorithm is as following: (1) First, find out the ending vertices connected to the Non-Forest edges according to Non-Forest Edge Set. We know the connectivity of these ending vertices is weakened by the deleted Non-Forest edges they originally connect to. Figure 6 shows the deleted Non-Forest edges in red of the original G. Figure 7 shows the ending vertices connected to the Non-Forest edges in red of Min-Forest, (2) and then find out NAV of the ending vertex connecting to Non-Forest edge in Min-Forest based on DFS traversal. If the father vertex of vertex v is just the starting vertex connecting to a Non-Forest edge, then we record this father vertex as NAV of vertex v, that is, NAV of vertex v is its father vertex, else NAV of vertex v is its father’s NAV recursively, that is, NAV of vertex v is its father’s NAV. In particular, if vertex v does not exist NAV, that means this vertex’s ancestor vertices does not connect to any Non-Forest edges, then we record NAVv = 0.

Fig. 7.
figure 7

NAV of Min-Forest (Color figure online)

From the above, we can see that vertex u may reach vertex v from one branch of the trees in Min-Forest, or through deleted Non-Forest edges. Although the deleted Non-Forest edges decrease reachability between vertices, the reachability also exists by using SVS and NAV. Therefore, reachability query (u → v? u, v \( \in \) V) can be answered by the following 3-step judges: (1) check the scopes of interval label (X, Y) for u and v, (2) and then check the reachability from u to any vertex in SVSv, (3) and then check the reachability for u to NAVv.

For example, if there is a reachability query (5 → 9?), we first check whether vertex 5 and vertex 9 are not on the same branch in Min-Forest from Fig. 7. If no, we cannot immediately conclude that vertex 5cannot reach vertex 9, because there actually exists an edge (5, 9) connecting vertex 5 and 9 in Fig. 6, which is just deleted when constructing Min-Forest. Therefore, we should use SVS 9 to judge whether there exist starting vertices connecting to vertex 9. From Fig. 6, we get SVS 9 = {5, 6}, so we can conclude 5 → 9.

Theorem 1.

Given vertex u and vertex v in the Min-Forest FG with the original DAG G, u can reach v if and only if (1) u can reach v in F G , i.e., u.X < v.Xu.Y ≥ v.Y, (2) u can reach v by one of vertices in SVSv directly or indirectly, (3) u can reach v by NAVv indirectly.

Proof:

The proposition of u → v is equivalent to the proposition that u can reach v or v’s ancestors, and we can check v’s ancestors by the following: (1) v’s ancestors are direct ancestors of v in Min-Forest by Definition 2, that is, u → v is equivalent to u.X < v.Xu.Y ≥ v.Y; (2) v’s ancestors are vertices in the set of SVSv by Definition 3, that is, u → v is equivalent to u → w, w \( \in \) SVSv; (3) v’s ancestor is NAVv by Definition 4, that is, u → v is equivalent to u → NAVv.

In conclusion, we can answer reachability query by the interval label (X, Y) of Min-Forest, SVS or NAV, so we construct vertex index by Theorem 1, and build index for each vertex by 3-tuple (X, Y, NAVv) and SVSv. Table 1 lists the index for each vertex in Fig. 1, and the construction time for index is O(n’ + m).

Table 1. DAG index for vertices in DAG G of Fig. 1

6 Reachability Query of Min-Forest Approach

To answer reachability query between two vertices of u and v, Min-Forest query processing presents a 3-step querying approach, and it answers whether u can reach v after the following checking:

  1. (1)

    Whether the interval label (X, Y) of u includes that of v by Min-Forest, that is, to check whether u.X < v.Xu.Y  v.Y.

  2. (2)

    Whether u can reach w in SVSv, where w is one of ancestor vertices in SVSv. This step is an iterative process to find ancestor vertices for the vertices in SVSv.

  3. (3)

    Whether u can reach v by NAVv.

Vertex u can reach vertex v if and only if (1) or (2) or (3) holds.

We design an efficient reachability query algorithm with O(d), where d is the number of deleted edges for all vertices, and we know that d ≪ n (n is the number of vertices in DAG). The reachability query algorithm is described in Algorithm 3.

figure c

7 Reachability Query of Min-Forest Approach

We conducted an extensive set of experiments in order to evaluate the performance of Min-Forest in comparison with state-of-art reachability approaches. We also focused on three important measures for reachability query: query time, index size and construction time.

7.1 Reachability Query

Our experiments are conducted by a Dell desktop computer equipped with 4 Intel Core i5 CPUs at 3.20 GHz, and 36 gigabyte of main memory. We use OS CentOS7/Linux with Cloud Platform of Spark using version 2.1.0. We study other reachability algorithms implemented by C++, such as Grail [12], Path-Tree [8], and BFL [15], and in real environment we implement Min-Forest approach using Scala languagewith version 2.11.8 on Spark GraphX with version 2.1.0 on local mode, which is a special standalone cluster mode of Spark for computing, namely stand-alone mode.

As for dataset size, we consider dataset with less than 500,000 vertices as small dataset, and others as large dataset. As for dataset density, we consider dataset with the ratio of edges to nodes below 1.5 as sparse dataset, and others as density dataset. Therefore, we divide 22 datasets into 4 categories based on dataset size and dataset density which are Small Sparse Datasets, Small Dense Datasets, Large Sparse Datasets and Large Dense Datasets. We use the real graph datasets listed in Tables 2, 3, 4 and 5, which are the benchmark sets for recent reachability research. We get these graph datasets from the URI of https://code.google.com/arc-hive/p/grail/downloads, which is provided by Dr. Wei Hao of Chinese University of Hong Kong.

Table 2. Small sparse datasets.
Table 3. Small dense datasets.

In Tables 2, 3, 4 and 5, columns of |V|, |E| and |E|/|V| record the number of vertices, edges, and ratio of edges to vertices in the original DAG, respectively. Columns of |V F | and |E F | record the number of vertices and edges in the Min-Forest converted from the original DAG, and the column of |E d | records the number of deleted Non-Forest edges. We order all four groups of datasets by the vertex number in datasets.

Table 4. Large sparse datasets.
Table 5. Large dense datasets.

7.2 Query Time, Index Size and Construction Time

We use four groups of real datasets to verify the efficiency of Min-Forest approach. In this section, we report these four groups of experiment results to address query time, index size and construction time.

Min-Forest approach is implemented by Spark GraphX, which is different from other approaches implemented by STL C++. We mainly compare our Min-Forest approach with the state-of-the-art reachability approaches including Grail, Path-Tree and BFL+, which generally perform well as analyzed in [12, 15]. All experiments in [12] are performed on machine with x86_64 Dual Core AMD Opteron (tm) Processor 870, 32 GB RAM with Linux OS, and all experiments in [15] are performed on machine with 3.60 GHz Intel Core i7-4790 CPU, 32 GB RAM with Linux OS, so both of the machines perform better than ours.

Tables 6, 7, 8 and 9 show query time, construction time and index size for four groups of datasets. As for small sparse and dense datasets, we compare Min-Forest with Grail and PT, and as for large datasets, we compare with Grail and BFL+.

Table 6. Query time (ms), index size and construction time (ms) on small sparse datasets.
Table 7. Query time (ms), index size and construction time (ms) on small dense Datasets.
Table 8. Query time (ms), index size and construction time (ms) on large sparse datasets.
Table 9. Query time (ms), index size and construction time (ms) on large dense datasets.

Query Time

As for sparse graphs in Tables 6 and 8, we note query time of Min-Forest approach is on average about 10−4 ms, not only for small sparse graphs, but also for large sparse graphs. However the query time of state-of-the-art reachability approaches at present is about 10 ms for sparse graphs.

As for dense graphs in Tables 7 and 9, we divide datasets of them into two groups. The first group are datasets of go, yago and go-uniprot, and the second group are datasets of arxiv, citeseer, pubmed, citeseerx and cit-patents for better query we use two query methods, including the original query method described in algorithm 3 for the first group datasets and the improved query method by restriction of query access for the second datasets. We observe that the query time of the first group is also on average about 10−4 ms and the second group’s is less than 5.3 ms.

Above all, the query time of our Min-Forest approach can be several orders of magnitude faster than other algorithms.

Index Size

We label for each vertex in Min-Forest with interval label of (X, Y), which denotes tree ID and branch ID of the vertex. We also assign the connectivity labels for each vertex in Min-Forest, including SVS and NAV. Therefore, the index size for Min-Forest approach is |V| + |E d |, where V is the set of all vertices in Min-Forest, and E d the deleted edge set from the original edge set E when constructing Min-Forest.

Note that |E d | is positively correlated to |E|, and we know |E d | < |E|, so we get the index size as:

$$ Index\,Size = |V| + \alpha |E|(0 < \alpha < 1) $$
(2)

We observe that the index size is positive correlative to |V| and |E|, with the coefficient of 1 and \( \alpha \), respectively. Therefore, the index size of Min-Forest approach is not large and has good scalability.

Construction Time

We note that construction time increase with increasing density of datasets. As for small sparse datasets, construction time of Min-Forest is almost less than that of PT but almost 11 times higher than that of Grail. However, as for small dense datasets construction time of ours is always 23 times less than that of PT and always less than that of Grail. As for large datasets, construction time of Min-Forest is almost 2.1 times less than that of Grail and 2.4 times higher than that of BFL+, which can keep about 75 percent the pruning power even in the densest datasets. However, we know Min-Forest is several orders of magnitude faster than BFL+ on query time.

Above all, the construction time of our Min-Forest approach is not bad and is good scalable in dense graph and large graph.

7.3 Scalability

In this section, we study the scalability for sparse graphs and dense graphs from the aspects of query time, according to the experiments results shown in Tables 2, 3, 4, 5, 6, 7, 8 and 9. We show the experiment results of query time for four kinds of datasets in Fig. 8.

Fig. 8.
figure 8

Query time in four kinds of datasets

Query time

As for the sparse datasets, we observe from Fig. 8(a) and (c) that query time doesn’t increase with the number of vertices increasing, but decrease rapidly at first and tend to be stable then. And the stable value of query time is about 10−4 ms no matter small sparse datasets and large sparse datasets, of which the range of number of vertices is 5605 to 25037600. From it, we can see the scalability of Min-Forest approach is very good for the sparse datasets.

As for the dense datasets, query time isn’t always linear growth with the number of vertices increasing. In terms of small dense datasets, the change trend of query time is decrease quickly at begin and slowly linear growth with the number of vertices after. As for large dense datasets, except exceptional data citeseerx, the query time is a downward trend with the number of vertices. Above all, the scalablity of Min-Forest for dense datasets is still good.

In summary, our experimental results on query time indicate that Min-Forest approach is scalable, especially in sparse datasets, since query time is always about 10−4 ms for sparse datasets, of which the range of number of vertices is 5605 to 25037600.

8 Conclusion

In this paper, we propose Min-Forest approach to solve large-scale reachability queries in large graphs. We present a 4-tuple labeling scheme to construct index of original DAG, with two tuples of interval labels for vertices in the same tree and two tuples of connecting labels for vertices in the different trees. We design algorithms for our Min-Forest approach by Scala and implement them on the Cloud Platform of Spark, which also help to speed up reachability query in large graphs. Our experiment results on four kinds of real datasets demonstrate that Min-Forest approach have the fastest query time and comparable index construct time compared with the state-of-art approaches, including Grail, Path-Tree and BFL+. Furthermore, the query time and index construction time of our approach are linear for both sparse graphs and dense graphs, and it performs quite well when graph are large and dense, so it is scalable and applicable to large-scale datasets. In the future, we plan to apply Min-Forest approach to the dynamic large graphs, and we will further study the reachability problem in query reasoning by Min-Forest approach.