SimRank*: effective and scalable pairwise similarity search based on graph topology
 129 Downloads
Abstract
Given a graph, how can we quantify similarity between two nodes in an effective and scalable way? SimRank is an attractive measure of pairwise similarity based on graph topologies. Its underpinning philosophy that “two nodes are similar if they are pointed to (have incoming edges) from similar nodes” can be regarded as an aggregation of similarities based on incoming paths. Despite its popularity in various applications (e.g., web search and social networks), SimRank has an undesirable trait, i.e., “zerosimilarity”: it accommodates only the paths of equal length from a common “center” node, whereas a large portion of other paths are fully ignored. In this paper, we propose an effective and scalable similarity model, SimRank*, to remedy this problem. (1) We first provide a sufficient and necessary condition of the “zerosimilarity” problem that exists in Jeh and Widom’s SimRank model, Li et al. ’s SimRank model, Random Walk with Restart (RWR), and ASCOS++. (2) We next present our treatment, SimRank*, which can resolve this issue while inheriting the merit of the simple SimRank philosophy. (3) We reduce the series form of SimRank* to a closed form, which looks simpler than SimRank but which enriches semantics without suffering from increased computational overhead. This leads to an iterative form of SimRank*, which requires O(Knm) time and \(O(n^2)\) memory for computing all \((n^2)\) pairs of similarities on a graph of n nodes and m edges for K iterations. (4) To improve the computational time of SimRank* further, we leverage a novel clustering strategy via edge concentration. Due to its NPhardness, we devise an efficient heuristic to speed up allpairs SimRank* computation to \(O(Kn{\tilde{m}})\) time, where \({\tilde{m}}\) is generally much smaller than m. (5) To scale SimRank* on billionedge graphs, we propose two memoryefficient singlesource algorithms, i.e., ssgSR* for geometric SimRank*, and sseSR* for exponential SimRank*, which can retrieve similarities between all n nodes and a given query on an asneeded basis. This significantly reduces the \(O(n^2)\) memory of allpairs search to either \(O(Kn + {\tilde{m}})\) for geometric SimRank*, or \(O(n + {\tilde{m}})\) for exponential SimRank*, without any loss of accuracy, where \({\tilde{m}} \ll n^2\). (6) We also compare SimRank* with another remedy of SimRank that adds selfloops on each node and demonstrate that SimRank* is more effective. (7) Using real and synthetic datasets, we empirically verify the richer semantics of SimRank*, and validate its high computational efficiency and scalability on large graphs with billions of edges.
Keywords
Similarity search Link analysis Graph topology SimRank measure1 Introduction
Recently, SimRank [12] has received growing interest as a widelyaccepted measure of pairwise similarity. The triumph of SimRank is largely due to its succinct yet elegant idea that “two nodes are assessed as similar if they are pointed to by similar nodes”, together with the base case that “each node is most similar to itself”. SimRank was first proposed by Jeh and Widom [12], and has gained tremendous popularity in many vibrant communities, e.g., collaborative filtering [1], social network analysis [37], and knearest neighbor search [17]. Since then, there has also been some studies [10, 11, 19, 33] focusing on Li et al. ’s SimRank model [19], a variant of Jeh and Widom’s model. The recent studies [16, 34] show the difference between these two SimRank models: In Jeh and Widom’s model [12], the SimRank similarity of each node with itself is always 1, whereas in Li et al. ’s model [19] there is no such a restriction. However, due to the selfreferentiality, both SimRank models suffer from high computational overhead.
While significant efforts have been devoted to optimizing computation of both SimRank models [9, 10, 11, 16, 19, 24, 26, 27, 32, 33], semantic issues of SimRank have attracted little attention. We observe that both SimRank models have an undesirable property (we call it “zerosimilarity”): SimRank score s(i, j) only accommodates the paths of equal length from a common “source” node to both i and j, but other paths for nodepair (i, j) are fully ignored by SimRank, as shown in Example 1.
Example 1
Consider a citation network \({{{\mathcal {G}}}}\) in Fig. 1, where each node is a paper, and an edge is a citation. Given damping factor \(C=0.6\), query node f, and the number of iterations \(K=20\), we assess all SimRank similarities \(\{s(\star ,f)\}\) w.r.t. query f in \({{\mathcal {G}}}\), using both Jeh and Widom’s model [12] and Li et al. ’s model [19], whose results are shown in columns JSR and LSR, respectively. We notice that, regardless of which SimRank model is used, many nodepairs in \({\mathcal {G}}\) have zero similarities when they have no incoming paths with equal length from a common “source” node. For instance, \(s(e,f)=0\) since the inlink “source” a is not in the center of the path Open image in new window . This means that when we recursively compute the pairwise inneighborhood similarities of two nodes, there is no likelihood for this recursion to reach the base case (i.e., a common inlink “source”) that a node is maximally similar to itself. Similarly, \(s(a,f)=0\) since a has no inneighbors, not to mention the fact that there is no such a common inlink “source” with equal distance to both a and f. In contrast, \(s(c,f)>0\) since there is a common inlink “source” b in the center of the path Open image in new window . \(\square \)
It is important to notice that the “zerosimilarity” issue refers to not only the problem that SimRank may produce “complete zero scores” (i.e., “completely dissimilar” issue), but also the problem that SimRank will neglect the contributions of a large class of inlink paths whose “source” node is not in the center (even though their similarity scores are not zero) due to the “zero contributions” of such paths to SimRank scores (i.e., “partially missing” issue). Indeed, as demonstrated by our experiments in Fig. 6b, both issues of “zerosimilarity” commonly exist in real graphs, e.g., on CitH, \(\sim \,97.9\%\) nodepairs have “zeroSimRank” issues, among which \(\sim \,19.2\%\) are evaluated to be “completely dissimilar”, and \(\sim \,78.7\%\) (though SimRank \(\ne \)0) to be “partially missing” the contributions of many inlink paths. These have adversely affected assessment quality, which highlights our need to enhance the existing SimRank model.
A pioneering piece of work by Zhao et al. [36] proposes rudiments of a novel approach to refining the SimRank model. Observing that SimRank may incur some unwanted “zero similarities”, they suggested PRank, an extension of SimRank, by taking both in and outlinks into consideration for similarity assessment, as opposed to SimRank that merely considers inlinks. Although PRank, to some degree, might reduce “zerosimilarity” occurrences in practice, we argue that such a “zerosimilarity” issue arises, not because of a biased overlook of SimRank against outlinks, but because of the blemish in SimRank philosophy that may miss the contribution of a certain kind of paths (whose inlink “source” is not in the center). In other words, PRank cannot, in essence, resolve the “zerosimilarity” issue of SimRank. For instance, nodes a and f are similar in the context of PRank, as shown in column PR of Fig. 1, since there is an outlink “source” d in the center of the outgoing path Open image in new window . However, the PRank similarity of (e, f) is still zero, since (1) i is not in the center of the outgoing path Open image in new window , and (2) there are no other outgoing paths between pair (e, f).
 Given

a graph \({\mathcal {G}}\), and a query node q in \({\mathcal {G}}\)
 Retrieve

all the similarities \(\{s(\star ,q)\}\) w.r.t. query q according to our proposed similarity measure.
1.1 Main contributions

We first provide a sufficient and necessary condition of the “zerosimilarity” problem for the existing similarity models, e.g., Jeh and Widom’s SimRank [12], Li et al. ’s SimRank [19], Random Walk with Restart (RWR) [28], and ASCOS++ [7] (Sect. 3).

We propose SimRank*, a semantic enhanced version of SimRank, and explain its semantic richness. Our model provides a way of traversing more incoming paths that are largely ignored by SimRank, and thus enables counterintuitive “zeroSimRank” nodes to be similar while inheriting the beauty of the SimRank philosophy (Sect. 4).

We convert the series form of SimRank* to a closed form, which looks more succinct yet with richer semantics than SimRank, without suffering from increased computational cost. This leads to an iterative model for computing allpairs SimRank* in O(Knm) time and \(O(n^2)\) memory on a graph of n nodes and m edges for K iterations (Sect. 5).

To speed up SimRank* computation further, as the existing technique [24] of partial sums memoization for SimRank optimization no longer applies, we leverage a novel clustering approach via edge concentration. Due to its NPhardness, an efficient algorithm is devised to speed up allpairs SimRank* computation to \(O(K n {\tilde{m}})\) time, where \({\tilde{m}}\) is the number of edges in our compressed graph, which is generally much smaller than m (Sect. 6).

To scale SimRank* over billionedge graphs, we also propose two memoryefficient singlesource algorithms for SimRank*, i.e., ssgSR* for geometric SimRank*, and sseSR* for exponential SimRank*, that require \(O(K^2 {\tilde{m}})\) time and \(O(K {\tilde{m}})\) time, respectively, to compute similarities between all n nodes and a given query on an asneeded basis. This significantly reduces the \(O(n^2)\) memory of allpairs search to either \(O(Kn + {\tilde{m}})\) for geometric SimRank*, or \(O(n + {\tilde{m}})\) for exponential SimRank*, without any compromise of accuracy, where \({\tilde{m}} \ll n^2\) (Sect. 7).

We also compare SimRank* with another alternative remedy for SimRank that adds selfloops on each node, and demonstrate that SimRank* is more effective (Sect. 8).

We evaluate the performance of SimRank* on real and synthetic datasets. Empirical results show that (i) SimRank* achieves richer semantics than existing measures (e.g., SimRank, PRank, and RWR); (ii) Our optimization techniques for SimRank* are consistently faster than the baselines by several times; (iii) SimRank* is scalable on large graphs with billions of edges, without any compromise of accuracy; (iv) The impacts of the query size and the number of iterations on the time and memory performance of SimRank* over largescale graphs (Sect. 9).

In Sects. 3.2 and 3.5, we provide a sufficient and necessary condition of the “zerosimilarity” problem for Jeh and Widom’s SimRank model [12] and ASCOS++ (a RWRlike model that appeared recently) [7]. In contrast, the prior work [31] only focused on Li et al. ’s SimRank model [19]. However, recent studies [16, 34] have pointed out that these two SimRank models are different. Thus, it is imperative to investigate if the similar “zerosimilarity” problem exists in Jeh and Widom’s SimRank model. Moreover, in Sect. 3.3, we add Corollary 2 to show that the positions of nodepairs with “zerosimilarity” issues in both SimRank models are exactly the same.

In Sect. 7, we propose two memoryefficient SimRank* singlesource algorithms, ssgSR* and sseSR*, that support ondemand computation of similarities between all n nodes and a given query in \(O(K^2{\tilde{m}})\) time and \(O(K{\tilde{m}})\) time, respectively. These algorithms also significantly reduce the space of allpairs SimRank* from \(O(n^2)\) to \(O(Kn+{\tilde{m}})\) for geometric SimRank* search, and to \(O(n+{\tilde{m}})\) for exponential SimRank* search, respectively, without any sacrifice of accuracy. We also provide the complexity bounds and correctness proofs of our memoryefficient algorithms. This has made the previous version of the SimRank* model in [31] highly scalable to large graphs with billions of edges.

In Sect. 8, we compare SimRank* with another alternative remedy of SimRank that adds selfloops on each node. Our analysis demonstrate that SimRank* is more effective than the straightforward treatment of adding selfloops, since SimRank* does not repeatedly count any incoming paths of different length when assessing pairwise similarity.

In Sects. 9.2.2 and 9.2.3–9.2.5, we conduct additional experiments on a variety of largescale datasets, including (i) qualitative case studies of the rich semantics of SimRank* for singlesource queries on real labeled datasets (DBLP and CitH); (ii) high scalability and low computational cost in terms of time and space for our memoryefficient SimRank* algorithms over billionedge graphs; (iii) exactness of ssgSR* and sseSR* as compared with the previous algorithms proposed in [31]; and (iv) impacts of the size of queries Q and the number of iterations K on the time and memory of ssgSR* and sseSR* on largescale datasets.

In Sect. 10, we update related work by incorporating the new research that has appeared recently.
2 Preliminaries
Symbols and description
Symbols  Description 

\({\mathcal {G}}\)  Directed graph 
\(\mathcal {\tilde{G}}\)  Induced bipartite graph from graph \({\mathcal {G}}\) 
\(\mathcal {\hat{G}}\)  Compressed graph of \(\mathcal {\tilde{G}}\) via edge concentration 
n  Number of nodes in graph \({\mathcal {G}}\) 
m  Number of edges in graph \({\mathcal {G}}\) 
\({\tilde{m}}\)  Number of edges in compressed graph \(\mathcal {\hat{G}}\) 
C  Damping factor (\(0<C<1\)) 
K  Number of iterations 
q  Query node in graph \({\mathcal {G}}\) 
\({\mathbf {e}}_q\)  \(n \times 1\) unit vector with a 1 in the qth entry and 0s elsewhere 
\({\mathbf {Q}}\)  Backward transition matrix 
\({\mathbf {S}}\)  SimRank matrix 
\({\hat{\mathbf {S}}}/{\hat{\mathbf {S}}}'\)  Geometric/exponential SimRank* matrix 
\({\mathbf {I}}_n\)  \(n \times n\) identity matrix 
\({{\mathbf {X}}}^\mathrm{T}\)  Transpose of matrix \({\mathbf {X}}\) 
\({[{\mathbf {X}}]}_{i,\star }\)  ith row of matrix \({\mathbf {X}}\) 
\({[{\mathbf {X}}]}_{\star ,j}\)  jth column of matrix \({\mathbf {X}}\) 
\({[{\mathbf {X}}]}_{i,j}\)  (i, j)th entry of matrix \({\mathbf {X}}\) 
2.1 Jeh and Widom’s SimRank model
Let \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) be a given graph with a set of nodes, \(\mathcal {V}\), and a set of edges, \(\mathcal {E}\). We denote by \(\mathcal {I} (a)\) a set of all the inneighbors of a, and \( \mathcal {I} (a) \) the cardinality of \(\mathcal {I} (a)\).
 (i)
\(s(a,b)=0\), if \({\mathcal {I}}(a)=\varnothing \text { or } {\mathcal {I}}(b)=\varnothing \);
 (ii)otherwise, where \(C \in (0,1)\) is a damping factor.
 (i)
Start with \(s_0 (a,a)=1\) and \(s_0(a,b)=0\) if \(a \ne b\).
 (ii)For \(k=0,1,2,\ldots \), iterate as indicated below:
 (a)
\(s_{k+1}( a,b )=0\), if \({\mathcal {I}}\left( a \right) =\varnothing \text { or } {\mathcal {I}}\left( b \right) =\varnothing \);
 (b)otherwise,
 (a)
2.2 Li et al. ’s SimRank model
Accordingly, Eq. (4) can be readily rewritten into the following component form:
(i) \(s_L(a,b)=0\), if \({\mathcal {I}}(a)=\varnothing \text { or } {\mathcal {I}}(b)=\varnothing \);
3 “Zerosimilarity” problem
In this section, we will provide a sufficient and necessary condition of the “zerosimilarity” problem for Jeh and Widom’s SimRank [12], Li et al. ’s SimRank [19], RWR [28], and ASCOS [7].
Before illustrating the existence of “zeroSimilarity” problems, let us first introduce the following notions.
Definition 1
Definition 2
An inlink path \(\rho \) is called symmetric if \(l_1=l_2\). \(\rho \) is called unidirectional if \(l_1=0\) or \(l_2=0\).
Example 2
Consider the graph \({\mathcal {G}}\) in Fig. 1, the path Open image in new window is an inlink path of nodepair (h, d), where a is the inlink “source”. \(\textsf {len}(\rho )=2+1=3\). \(\rho \) is an asymmetric inlink path as \(l_1=2\ne 1=l_2\). \(\square \)
Clearly, an inlink path \(\rho \) is symmetric if and only if there exists an inlink “source” in the center of \(\rho \). Thus, any inlink path of odd length (i.e., \(l_1+l_2\) is odd) is asymmetric since there do not exist two integers \(l_1\) and \(l_2\) s.t. \(l_1=l_2\) and \(l_1 +l_2\) is odd.
3.1 Counting inlink paths
To count the number of the inlink paths in a graph \({\mathcal {G}}\), we extend the power property of an adjacency matrix.
Traditionally, let \({\mathbf {A}}\) be the adjacency matrix of \({\mathcal {G}}\). There is an interesting property of \({\mathbf {A}}^l\) [5]: The entry \({[{{\mathbf {A}}}^{l}]}_{i,j}\) counts the number of paths of length l from node i to j. This property can be generalized as follows:
Lemma 1
The proof of Lemma 1 is completed by induction on l, which is similar to the proof of the power property of the adjacency matrix [5, Page 51].
Intuitively, Lemma 1 counts the number of generic paths whose edges are not always in the same direction. For instance, consider a path \(\rho : i \rightarrow \circ \leftarrow \circ \rightarrow \circ \rightarrow \circ \leftarrow j\), where \(\circ \) denotes an arbitrary node in a graph. We can construct \(\bar{{\mathbf {A}}}={\mathbf {A}} {\mathbf {A}}^\mathrm{T} {\mathbf {A}} {\mathbf {A}} {\mathbf {A}}^\mathrm{T} \), where \({\mathbf {A}}\) (resp. \({\mathbf {A}}^\mathrm{T}\)) is at the positions 1, 3, 4 (resp. 2, 5), corresponding to the positions of \(\rightarrow \) (resp. \(\leftarrow \)) in \(\rho \). Then, \({[\bar{{\mathbf {A}}}]}_{i,j}\) tallies the number of paths \(\rho \) in the graph. If no such paths, \({[\bar{{\mathbf {A}}}]}_{i,j}=0\). As another example, \({[{({\mathbf {A}}^\mathrm{T})}^{l_1} \cdot {\mathbf {A}}^{l_2}]}_{i,j}\) tallies the number of inlink paths of length \((l_1+l_2)\) for nodepair (i, j). As a special case when all \({\mathbf {A}}_k \ (\forall k\in [1,l])\) are set to \({\mathbf {A}}\), Lemma 1 reduces to the conventional power property of an adjacency matrix.
An immediate consequence of Lemma 1 is as follows:
Corollary 1
\(\sum _{k=1}^{\infty }{[{({\mathbf {A}}^\mathrm{T})}^{k} \cdot {\mathbf {A}}^{k}]}_{i,j}\) counts the number of all symmetric inlink paths of nodepair (i, j) in \({\mathcal {G}}\).
3.2 “Zerosimilarity” issue in Jeh and Widom’s model
Based on the notions of symmetric inlink paths, we next show why the “zerosimilarity” issue exists in Jeh and Widom’s model. Specifically, we show the following theorem:
Theorem 1
For any two distinct nodes a and b in \({\mathcal {G}}\), Jeh and Widom’s SimRank score s(a, b) will ignore all the contributions of asymmetric inlink paths for (a, b). As an extreme case, \(s(a,b)=0\) if and only if there are no symmetric inlink paths in \({\mathcal {G}}\) for nodepair (a, b).
Proof
(Sufficiency) We first prove that
“\(\exists \) a symmetric inlink path for \((i,j) \ \Rightarrow \ [{\mathbf {S}}]_{i,j} \ne 0\)”.
(Necessity) We next prove that
“\([{\mathbf {S}}]_{i,j} \ne 0 \ \Rightarrow \ \exists \) a symmetric inlink path for (i, j)”.
If \([{\mathbf {S}}]_{i,j} \ne 0\), then it follows from Eq. (6) that there exists a term (\(l_0\)th term) s.t. \([{\mathbf {Q}}^{l_0} \cdot {\mathbf {D}} \cdot {({\mathbf {Q}}^\mathrm{T})}^{l_0}]_{i,j} > 0\).
3.3 “Zerosimilarity” issue in Li et al. ’s SimRank
Apart from Jeh and Widom’s SimRank model, the “zerosimilarity” issue also exists in Li et al. ’s SimRank model, as indicated by the following theorem:
Theorem 2
For any two distinct nodes a and b in \({\mathcal {G}}\), Li et al. ’s SimRank similarity \(s_L(a,b)\) will also ignore the contributions of asymmetric inlink paths for (a, b). As an extreme case, \(s_L(a,b)=0\) whenever there are no symmetric inlink paths in \({\mathcal {G}}\) for nodepair (a, b).
(Please see “Appendix A.1” for the proof of Theorem 2).
Theorems 1 and 2 provide a sufficient and necessary condition of the “zerosimilarity” problem for both SimRank models. More interestingly, the proofs of these theorems imply further that nodepairs with the “zerosimilarity” problem in both models are the same:
Corollary 2
Proof
3.4 “Zerosimilarity” issue in RWR
Other nonSimRank family models, e.g., RWR [28], also imply a SimRanklike “zerosimilarity” problem.
Theorem 3
For any two distinct nodes a and b in \({\mathcal {G}}\), Random Walk with Restart (RWR) similarity \(s_R(a,b)\) will ignore the contributions of nonunidirectional paths from b to a. As an extreme case, \(s_R(a,b)=0\) whenever there are no unidirectional paths in \({\mathcal {G}}\) from b to a.
(Please see “Appendix A.2” for the proof of Theorem 3).
For example in Fig. 1, nodes e and f are assessed as dissimilar by RWR as there are two different directions “\(\leftarrow \)” and “\(\rightarrow \)” in the path Open image in new window . However, \(s_{R}(c,f) \ne 0\) since there is a path Open image in new window with one direction “\(\leftarrow \)” from f to c. Hence, both RWR and SimRank may encounter “zerosimilarity” issues.
3.5 “Zerosimilarity” issue in ASCOS++
Theorem 4
For any two distinct nodes a and b in \({\mathcal {G}}\), ASCOS++ similarity \(s_A(a,b)\) defined by Eq. (8) will ignore the contributions of nonunidirectional paths from b to a. As an extreme case, \(s_A(a,b)=0\) whenever there are no unidirectional paths in \({\mathcal {G}}\) from b to a.
Proof
(Sufficiency) We first prove that
“\(\exists \) a unidirectional path from j to \(i \Rightarrow [\mathbf {S_A}]_{i,j} \ne 0\)”.
“\([\mathbf {S_R}]_{i,j} \ne 0 \Rightarrow \exists \) a unidirectional path from j to i”.
If \([\mathbf {S_A}]_{i,j} \ne 0\), then it follows from Eq. (10) that there exists a term (\(l_0\)th term) s.t. \([{\mathbf {Q}}^{l_0}]_{i,j} \cdot [{\mathbf {D}}]_{j,j} > 0\). Since \({{[{\mathbf {D}}]}_{j,j}} \ge 1C >0 \ \ (\forall j)\), it follows that \([{\mathbf {Q}}^{l_0}]_{i,j} > 0\).
The proofs of Theorems 3 and 4 imply that nodepairs of “zero similarities” in both RWR and ASCOS++ models are the same. Indeed, by comparing their power series forms, we notice that RWR and ASCOS++ are almost the same in tallying unidirectional paths except weight assignment for each path.
The probability that the extreme cases of the “zerosimilarity” problems for RWR and ASCOS++ stated in Theorems 3 and 4 are often small in practice. This is especially evident for undirected graphs because, for an undirected graph, if the RWR (resp. ASCOS++) similarity \(s(a,b)=0\), it means there are no connectivity between nodes a and b, i.e., node a and b belong to two different components of a graph. Therefore, the importance of Theorems 3 and 4 is to highlight that, in nonextreme cases where the RWR (resp. ASCOS++) similarity between two nodes is not zero, there are still a number of nonunidirectional paths that can be ignored by the RWR (resp. ASCOS++) model.
4 SimRank*: a remedy for SimRank
4.1 Geometric series form of SimRank*
As SimRank (resp. RWR) loses asymmetric (resp. nonunidirectional) inlink paths to assess nodepair s(i, j), our treatment aims to compensate s(i, j) for such a loss, by accommodating asymmetric (resp. nonunidirectional) inlink paths. Precisely, we add the terms \({[{\mathbf {Q}}^{l_1} \cdot {({\mathbf {Q}}^\mathrm{T})}^{l_2}]}_{i,j}\), \(\forall l_1 \ne l_2\) (resp. \(\forall l_1 \ne 0\)), with appropriate weights, into the series form of SimRank (resp. RWR) as follows:
Definition 3
Although RWR and ASCOS++ capture part of inlink paths of odd length that are missed by SimRank, they ignore two types of nonunidirectional inlink paths that can be captured by SimRank*: (a) symmetric ones that are accommodated by SimRank; (b) asymmetric ones whose “source” node is not at the right end.
For instance, given nodepair (i, j), Fig. 2 compares all the inlink paths of length \(l \in [1,4]\) that are captured by Jeh and Widom’s SimRank [12], Li et al. ’s SimRank [19], RWR [28], ASCOS++ [7], and SimRank*. It can be noticed from ‘SimRank*’ column that only a small number of inlink paths are captured by SimRank (dark gray cells) and RWR/ASCOS++ (light gray cells).
4.2 Weighted factors of two types
We next describe two kinds of weighted factors adopted by SimRank* model Eq. (11): (a) length weights \(\{C^l\}_{l=0}^{\infty }\); and (b) symmetry weights \(\{{l \atopwithdelims ()\alpha }\}_{\alpha =0}^{l}\).
Intuitively, the length weight \(C^l \ (0<C<1)\) measures the importance of inlink paths of different lengths. Similar to the original SimRank (Eq. (13)), the outer summation over l in SimRank* (Eq. (12)) is to add up the contributions of inpaths of different length l. The length weight \(C^l\) aims to reduce the contributions of inpaths of long lengths relative to short ones as \(\{C^l\}_{l\in [0,\infty )}\) is a deceasing sequence w.r.t. length l.
The symmetry weight uses binomial \({l \atopwithdelims ()\alpha } \ \ (0 \le \alpha \le l)\) to assess the importance of inlink paths of a fixed length l, with \(\alpha \) edges in one direction (from the “source” node to one end of the path) and \(l\alpha \) edges in the opposite direction, where \(\alpha \) reflects the symmetry of inlink paths of length l. As depicted in Fig. 2, when \(\alpha = 0 \text { or } l\), inlink paths become completely asymmetric, reducing to a single direction; when \(\alpha \) is close to \(\lfloor l/2 \rfloor \), the “source” node is near the center of inlink paths, being almost symmetric.
To show that the use of binomial \({l \atopwithdelims ()\alpha }\) is reasonable, in “Appendix B”, we will answer the following questions:
(c) Why symmetric inlink paths are considered as more important than less symmetric ones, for a given length?
The use of \((1C)\) and \(\frac{1}{2^l}\) in Eq. (12) is to normalize \({[\hat{{\mathbf {S}}}]}_{i,j}\) and \({[\hat{{\mathbf {T}}}_{l}]}_{i,j}\), respectively, into [0, 1]. Specifically, we can verify that \({\Vert {\mathbf {Q}}^{l_1} \cdot {({\mathbf {Q}}^\mathrm{T})}^{l_2} \Vert }_{\max } \le 1 \ \ (\forall l_1, \forall l_2)\). Thus, (i) \({ \Vert \sum _{\alpha =0}^{l} {l \atopwithdelims ()\alpha } \cdot {{\mathbf {Q}}^{\alpha } \cdot {({\mathbf {Q}}^\mathrm{T})}^{l\alpha }} \Vert }_{\max } \le \sum _{\alpha =0}^{l} {l \atopwithdelims ()\alpha } = 2^l\), which implies \({\Vert {\hat{{\mathbf {T}}}}_{l}\Vert }_{\max } \le 1\). (ii) As \({\Vert \sum _{l=0}^{\infty } C^l \cdot {\hat{{\mathbf {T}}}}_{l}\Vert }_{\max } \le \sum _{l=0}^{\infty } C^l = \frac{1}{1C}\), it follows that \({\Vert {{{\mathbf {S}}}}\Vert }_{\max } \le 1\).
By combining these two kinds of weights, the contribution of any inlink paths for a given nodepair can be easily assessed. For example in Fig. 1, Open image in new window has a contribution rate of \((10.8) \cdot {0.8}^3 \frac{1}{2^3} {3 \atopwithdelims ()2} = 0.0384\) for nodepair (h, d). As opposed to SimRank that uses only length weight \(C^l\), SimRank* considers both \(C^l\) and symmetry weight \(l \atopwithdelims ()\alpha \).
4.3 Some extensions of SimRank* beyond counting inlink paths only
It is also worth mentioning that our proposed SimRank* model that determines the similarity by counting inlink paths also can be combined with other structuralcontext similarity models (e.g., RoleSim [14] that considers automorphism similarity relationship) to produce a comprehensive similarity measure.
4.4 Convergence of SimRank*
As SimRank* in Eq. (11) is an infinite geometric series, it is imperative to study the convergence of this series.
Theorem 5
(Please see “Appendix A.3” for the proof of Theorem 5).
4.5 Exponential series form of SimRank* variant
The choice of length weight \(\tfrac{C^l}{l!}\) for the exponential SimRank* (Eq. (16)) plays a key role in accelerating convergence. As suggested by the proof of Theorem 5, the bound \(C^{k+1}\) in Eq. (15) (resp. \(\tfrac{C^{k+1}}{(k+1)!}\) in Eq. (17)) is actually derived from our choice of length weight \(C^l\) (resp. \(\tfrac{C^l}{l!}\)) for the geometric (resp. exponential) SimRank*. Thus, there might exist other length weights for speeding up the convergence of SimRank*, as there is no sanctity of the earlier choices of length weight. That is, apart from \(C^l\) and \(\tfrac{C^l}{l!}\), other sequence, e.g., \(\tfrac{C^l}{l}\), that satisfies decreasing monotonicity w.r.t. length l can be regarded as another possible candidate for length weight, since the efficacy of the length weight is to reduce the contributions of inlink paths of long lengths relative to short ones. The reasons why we select \(C^l\) and \(\tfrac{C^l}{l!}\), instead of others, are twofold: (i) The normalized factor of length weight should have a simple form, e.g., \(\sum _{l=0}^{\infty } \frac{C^l}{l!}=e^C\). (ii) Once selected, the length weight should enable the series form of SimRank* to be simplified into a very elegant form, e.g., using \(\frac{C^l}{l!}\) allows Eq. (16) being simplified, as will be seen in Eq. (20), into a neat closed form. In contrast, \(\tfrac{C^l}{l}\) is not a preferred length weight as its series version may not be simplified into a neat recursive (or closed) form, though the form \(\sum _{l=0}^{\infty } \frac{C^l}{l}= \ln {\tfrac{1}{(1C)}}\) is simple for normalized factor.
5 Recursive and closed forms of SimRank*
A bruteforce way of computing the first kth partial sums of Eq. (11) requires \(O(k\cdot l^2 \cdot n^3)\) time, involving \(l^2\) matrix multiplications in the inner summation for each fixed l in the outer summation, which seems much more expensive than SimRank. In this section, we propose two simple representations of SimRank* (i.e., the recursive form of geometric SimRank*, and the closed form of exponential SimRank*).
5.1 Recursive form of geometric SimRank*
We first show the recursive form of the geometric SimRank* series in Eq. (11).
Theorem 6
(Please see “Appendix A.4” for the proof of Theorem 6).
Theorem 6 provides a timeefficient iterative algorithm to compute SimRank* matrix \(\hat{{\mathbf {S}}}_k\), with its accuracy guaranteed by Theorem 5. The complexity of this iterative method is O(Knm) time and \(O(n^2)\) memory. Please refer to “Appendix C” for a detailed analysis.
The \(O(n^2)\) memory of Eq. (18) is the main barrier that hinders the scalability of SimRank* on large graphs. In Sect. 7, we will provide a scalable algorithm, named ssgSR*, that will substantially reduce the memory from quadratic to linear, without any loss of accuracy.
5.2 Closed form of exponential SimRank*
Having converted the series form of geometric SimRank* into a simple recursive form, we next present the closed form of exponential SimRank* in Eq. (16).
Theorem 7
(Please see “Appendix A.5” for the proof of Theorem 7).
The utility of Theorem 7 will be shown in Sect. 6.4 for optimizing the exponential SimRank* computation.
6 Accelerate SimRank* computation
6.1 Finegrained memoization
6.2 Induced bigraph
Definition 4
An induced bipartite graph (bigraph) from a given graph \({{\mathcal {G}}}=({{\mathcal {V}}},{{\mathcal {E}}})\) is a bipartite graph \(\tilde{{\mathcal {G}}}=({{\mathcal {T}}}\cup {{\mathcal {B}}},\tilde{{{\mathcal {E}}}})\), such that its two disjoint node sets \({{\mathcal {T}}}=\{x\in {{\mathcal {V}}} \  \ {{\mathcal {O}}}(x) \ne \varnothing \}\), \({{\mathcal {B}}}=\{x \in {{\mathcal {V}}} \  \ {{\mathcal {I}}}(x) \ne \varnothing \}\),^{2} and for each \(u \in {{\mathcal {T}}}\) and \(v \in {{\mathcal {B}}}\), \((u,v) \in \tilde{{{\mathcal {E}}}}\) if and only if there is an edge from u to v in \({\mathcal {G}}\).
6.3 Biclique compression via edge concentration
Based on the induced bigraph \(\tilde{{\mathcal {G}}}\), we next introduce the notion of bipartite cliques (bicliques).
Definition 5
Given an induced bigraph \(\tilde{{\mathcal {G}}}=({{\mathcal {T}}}\cup {{\mathcal {B}}},\tilde{{{\mathcal {E}}}})\), a pair of two disjoint subsets \({{\mathcal {X}}} \subseteq {{\mathcal {T}}}\) and \({{\mathcal {Y}}} \subseteq {{\mathcal {B}}}\) is called a biclique if \((x,y) \in \tilde{{{\mathcal {E}}}}\) for all \(x \in {{\mathcal {X}}}\) and \(y \in {{\mathcal {Y}}}\).
Intuitively, a biclique \(({{\mathcal {X}}}, {{\mathcal {Y}}})\) is a complete bipartite subgraph of \(\tilde{{\mathcal {G}}}\), which has \({{\mathcal {X}}}+{{\mathcal {Y}}}\) nodes and \({{\mathcal {X}}} \times {{\mathcal {Y}}}\) edges. Each biclique \(({{\mathcal {X}}}, {{\mathcal {Y}}})\) in \(\tilde{{\mathcal {G}}}\) implies that, in \({\mathcal {G}}\), all nodes \(y \in {{\mathcal {Y}}}\) have the common inneighbor set \({{\mathcal {X}}}\). For example, there are two bicliques \((\{b,d\},\{c,g,i\})\) in dashed line, and \((\{e,j,k\},\{h,i\})\) in dotted line in Fig. 3. Biclique \((\{b,d\},\{c,g,i\})\) in \(\tilde{{\mathcal {G}}}\) implies that in \({\mathcal {G}}\), nodes c, g, i have two inneighbors \(\{b,d\}\) in common.
Bicliques are introduced to compress bigraph \(\tilde{{\mathcal {G}}}\) for optimizing SimRank* computation. In “Appendix D.1”, we present the main idea of our bigraph compression techniques. Then, we propose an algorithm, memogSR*, for computing allpairs SimRank* quickly, by using finegrained memoization (“Appendix D.2”). The correctness and complexity of memogSR* are shown in “Appendix D.3”, which requires \(O(K n {\tilde{m}})\) time and \(O(n^2)\) memory, followed by a running example in “Appendix D.4”.
To scale memogSR* on large graphs, in Sect. 7 we will propose a memoryefficient algorithm, ssgSR*.
6.4 Exponential SimRank* optimization
The aforementioned optimization methods for (geometric) SimRank* computation can be readily extended to exponential SimRank* variant. Please refer to “Appendix D.5” for the optimization techniques generalized to speed up the exponential SimRank* search.
7 Linearize SimRank* memory
7.1 Singlesource geometric SimRank*
To efficiently compute a single column of the SimRank* matrix \(\hat{{\mathbf {S}}}_k\), we first focus on geometric SimRank* search, and propose an efficient method that requires only linear memory while minimizing duplicate computations without any loss of accuracy.
Theorem 8
Before proving Theorem 8, we first give an example to illustrate the application of this theorem to compute singlesource SimRank* efficiently.
Example 3
Recall the graph in Fig. 1. Given query node e, the decay factor \(C=0.6\), and the number of iterations \(k=3\), the singlesource geometric SimRank* \({{[{{{\hat{\mathbf {S}}}}_{k}}]}_{*,e}}\) can be computed via Theorem 8 as follows:
i  j  \({\mathbf {m}}_{i}^{(j)}\) 

1  0  \( {\mathbf {m}}_{1}^{(0)} =\tfrac{C}{2} {{{\mathbf {Q}}}^{T}} \overbrace{{\mathbf {m}}_{0}^{(0)}}^{={\mathbf {0}}}+\overbrace{{\mathbf {m}}_{0}^{(1)}}^{={{{\mathbf {e}}}_{e}}}\) \( ={{[0 , 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]}^{T}}\) 
2  0  \( {\mathbf {m}}_{2}^{(0)} =\tfrac{C}{2} {{{\mathbf {Q}}}^{T}} {{\mathbf {m}}_{1}^{(0)}}+\overbrace{{\mathbf {m}}_{1}^{(1)}}^{={{{\mathbf {e}}}_{e}}}\) \( = [.3, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]^\mathrm{T} \) 
1  \( {\mathbf {m}}_{2}^{(1)} =\tfrac{C}{2} {{{\mathbf {Q}}}^{T}} \underbrace{{\mathbf {m}}_{1}^{(1)}}_{={\mathbf {0}}}+{{\mathbf {m}}_{1}^{(0)}}\) \( ={{[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]}^\mathrm{T}} \)  
3  0  \( {\mathbf {m}}_{3}^{(0)} =\tfrac{C}{2} {{{\mathbf {Q}}}^{T}} {\mathbf {m}}_{2}^{(0)}+\overbrace{{\mathbf {m}}_{2}^{(1)}}^{={{{\mathbf {e}}}_{e}}}\) \( =[.3, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]^\mathrm{T} \) 
1  \( {\mathbf {m}}_{3}^{(1)} =\tfrac{C}{2} {{{\mathbf {Q}}}^{T}} {\mathbf {m}}_{2}^{(1)}+{\mathbf {m}}_{2}^{(0)}\) \( =[.6, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]^\mathrm{T} \)  
2  \( {\mathbf {m}}_{3}^{(2)} =\tfrac{C}{2} {{{\mathbf {Q}}}^{T}} \underbrace{{\mathbf {m}}_{2}^{(2)}}_{={\mathbf {0}}}+{{\mathbf {m}}_{2}^{(1)}}\) \( =[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]^\mathrm{T} \)  
4  0  \( {\mathbf {m}}_{4}^{(0)} =\tfrac{C}{2} {{{\mathbf {Q}}}^{T}} {\mathbf {m}}_{3}^{(0)}+\overbrace{{\mathbf {m}}_{3}^{(1)}}^{={{{\mathbf {e}}}_{e}}}\) \( =[.3, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]^\mathrm{T} \) 
1  \( {\mathbf {m}}_{4}^{(1)} =\tfrac{C}{2} {{{\mathbf {Q}}}^{T}} {\mathbf {m}}_{3}^{(1)}+{{\mathbf {m}}_{3}^{(0)}}\) \( =[.6, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]^\mathrm{T} \)  
2  \( {\mathbf {m}}_{4}^{(2)} =\tfrac{C}{2} {{{\mathbf {Q}}}^{T}} {\mathbf {m}}_{3}^{(2)}+{\mathbf {m}}_{3}^{(1)}\) \( =[.9, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]^\mathrm{T} \)  
3  \( {\mathbf {m}}_{4}^{(3)} =\tfrac{C}{2} {{{\mathbf {Q}}}^{T}} \underbrace{{\mathbf {m}}_{3}^{(3)}}_{={\mathbf {0}}}+{\mathbf {m}}_{3}^{(2)}\) \( ={{[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]}^\mathrm{T}} \) 
i  \( {\mathbf {u}}_{i}\) 

0  \( {\mathbf {u}}_{0} = {\mathbf {m}}_{4}^{(3)} \) \( ={{[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]}^\mathrm{T}} \) 
1  \( {{{\mathbf {u}}}_{1}}={\mathbf {m}}_{4}^{(2)}+\tfrac{C}{2} {\mathbf {Q}} {{{\mathbf {u}}}_{0}}\) \( =[.9, 0, 0, 0, 1, 0, 0, .3, .1]^\mathrm{T} \) 
2  \( {{{\mathbf {u}}}_{2}}={\mathbf {m}}_{4}^{(1)}+\tfrac{C}{2} {\mathbf {Q}} {{{\mathbf {u}}}_{1}}\) \( =[.6, .27, 0, .135, 1.27, 0, 0, .3, .1]^\mathrm{T} \) 
3  \({{{\mathbf {u}}}_{3}}={\mathbf {m}}_{4}^{(0)}+\tfrac{C}{2} {\mathbf {Q}} {{{\mathbf {u}}}_{2}}\) \( =[.3, .18, .061, .09, 1.18, .081, .061, .381, .168]^\mathrm{T} \) 
 1.
It provides a memoryefficient iterative model that allows SimRank* retrieval scaling well on large graphs, without compromising accuracy and with no need to store all \((n^2)\) pairs SimRank* scores \(\hat{{\mathbf {S}}}_{k1}\) at the previous iteration of Eq. (23). As opposed to the \(O(n^2)\) memory of the conventional iterative model Eq. (23), our new iterative model in Theorem 8 requires only \(O(kn+m)\) memory, which is dominated by matrixvector multiplications \({\mathbf {Q}}\cdot {{{\mathbf {u}}}_{i1}}\) in Eq. (27) and \({{{\mathbf {Q}}}^{T}}\cdot {\mathbf {m}}_{i1}^{(j)}\) in Eq. (28).
 2.Compared with the straightforward righttoleft association in Eq. (25) that requires \(\frac{k(k+1)(k+2)}{3}\) matrixvector multiplications, our novel iterative model in Theorem 8 utilizes a Pascal’s triangle fashion to evaluate \(\{{\mathbf {m}}_{j}^{(i)}\}\) that effectively eliminates duplicate multiplications and significantly reduces the number of matrixvector multiplications to$$\begin{aligned} \underbrace{\Big ( \sum _{i=1}^{k} 1 \Big )}_{\text {Eq.}(27)} + \underbrace{\Big ( \sum _{i=1}^{k+1} \sum _{j=0}^{i1} 1 \Big )}_{\text {Eq.}(28)} = k + \frac{(k+1)(k+2)}{2} \end{aligned}$$
 3.
Theorem 8 implies an efficient parallel algorithm for allpairs SimRank* search. Indeed, the computation of allpairs SimRank* \(\hat{{\mathbf {S}}}\) can be broken into n columns \([\hat{{\mathbf {S}}}]_{*,q} \ (q=1,\ldots ,n)\) of singlesource SimRank* search, where each column can be computed concurrently on different processors via Theorem 8. In contrast, the previous iterative model Eq. (23) to compute allpairs SimRank* is not parallelizable.
 4.
The iterative model in Theorem 8 is querydependent, which provides an ondemand retrieving strategy for SimRank*. That is, SimRank* scores can be retrieved on an asneeded basis by Theorem 8. In comparison, the previous model Eq. (23) always outputs allpairs scores even if only a fraction of scores are requested.
Based on Theorem 8, we provide a memoryefficient algorithm, ssgSR*, for singlesource geometric SimRank*. We analyze its complexity and correctness below:
Theorem 9
(Complexity) Given a graph \({\mathcal {G}}\), a query q, and the number of iterations K, ssgSR* requires \(O(Kn+m)\) memory and \(O(K^2m)\) time to iteratively compute singlesource geometric SimRank* scores \({{[{{\hat{\mathbf {S}}}_{K}}]}_{\star ,q}}\).
(Please see “Appendix A.6” for the proof of Theorem 9).
It is worth mentioning that our edge concentration approach in Sect. 6 can be integrated with ssgSR* to enable a further speedup of singlesource SimRank* retrieval. We just need to replace \({\mathbf {Q}}\) of \({\mathcal {G}}\) with the new backward transition matrix of the compressed graph of \({\mathcal {G}}\) in Algorithm 1. Then, the total time of ssgSR* becomes \(O(K^2{\tilde{m}} + {\tilde{m}} \log (2n))\) time, where \({\tilde{m}}\) is the number of edges in the compressed graph, and \(O({\tilde{m}} \log (2n))\) is the time required for graph compression.
Correctness To show that the results \(\hat{s}_{K}(\star ,q)\) output by ssgSR* are correct, let us first propose the following two lemmas, which will be used to prove Theorem 8.
Lemma 2
(Please see “Appendix A.7” for the proof of Lemma 2).
Lemma 3
Proof
Leveraging Lemmas 2 and 3, we will complete the proof of Theorem 8.
Proof of Theorem 8
7.2 Singlesource exponential SimRank*
Having derived the singlesource geometric SimRank* model in Sect. 7.1, we next focus on the singlesource exponential SimRank* assessment. To efficiently evaluate a single column of the exponential SimRank* matrix \(\hat{{\mathbf {S}}}_k'\) in Eq. (16), we propose the following iterative model, whose CPU time and memory are not only linear w.r.t. the number of edges in the graph, but also less than those of the singlesource geometric SimRank*.
Theorem 10
Proof
We first prove that \({{{\mathbf {u}}}_{k}}=\sum _{j=0}^{k}{\tfrac{{{C}^{j}}}{{{2}^{j}}\cdot j!}{{({{{\mathbf {Q}}}^{T}})}^{j}}{{{\mathbf {e}}}_{q}}}\).
Theorem 10 implies an efficient algorithm, sseSR*, for singlesource exponential SimRank* search. Its computational complexity is analyzed as follows:
Theorem 11
(Complexity) Given a graph \({\mathcal {G}}\), a query node q, and the total number of iterations K, sseSR* yields \(O(m+n)\) memory and O(Km) time to iteratively compute singlesource exponential SimRank* scores \({{[{{\hat{\mathbf {S}}'}_{K}}]}_{\star ,q}}\).
Proof
The memory of sseSR* is \(O(m+n)\), which is dominated by (i) O(m) for storing sparse \({\mathbf {Q}}\) (line 1), and (ii) O(n) for storing vectors \({\mathbf {u}}\) (line 4) and \({\mathbf {v}}\) (line 7).
The time complexity of sseSR* is O(Km), which is dominated by the matrixvector multiplications \(({\mathbf {Q}}^\mathrm{T} \cdot {\mathbf {u}})\) (line 4) and \(({\mathbf {Q}} \cdot {\mathbf {v}})\) (line 7) for K iterations. \(\square \)
Compared with the \(O(K^2m)\) time of the singlesource geometric SimRank* algorithm ssgSR*, the singlesource exponential SimRank* reduces the time from \(O(K^2m)\) to O(Km) further, linear with K. Moreover, the memory of ssgSR* is improved from \(O(Kn+m)\) to \(O(n+m)\), independent of K. This is because, for the singlesource exponential SimRank* computation, the iterative process in Eq. (38) relies only on the resulting \({\mathbf {u}}_K\). Thus, there is no need of O(Kn) memory to store K vectors \(\{ {\mathbf {u}}_1, \ldots , {\mathbf {u}}_K \}\) in Eq. (39).
Example 4
Recall the graph in Fig. 1. Given query node b, the decay factor \(C=0.6\), and the number of iterations \(k=3\), the singlesource exponential SimRank* \({{[{{\hat{\mathbf {S}}'}_{k}}]}_{*,b}}\) can be computed via Theorem 10 as follows:
i  \( {\mathbf {u}}_{i}\) 

0  \( {\mathbf {u}}_{0} = {\mathbf {e}}_{b}\) \( ={{[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]}^\mathrm{T}} \) 
1  \( {{{\mathbf {u}}}_{1}}=\tfrac{C}{2 \cdot 3} {\mathbf {Q}}^\mathrm{T} {{{\mathbf {u}}}_{0}} + {\mathbf {e}}_{b}\) \( = [.1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]^\mathrm{T} \) 
2  \( {{{\mathbf {u}}}_{2}}=\tfrac{C}{2 \cdot 2} {\mathbf {Q}}^\mathrm{T} {{{\mathbf {u}}}_{1}} + {\mathbf {e}}_{b}\) \( = [.15,1, 0, 0, 0, 0, 0, 0, 0, 0, 0]^\mathrm{T} \) 
3  \( {{{\mathbf {u}}}_{3}}=\tfrac{C}{2 \cdot 1} {\mathbf {Q}}^\mathrm{T} {{{\mathbf {u}}}_{1}} + {\mathbf {e}}_{b}\) \( = [.3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]^\mathrm{T} \) 
i  \( {\mathbf {v}}_{i}\) 

0  \( {\mathbf {v}}_{0} = {{{\mathbf {v}}}_{3}} \) \( =[.3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]^\mathrm{T} \) 
1  \( {{{\mathbf {v}}}_{1}}=\tfrac{C}{2 \cdot 3} {\mathbf {Q}} {{{\mathbf {v}}}_{0}} + {{{\mathbf {u}}}_{3}} \) \( =[.3, 1.03, .05, .015, .03, .1, .05, 0, .0333]^\mathrm{T} \) 
2  \( {{{\mathbf {v}}}_{2}}=\tfrac{C}{2 \cdot 3} {\mathbf {Q}} {{{\mathbf {v}}}_{1}} + {{{\mathbf {u}}}_{3}} \) \( =[.3, 1.05, .078, .03, .045, .155, .078, .005, .054]^\mathrm{T} \) 
3  \( {{{\mathbf {v}}}_{3}}=\tfrac{C}{2 \cdot 3} {\mathbf {Q}} {{{\mathbf {v}}}_{2}} + {{{\mathbf {u}}}_{3}} \) \(=[.3, 1.09, .161, .068, .09, .314, .161, .014, .112]^\mathrm{T} \) 
8 Comparison with “adding selfloops”
Apart from SimRank*, there is another simple method that adds a selfloop on each node of a graph to fix the “zerosimilarity” issue of SimRank. In this section, we vindicate that SimRank* is more efficacious than the “adding selfloops” SimRank method in that there are many nodepairs overcounted in the similarity of the “adding selfloops” method.
To elaborate on this, we consider the first two consecutive steps of the two recursive models, respectively.
Description of real datasets \(({\bar{d}} = {{\mathcal {E}}}/{{\mathcal {V}}})\)
Datasets  \({{\mathcal {V}}}\)  \({{\mathcal {E}}}\)  \({\bar{d}}\)  

Small  D06 (2006–2008)  13,752  72,522  5.3 
D09 (2009–2011)  13,124  73,572  5.6  
D02 (2002–2007)  15,241  86,525  5.7  
CitH (citHepPh)  34,546  421,578  12.2  
Med  Email (EmailEuAll)  265,214  420,045  1.6 
WebG (webGoogle)  916,428  5,105,039  5.6  
Large  WikT (WikiTalk)  2,394,385  5,021,410  2.1 
SocL (socLiveJournal)  4,847,571  68,993,773  14.2  
UK05 (uk2005)  39,459,925  936,364,282  23.7  
IT04 (it2004)  41,291,594  1,150,725,436  27.9 
9 Experimental evaluation
9.1 Experimental settings
Datasets We adopt both real and synthetic datasets.
(1) Real datasets The size of each dataset is shown in Table 2. A detailed description is given in “Appendix E.1”.
(2) Synthetic datasets To produce synthetic networks, we use a graph generator GTgraph^{3} that takes as input the number of nodes \({{\mathcal {V}}}\) and edges \({{\mathcal {E}}}\).
Compared algorithms We compare the following algorithms: (a) ssgSR* and sseSR*, our singlesource geometric and exponential SimRank* algorithms in Sect. 7; (b) SLSR [27] and KMSR [16], the stateoftheart singlesource SimRank algorithms based on indexing strategies and random sampling; (c) RWR [15], a fast random walk with restart algorithm measuring node proximities w.r.t. a given query; (d) memogSR* and memoeSR*, the geometric and exponential SimRank* algorithms via partial sums memoization in Sect. 6; (e) psumSR [24] and psumPR [36], the SimRank and PRank algorithms via partial sums memoization; and (f) mtxSR [19], a matrixbased method that computes Li et al. ’s SimRank using singular value decomposition.
Test queries For similarity ranking evaluation, we randomly select 500 query nodes from each dataset, based on the following: For each graph, we first sort all nodes in order of their importance (measured by PageRank) into 5 groups, and then randomly choose 100 nodes from each group, aiming to guarantee that the selected nodes can systematically cover a broad range of all possible queries.
Parameters We set the following default parameters: (a) \(C=0.6\), the decay factor, as previously used in [12]. (b) For all the iterative models, we set the number of iterations \(K=20\) by default, to guarantee a high accuracy of \(C^K={0.6}^{21}\le 0.0000219\). (c) For KMSR, we follow the suggestion in [16], and set three parameters \(T=11\), \(R=100\), \(L=3\), to ensure a worstcase error \(\epsilon =C^\mathrm{T}/(1C)\approx 0.01\). (d) For SLSR, we follow Theorem 1 in [27], and set \(\epsilon _d = 0.003\) and \(\theta = 0.0001\), which guarantees its maximum error \(\epsilon < 0.01\). We also set \(\delta _d = 1/n^2\), which ensures that the preprocessing of SLSR to achieve at least \((11/n)\) probability.
Effectiveness metrics To evaluate semantics and similarity ranking, we adopt the following three metrics: Kendall’s \(\tau \), Spearman’s \(\rho \), and Normalized Discounted Cumulative Gain (NDCG). Please refer to “Appendix E.2”.
Ground truth (a) To assess similar authors on DBLP, we invite 20 experts from database and data mining areas to verify the correctness of retrieved coauthorships. The experts have a strong research profile of international stature along with a sustained record of significant and world leading publications in databases/data mining areas, e.g., ACM TODS, VLDBJ, IEEE TKDE, ACM TKDD, SIGMOD, SIGKDD, PVLDB, ICDE. We selected the outstanding researchers with the combined expertise of data science from all over the world (e.g., USA, Europe, Australia, Asia) according to their Google Scholar profile with the minimum thresholds of \(\# \text {of citations}>1000\) and \(\text {Hindex}>20\). Therefore, the selected scholars are familiar with their research domains, and can well evaluate relevant authors in data science through experience. They will also refer to “CoAuthor Path” in Microsoft Academic Search^{4} to see “separations” between any two collaborators.
(b) To evaluate similar papers on CitH, we hire 15 researchers from the physical department to judge the “true” relevance of retrieved cocitations. The scholars have a proven track record of excellence in High Energy Physics research over the recent five years, with publications in e.g., Physical Review D, Nuclear Physics B, Journal of High Energy Physics, and Physics Letters B. We selected these scholars based on their productivity (number of highquality publications) and research impact (number of citations) based on the Web of Science Core Collection (Thomson Reuters). These consistent publications in the highimpact journals indicate that the selected researchers have better knowledge in High Energy Physics research to well evaluate the similarities of papers in eprint arXiv. Their assessment may hinge on paper contents, Hindex, and the number of citations in www.ScienceDirect.com. For all the ground truth, the results are rendered by a majority vote of feedbacks.
9.2 Experimental results
9.2.1 Quantitative results on semantic effectiveness
We first run the algorithms on directed CitH and undirected DBLP. By randomly selecting 500 queries, we evaluate the average semantic accuracy for each algorithm via three metrics (Kendall, Spearman, NDCG). Figure 6a depicts the quantitative results. (1) On CitH, memogSR* and memoeSR* have higher accuracy (e.g., Spearman’s \(\rho \approx 0.91\)) than psumSR (0.29), RWR (0.12) and psumPR (0.42) on average, i.e., the semantics of SimRank* is effective. This is because SimRank* considers all inlink paths for assessing similarity, whereas SimRank and RWR, respectively, counts only limited symmetric and unidirectional paths. (2) On DBLP, the accuracy of RWR is the same as memogSR* and memoeSR*, due to the undirectedness of DBLP. This tells us that, regardless of edge directions, both SimRank* and RWR count the path of all lengths, as opposed to SimRank considering only the evenlength paths. Likewise, psumPR and psumSR produce the same results on undirected DBLP. (3) On each dataset, memogSR* and memoeSR* keep almost the same accuracy, implying that the relative order of the geometric SimRank* is well maintained by its exponential counterpart.
9.2.2 Qualitative case studies on semantics
Figure 7 presents the case study of qualitative results for topk similarity ranking w.r.t. queries Q1–Q4 on DBLP D09 (2009–2011). For example, Q1 finds the most similar coauthors of Prof. Jennifer Widom, by using different similarity measures, e.g., SimRank* (memogSR*, memoeSR*), Random Walk with Restart (RWR), SimRank without adding selfloops (psumSR), and SimRank by adding selfloops (selfloop). We observe that (1) RWR and memogSR* produce the same results on DBLP, which is due to the undirectedness of DBLP, as expected. (2) memogSR* and memoeSR* also yield the same results for our topk similarity search, showing the relative ranking preservation of memoeSR* w.r.t. memogSR*. (3) Some close coauthors of Prof. Jennifer Widom that are ranked lower undesirably by psumSR (as shown in the brackets of the gray cells) can be well identified by memogSR*, memoeSR*, and RWR. For instance, “Anish Das Sarma”, who has many collaborative publications with Prof. Jennifer Widom during 2009–2011, is ranked among top 5 by memogSR* and memoeSR*, but not top ranked by psumSR and selfloop. This is because SimRank ignores the contributions of asymmetric inlink paths (i.e., the paths of odd lengths in undirected graphs), whereas SimRank* considers the contributions of all inlink paths. As a result, many close coauthors (with high degrees of oneedge connection) of Prof. Jennifer Widom (e.g., Dr. Anish Das Sarma) are missed by SimRank, but can be found effectively by SimRank*. The disparity of ranking in gray cells shows that memogSR*, memoeSR*, RWR can perfectly resolve the “zerosimilarity” issue of psumSR on undirected graphs. (4) selfloop is more effective than SimRank, but sometimes less effective than SimRank*. For example in Q1, “Huacheng C. Ying” and “Qi Su” are identified by both SimRank* and selfloop, but they are ignored by SimRank. However, “Anish Das Sarma”, Prof. Jennifer Widom’s student, is not captured by SimRank or selfloop. “Beverly Yang” is ranked at \(6^{\text {th}}\) by selfloop, but he has no collaborative publications with Prof. Jennifer Widom on DBLP (2009–2011). This is due to the overcounting problem of selfloop that will lead to excessive length weight coefficients counterintuitively assigned to the pair (“Beverly Yang” and “Prof. Jennifer Widom”). In some cases, selfloop achieves the ranking results as good as SimRank*. For instance in Q4, the top4 most similar authorpairs in D09 (2009–2011) by SimRank* and selfloop are the same, both of which are more reliable than SimRank as they do not have “zeroSimRank” issues.
9.2.3 Scalability of sseSR* and ssgSR*
To evaluate the scalability of SimRank* on large graphs, we compare the computational time and memory space of sseSR* and ssgSR* with other algorithms on various real datasets with m ranging from 17 K to 1.15 G. We randomly select 20 queries, Q, from each dataset, and retrieve all the similarities \(\{s(*,q)\}_{q \in Q}\). Note that our query selection is based on its node PageRank value so that Q can cover a board range of queries. Figure 9 depicts the results for \(K=20\).
9.2.4 Varying Q for ssgSR* and sseSR*
To evaluate the effect of query size Q on the computational efficiency of sseSR* and ssgSR*, we fix \(K=20\) and vary Q from 200 to 600 on D02 and CitH, and compare the computation time and memory space of ssgSR* with memogSR*, and sseSR* with memoeSR*. The results on D02 and CitH are shown in Figs. 11 and 12, respectively. Since memogSR* will fail on large datasets, we vary Q from 10 to 200 on WebB, WikT, SocL, and show the CPU time and memory of ssgSR* and sseSR* in Figs. 13 and 14, respectively.
From the results, we notice that (1) when Q grows from 200 to 600, the time of sseSR* and ssgSR* increase linearly on both D02 and CitH, whereas the time of memoeSR* and memogSR* are insensitive to Q, remaining at constant time on D02 and CitH, respectively. This conforms to our expectation as sseSR* and ssgSR* adopt novel iterative models that provide ondemand retrieval w.r.t. given queries. In contrast, memoeSR* and memogSR* are queryindependent algorithms which have to assess allpairs similarities simultaneously even if one wishes only a fraction of pairs of similarities. (2) As Q increases on D02 and CitH, the memory of all the algorithms remains unaltered, insensitive to the query size. The reason is that, for each singlesource query q, ssgSR* will immediately release the auxiliary vector \({\mathbf {m}}_{i1}^{(j1)}\) when it has been used twice for iteratively generating the Pascal’s triangle pattern; after each query q, ssgSR* will also release the memory to start with a new retrieval w.r.t. another singlesource query \(q'\). For sseSR*, in each query q, only one auxiliary vector needs memoization after each iteration. The memory space of memoeSR* and memogSR* is always dominated by \(O(n^2)\) to store allpairs similarities regardless of query size, and thereby remains constant as Q varies. (3) On large datasets (e.g., WebB, WikT, SocL) in Figs. 13 and 14, when Q varies from 10 to 200, the time and memory of sseSR* and ssgSR* exhibit a similar tendency to those on small datasets (D02 and CitH), indicating that sseSR* and ssgSR* scale well to both the graph size and the query size Q.
9.2.5 Varying K for ssgSR* and sseSR*
10 Related work
10.1 Linkbased similarity measures
One of the most attractive linkbased similarity measures is SimRank, proposed by Jeh and Widom [12]. The recursive nature of SimRank allows two nodes to be similar even without common inneighbors sharing, which resembles PageRank [3] that recursively assigns a score for node ranking. However, SimRank implies some unsatisfactory traits. One limitation is that “the similarity of two nodes will decrease as the number of their common inneighbors increases”. To address this problem, many excellent methods have been proposed, leading to several SimRank variant models. For example, Fogaras and Rácz [8] introduced PSimRank. They (1) incorporated Jaccard coefficients, and (2) interpreted s(a, b) as the probability that two random surfers, starting from a and b, will meet at a node. Antonellis et al. [1] proposed SimRank++, by adding an evidence weight to compensate for the cardinality of inneighbor matching. Lin et al. [22] presented MatchSim, which refines SimRank with maximum neighborhood matching. Jin et al. [14] proposed RoleSim that generalizes Jaccard coefficients to ensure automorphic equivalence for SimRank. Yu and McCann et al. [34] introduce SimRank#, a highquality SimRankbased model that extends cosine similarity measure to aggregate pairs of multihop paths.
Another limitation of SimRank is the “zerosimilarity” problem that “\(s(a,b)=0\) if there are no nodes having equal distance to both a and b”. A special case of this problem was observed by Zhao et al. [36, Example 1.2]. They proposed PRank by taking both in and outlinks into account. PRank indeed can reduce the number of pairs of nodes with counterintuitive zero similarities. However, if there are neither equidistant inlink paths nor equidistant outlink paths from two nodes a and b, the similarity of (a, b) is still zero. Our work is different from [36] in that (1) we show that the “zeroSimRank” problem is not caused by the ignorance of outlinks in SimRank, and (2) we circumvent the “zerosimilarity” issue by traversing more incoming paths of nodepairs that are neglected by the original SimRank. Recently, Chen and Giles [7] also proposed a similarity model, ASCOS++, to address the SimRank issue that “if the length of a path between two nodes is an odd number, this path makes no contribution to the SimRank score”. The issue is a special case of our “zerosimilarity” issue. It differs from our work in that [7] provided a sufficient condition for \(s(a,b)=0\), whereas we give a sufficient and necessary condition for \(s(a,b)=0\). That is, “the oddlength path between two nodes a and b” given by [7] is not the only condition that will lead to \(s(a,b)=0\). Another condition that “the evenlength inlinked paths between nodes a and b whose ‘source’ node is not in the center of the path” also leads to \(s(a,b)=0\). Therefore, ASCOS++ only partially resolved our “zerosimilarity” issue of SimRank, as we discussed in Sect. 3.5.
There has also been research on linkbased similarity (e.g., [4, 18, 28, 29, 30]). LinkClus [30] adopted a hierarchical structure, called SimTree, for clustering multitype objects. Blondel et al. [4] proposed an appealing measure to quantify graphtograph similarity. SimFusion [29] exploited a reinforcement assumption to assess similarities of multitype objects in a heterogenous domain, as opposed to SimRank focusing solely on intratype objects in a homogenous domain. Tong et al. [28] suggested Random Walk with Restart (RWR) for assessing node proximities, which is an excellent extension of Personalized PageRank (PPR). Leicht et al. [18] extend RWR by incorporating independent and sensible coefficients. However, RWR and its variants (PPR and [18]) also imply SimRanklike “zerosimilarity” issues, as discussed in Sect. 3.4. The recent work of [16, 34] has showed that Jeh and Widom’s SimRank model [12] and Li et al. ’s SimRank model [19] are different. In the previous conference version [31], we only proved the existence of “zerosimilarity” issues in Li et al. ’s SimRank model [19]. In this work, we show further that “zerosimilarity” issues also exist in Jeh and Widom’s SimRank model [12]. Moreover, we prove in Sect. 3.3 that the affected pairs of nodes in these two SimRank models are exactly the same.
10.2 Optimization methods for computing similarities
The computational overheads of SimRankbased similarity arise from its recursive nature. To reduce its computational complexity, a number of efficient techniques have been proposed to optimize SimRank computation, including allpairs search, singlesource search, singlepair search, and partialpairs search.
For allpairs search, Lizorkin et al. [24] focused on SimRank iterative computation and proposed three excellent optimization approaches (i.e., essential nodepair selection, partial sums memoization, and thresholdsieved similarities). These substantially speed up SimRank computation from \(O(Kd^2n^2)\) to O(Knm) time. Later, Yu et al. [32] used a mini spanning tree to find the topological sort for finegrained partial sums sharing, which improved allpairs SimRank search further to \(O(Kd'n^2)\) time (with \(d' \le d\)). However, both methods require \(O(n^2)\) memory to output allpairs results at each iteration, which are impractical to largescale graphs. Li et al. [19] developed a SVDbased SimRank matrix computing model to approximate SimRank results, yielding \(O(r^4 n^2)\) time, where \(r \ (\le n)\) is the targeted rank of SVD. However, it does not always speed up the computation when r is large for achieving a high accuracy. In contrast, our SimRank* model is fast and memoryefficient. It scales well on billionedge graphs while tallying even more paths than SimRank to enrich semantics.
For singlesource search, Lee et al. [17] first proposed a pioneering model, TopSim, that used a Monte Carlo method to retrieve topk SimRank pairs in \(O(d^k)\) time. To trade accuracy for speed, they also presented two approximate techniques based on truncated random walk and prioritizing propagation, respectively. Later, Fujiwara et al. [10] presented SimMat, which (1) retrieves the topk similar nodes based on a Sylvester equation, and (2) prunes unnecessary search based on the CauchySchwarz inequality. Kusumoto et al. [16] introduced a “linear” recursive formula for SimRank, based on which they establish a novel randomwalkbased method for scalable topk singlesource similarity search. Tian and Xiao [27] designed an efficient index structure, SLING, for SimRank search that guarantees the worstcase error in each SimRank score returned. Recently, Shao et al. [25] and Jiang et al. [13] devised TSF and READS indexing schemes, respectively, to efficiently handle topk SimRank search over dynamic graphs. Liu et al. [23] presented ProbeSim, an indexfree solution for dynamic singlesource and topk SimRank queries with provable accuracy guarantees.
There has also been other work on SimRank search. Fogaras and Rácz [9] proposed PSimRank for a singlepair SimRank retrieval. Li et al. [20] developed CloudWalker, a parallel algorithm for largescale SimRank search on Spark with ten machines. Tao et al. [26] proposed an excellent twostage way for the topk SimRankbased similarity join. Zhang et al. [35] conducted comprehensive experiments and compare many existing SimRank algorithms in a unified environment. Their empirical study showed that, despite recent research efforts, the computational time and precision of known algorithms have still much space for improvement.
11 Conclusions
In this article, we have proposed SimRank*, an effective and scalable similarity model, for effectively assessing linkbased similarities. In contrast to SimRank that considers only the contributions of symmetric inlink paths, SimRank* tallies the contributions of all inlink paths between two nodes, thus resolving the “zeroSimRank” issue for semantic richness. We have also converted the series form of SimRank* into two elegant forms: the geometric SimRank* and its exponential variant, both of which look even simpler than SimRank, yet without suffering from increased computational cost. To speedup allpairs SimRank* search, we have devised a finegrained memoization strategy via edge concentration, with an efficient algorithm speeding up SimRank* computation from O(Knm) to \(O(K n{\tilde{m}})\) time, where \({\tilde{m}}\) is generally much smaller than m. However, the memory of this algorithm is still \(O(n^2)\), which is not applicable to sizable graphs. To scale SimRank* on billionedge graphs, we propose two memoryefficient singlesource algorithms, ssgSR* for geometric SimRank* search, and sseSR* for exponential SimRank* search without any loss of accuracy. ssgSR* utilizes a Pascal’s triangle pattern that requires \(O(K^2 {\tilde{m}})\) time and \(O(Kn + {\tilde{m}})\) memory to iteratively retrieve SimRank* similarities between all n nodes and a given query on an asneeded basis, whereas sseSR* employs a novel iterative model that entails only \(O(K {\tilde{m}})\) time and \(O(n + {\tilde{m}})\) memory, where \({\tilde{m}} \ll n^2\). We also compare SimRank* with another alternative remedy for SimRank that adds selfloops on each node, and vindicate that SimRank* is more efficacious. Our experimental results on real and synthetic data demonstrate the richer semantics, higher computational efficiency, and scalability of SimRank* on billionscale graphs.
Footnotes
Notes
Acknowledgements
The work is supported by NSFC61702560, NSFC61672235, ARC DP170101628, and DP180103096.
References
 1.Antonellis, I., Molina, H.G., Chang, C.: SimRank++: query rewriting through link analysis of the click graph. PVLDB 1(1), 408–421 (2008)Google Scholar
 2.Benczúr, A.A., Csalogány, K., Sarlós, T.: Linkbased similarity search to fight web spam. AIRWeb, 9–16 (2006)Google Scholar
 3.Berkhin, P.: Survey: a survey on PageRank computing. Internet Math 2(1), 73–120 (2005)MathSciNetCrossRefGoogle Scholar
 4.Blondel, V.D., Gajardo, A., Heymans, M., Senellart, P., Dooren, P.V.: A measure of similarity between graph vertices: applications to synonym extraction and web searching. SIAM Rev. 46(4), 647–666 (2004)MathSciNetCrossRefGoogle Scholar
 5.Brualdi, R., Cvetkovic, D.: A Combinatorial Approach to Matrix Theory and Its Applications. Discrete Mathematics and Its Applications. Taylor & Francis, Abingdon (2008)zbMATHGoogle Scholar
 6.Buehrer, G., Chellapilla, K.: A scalable pattern mining approach to web graph compression with communities. WSDM, 95–106 (2008)Google Scholar
 7.Chen, H., Giles, C.L.: ASCOS++: an asymmetric similarity measure for weighted networks to address the problem of SimRank. TKDD 10(2), 15:1–15:26 (2015)CrossRefGoogle Scholar
 8.Fogaras, D., Rácz, B.: Scaling linkbased similarity search. WWW, 641–650 (2005)Google Scholar
 9.Fogaras, D., Rácz, B.: Practical algorithms and lower bounds for similarity search in massive graphs. IEEE Trans. Knowl. Data Eng. 19, 585–598 (2007)CrossRefGoogle Scholar
 10.Fujiwara, Y., Nakatsuji, M., Shiokawa, H., Onizuka, M.: Efficient search algorithm for SimRank. ICDE, 589–600 (2013)Google Scholar
 11.He, G., Feng, H., Li, C., Chen, H.: Parallel SimRank computation on large graphs with iterative aggregation. KDD, 543–552 (2010)Google Scholar
 12.Jeh, G., Widom, J.: SimRank: A measure of structuralcontext similarity. KDD, 538–543 (2002)Google Scholar
 13.Jiang, M., Fu, A.W., Wong, R.C., Wang, K.: READS: a random walk approach for efficient and accurate dynamic SimRank. PVLDB 10(9), 937–948 (2017)Google Scholar
 14.Jin, R., Lee, V.E., Hong, H.: Axiomatic ranking of network role similarity. KDD, 922–930 (2011)Google Scholar
 15.Jung, J., Shin, K., Sael, L., Kang, U.: Random walk with restart on large graphs using block elimination. ACM Trans. Database Syst. 41(2), 12:1–12:43 (2016)MathSciNetCrossRefGoogle Scholar
 16.Kusumoto, M., Maehara, T., Kawarabayashi, K.: Scalable similarity search for SimRank. In: SIGMOD Conference, pp. 325–336 (2014)Google Scholar
 17.Lee, P., Lakshmanan, L.V.S., Yu, J.X.: On top\(k\) structural similarity search. ICDE, 774–785 (2012)Google Scholar
 18.Leicht, E.A., Holme, P., Newman, M.E.J.: Vertex similarity in networks. Phys. Rev. E 73(2), 026120 (2006)CrossRefGoogle Scholar
 19.Li, C., Han, J., He, G., Jin, X., Sun, Y., Yu, Y., Wu, T.: Fast computation of SimRank for static and dynamic information networks. EDBT, 465–476 (2010)Google Scholar
 20.Li, Z., Fang, Y., Liu, Q., Cheng, J., Cheng, R., Lui, J.C.S.: Walking in the cloud: parallel SimRank at scale. PVLDB 9(1), 24–35 (2015)Google Scholar
 21.Lin, X.: On the computational complexity of edge concentration. Discrete Appl. Math. 101(1–3), 197–205 (2000)MathSciNetCrossRefGoogle Scholar
 22.Lin, Z., Lyu, M.R., King, I.: MatchSim: a novel similarity measure based on maximum neighborhood matching. Knowl. Inf. Syst. 32(1), 141–166 (2012)CrossRefGoogle Scholar
 23.Liu, Y., Zheng, B., He, X., Wei, Z., Xiao, X., Zheng, K., Lu, J.: ProbeSim: scalable singlesource and top\(k\) SimRank computations on dynamic graphs. PVLDB 11(1), 14–26 (2017)Google Scholar
 24.Lizorkin, D., Velikhov, P., Grinev, M.N., Turdakov, D.: Accuracy estimate and optimization techniques for SimRank computation. PVLDB 1(1), 408–421 (2008)Google Scholar
 25.Shao, Y., Cui, B., Chen, L., Liu, M., Xie, X.: An efficient similarity search framework for SimRank over large dynamic graphs. PVLDB 8(8), 838–849 (2015)Google Scholar
 26.Tao, W., Yu, M., Li, G.: Efficient top\(k\) SimRankbased similarity join. PVLDB 8(3), 317–328 (2014)Google Scholar
 27.Tian, B., Xiao, X.: SLING: a nearoptimal index structure for SimRank. In: SIGMOD Conference, pp. 1859–1874 (2016)Google Scholar
 28.Tong, H., Faloutsos, C., Pan, J.Y.: Fast random walk with restart and its applications. ICDM, 613–622 (2006)Google Scholar
 29.Xi, W., Fox, E.A., Fan, W., Zhang, B., Chen, Z., Yan, J., Zhuang, D.: SimFusion: measuring similarity using unified relationship matrix. SIGIR, 130–137 (2005)Google Scholar
 30.Yin, X., Han, J., Yu, P.S.: LinkClus: efficient clustering via heterogeneous semantic links. VLDB, 427–438 (2006)Google Scholar
 31.Yu, W., Lin, X., Zhang, W., Chang, L., Pei, J.: More is simpler: effectively and efficiently assessing nodepair similarities based on hyperlinks. PVLDB, 13–24 (2014)Google Scholar
 32.Yu, W., Lin, X., Zhang, W., McCann, J.A.: Fast allpairs SimRank assessment on large graphs and bipartite domains. IEEE Trans. Knowl. Data Eng. 27(7), 1810–1823 (2015)CrossRefGoogle Scholar
 33.Yu, W., McCann, J.A.: Efficient partialpairs SimRank search for large networks. PVLDB 8(5), 569–580 (2015)Google Scholar
 34.Yu, W., McCann, J.A.: High quality graphbased similarity retrieval. SIGIR, 83–92 (2015)Google Scholar
 35.Zhang, Z., Shao, Y., Cui, B., Zhang, C.: An experimental evaluation of SimRankbased similarity search algorithms. PVLDB 10(5), 601–612 (2017)Google Scholar
 36.Zhao, P., Han, J., Sun, Y.: PRank: a comprehensive structural similarity measure over information networks. CIKM, 553–562 (2009)Google Scholar
 37.Zheng, W., Zou, L., Feng, Y., Chen, L., Zhao, D.: Efficient SimRankbased similarity join over large graphs. PVLDB 6(7), 493–504 (2013)Google Scholar
 38.Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute similarities. PVLDB 2(1), 718–729 (2009)Google Scholar
 39.Zhu, R., Zou, Z., Li, J.: SimRank computation on uncertain graphs. ICDE, 565–576 (2016)Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.