Keywords

1 Introduction

1.1 Motivation

kNN graphs have been widely used in graph-based learning, since they tend to capture the structure of the manifold where the data lie. However, it has been recently noted [1] that for a standard machine learning setting (\(n\rightarrow \infty \), \(k\approx \log n\) and large d, where n is the number of samples and d is their dimension) we have that kNN graphs result in a sparse, globally uninformative representation. In particular, a kNN-based estimation of the geodesics (for instance through the shortest paths as done in ISOMAP) diverges significantly unless we assign proper weights to the edges of the kNN. Finding such weights is a very difficult task as d increases. As a result, machine learning algorithms for graph-based embedding, clustering and label propagation tend to produce misleading results unless we are able of preserving the distributional information of the data in the graph-based representation. In this regard, recent experimental results with anchor graphs suggest a way to proceed. In [2, 7, 8], the predictive power of non-parametric regression rooted in the anchors/landmarks ensures a way of constructing very informative weighted kNN graphs from a reduced set of representatives (anchors). Since anchor graphs are bipartite (only data-to-anchor edges exist), this representation bridges the sparsity of the pattern space because a random walk traveling from node u to node v must reach one or more anchors in advance. In other words, for a sufficient number of anchors it is then possible to find links between distant regions of the space. As a result, the problem of finding suitable weights for the graph is solved through kernel-based regression.

Data-to-anchor kNN graphs are computed from \(m\ll n\) representatives (anchors) typically obtained through K-means clustering, in \(O(dmnT + dmn)\), where O(dmnT) is due to the T iterations of the K-means process. Since \(m\ll m\), the process of constructing the \(m\times m\) affinity matrix \(W=Z\varLambda Z^T\), where \(\varLambda ={{\mathrm{diag}}}(Z^T1)\) and Z is the data-to-anchor mapping matrix, is linear in n. As a byproduct of this construction, we have that the main r eigenvalue-eigenvector pairs associated with \(M=\varLambda ^{-1/2}Z^TZ\varLambda ^{-1/2}\), which has also dimension \(m\times m\), lead a compact solution for the spectral hashing problem [14] (see [9] for details). These eigenvectors-eigenvalues may also provide a meaningful estimation of the commute distances between the samples through the spectral expression of this distance [11].

Once considered the benefits of anchor graphs, their use is quite empirical since their foundations are poorly understood. For instance, the choice of the m representatives is quite open and heuristic. The K-means selection process outperforms the uniform selection because it approximates better the underlying distribution. More clever oracles for estimating not only the positions of the representatives but also their number would lead to interesting improvements. However, the developments of these oracles must be compatible with the underlying principle defining an anchor graph, namely densification. Densification refers to the process of increasing the number edges (or the weights) of an input graph so that the result preserves and even enforces the structural properties of the input graph. This is exactly what it anchors provide: given a sparse graph associated to a standard machine learning setting, they produce a more compact graph which is locally dense (specially around the anchors) and minimizes inter-class paths.

Graph densification is a principled study of how to significantly increase the number of edges of an input graph G so that the output, H, approximates G with respect to a given test function, for instance whether there exists a given cut. Existing approaches [4] pose the problem in terms of semidefinite programming SDP where a global function is optimized. These approaches have two main problems: (a) the function to optimize is quite simple and does not impose the minimization of inter-class edges while maximizing intra-class edges, and (b) since the number of unknowns is \(O(n^2)\), i.e. all the possible edges, and SDP solvers are polynomial in the number of unknowns [10], only small-scale experiments can be performed. However, these approaches have inspired the densification solution proposed in this paper. Herein, instead of proposing an alternative oracle (top-down solution) we contribute with a method for grouping sparse edges so that densification can rely on similarity diffusion (bottom-up solution). Since our long-term scientific strategy is to find a meeting point between bottom-up and top-down densifiers, here we study to what extent we can approximate the performance of anchor graphs from the input sparse graph as a unique source of information.

1.2 Contributions

In this paper, we propose a bottom-up graph densification approach which commences by grouping edges through return random walks (Sect. 2). Return random walks (RRW) are designed to enforce intra-class edges while penalizing inter-class weights. Since our strategy is completely unsupervised, return random walks operate under the hypothesis that inter-class edges are rare events. Given input sparse graph G (typically resulting from a thresholded similarity matrix W), RRWs produce a probabilistic similarity matrix \(W_e\). Then, high probability edges are assumed to drive the grouping process. To this end, we exploit the random walker [3] but in the edges space (Sect. 3). The random walker minimizes the Dirichlet integral, in this case that associated with the line graph of \(W_e\): \(Line_{W_e}\). Given a set of known edges (assumed to be the ones with maximal probability in \(W_e\)) we predict the remainder edges. The result is a locally-dense graph H that is suitable for computing commute distances. In our experiments (Sect. 4), we will compare our Dirichlet densifier with anchor graph as well as with existing non-spectral alternatives relying exclusivelly on kNN graphs.

2 Return Random Walks

Given a set of points \(\chi = \{\varvec{x}_1,...,\varvec{x}_n\} \subset \mathbb {R}^{d} \), we map the \(\varvec{x}_i\) to the vertices V of an undirected weighted graph G(VEW). We have that V is the set of nodes where each \(v_i\) represents a data point \(x_{i}\), \(E\subseteq V\times V\) is the set of edges linking adjacent nodes. An edge \(e=(i,j)\) with \(i,j\in V\), exists if \(W_{ij} >0\) where \(W_{ij}= e^{-\sigma ||\varvec{x}_i - \varvec{x}_j||^2}\), i.e. \(W\in \mathbb {R}^{n\times n}\) is a weighted similarity matrix.

Design of We . Given W we produce a reweighted similarity matrix \(W_e\) by following this rationale: (a) we explore the two-step random walks reaching a node \(v_j\) from \(v_i\) through any transition node \(v_k\), (b) on return from \(v_j\) to \(v_i\) we maximize the probability of returning through a different transition node \(v_l\ne v_k\). For the first step (going from \(v_i\) to \(v_j\) through \(v_k\)) we have \(p_{v_k}(v_j|v_i) = \frac{W_{i k}W_{kj}}{d(v_i)d(v_j)}\) as well as a standard return \(p_{v_l}(v_i|v_j) = \frac{W_{j l}W_{li}}{d(v_j)d(v_i)}\). Standard return works pretty well if \(v_i\) and \(v_j\) belong to the same cluster (see Fig. 1-left). However, \(v_l\) (the transition node for returning) can be constrained so that \(v_l\ne v_k\). In this way, travelling out of a class is penalized since the walker must choose a different path, which in turn is hard to find on average. Therefore, we obtain \(W_{e_{ij}}\) from \(W_{ij}\) as follows:

$$\begin{aligned} W_{e_{ij}} = \max \limits _{k} \max \limits _{\forall l\ne k} \{p_{v_k}(v_j|v_i) p_{v_l}(v_i|v_j)\}\;, \end{aligned}$$
(1)

i.e. for each possible transition node \(v_k\) we compute the probability of go and return (product of independent probabilities) through a different node \(v_l\). We retain the maximum product of probabilities for each \(v_l\) referred to a given k and finally we retain the supremum of these maxima. As a result, when inter-class travels are frequent for a given \(e=(i,j)\) (Fig. 1-right) its weight \(W_{e_{ij}}\) is significantly reduced. Our working hypothesis is that the number of edges subject to this condition is small on average, since the number of inter-class edges tends to be small compared with the total number of edges. However, in realistic situations where patterns can be confused due either to their intrinsic similarity or to the use of an unproper similarity measure, this assumption leads to a significant decrease of many weights of W.

Fig. 1.
figure 1

Return random walks for reducing inter-class noise.

3 The Dirichlet Graph Densifier

3.1 The Line Graph

The graph densification problem can be posed as follows: given a graph \(G=(V,E,W)\) infer another graph \(H=(V,E',W')\) so that \(|E'|\ge |E|\) in such a way that the bulk of the increment in the number of edges is constrained to intra-class edges (i.e. the number of inter-class edges is minimized). Therefore, the unknowns of the problem are the new edges to infer, not the vertices. In principle we have a \(O(n^2)\) unknowns, where \(n=|V|\), but working with all of them is infeasible. This motivates the selection of a small fraction of them (those with the highest values of \(W_{e_{ij}}\)) according to a given theshold \(\gamma _e\). The counterintuitive fact that the smaller the fraction the better the accuracy is explained below and showed later in the experimental section. Concerning efficiency, the first impact of this choice is that only \(|E''|\) edges, with \(|E''|\ll |E|\) are considered for building a graph of edges, i.e. a line graph \(Line_{W_e}\). Let A the \(p\times n\) edge-node incidence matrix defined as follows:

$$\begin{aligned} A_{e_{ij}v_k} = \left\{ \begin{array}{ll} +1 &{} \text {if}\;i=k, \\ -1 &{} \text {if}\;j=k,\\ 0 &{} \text {otherwise}, \end{array}\right. \end{aligned}$$
(2)

Then, the \(C = AA^T - 2I_{p}\) is the adjacency matrix of an unweighted line graph, where: \(I_{p}\) is the \(p\times p\) identity matrix, the nodes \(e_{a}\) are given by all the possible pairs of \(r=|E''|\) edges with a common vertex according to A. The edges of C implement second-order interactions between nodes in the original graph from which A comes from. However, C is still unattributed (although conditioned by \(W_e\)). A proper weighting of for this graph is to use standard “go and return” random walks, i.e.

$$\begin{aligned} Line_{W_e}(e_a,e_b)=\sum _{k=1}^{r} p_{e_k}(e_b|e_a)p_{e_k}(e_a|e_b)\;, \end{aligned}$$
(3)

i.e. return walks are not applied because they become too restrictive. Then, there is an edge in the line graph for every pair \((e_a,e_b)\) with \(Line_{W_e}(e_a,e_b)>0\). We denote the set os edges of the line graph by \(E_{Line}\).

3.2 The Dirichlet Functional for the Line Graph

Given the line graph \(Line_{W_e}\) with r nodes (now edges) many of them will be highly informative according to \(W_e\) and the application of Eq. 3. We retain a fraction of them (again, those with the largest values of \(W_e\)) according to a second threshold \(\mu _e\). This threshold must be set as smaller as possible since it defines the difference between the “known” and the “unknown”. More precisely, \(W_e\) acts as a function \(W_e:|E''|\rightarrow \mathbb {R}\) so that the larger its value the more trustable is a given edge as an stable or known edge in the original graph G. Unknown edges are assumed to have small values of \(W_e\) and this is why they are not selected, since the purpose of our method is to infer them.

This is a classical inference problem, now in the space of edges and completely unsupervised, which has been posed in terms of minimizing the disagreements between the weights of existing (assumed to be “known”) edges and those of the “unknown” or inferred ones. In this regard, since unknown edges are typically neighbors of known ones, the minimization of this disagreement is naturally expressed in terms of finding an harmonic function. Harmonic functions u(x) satisfy \(\nabla ^2 u=0\) which in our discrete setting leads to the following property

$$\begin{aligned} u(e_a) = \frac{1}{d(e_a)}\sum _{(e_a,e_b)\in E_{Line}} Line_{W_e}(e_a,e_b)u(e_b)\;, \end{aligned}$$
(4)

The harmonic function u(.) is not unconstrained, since it is known for some values of the domain (the so called boundary). In our case, we set \(u(e_a)=W_{e_a}\) for \(e_a\in E_B\), referred to as border nodes since they are associated with assumed known edges. The harmonic function is unknown for \(e_b\in E_I=E''\sim E_B\) (the inner nodes). Then, finding an harmonic function given boundary values is called the Dirichlet problem and it is typically formulated in terms of minimizing the following integral

$$\begin{aligned} D[u] = \frac{1}{2}\int _{\varOmega } |\nabla u|^2 d\varOmega , \end{aligned}$$
(5)

whose discrete version relies on the graph Laplacian [3] (in this case on the Laplacian of the line graph):

$$\begin{aligned} D_{Line}[u]= & {} \frac{1}{2}(Au)^TR(Au)=\frac{1}{2}u^T\mathcal{L}_{Line}u\nonumber \\= & {} \frac{1}{2}\sum _{(e_a,e_b)\in E_{Line}}Line_{W_e}(e_a,e_b)(u(e_a)-u(e_b))^2\;, \end{aligned}$$
(6)

where \(A'\) is the \(|E''|\times |E_{Line}|\) incidence matrix, R is the \(|E_{Line}|\times |E_{Line}|\) diagonal constitutive matrix containing all the weights of the edges in the line graph, and \(\mathcal{L}_{Line}=D_{Line}-Line_{W_e}\) with \(D_{Line}=diag(d(e_a)\ldots d(e_{|E''|}))\) where \(d(e_a)=\sum _{e_b\ne e_a}Line_{W_e}(e_a,e_b)\) is the diagonal degree matrix. Then, \(\mathcal{L}_{Line}\) is the Laplacian of the line graph.

Given the Laplacian \(\mathcal{L}_{Line}\) and the Dirichlet combinatorial integral \(D_{Line}\) we have that the nodes in the line graph are partitioned in two classes: “border” and “inner”, i.e. \(E^{\prime \prime }= E_B \cup E_I\). This partition leads to a reordering of the harmonic function \(u=[u_B\; u_I]\) as well as the Dirichlet integral:

$$\begin{aligned} \begin{matrix}D\begin{bmatrix} u_{I} \end{bmatrix} = \frac{1}{2} \begin{bmatrix} u^{T}_{B} &{} u^{T}_{I} \end{bmatrix} \begin{bmatrix} L_{B} &{} K \\ K^{T} &{} L_{I} \end{bmatrix} \begin{bmatrix} u_{B} \\ u_{I} \end{bmatrix} \end{matrix} \end{aligned}$$
(7)

where \(D\begin{bmatrix} u_{I} \end{bmatrix} = \frac{1}{2} (u^{T}_{B} L_{B} u_{B} + 2u^{T}_{I} K^{T} U_{B} + u^{T}_{I} L_{I} u_{I})\) and differentiating w.r.t. \(u_I\) leads to solve a linear system which relates \(u_{I}\) with \(u_{B}\):

$$\begin{aligned} L_{I}u_{I} = -K^Tu_{B}\;. \end{aligned}$$
(8)

Then, let \(s\in [0,1]\) be a label indicating to what extend a given node of the line graph (an edge in the original graph) is relevant. We define a potential function \(Q:E_B\rightarrow [0,1]\) so that for a known node \(e_a\in E_{B}\) we assign a label s, i.e. \(Q(e_a)=s\). This leads to declaring the following vector for each label:

$$\begin{aligned} m^{s}_{a}= \left\{ \begin{array}{ll} \frac{W_{e_a}}{\max _{e_b\in E^{\prime \prime }}\{ W_{e_{b}}\}} &{} \text {if}\;Q(e_a)=s, \\ 0 &{} \text {if}\;Q(e_a)\ne s \end{array}\right. \;. \end{aligned}$$
(9)

Finally, the linear system is posed in terms of how the known labels do predict the unknown ones, placed in the vector u, as follows:

$$\begin{aligned} L_{I}u^s = -K^Tm^s\;. \end{aligned}$$
(10)

If we consider simultaneously all labels instead of a single one, we have

$$\begin{aligned} L_{I}U = -K^TM\;\Rightarrow U = (-K^TM)L_{I}^{-1}\;, \end{aligned}$$
(11)

where U is a vector of \(|E_I|\) rows (one per unknown/inner edge, known solved) and M has \(|E_B|\) rows and columns. Then, let \(U_b\) be the \(b-\)th row, i.e. the weight of a previously unknown edge \(e_b\). Since there is a bijective correspondence between the nodes in the line graph (some of them are denoted by \(e_a\) since they are known, and the remainder are denoted by \(e_b\)) and the edges in the original graph \(G=(V,E,W)\), then we have that \(e_k\) corresponds to edge \((i,j)\in E\). However, since its weight has potentially changed after solving the linear system, we adopt the following densification criterion (labeling) for creating the graph \(H=(V,E,W')\):

$$\begin{aligned} H_{ij} = \left\{ \begin{array}{ll} \mathop {\max }\nolimits _{e_k\in U} U_k &{}\text {if}\; e_k \in E_{I}\\ M_{ij} &{} \text {if}\; e_{k}\in E_{B}, \\ 0 &{} \text {otherwise} \end{array}\right. \end{aligned}$$
(12)

In this way, the edges \(E'\) of the dense graphs are given by \(H_{ij}>0\).

Table 1. Dirichlet densifier: accuracy for the reduced NIST database
Fig. 2.
figure 2

Densification result and its associate approximate commute times (ACT) matrix for different fractions of known labels \(|E_B|\) and leading edges \(|E^{\prime \prime }|\): (a) Densification with \(|E_B|=5\,\%\), \(|E^{\prime \prime }|=5\,\%\), (b) corresponding ACT, (c) Densification with \(|E_B|=50\,\%\), \(|E^{\prime \prime }|=5\,\%\), (d) corresponding ACT, (e) Densification with \(|E_B|=50\,\%\), \(|E^{\prime \prime }|=50\,\%\), (f) corresponding ACT.

Fig. 3.
figure 3

Top: accuracy of anchors graphs, kNN graphs and dirichlet densifiers. Dirichlet densifiers do not depend on the number of anchors and are completely unsupervised. Bottom: accuracy vs spectral gap.

4 Experiments and Conclusions

In our experiments we use a reduced version of the NIST digits database: \(n=200\) (20 samples per class) and proceed to estimate commute distances. In all cases, given a similarity matrix we use the O(nlogn) randomized algorithm proposed in [12]. We explore the behavior of the proposed Dirichlet densifier for different values of \(\gamma _e\), the threshold leading to preserve different fractions of the leading edges (the ones with the highest values in \(W_e\): from \(5\,\%\) to \(50\,\%\). Concerning the theshold \(\mu _e\) controlling the fraction of leading nodes in the line graph assumed to be “known” (i.e. border data in the terminology of Dirichlet problems), we have explored the same range: from \(5\,\%\) to \(50\,\%\) (see Table 1 where we show the accuracies corresponding to each of the 100 experiments performed. A first important conclusion is that the best clustering accuracy (w.r.t. the ground truth) is obtained when the fraction of retained edges for constructing the line graph is minimal. Although the removed edges cannot be reconstructed after solving the Dirichlet equation, i.e. we bound significantly the level of densification, reducing the fraction of retained edges reduces significantly the inter-class noise. We show this effect in Fig. 2 (e)–(f). For instance, a fraction of \(5\,\%\) (c) produces a better approximation of the commute distance (d) w.r.t. retaining \(50\,\%\) of the edges to build the line graph. The commute distances after retaining \(50\,\%\) are meaningless (f) despite the obtained graph is denser. In all cases, the error assumed when approximating the commute times matrix is \(\epsilon =0.25\).

In a second experiment, we compare the commute distances obtained with the optimal Dirichlet densifier (fraction of retained leading edges \(|E^{\prime \prime }|=5\,\%|\) and fraction known labels \(|E_B|=50\,\%\)) with different settings for the anchor graphs. Concerning anchor graphs, in all cases we set \(\sigma =0.08\) for constructing the Gaussian graphs from the raw input data. In our Dirichlet approach we use the same setting. This provides the best result in the range \(\sigma \in [0.05,0.13]\). In Fig. 3-Left we show how the accuracy evolves while increasing the number of (anchors) m: from 5 to 150. The performance of anchor graphs increases with m but degrades after reaching the peak at \(m=70\) (accuracy 0.67). This peak is due to the fact that anchor graphs tend to reduce the amount of inter-class noise. However, this often leads to poor densification. On the other hand, Dirichlet densifiers they are completely unsupervised and do not rely on anchor computation. Their performance is constant w.r.t. m and the best accuracy is 0.60. We outperform anchor graphs for \(m<35\) and \(m>105\) and in the range \(m\in [35,105]\) our best accuracy is very close to the anchor graph’s performance. Regarding existing approaches that compute commute distances from standard weighted kNN graphs [5, 6] we outperform them for any choice of m, since their performace degrades very fast with m due to the intrinsic inter-class noise arising in realistic databases.

Finally, we reconcile our results, and those of the anchor graphs with the von Luxburg and Radl’s fundamental bounds. In principle, commute distances cannot be properly estimated from large graphs [13]. However, in this paper we show that both anchor graphs and Dirichlet densifiers provide meaningful commute times. It is well known that this can be done insofar the spectral gap is close to zero or the minimal degree is close to the unit. Dirichlet densifiers provide spectral gaps close to zero (see Fig. 3-Right) for low fractions of leading edges, but the accuracy degrades linearly when the spectral gap increases. This means that the spectral gap is negatively correlated with increasing fractions of inter-class noise. This noise arises when the densification level increases since Dirichlet densifiers are not still able of confining densification to intra-class links. Concerning anchor graphs, their spectral gap is close to the unit since the degree also the unit (double-stochastic matrices) and they outperform Dirichlet densifiers to some extent at the cost of computing anchors and finding the best number of them.

To conclude, we have contributed with a novel method for transforming input graphs into denser versions which are more suitable for estimating meaningful commute distances in large graphs.