1 Motivation

k-Nearest-Neighbor (KNN) graphs have found usage in a number of domains, including machine learning, recommenders, and search. Some applications do not however require the k closest nodes, but the k most dissimilar nodes, what we term the k-Furthest-Neighbor (KFN) graph.

Virtual Machines (VMs) placement —i.e. the (re-)assignment of workloads in virtualised IT environments— is a good example of where KFN can be applied. The problem consists in finding an assignment of VMs on physical machines (PMs) that minimises some cost function(s) [27]. The problem has been described as one of the most complex and important for the IT industry [3], with large potential savings [20]. An important challenge is that a solution does not only consist in packing VMs onto PMs — it also requires to limit the amount of interferences between VMs hosted on the same PM [31]. Whatever technique is used (e.g. clustering [21]), interference aware VM placement algorithms need to identify complementary workloads — i.e. workloads that are dissimilar enough that the interferences between them are minimised. This is why the application of KFN graphs would make a lot of sense: identifying quickly complementary workloads (using KFN) to help placement algorithms would decrease the risks of interferences.

The construction of KNN graphs in decentralized systems has been widely studied in the past [4, 14, 17, 30]. However, existing approaches typically assume a form of “likely transitivity” of similarity between nodes: if A is close to B, and B to C, then A is likely to be close to C. Unfortunately this property no longer holds when constructing KFN graphs. As a result, these approaches, as demonstrated in the remainder of the paper, are not working anymore when applied to this new problem.

To address this problem, this paper proposes HyFN (standing for Hybrid KFN, pronounced hyphen), an hybrid decentralized approach for the decentralized construction of k-furthest-neighbor graphs. We show that HyFN is able to construct a KFN graph with 3200 nodes in less than 17 rounds, when a traditional greedy approach is unable to converge. We also show that our proposal is highly scalable, with a convergence time evolving in O(log(n)) for larger graphs.

The remainder of this paper is organized as follows: we first discuss some background about k-nearest-neighbor (KNN) graphs and their decentralized construction in peer-to-peer networks, before presenting our intuition for the construction of a k-furthest-neighbor graph (KFN) in Sect. 2. In Sect. 3, we describe in more detail HyFN and its variants. We evaluate our approach in Sect. 4, discuss related work in Sect. 5 and conclude in Sect. 6.

2 Decentralized Construction of a KFN Graph

2.1 Background: Decentralized KNN Graph Construction

The problem of constructing a k-furthest-neighbor (KFN) graph can be seen as a variant of a k-nearest-neighbor (KNN) graph construction that uses an opposed similarity.

figure a

A large body of works have been proposed to construct KNN graphs in decentralized systems, with applications ranging from recommendation [4, 14, 19], to search [13], to news dissemination [6]. In such systems, nodes (e.g. representing a user) can connect to each other using point-to-point networking, but only maintain a small partial view of the rest of the system, typically a small-size neighborhood of other nodes. Each node also stores a profile (e.g. a user’s browsing history), and uses a peer-to-peer epidemic protocol [1, 4, 17, 30] to converge towards an optimal neighborhood, i.e. a neighborhood containing the k most-similar other nodes in the system according to some similarity metric on profiles (e.g. cosine similarity, or Jaccard’s coefficient).

Fig. 1.
figure 1

A round of greedy decentralized KNN construction

The principle of a typical P2P protocol for KNN graph construction [9, 30] is shown in Algorithm 1, in its push-pull variantFootnote 1. Starting from a random neighborhood, individual nodes repeatedly select a random neighbor q (line 2), exchange their current neighborhood with that of q (noted \(\varGamma (q)\), line 4), and use the gained information to select more similar neighbors (line 5)Footnote 2. Similarly, when receiving a new neighborhood pushed to them, nodes update their local view with the new nodes they have just heard of (lines 6–8). The intuition behind this greedy procedure is that if A is similar to B, and B to C, C is likely to be similar to A as well. To avoid local minima, this greedy procedure is often complemented with a few random peers (returned by a peer sampling service [18], tuned with parameter r at line 4).

This mechanism is illustrated in Fig. 1. In this example, node Alice is interested in hearts (Alice’s profile), and is currently connected to Frank, and to Ellie. During this round, Alice selects Bob as her exchange partner. After exchanging her neighbors list with Bob, Alice finds out about Carl, who appears to be a better neighbor than Ellie. As such, Alice replaces Ellie with Carl in her neighborhood. Similarly Bob detects that Ellie is a better neighbor than Alice, and drops Alice in favor of Ellie.

2.2 Moving to Decentralized k-Furthest-Neighbor Graph Construction

Algorithm 1 can be easily adapted to compute a decentralized k-furthest-neighbor (KFN) graph by using a negative similarity at line 5:

$$\begin{aligned} \varGamma (p) \leftarrow \mathop {\mathrm {argtop}}^{k}_{g\in cand}{\big (-\text {sim}(p,g)\big )} \end{aligned}$$
(1)

Unfortunately, with this modification, one of the key premises of Algorithm 1 disappears: the far neighbors of a far neighbor are not so likely to be interesting candidates to construct a KFN graph. Said differently, if A is far from B, and B far from C, this does not imply that A is far from C (or further from C than any other node taken randomly in the dataset).

Fig. 2.
figure 2

The two heuristics we propose to construct a KFN graph

Starting from this observation, we propose instead to use a dual strategy that constructs an intermediate KNN graph in order to construct a final KFN graph. In our approach, each node p maintains two views containing k nodes each: \(\varGamma _{{\text {close}}}(p)\) and \(\varGamma _{{\text {far}}}(p)\).

\(\varGamma _{{\text {close}}}(p)\) uses the algorithm shown in Algorithm 1 to converge towards the k most similar other nodes in the system. \(\varGamma _{{\text {far}}}(p)\) employs two greedy optimization heuristics that exploits \(\varGamma _{{\text {close}}}(p)\) to progressively discover the k furthest neighbors from p. The intuition behind these two heuristics (shown in Fig. 2 in the case of the node Alice) is as follows:

  • The first heuristic (termed far-from-close and labeled 1 in the figure) requests the “far neighborhood” \(\varGamma _{{\text {far}}}(B)\) of a node Bob found in Alice’s “close neighborhood” \(\varGamma _{{\text {close}}}(A)\). The idea is that if Bob is close to Alice, then nodes that are far from Bob (such as Carl in Fig. 2) will also be far from Alice.

  • The second heuristic (termed close-to-far and labeled 2 in the figure) requests the “close neighborhood” \(\varGamma _{{\text {close}}}(D)\) of a node Dave found in Alice’s “far neighborhood” \(\varGamma _{{\text {far}}}(A)\). The idea is that if Dave is far from Alice, then nodes that are close to Dave (such as Ellie in Fig. 2) will also be far from Alice.

In the following we present HyFN, a general algorithm that combines the two heuristics described above in various measures.

3 Algorithms

3.1 General Framework

Algorithm 2 provides an overview of the approach we propose, termed HyFN, as executed by Node p. For a fair comparison with a traditional greedy approach, we limit ourselves to one push-pull exchange per round and per node (as in Algorithm 1). This limitation is key to properly assess the interest of our approach: an algorithm that exchanges more information is naturally advantaged against its more frugal competitors. It would for instance be unfair to compare an algorithm using multiple push-pull exchanges to maintain multiple views against Algorithm 1, as such an algorithm would be more costly in terms of network traffic.

figure b

To ensure only one push-pull exchange is performed per round we use the construct with probability \(\alpha \) do .. otherwise at line 3. This construct executes with a given probability (here \(\alpha \)) the first statement, and with a probability \((1-\alpha )\) the second. In this particular case, Algorithm 2 randomly alternates between invoking updateCloseView() at line 4, and invoking updateFarView() at line 6. Both procedures (discussed below), only generate one network exchange per node and per round, thus enforcing our communication limit. updateCloseView() maintains \(\varGamma _{{\text {close}}}(p)\), p’s close neighborhood, while updateFarView() uses \(\varGamma _{{\text {close}}}(p)\) to construct \(\varGamma _{{\text {far}}}(p)\). The parameter \(\alpha \) (contained in [0, 1]) measures out how much effort each node will spend on \(\varGamma _{{\text {close}}}(p)\) rather than \(\varGamma _{{\text {far}}}(p)\).

updateCloseView(), shown at lines 7–11, uses Algorithm 1 (discussed in Sect. 2.1) to construct \(\varGamma _{{\text {close}}}(p)\). updateFarView() depends on a pluggable procedure farCandidatesXX(p), which exchanges potential new candidate nodes using a push-pull approach to update p’s far neighborhood, \(\varGamma _{{\text {far}}}(p)\) at line 16. The current far neighborhood of p, the nodes received by farCandidatesXX(p), and r random nodes are stored in the intermediate \(cand_{{\text {far}}}\) variable (line 16). The k furthest nodes from \(cand_{{\text {far}}}\) then become p’s new far neighborhood (line 17; note the minus sign before \(\text {sim}(p,g)\), in contrast to line 11). (We discuss the push part of the exchange just below.)

3.2 Instantiating the Selection of Far Candidates

The pluggable method farCandidatesXX(p) can be instantiated in three different manners, with the procedures farCandidatesFarFromClose(p), farCandidatesCloseToFar(p) and farCandidatesMixed(p), shown in Algorithms 3, 4 and 6.

figure c
figure d
figure e
figure f
  • farCandidatesFarFromClose(p) (Algorithm 3) implements the far-from-close strategy discussed in Sect. 2.2: the local node p first selects one of its close neighbors \(q_{{\text {close}}}\) (line 2), and returns the far neighbors of \(q_{{\text {close}}}\), \(\varGamma _{{\text {far}}}(q_{{\text {close}}})\), as new candidates to update \(\varGamma _{{\text {far}}}(p)\). In addition, the procedure pushes towards \(q_{{\text {close}}}\) the far neighbors of p, as nodes far from p are likely to lay far from \(q_{{\text {close}}}\) as well. The receipt of the corresponding far message is handled by the code shown in Algorithm 5.

  • farCandidatesCloseToFar(p) (Algorithm 4) implements the close-to-far strategy presented above: this time, p picks one of its current far neighbors \(q_{{\text {far}}}\), and returns the close neighbors of \(q_{{\text {far}}}\), \(\varGamma _{{\text {close}}}(q_{{\text {far}}})\) in order to improve \(\varGamma _{{\text {far}}}(p)\). The procedure also pushes towards \(q_{{\text {far}}}\) the close neighborhood of node p, \(\varGamma _{{\text {close}}}(p)\), as those are likely to lay far from \(q_{{\text {far}}}\). The push message, of type far, is handled as above.

  • farCandidatesMixed(p) (Algorithm 6) combines the two above strategies in one single heuristics. As in Algorithm 2, we use the with probability construct to switch between the far-from-close and close-to-far strategies with probability \(\beta \), thus insuring that only one push-pull exchange occurs every time farCandidatesMixed(p) is invoked. The parameter \(\beta \) further controls how much each strategy is used, and allows farCandidatesMixed(p) to generalize the previous two procedures: the extreme case \(\beta =0\) corresponds to the far-from-close strategy, while \(\beta =1\) implements a close-to-far approach.

Considered all-together, Algorithms 2 to 6 capture a family of decentralized k-furthest-neighbor (KFN) graph construction protocols, controlled by two stochastic parameters, \(\alpha \) and \(\beta \). Parameter \(\alpha \) controls the distribution of efforts between the intermediate KNN view and the final KFN view, while \(\beta \) arbitrates between the far-from-close and close-to-far strategies.

Note that some gossip protocols, such as the original T-Man, tailor the candidates they send to the specific node that requested them, while we do not. For instance, in farCandidatesFarFromClose, q sends back the same set \(\varGamma _{{\text {far}}}(q)\) as potential new neighbors for p, whatever node p sent the request. This set is not tailored to a specific node p. This is because those other protocols work with an unbounded view that keeps all data received but fixed-size messages, and so they want to send back the best information they have available. As our approach works with fixed-size view, we simply send the full set of node.

4 Evaluation

We evaluate our framework using the simulator PeerSim [23], and compare its behavior against a basic greedy epidemic protocol (Algorithm 1) that uses a negative similarity metric (Eq. 1). We term this baseline solution Far From Far and we note that this is strictly better than taking purely random nodes: it selects the best neighbors from candidates specifically including random nodes from the peer-sampling service, but also some additional nodes known from one-hop neighbors.

We are essentially interested in two aspects of our solution: (i) its convergence, i.e. how fast our framework is able to converge to a good KFN graph, and (ii) its scalability, i.e. how does this convergence speed evolve with growing network sizes. The code used for our experiments can be found on-line at https://gitlab.inria.fr/ASAP/HyFN.

4.1 Experimental Set-Up and Metrics

Unless stated otherwise our default set-up involves 3200 nodes regularly positioned on a [0, 1) ring. By default, we use views of \(k = 14\) nodes, and fetch \(r=3\) random nodes in each round. We set the parameters of HyFN to \(\alpha =\beta =0.5\). These values mean that on average nodes spend the same number of rounds constructing their KNN and KFN views (\(\alpha \) at line 3 of Algorithm 2), and that the construction of the KFN view uses the heuristics far-from-close and close-from-far in equal measure (\(\beta \) at line 2 of Algorithm 6). We assume a random peer sampling service (RPS) [18] is available, which we use to initialize all views with random nodes before the protocol starts, and to provide r random nodes in each round.

To measure the convergence of the approximate KFN graph constructed by HyFN we use the following four metrics:

  • Number of missing links: We count for each node how many of its k furthest neighbors are missing from its KFN view. The count of all these missing links over the network yields our first metric.

  • Number of converged nodes: As a second measure of convergence, we consider that a node is converged when at least 80% of its k furthest neighbors (taking into account ties) are contained in its KFN view. As a measure of the network’s convergence, we count in each round how many nodes are converged.

  • Average KFN distance: For each node, we compute the average distance between this node and the nodes in its KFN view. This metric should tend toward 0.5 in a ring of perimeter 1 (our default topology). Note that even a perfectly converged network won’t actually reach 0.5 though, with the exact value depending on the density of the network.

  • Convergence time: Finally, we consider that the whole network is converged when at least 80% of all nodes are converged, according to the above criterion. We count the number of rounds until this convergence condition is fulfilled.

Fig. 3.
figure 3

Converged nodes, missing links, and average similarity of the baseline (Far-from-Far) and of three versions of HyFN (corresponding to \(\beta =1\) for Close-to-Far, \(\beta =0\) for Far-from-Close and \(\beta =0.5\) for Hybrid) on a 3200-node regular ring.

We do not report the communication overhead of either HyFN or our baseline: the protocols are all designed to initiate one single push-pull exchange in each round, and therefore present the same communication costs.

In the following we first evaluate HyFN on our default scenario (3200 nodes on a regular ring, \(k=14\), \(r=3\), \(\alpha =\beta =0.5\), the values for k and r being the smallest values still providing functional results) and compare it against our baseline. We then analyze the impact of the mixing parameters \(\alpha \) and \(\beta \). Finally, we study the scalability of HyFN up to networks of 12800 nodes, both on a ring and grid topology. All reported values are averages computed over 25 experimental runs.

4.2 Results

Figure 3 shows the convergence of HyFN in our default scenario (3200 nodes on a regular ring), according to three convergence metrics: the percentage of converged nodes (Fig. 3a), the number of missing links (Fig. 3b), and the average KFN similarity (normalized to 1, Fig. 3c). The behavior of three variants of HyFN are shown, which correspond to the three heuristics presented in Algorithms 3 (Far-from-Close), 4 (Close-to-Far), and 6 (Hybrid), discussed in Sect. 3.2.

Comparison to the Far-from-Far Baseline. From the three convergence metrics, it appears that the three versions of HyFN clearly outperform the baseline. More precisely, all HyFN variants have reached 80% of converged nodes after at most 20 rounds whereas the baseline is unable to converge even after 65 rounds (Fig. 3a). Interestingly, the hybrid variant has the best performances in terms of overall convergence. From the average similarity metric (Fig. 3c), the baseline has the worst performances, even if it gets decent results in a reasonable time. In fact, it doesn’t get the farthest neighbors, but still it gets far neighbors. Moreover, the metric of missing links (Fig. 3b) shows clearly that the baseline does not work: it just converges linearly only due to the couple of random neighbors that are fetched at each turn. Finally, among all HyFN variants, the Hybrid approach seems to converge most closely to the theoretically ideal network at the price of being a slightly slower than Close-to-Far.

Fig. 4.
figure 4

Impact of the \(\alpha \) stochastic parameter on a 3200-node regular ring.

Fig. 5.
figure 5

Impact of the \(\beta \) stochastic parameter on a 3200-node regular ring.

Influence of the parameters \(\varvec{\alpha }\) and \(\varvec{\beta }\) . Our key aim is to evaluate the effective impact of the stochastic parameters \(\alpha \) and \(\beta \) on the KFN graph and to set them accordingly. Figure 4 outlines the impact of the \(\alpha \) parameter, and shows that \(\alpha =0.5\) is close to the optimal. This value provides: (i) the best convergence time (Fig. 4a), and (ii) the best tradeoff between the convergence speed and the quality of the neighborhood (Fig. 4b). Concerning the impact of fine tuning \(\beta \) (Fig. 5), having \(\beta \) close to 0.2 gives the best network convergence, and convergence speed. Note that, we are not able to reach 100\(\%\) of converged nodes when we choose a \(\beta \) value of either 0 or 1. As a result having a non hybrid heuristic is not the most suitable choice, although the results of these kind of heuristics is still better than the baseline. Furthermore, as soon as we use the hybrid strategy, the value of \(0<\beta <1\) has a little impact on the convergence time.

Consequently, it appears that fine tuning \(\alpha \) is predominant compared to \(\beta \). In other terms, once we have set \(\alpha \) to its best value (i.e. 0.5), the value of \(\beta \) has a little impact as long as \(0<\beta <1\), so as long as we are actually using an hybrid approach.

Fig. 6.
figure 6

Behavior of HyFN with the hybrid heuristic for networks from s = 100 nodes to s = 12800 nodes, for a variety of network topologies (Ring and Grid in the above figure).

Scalability. We have investigated the applicability of the hybrid heuristic on both a ring and grid logical networks of varying sizes from 100 to 12800 nodes (Fig. 6). The values for k and r in the default 3200-node configuration where the smallest possible while still providing good performances, and it is a known property that this parameters evolve logarithmically with respect to the size of the network s. So for every configuration, we set up \(k = 1.2*\log _2(s)\) and \(r = 0.3*\log _2(s)\), both rounded to the closest integer — in order to give back \(k = 14\) and \(r = 3\) for \(s = 3200\). As a result, it appears that HyFN converges as expected in logarithmic time relative to the network total size, demonstrating thus that our approach scales well.

5 Related Work

To the best of our knowledge, HyFN is the first decentralized protocol specifically designed to compute a distributed k-furthest-neighbor (KFN) graph.

In terms of related mechanisms, a distributed KFN graph is a form of peer-to-peer network overlay. Peer-to-peer overlays have been widely applied in the past to implement distributed services, ranging from distributed storage [25, 26, 28], and streaming [12, 22], through to pub/sub [2, 24] and environmental sensing [15]. Among peer-to-peer overlays, k-nearest neighbor (KNN) overlays [4, 17, 30] come closest to HyFN, although they converge poorly when applied to the KFN graph construction problem, as our evaluation shows. KNN overlays have been extensively studied in the past, as they provide decentralized self-organization properties which have been exploited to implement a large number of resilient and scalable services, from recommendation systems [4, 14, 30], to collaborative caching [11] and generic topology construction [5, 17].

Epidemic topology construction protocols such as the ones presented in this work are typically highly scalable and efficient due to their inherent concurrency (each node executes the protocol in parallel) and locality (nodes only perform a few interactions per round). These two properties (concurrency and locality) render these algorithms also attractive for high-end parallel machines, and have given rise to several highly effective parallel KNN graph construction algorithms [7, 8, 10].

VM placement (the main technique for data centre optimisation) aims at assigning VMs to PMs in data centres — so that some cost function(s) is minimised [3, 27], such as, electricity cost, resource (e.g. CPU or memory) wastage, maintenance cost. The problem is often described as an instance of the general bin packing problem, and most techniques in the literature pack as many VMs as possible on PMs. However, in practice, piling up VMs may not be such a good idea as all resources cannot be perfectly isolated. This lack of isolation generates contentions between VMs hosted in the same PM; for instance, pressure on cache or I/O by one VM will have an impact on the other VMs sharing this PM. Most studies in the literature use time series analysis to compare two VMs’ workloads. For instance, Halder et al. [16] propose an interference aware first fit decreasing using a large correlation matrix – keeping track of the VMs’ time series and the composition of those time series in each PM. Verma et al. [29] simplify the time series using a concept of envelop, recording only the peaks of utilisation and not the full time series. They then cluster similar workloads and make sure they do not end up in the same PMs. Li et al. [21] propose a two phase clustering that addresses the scalability issues that previous approaches suffer from. They also propose a placement algorithm that minimises the number of required PMs and the number of interferences. Their solutionFootnote 3 would certainly benefit from the concept of KNN and the algorithms proposed in the current paper – we are working on an adaptation to an industry setting, with large and hosting departments running complex workloads.

6 Conclusion

In this paper, we propose HyFN, a novel and generic decentralized protocol to compute k-furthest-neighbor (KFN) graphs. HyFN exploits an intermediate k-nearest-neighbor (KNN) graph, which is constructed in parallel, to progressively converge towards an optimal solution. We have in particular proposed three heuristics to exploit this KNN graph. Our evaluation shows that our proposal clearly outperforms a naive greedy implementation based on existing KNN epidemic protocols.

Beyond its application to decentralized and pair-to-pair systems, we believe our KFN construction framework holds a strong potential for the computation of KFN graphs on highly parallel machines. Its inherent properties of locality and high concurrency are likely to make it a worthwhile approach in cases in which a KFN graph is required, including resource allocation problems such as those encountered in VM allocation services.