# A Million Edge Drawing for a Fistful of Dollars

## Abstract

In this paper we study the problem of designing a graph drawing algorithm for large graphs. The algorithm must be simple to implement and the computing infrastructure must not require major hardware or software investments. We report about the experimental analysis of a simple implementation of a spring embedder in Giraph, a vertex-centric open source framework for distributed computing. The algorithm is tested on real graphs of up to 1 million edges by using a cheap PaaS (Platform as a Service) infrastructure of Amazon. We can afford drawing graphs with about one million edges in about 8 min, by spending less than 1 USD per drawing for the cloud computing infrastructure.

## Keywords

Cloud Computing Graphical Processing Unit Large Graph Cloud Computing Service Real Graph## 1 Introduction

Classical force-directed algorithms, like *spring embedders*, are by far the most popular graph drawing techniques (see, e.g., [4, 10]). One of the key components of this success is the simplicity of their implementation and the effectiveness of the resulting drawings. Spring embedders make the final user only a few lines of code away from an effective layout of a network. They model the graph as a physical system, where vertices are equally-charged electrical particles that repeal each other and edges are modeled as springs that give rise to attractive forces. Computing a drawing corresponds to finding an equilibrium state of the force system by a simple iterative approach (see, e.g., [5, 6]).

The main drawback of spring embedders is that they are relatively expensive in terms of computational resources, which gives rise to scalability problems even for graphs with a few thousands vertices. To overcome this limit, sophisticated variants of force-directed algorithms have been proposed; they include *hierarchical space partitioning*, *multidimensional scaling* techniques, *multi-scale* techniques, and *stress-majorization* approaches (see, e.g., [1, 8, 10] for a survey). Also, both centralized and parallel multi-scale force-directed algorithms that use the power of graphical processing units (GPU) are described [7, 9, 14, 18]. They scale to graphs with some million edges, but their implementation is not easy and the required infrastructure could be expensive in terms of hardware and maintenance.

A few works concentrate on designing relatively simple parallel implementations of classical spring embedders. Mueller *et al.* [13] and Chae *et al.* [2] propose a graph visualization algorithm that uses multiple large displays. Vertices are evenly distributed on the different displays, each associated with a different processor, which is responsible for computing the positions of its vertices; scalability experiments are limited to graphs with some thousand vertices. Tikhonova and Ma [15] present a parallel force-directed algorithm that scales well to graphs with some hundred thousand edges. It is important to remark that all the above algorithms are mainly *parallel*, rather than *distributed*, force-directed techniques. Their basic idea is to partition the set of vertices among the processors and to keep data locality as much as possible throughout the computation.

Motivated by the increasing availability of scalable cloud computing services, we study the problem of adapting a simple spring embedder to a distributed architecture. We want to use such an algorithm on a cheap PaaS (Platform as a Service) infrastructure to compute drawings of graphs with million edges. We design, engineer, and experiment a distributed Fruchterman-Reingold (FR) spring embedder [6] in the open source *Giraph* framework [3], on a small Amazon cluster of 10 computers, each equipped with 4 vCPUs (http://aws.amazon.com/en/elasticmapreduce/). Giraph is a popular computing framework for distributed graph algorithms, used for instance by Facebook to analyze social networks (http://giraph.apache.org/).

Our distributed algorithm is based on the “*Think-Like-A-Vertex (TLAV)*” design paradigm [12] and its performance is experimentally tested on a set of real-world graphs whose size varies from tens of thousand to one million edges. The experiments measure not only the execution time and the visual complexity, but also the cost in dollars on the Amazon PaaS infrastructure. For example, computing a drawing on a graph of our test suite with 1,049,866 edges required about 8 min, which corresponds to less than 1$ payed to Amazon. The parallel algorithm described by Tikhonova and Ma [15] needs about 40 min for a graph of 260,385 edges, on 32 processors of the PSC’s BigBen Cray XT3 cluster.

The remainder of the paper is organized as follows. Section 2 describes the algorithmic pipeline of our distributed spring embedder. The experimental analysis is the subject of Sect. 3. Conclusions and future work can be found in Sect. 4.

## 2 A Vertex Centric Spring Embedder

The vertex-centric programming model is a paradigm for distributed processing frameworks to address computing challenges on large graphs. The main idea behind this model is to “Think-Like-A-Vertex” (TLAV), which translates to implementing distributed algorithms from the perspective of a vertex rather than the whole graph. Such an approach improves locality, demonstrates linear scalability, and provides a natural way to express and compute many iterative graph algorithms [12]. TLAV approaches overcome the limits of other popular distributed paradigms like MapReduce, which are often poor-performing for iterative graph algorithms.

The first published implementation of a TLAV framework was Google’s Pregel [11], based on the Bulk Synchronous Programming model (BSP) [16]. *Giraph* is a popular open-source TLAV framework built on the Apache Hadoop infrastructure [3]. In Giraph, the computation is split into *supersteps* executed iteratively and synchronously. A superstep consists of two processing steps: (*i*) a vertex executes a user-defined vertex function based on both local vertex data, and on data coming from adjacent vertices; (*ii*) the results of such local computation are sent to the vertex neighbors along its incident edges. Supersteps always end with a synchronization barrier which guarantees that messages sent in a given superstep are received at the beginning of the next superstep. The whole computation ends after a fixed number of supersteps has occurred or when all vertices are inactive (i.e., no message has been sent).

**Pruning:** For the sake of efficiency, we first remove all vertices of degree one from the graph, which will be reinserted at the end of the computation by means of an ad-hoc technique. This operation can be directly performed while loading the graph. The number of degree-one vertices adjacent to a vertex *v* is stored as a local information of *v*, to be used throughout the computation.

**Partitioning:** We then partition the vertex set into subsets, each assigned to a computing unit, also called *worker* in Giraph; in the distributed architecture, each computer may have more than one worker. The default partitioning algorithm provided by Giraph aims at making the partition sets of uniform size, but it does not take into account the graph topology (it is based on applying a hash function). As a result a default Giraph partition may have a very high number of edges that connect vertices of different partition sets; this would negatively affect the communication load between different computing units. To cope with this problem, we used a partitioning algorithm by Vaquero *et al.*, called Spinner [17], which creates balanced partition sets by exploiting the graph topology. It is based on iterative vertex migrations, relying only on local information.

**Layout:** Recall that classic spring embedders split the computation into a set of iterations. In each iteration every vertex updates its position based on the positions of all other vertices. The computation ends after a fixed number of iterations has occurred or when the positions of the vertices become sufficiently stable. The design of a distributed spring embedder within the TLAV paradigm must consider the following architectural constraints: (*i*) each vertex can exchange messages only with its neighbors, (*ii*) each vertex can locally store a small amount of data, and (*iii*) the communication load, i.e., the total number of messages and length sent at a particular superstep, should not be too high, for example linear in the number of edges of the graph. These three constraints together do not allow for simple strategies to make a vertex aware of the positions of all other vertices in the graph, and hence a distributed spring embedding approach must use some locality principle. We exploit the experimental evidence that in a drawing computed by a spring embedder: (*a*) the graph theoretic distance between two vertices is a good approximation for their geometric distance; (*b*) the fact that the repulsive forces between a vertex *u* and a vertex *v* tend to be less influential as the distance between *u* and *v* increases. See, e.g., [10]. Hence, we find it reasonable to adopt a locality principle where the force acting on each vertex *v* only depends on its *k* *-neighborhood* \(N_v(k)\), i.e., the set of vertices whose graph theoretic distance from *v* is at most *k*, where *k* is a suitably defined constant. The attractive and repulsive forces acting on a vertex are defined according to the FR spring embedder model [6]. In our distributed implementation, each drawing iteration consists of a sequence of Giraph supersteps.

An iteration works as follows. By means of a controlled flooding technique, every vertex *v* knows the position of each vertex in \(N_v(k)\). In the first superstep, vertex *v* sends a message to its neighbors. The message contains the coordinates of *v*, its unique identifier, and an integer number, called *TTL (Time-To-Live)*, equal to *k*. In the second superstep, *v* processes the received messages and uses them to compute the attractive and repulsive forces with respect to its adjacent vertices. Then, *v* uses a data structure \(H_v\) (a hash set) to store the unique identifiers of its neighbors. The TTL of each received message is decreased by one unit, and the message is broadcasted to *v*’s neighbors. In superstep *i* (\(i > 2\)), vertex *v* processes the received messages and, for each message, *v* first checks whether the sender *u* is already present in \(H_v\). If this is not the case, *v* uses the message to compute the repulsive force with respect to *u*, and then *u* is added to \(H_v\). Otherwise, the forces between *u* and *v* had already been computed in some previous superstep. In both cases, the TTL of the message is decreased by one unit, and if the TTL is still greater than zero, the message is broadcasted. When no message is sent, the coordinates of each vertex are updated and the iteration is ended.

**Reinsertion:** After a drawing of the pruned graph has been computed, we reinsert the degree-one vertices by means of an ad-hoc technique. The general idea is to reinsert in a region close to *v* its adjacent vertices of degree one. Namely, each angle around *v* formed by two consecutive edges will host a number of vertices that is proportional to its extent. To reduce the crossings, the edges incident to the reinserted vertices are assigned a length of one tenth of the ideal spring length. We found experimentally that this solution gives good results on graphs with many one-degree vertices.

Benchmark of real-world complex networks and results of our experiments.

Clint- \(k=2\) | Clint- \(k=3\) | FR | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

| | | | \(\delta \) | Time [sec.] | $ | CR (\(10^6\)) | Time [sec.] | $ | CR (\(10^6\)) | Time [sec.] | CR(\(10^6\)) | |

add32 | 4,960 | 9,462 | 28 | 30.3 | 0.04 | 0.26 | 40.2 | 0.06 | 0.24 | 6.9 | 0.1 |

ca-GrQc | 5,242 | 14,496 | 17 | 112.6 | 0.16 | 2.08 | 128.9 | 0.18 | 1.2 | 4.9 | 1.85 |

grund | 15,575 | 17,427 | 15 | 36.1 | 0.05 | 0.34 | 46.3 | 0.07 | 0.19 | 71 | 0.35 |

pGp-giantcompo | 10,680 | 48,632 | 17 | 35.1 | 0.05 | 3.11 | 72.4 | 0.1 | 2.02 | 32.2 | 1.8 |

p2p-Gnutella04 | 10,876 | 39,994 | 9 | 39.2 | 0.06 | 73.8 | 122 | 0.17 | 59.5 | 40 | 12.3 |

ca-CondMat | 23,133 | 93,497 | 14 | 179.1 | 0.25 | 146.8 | 525.4 | 0.74 | 100.2 | 59 | 77.9 |

p2p-Gnutella31 | 62,586 | 147,892 | 11 | 58.5 | 0.08 | 694.4 | 323.5 | 0.46 | 545.4 | - | - |

amazon0302 | 262,111 | 899,792 | 32 | 203.2 | 0.29 | 5,267.4 | 1,228.7 | 1.74 | 4,213.9 | - | - |

com-amazon | 334,863 | 925,872 | 44 | 278.9 | 0.39 | 3,314.6 | 946.8 | 1.34 | 3,130.3 | - | - |

com-DBLP | 317,080 | 1,049,866 | 21 | 508.5 | 0.72 | 11,978.7 | - | - | - | - | - |

We conclude this section with the analysis of the time complexity of Clint. Let *G* be a graph with *n* vertices and maximum vertex degree \(\varDelta \). Recall that *k* is the integer value used to initialize the TTL of each message. Then the local function computed by each vertex costs \(O(\varDelta ^k)\), since each vertex needs to process (in constant time) one message for each of its neighbors at distance at most *k*, which are \(O(\varDelta ^k)\). Moreover, let *c* be the number of computing units. Assuming that each of them handles (approximately) *n* / *c* vertices, we have that each superstep costs \(O(\varDelta ^k)\frac{n}{c}\). Let *s* be the maximum number of supersteps that Clint performs (if no equilibrium is reached before), then the time complexity is \(O(\varDelta ^k)s\frac{n}{c}\). If we assume that *c* and *s* are two constants in the size of the graph, then we have \(O(\varDelta ^k)n\), which, in the worst case, corresponds to \(O(n^{k+1})\).

## 3 Experiments

We experimentally studied the performances of Clint. We took into account two main experimental hypotheses: **H1.** For small values of *k* (\(k \le 2\)), Clint can draw graphs up to one million edges in a reasonable time, on a cloud computing platform whose cost per hour is cheap. This hypothesis is motivated by the fact that, for a small *k*, the amount of data stored at each vertex should be relatively small and the message traffic load should be limited. **H2.** When the diameter of the graph is not too high, small increases of *k* may give rise to relatively high improvements of the drawing quality. Nevertheless, small diameters may cause a dramatic increase of the running time even for small changes of *k*, because the data stored at each vertex might significantly grow.

To test our hypotheses, we performed the experiments on a benchmark of 10 real graphs with up to 1 million edges, taken from the Sparse Matrix Collection of the University of Florida (http://www.cise.ufl.edu/research/sparse/matrices/) and from the Stanford Large Networks Dataset Collection (http://snap.stanford.edu/data/index.html). Previous experiments on the subject use a comparable number of real graphs (see, e.g., [15]). On each graph, we ran Clint with \(k \in \{2,3\}\). Every computation ended after at most 100 iterations (corresponding to a few hundreds Giraph supersteps). The experiments were executed on the Amazon EC2 infrastructure, using a cluster of 10 memory-optimized instances (R3.xlarge) with 4 vCPUs and 30.5 GiB RAM each. The cost per hour to use this infrastructure is about 5 USD. Table 1 reports some experimental data. Each row refers to a different network, with the networks ordered according to increasing number of edges. The columns report the number |*V*| of vertices and |*E*| of edges, the network diameter \(\delta \), the running time of Clint, the Amazon cost for the drawing, and the number of crossings in the drawing.

Concerning **H1**, the data suggest that this hypothesis is not disproved. The computation of the biggest network of our test suite, consisting of more than one million edges, took about 8 min with \(k=2\). Most of this time was required for sending messages among the different Giraph workers. The cloud computing infrastructure cost of the computation is less than 1 USD. On graph com-DBLP, the computation for \(k=3\) failed due to a lack of storage resources, which means that more than 10 workers are necessary in this case. On the other hand, for the 4 smallest networks of the test suite we were able to compute the layout up to \(k=5\). The running time on ca-GrQc and ca-CondMat is higher than that spent on other graphs of similar size; more than \(70\,\%\) of this time was needed to compute the (many) connected components of these graphs.

Concerning **H2**, we report the quality of the drawings in terms of number of edge crossings. The improvement passing from \(k=2\) to \(k=3\) varies from \(6\,\%\) (on com-amazon) to \(44\,\%\) (on grund). As expected, the improvement is usually higher on networks with relatively small diameter. Also the increase of the running time, going from \(k=2\) to \(k=3\), is usually more severe for small diameters. For example, graph p2p-Gnutella04 has half the size of graph ca-CondMat, and its diameter is also much smaller; nevertheless, the increase of running time passing from \(k=2\) to \(k=3\) on p2p-Gnutella04 is higher (\(211\,\%\)) than on ca-CondMat (\(193\,\%\)). Again, the increase of time on graph amazon0302 (whose diameter is 32) is almost twice that on graph com-amazon (whose diameter is 44), although the latter is bigger than the former. Hence, also hypothesis **H2** is not disproved.

In addition to the above experiments, we ran a centralized version of the FR algorithm against our benchmark on an Intel i7 3630QM laptop, with 2.4 GHz and 8 GB of RAM. Namely, we ran the optimized FR implementation available in the OGDF library (http://www.ogdf.net/). This algorithm was able to complete the computation for the 6 smaller graphs of the test suite. The last two columns report the time and the number of crossings of the centralized FR computations. In the average, the drawings computed by Clint for \(k=3\) have about 1.8 times the number of crossings of those computed by FR. In some cases however, Clint performed better than FR (see ca-GrQC and grund) or similarly (see pGp-giantcompo). About the running time, Clint is often slower than the centralized FR, due to the time required by the flooding techniques for exchanging messages and by the fix infrastructure cost of the distributed environment, which is better amortized over the computation of bigger instances.

We also tried to estimate the *strong scalability* of Clint, that is, how the running time varies on a given instance when the number of workers increases. For each graph we ran Clint also with 6 and 8 workers. For the largest graphs and for \(k=2\), passing from 6 to 8 workers improves the running time of about \(20\,\%\), while passing from 8 to 10 workers causes a further decrease of about \(10\,\%\). These percentages increase for \(k=3\). On the smaller graphs, the benefit of using more workers is evident from \(k \ge 4\).

## 4 Conclusions and Future Research

We described and experimented the first TLAV distributed spring embedder. Our results are promising, but more experiments would help to find better trade-offs between values of *k*, running time, drawing quality, and number of workers in the PaaS. Future work includes: (*a*) Developing TLAV versions of multi-scale force-directed algorithms, able to compute several million edge graphs on a common cloud computing service; this would improve running times and drawing quality. (*b*) Designing a vertex-centric distributed service to interact with the visualizations of very large graphs; a TLAV drawing algorithm should be one of the core components of such a service.

## References

- 1.Brandes, U., Pich, C.: Eigensolver methods for progressive multidimensional scaling of large data. In: Kaufmann, M., Wagner, D. (eds.) GD 2006. LNCS, vol. 4372, pp. 42–53. Springer, Heidelberg (2007) CrossRefGoogle Scholar
- 2.Chae, S., Majumder, A., Gopi, M.: Hd-graphviz: highly distributed graph visualization on tiled displays. In: ICVGIP 2012, pp. 43:1–43:8. ACM (2012)Google Scholar
- 3.Ching, A.: Giraph: large-scale graph processing infrastructure on hadoop. In: Hadoop Summit (2011)Google Scholar
- 4.Di Battista, G., Eades, P., Tamassia, R., Tollis, I.G.: Graph Drawing. Prentice Hall, Upper Saddle River, NJ (1999)zbMATHGoogle Scholar
- 5.Eades, P.: A heuristic for graph drawing. Congr. Numerant.
**42**, 149–160 (1984)MathSciNetGoogle Scholar - 6.Fruchterman, T.M.J., Reingold, E.M.: Graph drawing by force-directed placement. Software, Practice and Experience
**21**(11), 1129–1164 (1991)CrossRefGoogle Scholar - 7.Godiyal, A., Hoberock, J., Garland, M., Hart, J.C.: Rapid multipole graph drawing on the GPU. In: Tollis, I.G., Patrignani, M. (eds.) GD 2008. LNCS, vol. 5417, pp. 90–101. Springer, Heidelberg (2009) CrossRefGoogle Scholar
- 8.Hachul, S., Jünger, M.: Drawing large graphs with a potential-field-based multilevel algorithm. In: Pach, J. (ed.) GD 2004. LNCS, vol. 3383, pp. 285–295. Springer, Heidelberg (2005) CrossRefGoogle Scholar
- 9.Ingram, S., Munzner, T., Olano, M.: Glimmer: multilevel MDS on the GPU. IEEE Trans. Vis. Comput. Graph.
**15**(2), 249–261 (2009)CrossRefGoogle Scholar - 10.Kobourov, S.G.: Force-directed drawing algorithms. In: Tamassia, R. (ed.) Handbook of Graph Drawing and Visualization, pp. 383–408. CRC Press, Boca Raton (2013)Google Scholar
- 11.Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD 2010, pp. 135–146. ACM (2010)Google Scholar
- 12.McCune, R.R., Weninger, T., Madey, G.: Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Comput. Surv.
**1**(1), 1–35 (2015)CrossRefGoogle Scholar - 13.Mueller, C., Gregor, D., Lumsdaine, A.: Distributed force-directed graph layout and visualization. In: EGPGV 2006, pp. 83–90. Eurographics (2006)Google Scholar
- 14.Sharma, P., Khurana, U., Shneiderman, B., Scharrenbroich, M., Locke, J.: Speeding up network layout and centrality measures for social computing goals. In: Salerno, J., Yang, S.J., Nau, D., Chai, S.-K. (eds.) SBP 2011. LNCS, vol. 6589, pp. 244–251. Springer, Heidelberg (2011) CrossRefGoogle Scholar
- 15.Tikhonova, A., Ma, K.: A scalable parallel force-directed graph layout algorithm. In: EGPGV 2008, pp. 25–32. Eurographics (2008)Google Scholar
- 16.Valiant, L.G.: A bridging model for parallel computation. Commun. ACM
**33**(8), 103–111 (1990)CrossRefGoogle Scholar - 17.Vaquero, L.M., Cuadrado, F., Logothetis, D., Martella, C.: Adaptive partitioning for large-scale dynamic graphs. In: ICDCS 2014, pp. 144–153. IEEE (2014)Google Scholar
- 18.Yunis, E., Yokota, R., Ahmadia, A.: Scalable force directed graph layout algorithms using fast multipole methods. In: ISPDC 2012, pp. 180–187. IEEE (2012)Google Scholar