Analytics for directed contact networks
Abstract
Directed contact networks (DCNs) are temporal networks that are useful for analyzing and modeling phenomena in transportation, communications, epidemiology and social networking. Specific sequences of contacts can underlie higher-level behaviors such as flows that aggregate contacts based on some notion of semantic and temporal proximity. We describe a simple inhomogeneous Markov model to infer flows and taint bounds associated with such higher-level behaviors, and also discuss how to aggregate contacts within DCNs and/or dynamically cluster their vertices. We provide examples of these constructions in the contexts of information transfers within computer and air transportation networks, thereby indicating how they can be used for data reduction and anomaly detection.
Abbreviations
- DAG
directed acyclic graph (In “Taint bounds” section and “Clustering” section we make extensive use of International Air Transport Association airport codes, for which see https://www.iata.org/services/Pages/codes.aspx.)
- DCN
directed contact network
- FPR
false positive rate
- NPV
negative predictive value
- PPV
positive predictive value (or precision)
- RG
renormalization group
- TPR
true positive rate (or recall)
- UK
United Kingdom
Introduction
Directed contact networks (DCNs) are temporal networks in which edges are directed (Holme 2015; Masuda and Lambiotte 2016). Loosely speaking, temporal networks have time attributes associated with edges while directed contact networks have both time and direction attributes associated with edges. DCNs are a natural temporal generalization of digraphs, and we can think about them informally as collections of time-stamped and directed contacts between and among the entities represented by the nodes. A simple example of a directed contact network is a collection of call data records in which each record includes information about who placed the call, who received the call and the time of the call (Bianchi et al. 2016).
In this paper, we address the problem of how directed contacts can be aggregated and coarsened for purposes such as anomaly detection. To accomplish this, we construct a natural inhomogeneous (that is, time varying) Markov model (Huntsman 2018a) for probabilistic modeling of potential flows that aggregate contacts based on a simple notion of spatiotemporal proximity. This model involves a single parameter, which in practice we set automatically with an intuitive heuristic. Through analytical and practical examples, we illustrate the behavior of this Markov model. We emphasize that this Markov model is not statistical in the sense that it involves no learning, fitting, optimization or other estimation procedure. Instead, it starts from a small number of symmetry and invariance requirements that any model with its goals ought to obey. This follows a tradition in physics by exhibiting a general mathematical structure that is consistent with the required symmetries. Because of this generality, the model applies to a wide range of problems that can be modeled with directed contacts, including call data record analysis, network traffic analysis and disease surveillance.
We also introduce the concept of a “taint bound” that quantifies the impact of weighted contacts: for example, how much of a given information transfer can possibly propagate through the network. Using publicly available data on flight timetables, we demonstrate by analogy how such taint bounds can constrain data exfiltration within a computer system and network. Finally, we also discuss two ways of aggregating and coarsening networks of directed contacts through renormalization and clustering.
It is useful to note that a contact of any type, information or not, involves a sources, a targett, and a time τ associated with the contact. For simplicity, we assume contacts occur at a given instant in time with the resulting notion of a directed contact as an ordered triple (s,t,τ). This still enables considerable generality: for instance, a transfer from s to t over the time interval [τ_{0},τ_{1}] can be represented by two contacts involving a surrogate third node as (s,∗,τ_{0}) and (∗,t,τ_{1}) where ∗ is the surrogate placeholder for (s,t,[τ_{0},τ_{1}]).^{1}
The paper is structured as follows: “Directed contact networks and temporal digraphs” section introduces directed contact networks and temporal digraphs; “Markov chain models for DCNs” section discusses our Markov model, and “Data reduction and anomaly detection” section discusses its performance in data reduction and anomaly detection. We then turn to taint bounds in “Taint bounds” section before discussing renormalization and clustering in directed contact networks in “Renormalization” section and “Clustering” section, respectively. “Remarks” section concludes the paper.
Directed contact networks and temporal digraphs
A particularly useful family of temporal networks are directed contact networks (DCNs) (Holme 2015; Masuda and Lambiotte 2016). DCNs are a natural temporal generalization of digraphs, and we can think about them informally as collections of contacts as introduced in “Introduction” section. However, to avoid certain degenerate cases, we provide a slightly more formal and restrictive notion here.
A DCN with vertices V=[n]:={1,…,n} is a finite nonempty set \(\mathcal {C}\), where each contact\(c \in \mathcal {C}\) corresponds to a unique triple \((s(c),t(c),\tau (c)) \in [n] \times [n] \times \mathbb {R}\) with s(c)≠t(c). As a matter of convenience, we identify contacts with their corresponding triples in this manner.
The first and second sets in the union on the right hand side of (3) are respectively the sets of temporal arcs and spatial arcs. Because \(|V(T(\mathcal {C}))| = \sum _{v} |\mathcal {C}@v| \le 2|V| + 2|\mathcal {C}|\) and \(|A(T(\mathcal {C}))| = |V(T(\mathcal {C}))| - |V| + |\mathcal {C}| \le |V| + 3 |\mathcal {C}|\), it is easy to see that \(T(\mathcal {C})\) can be formed with only linear runtime and memory, though an efficient algorithm requires somewhat more care in practice than is readily apparent.
We also note that DCNs support the natural notion of a temporally coherent path based on a set of contacts of the form {(s_{i},t_{i},τ_{i})|s_{i+1}=t_{i}, τ_{i}≤τ_{i+1}}. Fig. 1 illustrates a temporally coherent path.
In words, the temporal digraph of a given DCN can be drawn with the horizontal axis representing time and the vertical axis representing nodes in the original DCN in some ordering. Nodes in the temporal digraph are comprised of the start and end point nodes of individual contacts in the DCN at the associated times. As depicted, each vertical arc/edge in the temporal digraph represents a directed contact between two underlying DCN nodes at the specified time while horizontal edges connect a DCN node between the times it is involved in a contact. Note that there are no arcs that go backwards in time in this representation so that all paths are basically from left to right with possible vertical arcs.
Markov chain models for DCNs
In this section, we show that a useful probabilistic model of temporally coherent paths can easily be constructed from \(T(\mathcal {C})\) alone. The basic idea comes from traffic analysis, where tools such as pen registers or trap and trace devices generate data that enable a user to make substantive inferences about communication sources and paths in networks (Bianchi et al. 2016).
Specifically, consider two contacts of the form (A,B,0) and (B,C,τ). A natural probability model for a "flow" from A to C should decrease from 1 to 0 as τ↑∞. That is, if A calls B and then B calls C, immediately after, there should be a high expectation that some information from A triggered the call to C but if much time transpires between the calls, there is a lower expectation that A’s call and communicated information triggered B’s call to C.
In practice, we expect that enough flows of interest will involve unusual sources/targets and/or temporally localized contacts to be detected against a background of “bulk traffic” that the model will also effectively characterize. For example, in “Data reduction and anomaly detection” section we show that even sophisticated malicious cyber-activity leading to so-called “low and slow” data exfiltration involves at least some system call-scale directed contacts that can be readily detected through temporally coherent path identification and analysis.
probabilities on spatial arcs from the same source and time should be identical;
the model should yield probabilities for flows that coherently compose over arbitrary consecutive time windows that span the same interval;
probabilities on temporal arcs should only depend on their duration and the number of spatial arcs occurring with the same source and initial time;
simultaneous or near simultaneous events’ corresponding probabilities should differ only infinitesimally if at all.
We describe such a physically inspired Markov model of temporally coherent random paths that essentially satisfies these properties. We expect that these properties will be reasonably evident to the mathematically inclined reader.
Define the restriction of a DCN \(\mathcal {C}\) to \(X \subset \mathbb {R}\) to be the subset of contacts with times in X, so that \(\mathcal {C} |_{X} := \tau ^{-1}(\tau (\mathcal {C}) \cap X)\). Next, for \(a_{1} \notin \tau (\mathcal {C})\) and a_{0}<a_{1}, we define the restricted temporal digraph\(T(\mathcal {C})|_{[a_{0},a_{1})}\) from \(T(\mathcal {C}|_{[a_{0},a_{1})})\) by replacing the time component of the vertices (v,−∞) with a_{0} and the time component of the vertices (v,∞) with a_{1}, while retaining all the arcs.
where \(\tau _{(a,m)}^{@v+} := \min (\inf [(a_{m},\infty) \cap \mathcal {C}@v], a_{M})\), the normalizing constants \(Z_{(v,\tau ^{@v}_{j})}\) are such that the rows of \(P_{(\mathcal {C},a,m)}^{(\beta)}\) sum to 1, and d^{+} denotes the outdegree in \(T(\mathcal {C})|_{[a_{m-1},a_{m})}\). The reader should note that terms of the form exp(−βΔτ) can easily underflow numerically if the exponential argument is large and negative so that care must be taken when implementing these formulae in finite precision arithmetic.
The specific formulae of (4) arise from the four identified constraints above rather than from arbitrary choices.
In particular, the requirement \(a \cap \tau (\mathcal {C}) = \varnothing \) ensures that the Markov chain defined by (4) has exactly n absorbing states. Each of these has the form (v,a_{m}) and a corresponding “emitting” state (v,a_{m−1}), and a natural quantity to consider is the probability \(\mathcal {Q}_{(\mathcal {C},a,m)}^{(\beta)}(v,w)\) of arriving at the absorbing state (w,a_{m}) after starting in the emitting state (v,a_{m−1}). We can straightforwardly compute this quantity using the so-called fundamental matrix (Brémaud 1999). Finally, the term \(\tau _{(a,m)}^{@v+}\) mitigates artificial “boundary effect” behavior for m=M.
Taken as a whole, (4) thus leads to a natural temporal coherence property and a straightforward physical interpretation. Regarding the symmetries and invariants listed above, the terms equal to 1 and 0 in (4) merely codify i), while ii) is embodied in the following identity which can easily be verified from the construction of the probability transition matrices, \(\mathcal {Q}_{(\mathcal {C},a,m)}^{(\beta)} \).
Lemma 1
Proof
The lemma follows by applying the Kolmogorov-Chapman property to each of the inhomogeneous Markov chains \(\mathcal {Q}_{(\mathcal {C},a',\cdot)}^{(\beta)}\) and \(\mathcal {Q}_{(\mathcal {C},a,\cdot)}^{(\beta)}\). □ □
That is, for any DCN \(\mathcal {C}\) and parameter β, we have an associated temporally coherent family of time-inhomogeneous Markov chains. This lemma does not rely on the specific form of (4): the exp(−β·Δτ) terms could be significantly changed without breaking property ii) above.^{3}
Moreover, this form is necessary to jointly satisfy properties iii) and iv): i.e., memorylessness and self-consistency in the limit Δτ↓0. (In particular, the self-consistency requirement prohibits multiplying the exponentials by some nontrivial constant.) That is, the form of (4) is dictated by the structure of a temporal digraph along with manifestly desirable symmetries.^{4}
In the limit β→∞, the trajectories of the Markov model are so-called greedy walks (Saramäki and Holme 2015), and more generally for β>0, “spatial” transitions are preferred over “temporal” transitions. Regardless of the value of β, we shall see that the temporal coherence of trajectories is captured more faithfully and conservatively in the Markov model than in series of “projected snapshots” that characterize earlier efforts such as Perra et al. (2012); Starnini et al. (2012); Rocha and Masuda (2014); Grindrod and Higham (2013); Valdano et al. (2015) to analyze DCNs through graph time series and/or provide a substrate for random walks. A recent notable work that develops techniques for special epidemiological models can be found in Valdano et al. (2018).
The preceding lemma facilitates a computational complexity analysis as a function of a. Writing \(N := |\mathcal {C}|\) and M:=|a|, we suppose that a is approximately uniform in the sense that \(\left |\mathcal {C}|_{[a_{m},a_{m+1})} \right | \approx N/M\), which also implies that \(|V(T(\mathcal {C}|_{[a_{m},a_{m+1})}))| \approx 2(n+N/M)\). The complexity of computing \(\mathcal {Q}_{(\mathcal {C},a,m)}^{(\beta)}\) is governed by a matrix division of the form (I−Q)∖R, where here Q is the block of \(P_{(\mathcal {C},a,m)}^{(\beta)}\) whose rows and columns both correspond to transient states, and R is the block whose rows and columns respectively correspond to transient and absorbing states. Since the numbers of transient and absorbing states are respectively approximately n+2N/M and exactly n, the complexity of computing (I−Q)∖R is O(n(n+2N/M)^{ω−1}), where we take matrix multiplication and inversion to have complexity exponent ω>2 (for dense unstructured matrices, in practice ω=3). Because there are M−1 matrix multiplications, the computational complexity for the right hand side of (5) is O(Mn(n+2N/M)^{ω−1}). Now arg minMMn(n+2N/M)^{ω−1}=2(ω−2)N/n, and this value for M yields computational complexity which is nominally linear in N. Meanwhile, it only makes sense to take M≪N/n if the complexity of the linear algebra involved is dominated by the rest of the computation. In other words, it is less expensive to invert and multiply many small matrices than to invert and multiply a few large matrices. Since taking M larger yields a more detailed picture of the dynamics of \(\mathcal {C}\), it is sensible to require M to be (at least) on the order of N/n.
We exhibit the basic mechanics of the model in the following
Example 1
so that with increasing β (or equivalently, decreasing temperature) the likeliest transitions correspond to temporally coherent paths that greedily traverse spatial arcs. Now consider the digraph D with arcs (s(c),t(c)) for \(c \in \mathcal {C}\) and loops (v,v) for v∈[n]. The adjacency matrix of D has nonzero entries in the same locations as \(\mathcal {Q}_{(\mathcal {C},a^{\prime },1)}^{(\beta)}\), and also in the (2,3) and (2,4) locations. The (2,3) and (2,4) entries correspond to spurious temporally coherent paths in \(\mathcal {C}\). In particular, \(\mathcal {Q}\) gives a more faithful description of \(\mathcal {C}\) than D. □
Although the context is quite different, the closest work to the construction of this section is Ser-Giacomi et al. (2015), which shows that the most probable paths in a Markovian model of a very complicated temporal network (viz., ocean water transport in the Mediterranean) suffice to describe the network’s key features. Other works have looked at higher-order models in discrete time as a way to finesse the challenges of continuous time modeling as discussed here (Lambiotte et al. 2019; Rosvall et al. 2014). Despite the many differences of detail, our own model likewise shows that the most probable paths/flows suffice for capturing the essential dynamics of directed contact networks. In particular, this includes flows that the model assesses as highly probable, but whose associated contact motifs occur infrequently (or perhaps just once in a given data set): in our experiments, such flows reliably capture anomalous and even malicious behavior (see, for example “Data reduction and anomaly detection” section).
Embeddability
It is natural to wonder under what if any circumstances the Markov chain \(\mathcal {Q}_{(\mathcal {C},a,m)}^{(\beta)}\) corresponds to a continuous-time Markov process. As it happens, this instance of the Markov embeddability problem Lencastre et al. 2016) can be answered quite effectively (if not always affirmatively) for most situations of practical interest.
On the other hand, simultaneous contacts are a possible obstruction to embeddability. For n=2, a stochastic matrix is embeddable iff it has positive determinant. But a quick calculation for \(\mathcal {C} := \{(1,2,0), (2,1,0), (1,2,\tau), (2,1,\tau)\}\) and a={−1,τ/2,2τ} shows that \(\det \mathcal {Q}_{(\mathcal {C},a,1)}^{(\beta)} < 0\) for β>0.
Proposition IV.3 of Lencastre et al. (2016) immediately yields a generalization of the preceding observations:
Proposition 1
If \(T(\mathcal {C})\) is acyclic, then \(\mathcal {Q}_{(\mathcal {C},a,m)}^{(\beta)}\) is embeddable. □
Finally, when a Markov generator exists, the algorithm of §V.B of Lencastre et al. (2016) can be used to estimate it.
Data reduction and anomaly detection
In this section, we sketch how the model of “Markov chain models for DCNs” section performs data reduction well enough to be used as a practical anomaly detector. For some additional background details of this analysis, see (Huntsman 2018b).
We considered a DCN \(\mathcal {C}\) formed from N≈3.4·10^{6} kernel-level events spanning a period of four days and derived in turn from data produced by the CADETS tool (Jenkinson et al. 2017). Also, after curation, we obtained a set \(\mathcal {G} \subset \mathcal {C}\) of 216 ground truth contacts that were distributed over 54 malicious exfiltration events.
or both, depending on the semantics of event type. While many event type semantics included a natural and unambiguous (bi)directionality determining the mapping above, some did not, and in that case both contacts were conservatively included.
We obtained flows using the model of “Markov chain models for DCNs” section by setting β to the mean inter-contact time (i.e., the proposed heuristic) and used M=28423 windows of 10 s.
We let \(\hat {\mathcal {I}}(m)\) denote the set of indices corresponding to sources or targets of flows spanning the mth time window that simultaneously fell above a flow probability threshold λ∈[0,1] and below a per-window frequency threshold μ∈[0,1]. The set \(\hat {\mathcal {I}}(m)\) is an estimate of the set \(\mathcal {I}(m)\) of indices corresponding to sources or targets of ground truth events during the mth time window.^{5}
and positive predictive value (or precision) and negative predictive value
From Figs. 3 and 4, we can see that the results were insensitive to the probability threshold λ.^{6} Similarly, a cursory analysis indicated broad insensitivity to the value of β over several orders of magnitude, a fact attributable to information flow probabilities that tended to be either very near or bounded away from 1. This also underlies the insensitivity with respect to the probability threshold λ. For the value μ=10^{−3}, a majority of malicious events were detected with a false positive rate below 2 percent (by either version of the metrics).
The results indicate that the Markov model is a sufficiently effective data reduction technique (in particular, the negative predictive value is essentially perfect) to be a useful anomaly detector. In fact, of the 57 (out of 418) files which are targets of high-probability potential information flows in the model, 27 fell below the μ=10^{−3} level and had backtracks (King and Chen 2005) with fewer than 20 (or for that matter, 90) vertices. From these 27, 6 (in 3 pairs) corresponded to the 3 executables which the malicious attacker wrote to /tmp from its initial foothold.
Taint bounds
The notions of dynamic taint analysis (Schwartz et al. 2010) and provenance (Cheney et al. 2013) inform the context where a DCN models information flow in a computational environment. The analytic problems corresponding to these notions are generically undecidable. With this in mind, we introduce the idea of taint bounds, wherein correct nontrivial bounds on the information flow are maintained.^{7} We formalize this idea here before showing its utility as a practical guide to producing effective data-reducing path abstractions.
where a∧b= min(a,b) in this section. Here we note that the standard interpretation here is that not only minima but also summation over the empty set yield ∞.
Lemma 2
Proof
By lemma 3.5 of Flas̆ka et al. (2007), a binary relation ρ is superirreflexive iff the digraph corresponding to ρ is acyclic. Since by assumption ρ is nontrivial and X is finite, there exists some j<∞ such that \(\rho ^{\circ (j+1)} = \varnothing \) and \(\rho ^{\circ j} \ne \varnothing \), where the composition of ρ with itself is indicated. Furthermore, superirreflexivity implies irreflexivity.
Similarly, the recursion implicit in (13) also terminates, though in this case without any additional simplification. The bounds α≤β≤γ follow. □ □
For a DCN \(\mathcal {D}\) such that τ is injective, a natural choice for ρ is c_{1}ρc ⇔ (t(c_{1})=s(c))∧(τ(c_{1})<τ(c)). Note that this relation is superirreflexive.
Example 2
Consider once more the DCN depicted in Fig. 1 and ρ as defined immediately above. The only nontrivial taint bounds are for the last contact: we have that α(4,3,τ_{4})=(γ(1,4,τ_{1})∧γ(5,4,τ_{2}))∧γ(4,3,τ_{4}) and β(4,3,τ_{4})=(γ(1,4,τ_{1})+γ(5,4,τ_{2}))∧γ(4,3,τ_{4}). If for 3≤j≤J we add to this DCN the contacts (3,4,τ_{2j−1}) and (4,3,τ_{2j}) with τ_{k} strictly increasing, then the result is a DCN with 2J contacts and α(·,·,τ_{k}) and β(·,·,τ_{k}) nonincreasing for k>4. □
However, in practice τ need not be injective, and in fact this is often the case for kernel-level information flows (timestamps of system-level activity in computers or network interfaces are generally precise only to milliseconds or at best microseconds, and getting higher precision generally entails a heavy burden or can even be practically infeasible, depending on the detailed context). Indeed, for a distributed system, even the synchronization of clocks can become an issue, and so it is desirable to have a relation ρ that accounts for more structural details. The problem is highlighted by considering directed acyclic graphs (DAGs) rather than DCNs: for a DAG D, the natural choice for ρ is uρv iff u precedes v. However, this does not generalize to arbitrary digraphs, which are the structures that essentially embody multiple contacts occurring at the same time.
This suggests two strategies: live with a fairly generic relation ρ and only seek to compute taint bounds when the digraph corresponding to ρ actually turns out to be acyclic, or build in mechanisms that enforce acyclicity (if these are artificial, we can provide warnings when they have any effect). Only the second of these strategies requires further comment here. One simple approach is to leverage some auxiliary strict order on X; another simple approach is to require a nonzero delay between contacts. In general, context will constrain and inform the construction of ρ.
Example 3
Taint bounds with β<γ for UK flight data in Example 3
Origin | Destination | Seats | Departure | Arrival | β | γ |
---|---|---|---|---|---|---|
CWL | EDI | 74 | 0850 | 1010 | 29 | 74 |
CWL | GLA | 74 | 0850 | 1010 | 29 | 74 |
BRS | BHD | 189 | 0745 | 0855 | 82 | 189 |
BLK | BFS | 142 | 1405 | 1455 | 40 | 142 |
GLO | IOM | 18 | 1025 | 1130 | 8 | 18 |
INV | LTN | 149 | 1115 | 1245 | 142 | 149 |
GLA | EMA | 136 | 0835 | 0945 | 74 | 136 |
Let us consider each of these in turn. On the day in question, one flight arrives at CWL by 0820, and it has 29 seats. Two flights arrive at BRS by 0715, each with 41 seats. Two flights arrive at BLK by 1335, each with 20 seats. One flight arrives at GLO by 0955, and it has 8 seats. Three flights arrive at INV by 1045, with 34, 34, and 74 seats, respectively. Finally, one flight arrives at GLA by 0805, and it has 74 seats. The general pattern amongst these is evident by inspection: there are a few preceding flights inbound to the origin with less total capacity than the current flight. □
The preceding example illustrates by way of analogy how differences between β and γ can serve as a preliminary indicator of anomalously asymmetric flows (which might correspond to, for example, the original dissemination of material and/or unauthorized data exfiltration), particularly at vertices corresponding to sensitive objects or locations.
Renormalization
We want \(0 \le \Delta \le \sum _{j} \delta _{j}\), so that the notional renormalized contact (1,2,τ_{N}−Δ) can replace \(\mathcal {C}\) in a self-consistent way. While the first inequality holds, the second is equivalent to \(\prod _{j} [e^{\delta _{j}}+1] \le \prod _{j} e^{\delta _{j}} + 1\), which is impossible for β>0. That is, our goal of associating \(\mathcal {C}\) with a single renormalized contact is generally impossible.
In light of the preceding considerations, it seems necessary to resort to more algorithmical and computational versus analytical approaches to coarse-graining or renormalizing DCNs. At the same time, it is helpful to introduce some additional context. Stripped bare of its associations with physics, the renormalization group (RG; see, for example, (Barenblatt 2003; Goldenfeld 1992)) is a simple approach to understanding theories in terms of their fixed points. We also note that renormalization ideas have been applied to undirected networks with specific structures (Barrat and Cattuto 2013; Karschau et al. 2018; Newman and Watts 1999).
For a theory determined by a function f(x;θ) of data x and parameters θ, and given a suitable coarsening operator C, if there exists a function g such that f(x;θ)=f(C(x);g(θ)), then the theory is called renormalizable. ^{8} In our setting, a probability cutoff takes the role of f; the underlying DCN takes the role of x; the parameter β=θ is computed from the data x according to a fixed heuristic; and the coarsening operator C is realized by the Markov model of “Markov chain models for DCNs” section along with a fixed heuristic for its remaining parameters - for instance, we can fix the number of contact times per window (with an exception provided for the last window). The use of fixed heuristics yields a RG transformation on DCNs that renormalizes probable flows into contacts in a given time window.
Iterating the RG transformation along these lines leads to an "ultraviolet cutoff" at which the process stops, essentially sublimating temporal data into a single weighted digraph. While there is a great deal of freedom in its precise specification, such an RG transformation and fixed point is surely of interest for summarizing complex DCNs.
Reduction of data similar to that described in “Data reduction and anomaly detection” section under RG transformations
RG iteration | Number of contacts |
---|---|
0 | 569480 |
1 | 10726 |
2 | 4687 |
3 (∞) | 184 |
Clustering
The problem of clustering in digraphs is much more delicate than its analogue for the undirected case (Malliaros and Vazirgiannis 2013). It should therefore come as no surprise that the problem of clustering in DCNs is more challenging than either clustering in digraphs or in undirected temporal networks. Indeed, most of the approaches purporting to address clustering in temporal networks in the literature (cf. §4.11 of (Holme 2015), §4.12 of Masuda and Lambiotte (2016) or Speidel et al. (2015)) actually cluster in time series of graphs, not the more granular notion of a DCN.
A sensible step forward is to consider the temporal digraph \(T(\mathcal {D})\) of a DCN \(\mathcal {D}\). As an “almost acyclic” digraph, it might seem natural to try to apply techniques such as those detailed in Malliaros and Vazirgiannis (2013) directly to \(T(\mathcal {D})\). While this would offer the prospect of retaining qualitative temporal structure, it still ignores the quantitative temporal details; furthermore, it is far from evident how to remove any cycles that might (and in practice frequently do) occur. We seek instead a controlled way to coarse-grain this temporal information independent of the approach in “Renormalization” section.
We note that clustering for temporal networks is a topic of much current interest (Bassett et al. 2013; Bazzi et al. 2016; Gauvin et al. 2014; Sarzynska et al. 2015) seeing as the dynamics of communities within social networks and other applications are relevant to current social media and related topics. However, our focus here is an on the specific structure of directed contact networks which has not been specifically studied to our knowledge before.
Clustering techniques leveraging (5)
Any clustering technique ought to directly exploit the probabilistic framework that (5) offers and seek to avoid any additional model features unless they are necessary. This principle discourages–but of course does not completely rule out–techniques that require for example random walks generated by an ergodic transition matrix, which would in turn require incorporating a “teleportation” device à la PageRank. Techniques that require a unique and/or nondegenerate stationary distribution are therefore also discouraged by this principle. Such discouraged techniques include (following the precise enumeration in table 2 of Malliaros and Vazirgiannis (2013)) symmetrization and random walk simulations, LinkRank, directed Laplacians, two-step random walks, message passing, and Infomap. Meanwhile, many other techniques do not exploit (5) at all and should be completely ruled out: for example, network embedding, bipartite modularity, a modified adaptive genetic algorithm, semi-supervised learning, directed modularity, directed Gaussian random network, overlapping modularity, local modularity, cuts, attraction/repulsion, local partitioning, directed clique percolation, local density, mixture models, and community kernels.
The techniques in Malliaros and Vazirgiannis (2013) not discouraged or completely ruled out by the immediately preceding considerations essentially amount to coclustering (or the closely related notion of “blockmodels”). Among coclustering approaches, we single out (Ge et al. 2003; Chakrabarti 2004; Rohe et al. 2016) as holding particular interest. (Ge et al. 2003) focuses on reducing the number of states of a Markov chain estimated directly from a sample trajectory (and is thus not manifestly suitable in our context, where the data is a sequence of contacts rather than a sequence of vertices), while (Chakrabarti 2004; Rohe et al. 2016) explicitly address unweighted digraphs. (Ge et al. 2003; Rohe et al. 2016) cluster singular vectors of a suitable matrix, whereas (Chakrabarti 2004) optimizes a minimum description length criterion for coclustering. Bearing all this in mind, a reasonable strategy would be to look for opportunities to evaluate the singular value decomposition or an information-theoretical compression of a suitable matrix. At the same time, the notion of stochastic equivalence (Holland et al. 1983) leveraged by (Rohe et al. 2016) appears particularly relevant: vertices v and w are stochastically equivalent for (5) iff \(\mathcal {Q}_{v\cdot } = \mathcal {Q}_{w\cdot }\) and \(\mathcal {Q}_{\cdot v},\mathcal {Q}_{\cdot w}\), where we use a shorthand. We shall exploit a very similar notion immediately below.
Any of the preceding metrics (20), (21), or (22) lend themselves straightforwardly to various clustering techniques.
Example 4
If d is the total variation or Hellinger distance, then the metric (20) takes values in the unit interval. It is then reasonable to automatically select a cutoff for a hierarchical clustering technique along the same lines as described in “Data reduction and anomaly detection” section, possibly after some rescaling. Here, we will consider total variation distance and use single linkage clustering.
The transitions from CEG and FZO to FZO arise from a sequence CEG→FZO→CEG→FZO of three chartered flights between these airports, which host Airbus facilities and have no other scheduled passenger flights.
- The transitions from GLO, OXF, and TRE to SOU arise as follows. The first flight departing OXF departs to GLO at 0815, arriving at 0830. The first flight departing GLO not earlier than 0830 departs to IOM at 1025, arriving at 1130. The first flight departing IOM not earlier than 1130 departs to GLA at 1210, arriving at 1300. Meanwhile, the only flight departing TRE arrives at GLA at 1300. Starting from GLA at 1300 and taking the shortest possible (even instantaneous) layovers, we have the sequence of nominally connecting flights$$ \underset{1315}{\text{GLA}} \rightarrow \overset{1425}{\underset{1500}{\text{LTN}}} \rightarrow \overset{1610}{\underset{1610}{\text{GLA}}} \rightarrow \overset{1730}{\underset{1735}{\text{LCY}}} \rightarrow \overset{1910}{\underset{1920}{\text{EDI}}} \rightarrow \overset{2020}{\underset{2020}{\text{MAN}}} \rightarrow \overset{2120}{\text{SOU}} $$
that terminates at SOU when there are no more departing flights for the day.
The transitions from EMA, HUY, and NWI to ABZ arise as follows. The first flights departing EMA, HUY, and NWI each arrive at ABZ between 0800 and 0810; meanwhile, the first flight departing ABZ not earlier than 0800 departs to MME at 0820. Taking shortest possible layovers as above eventually terminates at ABZ when there are no more departing flights for the day.
The “greedy” connections of the sort above become less likely as β decreases.
Remarkably, there is a considerable amount of geographic coherence to the clusters on weekdays, particularly on Friday. For example, even the two geographically largest clusters in the center panel of Fig. 5 consist entirely of airports on the periphery of the UK and Crown dependencies. As another example, the airports in Orkney with service (viz., KOI, NDY, NRL, PPW, SOY, and WRY) are part of a single, geographically local cluster. ^{9}
The time reversal (not shown) is less coherent, probabilistically and geographically. This highlights a critical distinction between DCNs with scheduled versus observed contacts. The former will generally correspond to something like a transportation network that is specifically engineered to facilitate certain connections (this in turn highlights that the UK domestic flight data is really a superposition of DCNs corresponding to individual airlines, with codeshares serving to add further complexity). It seems plausible that this distinction underlies the very limited degree of commonality between the forward and time-reversed clusters. □
Remarks
The model of “Markov chain models for DCNs” section exhibits a very sensitive dependence on the structure of the underlying DCN. In experiments not detailed here, we have seen that inserting just 1% of uniformly random contacts seriously degrades the model’s data reduction and anomaly detection characteristics. Rather than being a shortcoming, this behavior demonstrates that the model actually captures delicate yet critical aspects of flows from contact data.
An interesting perspective on this model is that it yields a time-varying geometry once a metric on probability distributions is chosen. Using the induced metric to assess the model dynamics is worth examining in future work. In a similar vein, analyzing the variation of information (Meilă 2007) between subsequent clusterings would give additional principled insight into the behavior of DCNs.
Regarding the taint bounds of “Taint bounds” section, we note that using alternative arithmetics/semirings along similar lines is also of interest, such as the so-called log and Viterbi semirings.
Footnotes
- 1.
By analogy, consider a flight departing from s at τ_{0} and arriving at t at τ_{1}. Here the contact (s,∗,τ_{0}) corresponds to embarking, while the contact (∗,t,τ_{1}) corresponds to debarking. We can think of ∗ as the physical plane on which the passengers flew. Alternative representations involving additional contacts of the form (s,∗,τ_{∗}) with τ_{0}≤τ_{∗}<τ_{1} might also be appropriate depending on circumstances and model intent.
- 2.
Allowing a negative absolute temperature (Ramsey 1956), β=−∞ and β=∞ respectively correspond to “absolute hot” (no spatial arc traversals) and absolute zero (no temporal arc traversals). In practice, we use a physical analogy/heuristic to setβ^{−1} to the average time between contacts. Further discussion of the role ofβ can be found below.
- 3.
That said, a dependence on Δτ is necessary. In the context of intra-computer information flows, this time difference plausibly approximates (at least for small values) a linear function of the conditional Kolmogorov complexity of the intervening computation.
- 4.
We can generalize this construction to the related notion of a weighted DCN by normalizing the sum of outbound weights and modifying the first case in (4) accordingly.
- 5.
This construction was necessary because in many cases the source or target of a ground truth event did not exist. For example, the userspace commands hostname and put /tmp/netrecon correspond to the (process name,filename) pairs \((\texttt {hostname},\varnothing)\); and \((\varnothing,\texttt {/tmp/netrecon})\). By way of comparison, the command rm -f /tmp/netrecon.log corresponds to the pair (rm,/tmp/netrecon.log).
- 6.
In more delicate situations, the approach of Huntsman (2018a) offers a principled solution to the problem of thresholding.
- 7.
Notwithstanding their fundamentally dynamic character, these bounds may be regarded as having a loose analogue in the practice of abstract interpretation in static analysis of computer programs (Nielson et al. 2010).
- 8.
Renormalizable theories in physics (and their fixed/critical points) are of great interest: indeed, renormalizability is actually a requirement for statistical and quantum field theories to be well-defined rather than “effective.”
- 9.
As an amusing aside, the shortest scheduled passenger flight in the world is between PPW and WRY: it has been completed in under a minute.
Notes
Acknowledgements
We thank Yingbo Song, Rob Ross, and Mike Weber for many helpful discussions as well as creating the summary and ground truth data used in “Data reduction and anomaly detection” section.
Authors’ contributions
GC conceived of and prototyped the techniques discussed in “Taint bounds” section, provided advice and comments on the subject matter of the entire paper, and edited the manuscript. SH conceived of the model in “Markov chain models for DCNs” section, performed the data analyses presented in the paper, and wrote the manuscript. Both authors read and approved the final manuscript.
Funding
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL). Cybenko’s efforts in this work were also partially supported by ARO MURI Grant W911NF-13-1-042. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA, the US Army or AFRL.
Competing interests
The authors declare that they have no competing interests.
References
- Bang-Jensen, J, Gutin G (2009) Digraphs: Theory, Algorithms and Applications. 2nd. Springer, LOndon. https://doi.org/10.1007/978-1-84800-998-1.CrossRefGoogle Scholar
- Barenblatt, GI (2003) Scaling, Cambridge. https://doi.org/10.1017/cbo9780511814921.
- Barrat, A, Cattuto C (2013) Temporal networks of face-to-face human interactions In: Temporal Networks, 191–216.. Springer, Berlin.CrossRefGoogle Scholar
- Bassett, DS, Porter MA, Wymbs, NF Grafton ST, Carlson JM, Mucha PJ (2013) Robust detection of dynamic community structure in networks. Chaos: Interdisc J Nonlinear Sci 23(1):013142.Google Scholar
- Bazzi, M, Porter MA, Williams S, McDonald M, Fenn DJ, Howison SD (2016) Community detection in temporal multilayer networks, with an application to correlation networks. Multiscale Modeling Simul 14(1):1–41.MathSciNetCrossRefGoogle Scholar
- Bianchi et al., FM (2016) Ide h data mining on call data records. Eng Appl Artif Intell 54:49–61.Google Scholar
- Brémaud, P (1999) Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues. Springer, New York.CrossRefGoogle Scholar
- Chan, SC, et al. (2017) Expressiveness benchmarking for system-level provenance. TaPP.Google Scholar
- Chakrabarti, D (2004) AutoPart: parameter-free graph partitioning and outlier detection. PKDD.Google Scholar
- Cheney, J, Acar UA, Perera R (2013) Toward a theory of self-explaining computation. In: Tannen V et al. (eds)Search of Elegance in the Theory and Practice of Computation.. Springer.Google Scholar
- Flaška, V, et al. (2007) Transitive closures of binary relations I. Acta Uni Carolinae - Math Phys 48:55.MathSciNetzbMATHGoogle Scholar
- Gallotti, R, Barthelemy M (2015) The multilayer temporal network of public transport in Great Britain. Sci Data 2:140056.CrossRefGoogle Scholar
- Gallotti, R, Barthelemy M (2015) The multilayer temporal network of public transport in Great Britain. Dryad Digit Repository. https://doi.org/10.5061/dryad.pc8m3.
- Gauvin, L, Panisson A, Cattuto C (2014) Detecting the community structure and activity patterns of temporal networks: a non-negative tensor factorization approach. PloS One 9(1):e86028.CrossRefGoogle Scholar
- Ge, X, Parise S, Smyth P (2003) Clustering Markov states into equivalence classes using SVD and heuristic search algorithms. AISTATS.Google Scholar
- Glazek, K (2002) Selected Applications of Semirings In: A Guide to the Literature on Semirings and their Applications in Mathematics and Information Sciences, 67–87.. Springer. https://doi.org/10.1007/978-94-015-9964-1_6.
- Goldenfeld, N (1992) How Phase Transitions Occur in Principle In: Lectures on Phase Transitions and the Renormalization Group, 23–83.. Addison-Wesley. https://doi.org/10.1201/9780429493492.
- Grindrod, P, Higham DJ (2013) A matrix iteration for dynamic network summaries. SIAM Rev 55:118.MathSciNetCrossRefGoogle Scholar
- Holland, PW, Laskey KB, Leinhardt S (1983) Stochastic blockmodels: first steps. Soc Netw 5:109.MathSciNetCrossRefGoogle Scholar
- Holme, P (2015) Modern temporal network theory: a colloquium. Eur Phys J B 88:234.CrossRefGoogle Scholar
- Huntsman, S (2018a) Topological mixture estimation. ICML.Google Scholar
- Huntsman, S (2018b) A Markov model for inferring flows in directed contact networks. In: Aiello L, Cherifi C, Cherifi H, Lambiotte R, Lió P, Rocha L (eds)Complex Networks and Their Applications VII. COMPLEX NETWORKS 2018. Studies in Computational Intelligence.. Springer, Cham.Google Scholar
- Jenkinson, G, et al. (2017) Applying provenance in APT monitoring and analysis. TaPP.Google Scholar
- Karschau, J, Zimmerling M, Friedrich BM (2018) Renormalization group theory for percolation in time-varying networks. Sci Rep 8(1):8011.CrossRefGoogle Scholar
- King, ST, Chen PM (2005) Backtracking intrusions. ACM Trans Comp Sys 23:51.CrossRefGoogle Scholar
- Lambiotte, R, Rosvall M, Scholtes I (2019) From networks to optimal higher-order models of complex systems. Nat Phys 15(4):313–320. https://doi.org/10.1038/s41567-019-0459-y.CrossRefGoogle Scholar
- Lencastre, P, et al. (2016) From empirical data to continuous Markov processes: a systematic approach. Phys Rev E 93:032135.CrossRefGoogle Scholar
- Malliaros, FD, Vazirgiannis M (2013) Clustering and community detection in directed networks: a survey. Phys Rep 533:95–142.MathSciNetCrossRefGoogle Scholar
- Masuda, N, Lambiotte R (2016) Models of temporal networks In: A Guide to Temporal Networks.. World Scientific. https://doi.org/10.1142/q0033.
- Meilă, M (2007) Comparing clusterings-an information based distance. J Mutlivariate Anal 98:873.MathSciNetCrossRefGoogle Scholar
- Newman, ME, Watts DJ (1999) Renormalization group analysis of the small-world network model. Phys Lett A 263(4-6):341–346.MathSciNetCrossRefGoogle Scholar
- Nielson, F, Nielson HR, Hankin C (2010) Principles of Program Analysis. Springer, Berlin.zbMATHGoogle Scholar
- Perra, N, et al. (2012) Random walks and search in time-varying networks. Phys Rev Lett 109:238701.CrossRefGoogle Scholar
- Ramsey, NF (1956) Thermodynamics and statistical mechanics at negative absolute temperatures. Phys Rev 103:20.CrossRefGoogle Scholar
- Rocha, LEC, Masuda N (2014) Random walk centrality for temporal networks. New J Phys 16:063023.MathSciNetCrossRefGoogle Scholar
- Rohe, K, Qin T, Yu B (2016) Co-clustering directed graphs to discover asymmetries and directional communities. Proc Nat Acad Sci 113:12679.MathSciNetCrossRefGoogle Scholar
- Rosvall, M, Esquivel AV, Lancichinetti A, West JD, Lambiotte R (2014) Memory in network flows and its effects on spreading dynamics and community detection. Nat Commun 5:4630.CrossRefGoogle Scholar
- Saramäki, J, Holme P (2015) Exploring temporal networks with greedy walks. Eur Phys J B 88:334.CrossRefGoogle Scholar
- Sarzynska, M, Leicht EA, Chowell G, Porter MA (2015) Null models for community detection in spatially embedded, temporal networks. J Compl Netw 4(3):363–406.MathSciNetCrossRefGoogle Scholar
- Schwartz, EJ, Avgerinos T, Brumley D (2010) All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask) In: 2010 IEEE Symposium on Security and Privacy. https://doi.org/10.1109/sp.2010.26.
- Ser-Giacomi, E, et al. (2015) Most probable paths in temporal weighted networks: an application to ocean transport. Phys Rev E 92:012818.CrossRefGoogle Scholar
- Speidel, L, Takaguchi T, Masuda N (2015) Community detection in directed acyclic graphs. Eur Phys J B 88:203.CrossRefGoogle Scholar
- Starnini, M, et al. (2012) Random walks on temporal networks. Phys Rev E 85:056115.CrossRefGoogle Scholar
- Valdano, E, Poletto C, Colizza V (2015) Infection propagator approach to compute epidemic thresholds on temporal networks: impact of immunity and of limited temporal resolution. Eur Phys J B 88:341.CrossRefGoogle Scholar
- Valdano, E, Fiorentin MR, Poletto C, Colizza V (2018) Epidemic threshold in continuous-time evolving networks. Phys Rev Lett 120(6):068302.CrossRefGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.