# Testing the hypothesis of preferential attachment in social network formation

- 3.2k Downloads
- 2 Citations

## Abstract

The hypothesis of preferential attachment (PA) - whereby better connected individuals make more connections - is hotly debated, particularly in the context of epidemiological networks. The simplest models of PA, for example, are incompatible with the eradication of any disease through population-level control measures such as random vaccination. Typically, evidence has been sought for the presence or absence of preferential attachment via asymptotic power-law behaviour. Here, we present a general statistical method to test directly for evidence of PA in count data and apply this to data for contacts relevant to the spread of respiratory diseases. We find that while standard methods for model selection prefer a form of PA, careful analysis of the best fitting PA models allows for a level of contact heterogeneity that in fact allows control of respiratory diseases. Our approach is based on a flexible but numerically cheap likelihood-based model that could in principle be applied to other integer data where the hypothesis of PA is of interest.

### Keywords

MLE Phase-type distribution model selection spectral methods## 1 Introduction

### 1.1 Contact heterogeneity in infectious disease epidemiology

Infectious pathogens that spread via contact between people are a major cause of human disease, driving attempts to understand their epidemiology [1]. Much theoretical work on infectious disease dynamics has been focused on the role of heterogeneity in the human population [2], which is often conceptualised as a network of epidemiologically relevant contacts [3, 4, 5].

*K*from the same

*degree distribution*, but transmission is otherwise homogeneous,

### 1.2 Data

Of course, whether such a theoretical possibility matters for the study of infectious diseases depends on the actual variance in degree for epidemiologically relevant contacts. While 20th century models of infectious disease were often based on strong a priori assumptions about mixing patterns [1], various methods for measurement of contact patterns now exist and were reviewed by Read et al. [10]. As well as direct measurement of individuals through surveys [11] it is possible to improve coverage through snowball and respondent-driven sampling [12, 13], to make use of the extremely large datasets produced by electronic sensors [14, 15], and also to combine aggregate data [16, 17].

*k*, a randomly selected node obeys

These empirical observations of high heterogeneity in contact number, together with theoretical results about \(R_{0}\), present a paradox for infectious disease epidemiology: is the extreme heterogeneity in observed contact patterns indicative of PA and does that imply that \(R_{0}>1\) for almost any finite level of person-to-person transmissibility meaning that our theoretical understanding of infectious disease epidemiology is somehow severely lacking?

### 1.3 Preferential attachment and power laws in empirical data

Recent years have seen a debate about the level of heterogeneity that exists in a variety of observed networks. A particularly influential paper by Barabási and Albert [22] considered a model of network formation in which many new nodes are added to a small existing network. These new nodes connect preferentially to nodes that have more links in the existing network, leading to the asymptotic result (2) with \(\gamma=3\). In this way preferential attachment is intimately linked with, but not always equivalent to, asymptotic power-law behaviour.

Simple power-law relationships have been claimed for numerous real-world systems, and a critical review of these claims by Clauset et al. [23] used maximum-likelihood fitting of distribution tails to power-law distributions to show varying levels of statistical support for claims in the literature. In the context of discrete data, pioneering work by Zipf [24] found power-laws in word frequencies; considering the count of unique words in *Moby Dick* both Newman [25] and Clauset et al. [23] agree that the statistical evidence for Zipf’s power-law distribution in this context is strong. On the other hand, the in- and out-degrees of *E. coli* metabolic networks have been claimed to follow a power law [26], but this is disputed by the analyses of Huss and Holme [27] and Clauset et al. [23].

The debate around presence or absence of power laws in real data continues, perhaps most strongly in the context of networks. For example, Barabási [28] writes that preferential attachment is network science’s “most profuse concept,” and that “the impact of preferential attachment is hard to miss.” At the same time, Stumpf and Porter [29] argue that “most reported power laws lack statistical support and mechanistic backing.”

### 1.4 Testing preferential attachment directly

In this work, we attempt to test the hypothesis of preferential attachment in social contact data directly, rather than via asymptotic power law behaviour. We make use of previously collected data on social encounters specifically designed to measure heterogeneity in numbers of contacts amongst the British population, and fit mechanistic models of different complexity to these data. We determine that models with significant levels of preferential attachment have better evidential support from the data than models without.

## 2 Methods

### 2.1 Social Contact Survey data

A cross-sectional study was conducted between May 2009 and October 2010, recruiting households and individuals through postal and online questionnaires, supported by a large random-address mailshot and a modest online and media promotion [30, 31]. Questionnaires asked respondents to report on the number of distinct individuals they encountered the previous day: their contacts. Respondents were able to report contacts either as individuals or as members of a group with a reported size. Allowing the reporting of groups of individuals was a deliberate methodological design to permit the easy reporting of large numbers of contacts, to avoid the approach taken by previous studies [11], which imposed a high burden on respondents with large number of contacts, and to ensure the best capture of the right-hand tail of the degree distribution. In general, we expect that such data will become increasingly available due to the epidemiological importance of this tail (e.g. the study of Read et al. [21]).

In total, completed questionnaires were received from 5,388 participants in Great Britain, 3,901 of which were from postal surveys. There was some bias in demographical representation, most notably younger age groups and males were generally under-represented (see Danon et al. [31] for more details). The data is available at http://wrap.warwick.ac.uk/54273/.

### 2.2 Generalised preferential attachment

*N*individuals indexed by

*i*each individual has an integer-valued random variable \(K_{i}(t)\) for its number of contacts. Individual

*i*starts with \(K_{i}(0)=0\) and makes new contacts over a time period \(T_{i}\), which is given by a positive real-valued random variable with probability density function \(\rho(t)\). The generation of new social contacts is modelled by a continuous-time Markov chain with the following events and rates:

*t*but not

*k*and the asymptote holds as

*k*becomes large. This is not a simple power-law relationship, and so the asymptotic behaviour of the moments is not determined by the power-law exponent, but rather through the moment generating function \(M(t,z) = g(t,\mathrm{e}^{z})\), \(z\in(-\infty,0]\), such that the

*r*th moment of the degree distribution, conditional on \(t< T\), is

*r*th moment of the degree distribution will be

### 2.3 Phase-type holding times

The question is then posed as to an appropriate distribution from which to draw the holding times \(\{T_{i}\}\) for the amount of time spent making new contacts on the day for which individuals provide data. In previous work [30] on a related model of contact formation we considered holding times \(T_{i}\) that were log-normally distributed. This provided a good fit to data, but was computationally intensive and lacked a mechanistic interpretation. We therefore consider here a class of distributions for the holding times that is highly flexible, but which has analytic and numerical benefits - the distributions of *phase type* [35]. Phase-type distributions are dense in the space of positive-valued probability distributions [36], meaning that they can be made arbitrarily close to any other distribution. They have a mechanistic interpretation and allow for analytic manipulations that greatly reduce the numerical cost of likelihood evaluation.

*a*is \(\nu_{a}\) (meaning these parameters must sum to unity); the rate of stopping making new social contacts is \(\mu_{a}\) for an individual in phase

*a*; and the rate of moving from phase

*a*to phase

*b*is \(Q_{a,b}\). Note that different rates of making contacts in different phases are not realistically distinguishable from different times spent and so are not included as parameters. The phases have a mechanistic interpretation as the activities that individuals undertake on a given day.

*r*th moment of the degree distribution will involve a term like

**I**is the identity matrix. Let \(\mathbf{A} = \mathbf {M} + r\tau \mathbf{I}\); this matrix is triangular and so its eigenvalues are equal to its diagonal elements; in particular the

*a*th eigenvalue of

**A**is \(\lambda_{a} = -M_{a} + r \tau\). If we let

**R**be a matrix whose

*a*th column is the

*a*th right eigenvector of

**A**and

**L**be a matrix whose

*a*th row is the

*a*th left eigenvector of

**A**then

**D**is a diagonal matrix whose

*a*th diagonal element is \(\mathrm{e}^{\lambda_{a} t}\). The integral \(\mathcal{I}_{r}\) therefore converges exactly when ∀

*a*, \(\lambda_{a}<0\), which implies that the condition for divergence of the

*r*th moment is

In general, however, combination of (10) and (6) is not the most numerically efficient method for calculation of the overall probability mass function for final number of contacts \(K_{i}(T_{i})\) and a different approach is needed.

### 2.4 Numerically efficient model solution

*a*and with

*k*social contacts at time

*t*, \(p_{a,k}(t)\). These ODEs take the form

*k*social contacts make new contacts, given in (3). We are then interested in \(d_{k}\), the probability mass function for a randomly selected individual having made

*k*social contacts by the end of the process. A series of manipulations directly analogous to those of Bailey [37] shows that

### 2.5 Model likelihood, fitting and selection

*k*social contacts when surveyed. A model \(\mathcal{M}\) is therefore specified by a number of phases

*m*and the presence or absence of PA, meaning the general parameters are \(\boldsymbol{\theta} = (\tau,\nu_{a},\mu_{a},Q_{a,b})\), with

*τ*present only if there is PA. The number

*n*of individuals sampled from the British population

*N*is

We consider the use of the likelihood function (20) using standard statistical methodology. Numerical maximum likelihood estimation was performed using simulated annealing run from multiple starting points to ensure the global optimum was obtained. Model selection was performed using AIC [38] and BIC [39], as well as likelihood ratio tests [40] on pairs of models where this test was informative. This was done since each approach involves different trade-offs between model fit and complexity, and to check that our conclusions about PA are not overly sensitive to the precise method used. Uncertainty in model parameters was quantified using confidence intervals obtained through bootstrapping the data, and uncertainty in model outputs such as the predicted degree distribution was quantified using a parametric bootstrap.

## 3 Results and discussion

**Comparison of models with different numbers of phases, with and without preferential attachment (PA), together with: number of parameters; differences in AIC and BIC values compared to the overall minimum; and the lowest divergent moment for models with PA**

| | | | |
---|---|---|---|---|

(1,No) | 1 | 2.2 × 10 | 2.1 × 10 | – |

(2,No) | 4 | 2.1 × 10 | 1.5 × 10 | – |

(3,No) | 8 | 1.2 × 10 | 83 | – |

(4,No) | 13 | 42 | 38 | – |

(5,No) | 19 | 23 | 58 | – |

(6,No) | 26 | 27 | 1.1 × 10 | – |

(1,Yes) | 2 | 1.9 × 10 | 1.1 × 10 | 3 |

(2,Yes) | 5 | 1.3 × 10 | 72 | 4 |

(3,Yes) | 9 | 31 | \(\mathbf{[0]}\) | 3 |

(4,Yes) | 14 | 11 | 14 | 3 |

(5,Yes) | 20 | \(\mathbf{[0]}\) | 42 | 3 |

(6,Yes) | 27 | 9 | 97 | 3 |

For the 3-phase model with PA, \(\tau= 0.018\ [0.012,0.026]\); and if we set \(\tau=0\) but leave the other parameters at their fitted values, then the total number of contacts per person is reduced to 64% of its original value. For the 5-phase model with PA \(\tau= 0.026\ [0.019,0.036]\); and if we set \(\tau=0\) but leave the other parameters at their fitted values, then the total number of contacts per person is reduced to 58% of its original value. This shows that in both of these models, we can attribute a substantial fraction of the contacts to PA.

We also calculate that the second moment does not diverge in any of the fitted models, which helps to resolve the epidemiological paradox that we introduced at the start of this paper. PA is empirically supported, and is also mechanistically plausible since existing social contacts give more opportunities for future social contact. Combined with a sufficiently detailed phase-based mechanistic model of the contexts in which social contacts are made, however, PA does not imply a divergent second moment for the distribution of contacts relevant for the spread of directly transmitted infections. This means that our understanding of how basic epidemiological quantities like the basic reproductive ratio, \(R_{0}\), are related to contact networks does not need to be revised in the light of empirical evidence.

As a final observation, we believe that as computational resources for fitting models to data improve, it will in general be easier to test the hypothesis of PA directly in all kinds of data, rather than looking for asymptotic power laws.

## Notes

### Acknowledgements

The Social Contact Survey was funded by the Medical Research Council, grant number G0701256. TH and MJK are supported by the Engineering and Physical Sciences Research Council. JMR and MJK are supported by the Economic and Social Research Council, grant ES/K004255/1. LD is supported by the Leverhulme Trust.

### References

- 1.Anderson RM, May RM (1991) Infectious diseases of humans. Oxford University Press, Oxford Google Scholar
- 2.Diekmann O, Heesterbeek JAP, Britton T (2012) Mathematical tools for understanding infectious disease dynamics. Princeton University Press, Princeton CrossRefGoogle Scholar
- 3.Bansal S, Grenfell BT, Meyers LA (2007) When individual behaviour matters: homogeneous and network models in epidemiology. J R Soc Interface 4(16):879-891 CrossRefGoogle Scholar
- 4.Danon L, Ford AP, House T, Jewell CP, Keeling MJ, Roberts GO, Ross JV, Vernon MC (2011) Networks and the epidemiology of infectious disease. Interdiscip Perspect Infect Dis 2011:284909 Google Scholar
- 5.Pellis L, Ball F, Bansal S, Eames K, House T, Isham V, Trapman P (2014) Eight challenges for network epidemic models. Epidemics 10:58-62. doi: 10.1016/j.epidem.2014.07.003 CrossRefGoogle Scholar
- 6.Diekmann O, Heesterbeek JAP (2000) Mathematical epidemiology of infectious diseases: model building, analysis and interpretation. Wiley, New York Google Scholar
- 7.Pastor-Satorras R, Vespignani A (2001) Epidemic dynamics and endemic states in complex networks. Phys Rev E 63:066117 CrossRefGoogle Scholar
- 8.May RM, Lloyd AL (2001) Infection dynamics on scale-free networks. Phys Rev E 64:066112 CrossRefGoogle Scholar
- 9.Durrett R (2010) Some features of the spread of epidemics and information on a random graph. Proc Natl Acad Sci USA 107(10):4491-4498 MathSciNetCrossRefGoogle Scholar
- 10.Read JM, Edmunds WJ, Riley S, Lessler J, Cummings DAT (2012) Close encounters of the infectious kind: methods to measure social mixing behaviour. Epidemiol Infect 140(12):2117-2130 CrossRefGoogle Scholar
- 11.Mossong J, Hens N, Jit M, Beutels P, Auranen K, Mikolajczyk R, Massari M, Salmaso S, Tomba GS, Wallinga J, Heijne J, Sadkowska-Todys M, Rosinska M, Edmunds WJ (2008) Social contacts and mixing patterns relevant to the spread of infectious diseases. PLoS Med 5(3):381-391 CrossRefGoogle Scholar
- 12.Goodman LA (1961) Snowball sampling. Ann Math Stat 32:148-170 MATHCrossRefGoogle Scholar
- 13.Heckathorn DD (1997) Respondent-driven sampling: a new approach to the study of hidden populations. Soc Probl 44:174-199 CrossRefGoogle Scholar
- 14.Salathé M, Kazandjieva M, Lee JW, Levis P, Feldman MW, Jones JH (2010) A high-resolution human contact network for infectious disease transmission. Proc Natl Acad Sci USA 107(51):22020-22025 CrossRefGoogle Scholar
- 15.Isella L, Stehlé J, Barrat A, Cattuto C, Pinton J, Van den Broeck W (2011) What’s in a crowd? Analysis of face-to-face behavioral networks. J Theor Biol 271(1):166-180 CrossRefGoogle Scholar
- 16.Eubank S, Guclu H, Kumar VSA, Marathe MV, Srinivasan A, Toroczkai Z, Wang N (2004) Modelling disease outbreaks in realistic urban social networks. Nature 429(6988):180-184 CrossRefGoogle Scholar
- 17.Eubank S, Barrett C, Beckman R, Bisset K, Durbeck L, Kuhlman C, Lewis B, Marathe A, Marathe M, Stretz P (2010) Detail in network models of epidemiology: are we there yet? Journal of Biological Dynamics 4(5):446-455 MathSciNetCrossRefGoogle Scholar
- 18.Fournet J, Barrat A (2014) Contact patterns among high school students. PLoS ONE 9(9):e107878 CrossRefGoogle Scholar
- 19.Schneeberger A, Mercer CH, Gregson SAJ, Ferguson NM, Nyamukapa CA, Anderson RM, Johnson AM, Garnett GP (2004) Scale-free networks and sexually transmitted diseases: a description of observed patterns of sexual contacts in Britain and Zimbabwe. Sex Transm Dis 31(6):380-387 CrossRefGoogle Scholar
- 20.Leigh Brown AJ, Lycett SJ, Weinert L, Hughes GJ, Fearnhill E, Dunn DT (2011) Transmission network parameters estimated from HIV sequences for a nationwide epidemic. J Infect Dis 204(9):1463-1469 CrossRefGoogle Scholar
- 21.Read JM, Lessler J, Riley S, Wang S, Tan LJ, Kwok KO, Guan Y, Jiang CQ, Cummings DAT (2014) Social mixing patterns in rural and urban areas of southern China. Proc R Soc B 281(1785):20140268 CrossRefGoogle Scholar
- 22.Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509-512 MathSciNetCrossRefGoogle Scholar
- 23.Clauset A, Shalizi CR, Newman MEJ (2009) Power-law distributions in empirical data. SIAM Rev 51(4):661-703 MATHMathSciNetCrossRefGoogle Scholar
- 24.Zipf GK (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley, Reading Google Scholar
- 25.Newman MEJ (2005) Power laws, Pareto distributions and Zipf’s law. Contemp Phys 46(5):323-351 CrossRefGoogle Scholar
- 26.Jeong H, Tombor B, Albert R, Oltvai ZN, Barabási AL (2000) The large-scale organization of metabolic networks. Nature 407(6804):651-654 CrossRefGoogle Scholar
- 27.Huss M, Holme P (2007) Currency and commodity metabolites: their identification and relation to the modularity of metabolic networks. IET Syst Biol 1(5):280-285 CrossRefGoogle Scholar
- 28.Barabási AL (2012) Network science: luck or reason. Nature 489(7417):507-508 CrossRefGoogle Scholar
- 29.Stumpf MPH, Porter MA (2012) Critical truths about power laws. Science 335(6069):665-666 MathSciNetCrossRefGoogle Scholar
- 30.Danon L, House T, Read JM, Keeling MJ (2012) Social encounter networks: collective properties and disease transmission. J R Soc Interface 9(76):2826-2833 CrossRefGoogle Scholar
- 31.Danon L, Read JM, House T, Vernon MC, Keeling MJ (2013) Social encounter networks: characterizing Great Britain. Proc R Soc B 280(1765):20131037 CrossRefGoogle Scholar
- 32.Durrett R (2007) Random graph dynamics. Cambridge University Press, Cambridge MATHGoogle Scholar
- 33.Simkin MV, Roychowdhury VP (2011) Re-inventing Willis. Phys Rep 502(1):1-35 MathSciNetGoogle Scholar
- 34.Yule GU (1925) A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis, F.R.S. Philos Trans R Soc Lond B, Contain Pap Biol Character 213:21-87 CrossRefGoogle Scholar
- 35.Neuts MF (1981) Matrix-geometric solutions in stochastic models: an algorithmic approach. Johns Hopkins University Press, Baltimore MATHGoogle Scholar
- 36.Neuts MF (1975) Probability distributions of phase type. In: Liber amicorum Professor emeritus Dr. H. Florin. Katholieke Universiteit Leuven, Departement Wiskunde, Leuven, pp 173-206 Google Scholar
- 37.Bailey NTJ (1957) The mathematical theory of epidemics. Griffin, London Google Scholar
- 38.Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716-723 MATHMathSciNetCrossRefGoogle Scholar
- 39.Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461-464 MATHCrossRefGoogle Scholar
- 40.Neyman J, Pearson ES (1933) On the problem of the most efficient tests of statistical hypotheses. Philos Trans R Soc Lond A, Math Phys Eng Sci 231:289-337 CrossRefGoogle Scholar

## Copyright information

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.