Keywords

1 Introduction

Spreading memes over social networks became the subject of interdisciplinary research in many fields of science. From linguistics and sociology to statistical physics, we observe numerous works analyzing sociolinguistic phenomenon of virality, trying to explain key mechanisms of meme propagation and predict its future popularity [3, 15, 18].

In common understanding, meme is a piece of information or a unit of cultural transmission, which replicates over population. In the context of social media the meme could be a link, hashtag, phrase, image or video. The viral is defined as longstanding meme with the high probability of reposting. The process of meme proliferation, typically different than epidemic spreading, is therefore described as complex contagion. The deviation from simple epidemic patterns is caused by multiple social factors and cognitive constraints such as homophily, confirmation bias, social reinforcement, triadic closure or echo chambers [15, 17]. In this work, we consider spreading of hashtags - specific string-type concise memes used extensively to tag microblog Twitter posts and decorate them with additional semantics, personal affect or extended context. The substrate for hashtag propagation is formed by social network of Twitter users and posts, connected by three types of relations: follow, retweet and mention.

Modeling meme dynamics is an interesting topic, essential not only in broadly described marketing and business analysis but also more recently in public security sector, where the new challenges related to elections manipulation, fake news, terror attacks or riots were raised [4]. What factors should be considered in modeling hashtag propagation and recognizing virality? The recent studies suggest that while the content and related affect seems natural driver of virality, the social influence [8] and network structure [6, 18] play more important role in meme diffusion process. This is particularly visible for hashtags, which are intended to be ultra concise and short.

The impact of network structure on meme dynamics is typically modeled as a social reinforcement effect, for which the probability of meme adoption increases with the number of exposures [17]. Motivated by recent blog posts about overused memes, we argue that social reinforcement mechanism should be supplemented with negative feedback loop, which reflects that the users can be annoyed with the meme, when they see it too frequently. We hypothesise, that by adding overexposure effect to simulation, we will achieve meme dynamics closer to one observed for real tweets. To prove our hypothesis we first, in Sect. 2, introduce basic definitions and describe five new models of hashtag propagation over micro-blog network. These models use the baseline structural factors known from other works [14, 17, 18], but additionally take into account that multiple exposures of the meme can have both positive and negative effect on an adoption. In Sect. 3, we present how one-day Twitter network dump with the real hashtags was used for analyzing our models and comparing them with the classical models described in [17]. The results of experiments, visualizations and quantitative model comparison are presented in Sect. 4. The wider context of meme spreading is described in Sect. 5, where the overview of related works is presented. We conclude in Sect. 6, by presenting summary of our findings and discussing limitations of our approach.

The key contributions of this work are new models of hashtag spreading, which take into account not only community structure and social reinforcement but also meme overexposure. Five different mechanisms of incorporating overexposure are proposed. We also perform validation of the new models against the models described in [17], based on Twitter data.

2 Spreading Models

Herein we introduce definitions required in the rest of this work and describe meme spreading models including five new models combining the influence of communities and cognitive limitations of individuals.

2.1 Networks with Communities

We define an undirected, static graph as a tuple \(G=(V,E)\) where V is a set of vertices and \(E = \{\{v_1,v_2\}: v_1, v_2 \in V\}\) is set of edges. We also define a community as a subset of vertices \(C_i \subseteq V, i \in \{1,...,n\}\), where is a number of all communities. We assume that the community structure is fixed in time. The set of all communities \(C=\{C_i\}\) must contain all nodes from the graph: \(\bigcup C_i = V, C_i \in C\). Similarly to [17, 18], we assume that the communities are disjoint, meaning that one vertex is a member of exactly one community \(\forall v \in V \,\exists ! C_i: v \in C_i\), or in other words \(\forall u,v \in V; C_i, C_j \in C, C_i \ne C_j: u \in C_i \wedge v \in C_j \Rightarrow u \ne v \).

2.2 Spreading Process

The defined graph is a base network on which the meme spreading processes take place. Spreading itself can be defined as an iterative function taking any known or unknown parameters of the network and returning state of the spreading. This requires assumption about similar timescales of all spreading processes, which is not always true - the same number of tweets can be produced really quickly during an intensive discussion and rather slowly in a marketing campaign. For our work it is enough to assume that spreading process uses the knowledge about the topology of the network, community structure and a state of spreading from the previous iteration.

We borrow the language from epidemiology and define generation of a tweet as an infection. To be more formal, we define a state of the spreading \(S_t\) at given time as a subset of infected nodes \(S_t \subseteq V\). If the spreading process is more complex, i.e. there are more possible states of the node, then new disjoint subsets of nodes must be defined. The spreading process is then defined as a function \(f_s: S_t \times G \times C \rightarrow S_{t+1}\).

For this work we assume that at any given iteration t only one node can be infected and we can infect the same node multiple times during the process. We define infected node \(v_t \in V\) as: \(S_{t} = S_{t-1} \cup \{v_t\}\). Then, the spreading process function can be simplified to a sequence of nodes infected \((v_t)\) in each iteration \(f_s: S_{t-1} \times G \times C \rightarrow \{v_t\}\). For describing spreading processes, we define neighborhood of a set of nodes as all nodes from its hull that are not in that set \(S \subseteq V\): \(N(S) = \{v_1: \exists v_2: \{v_1, v_2\} \in E \wedge v_1 \notin S\}\). We also define inclusive neighborhood of a set \(S \subseteq V\) as \(N_{incl}(S) = N(S) \cup S\).

2.3 Spreading Models

In this section, we provide spreading process functions for baseline models (M1 to M4 from [17]) and for five models proposed by us.

Random Sampling Model (M1). In the random model, at each iteration, we randomly choose the infected node with a uniform probability. It assumes that the network topology does not affect the spreading. This model will be used as a baseline for calculating spreading metrics.

$$\begin{aligned} v_t = rand_t(V) \end{aligned}$$
(1)

Network Structure Model (M2). Here we assume that the network structure has an impact on spreading. This effect is reflected by choosing a random, already infected vertex and then selecting one of its topological neighbors:

$$\begin{aligned} v_t = rand(\{v: v \in N_{incl}(\{u\}); u = rand(S_{t-1})\}) \end{aligned}$$
(2)

In addition, with constant probability \(p_{restart} = 0.15\), the process is restarted from a random node, as in Eq. (1).

Social Reinforcement Model (M3). At each step of the spreading process, the node with most infected neighbors is infected. If there is no unique vertex satisfying this requirement, then a random one (from the all with the same number of infected neighbors) is selected. The restart process is included similarly to the model M2.

$$\begin{aligned} v_t = rand(\{v: v \in N_{incl}(S_{t-1}) \wedge argmax_v(N(v) \cap S_{t-1})\} ) \end{aligned}$$
(3)

Homophily Model (M4). Homophily model uses community structure to define nodes having similar interests. We choose randomly an already infected vertex and then, select randomly its neighbor from the same community. The restart process is included in M2 and M3.

$$\begin{aligned} v_t = rand(\{v: v \in N_{incl}(\{u\}) \wedge u,v \in C_i; u = rand(S_{t-1})\}) \end{aligned}$$
(4)

Unique Information Overexposure Model (M5). In meme spreading in social networks some of hashtags are so unique, that they unlikely appear in multiple places of the graph but they are still affected by social reinforcement. To include that effect, we change the M3 model by removing the restart process and adding a requirement to not infect already infected nodes. To some extent this is similar to Susceptible-Infected-Resistant (SIR) epidemic model, but it is a more directed approach, because we are choosing a vertex with the most infected neighbors. This is an example of simple overexposure, because each node will not be infected multiple times with the same hashtag.

$$\begin{aligned} v_t = rand(\{v: v \in N(S_{t-1}) \wedge argmax_v(N(v) \cap S_{t-1})\} ) \end{aligned}$$
(5)

Global Increase of Generation Probability Model (M6). One of the cognitive limitations in social interactions is that people get bored, when they see the same meme too many times. To model such mechanism, we infect a node with the most exposures, but if a neighbor vertex was infected multiple times, we count all of these occurrences as exposures. Secondly, the restart probability at each step grows linearly from 0 to 1. This means that initially, the spreading is rapid and not constrained by community structure. After a period of time, when the members of a community are overexposed, new random vertices out of a community are likely to be infected.

This definition requires setting maximum number of iterations, for which the spreading occurs. Similarly to choosing \(p_{restart}\), this is strictly connected to finding timescale of a process. In this model, we assume that the spreading lasts as long as the longest spreading time observed in the real data, noted as \(T_{max}\). We define a number of exposures for a given node v at iteration t as:

$$\begin{aligned} n_{exp}(v, t) = |\{u: u \in N(v) \wedge u = v_\tau , \tau < t\}|. \end{aligned}$$
(6)

Then, selection of the currently infected node can be represented as:

$$\begin{aligned} v_t = rand(\{v: argmax_v(n_{exp}(v, t))\}). \end{aligned}$$
(7)

The restart probability is:

$$\begin{aligned} p_{restart} = \frac{t}{T_{max}}. \end{aligned}$$
(8)

Local Increase of Generation Probability Model (M7). Similarly to M6, we increase the restart probability linearly but it is set to 0 after each restart.

$$\begin{aligned} v_t = rand(\{v: argmax_v(n_{exp}(v, t))\}) \end{aligned}$$
(9)

The restart probability is:

$$\begin{aligned} p_{restart} = \frac{i}{T_{max}}, \end{aligned}$$
(10)

where i is the number of iterations since last restart.

Inverse Exposure Model (M8). This is a more complex overexposure model, where the exposures are counted separately for all nodes. Then, the infected node is selected randomly, with the probability inversely proportional to the number of its exposures:

$$\begin{aligned} v_t = rand_{ipe}(N_i(S_{t-1})), \end{aligned}$$
(11)

where \(rand_{ipe}\) is function choosing a node with probability inversely proportional to the number of exposures for each node. Restart has a constant probability of

$$\begin{aligned} p_{restart} = 0.15. \end{aligned}$$
(12)

Inverse Exposure Model with No Restarts (M9). Similar to M8 model but without a restart mechanism. Choosing the next infected node is described using the same function from Eq. (11).

2.4 Spreading Metrics

To compare different models, we calculate the following metrics. Most of them are taken from [17].

Intra- and Inter-community Activity. The patterns of within- and between-community meme activity are reflecting the influence of these sub-structures on diffusion process and can be used to quantify overall spreading dynamics. In models M2 and M4 the meme transfer happens between a source node (u - infected in one of the previous iterations) and a destination node (\(v_t\)). If the nodes belong to the same community \(u, v_t \in C_i\), the transition is classified as intra-community activity \((a_{intra})\). If they belong to different communities \(u \in C_i \wedge v_t \in C_j \wedge i \ne j\), it is called inter-community activity (\(a_{inter}\)). To be able to compare these metrics between different networks we normalize it by dividing them by the number of all inter- (\(edges_{inter}\)) and intra-community edges (\(edges_{intra}\)) in the network.

$$\begin{aligned} a_{intra} = \frac{c_{intra}}{edges_{intra}}\end{aligned}$$
(13)
$$\begin{aligned} a_{inter} = \frac{c_{inter}}{edges_{inter}} \end{aligned}$$
(14)

Here \(c_{intra}\), \(c_{inter}\) are the numbers of spreading events occurring within the same and between the different communities.

Usage Dominance. We define usage dominant community as a community with the biggest number of spreading events: \(C_d = \max _{C_i}\sum _{t=0}^{T_{max}}|\{v_t \in C_i\}|\) and a number of spreading events produced in this community as \(T_{C_d} = \sum _{t=0}^{T_{max}}|\{v_t \in C_d\}|\). Then usage dominance is defined as

$$\begin{aligned} D_u = \frac{|T_{C_d}|}{T_{max}}. \end{aligned}$$
(15)

The high values of usage dominance indicate that the meme was trapped mostly within one community and its dynamics is more local. The probability of becoming the viral is low.

Adoption Dominance. We define adoption dominant community (\(C_a\)) as a community with the biggest number of vertices infected by spreading. Note that it is different than \(C_d\) for which we count all spreading events, here we count only unique vertices: \(C_a = max_{C_i}|S_{T_{max}} \cap C_i|\). The adoption dominance can be defined as:

$$\begin{aligned} D_a = \frac{|C_a|}{|S_{T_{max}}|}. \end{aligned}$$
(16)

3 Simulation Details

Each model was run at least twice on the network recreated from real tweets. Ground truth data were obtained from the same set of tweets and analyzed on the same network, to obtain metrics reflecting real-world spreading dynamics.

3.1 Recreated Network

We gathered 2 045 413 tweets using public Twitter API. From that, we selected only the ones containing mention relation, which involves at least two users: one source and possibly multiple targets. Each source-target pair was treated as an undirected edge \(e \in E\), to form initial graph G. Then, the largest connected component of G was extracted. In our case it consisted of 710 195 vertices and 919 022 edges. This was the base graph, on which the simulations were run. In Table 1, the details of recreated networks are summarised. The visualisation of the graph using IVGA, a fast force-directed method from [5], is presented in Fig. 1.

Fig. 1.
figure 1

Visualization of the giant component of sampled Twitter network (mention relation) using IVGA algorithm [5]. The color of the node represents the community. (Color figure online)

Table 1. Recreated network details.

Static networks do not completely model temporal relationships occurring between nodes in real networks. In our case the timescales of spreading are much lower than the changes in relationships between users which is why we can assume that they do not change over time. We are using mentions relation because in its nature it is bidirectional: the user mentioning another user is aware of the message content as its creator and the mentioned user is notified about that message as well. This is why we can assume that our base network is undirected.

Community Detection. We used INFOMAP algorithm to obtain partitioning of the nodes into communities C. Its main advantage is linear time complexity O(|E|), which is practical for large graphs. Secondly, it does not require the number of expected communities as an input. This is not achievable, e.g., for Label Propagation algorithm.

The network was split into 32 948 communities, with the smallest communities consisting of just two nodes and the largest containing 5394 nodes. The distribution of communities is presented in Fig. 2. We also ran simulations on the same network with community structure created using label propagation algorithm. The results were similar but we do not present them here because of space constraints.

Fig. 2.
figure 2

The frequencies of community sizes detected using INFOMAP algorithm on sampled Twitter network (a) and Hashtag popularity (defined as number of occurrences) distribution (b) in logarithmic scale. Broader distribution is visible for highly popular hashtags. It is caused by insufficient time to spread more. With the time passing most of these points would update the trend upwards.

3.2 Ground Truth Data

From all the tweets, we filtered out the ones without a hashtag or with at least one vertex not present in the largest component. For each unique hashtag, the list of tweets using it was extracted and sorted by the tweet’s time \(t \in T\). To form the ground truth for measuring and comparing hashtag spreading models, we simulated meme propagation on the sample Twitter network by infecting nodes according to their order in that list. The case when two consecutive nodes are not connected is modeled by the restart mechanism described in Sect. 2.

The distribution of hashtag popularity does not completely follow the power law as presented in Fig. 2. Because of the large number of hashtags (41 356) we calculated spreading metrics only for a subset of them. It was chosen so that, the hashtags were distributed evenly in the spreading popularity bins. For our calculations of usage dominance and adoption dominance we used data for 280 unique hashtags spread across 54578 iterations in total.

4 Results

Based on simulation results, we learned that our models M5–M9 follow the real-world hashtag spreading dynamics more accurately than models M2–M4 (both for unpopular and viral hashtags). For metric calculation, we define hashtags with low \(a_{intra}\) as those having its value lower than average. Analogously, we define hashtags with high \(a_{intra}\) as those with the value higher than average.

Fig. 3.
figure 3

Adoption (a) and usage dominance values for models M4–M9. Each line represents changes in proportions of hashtags produced inside adoption (\(C_a\)) and usage dominant communities (\(C_d\)) to number of all hashtags produced in the network. The results are averaged across 30 logarithmic bins and normalized by the random sampling model (M1). ground_truth represents ground truth spreading including all hashtags. The series: ground_truth_small_intra and ground_truth_high_intra show the results for hashtags with \(a_{intra}\) lower and higher than average, accordingly. The standard error slightly exceeds 2% only for one model. For the rest of the models, as well as for the ground truth, its value is below 1%.

Table 2. Correlation of non normalized and root mean square error of normalized adoption dominance \(D_a\). Highlighted entries represent the best two values in each column.
Table 3. Correlation of non normalized and root mean square error of normalized usage dominance \(D_u\). Highlighted entries represent the best two values in each column.

The usage dominance and adoption dominance measures are normalized by dividing them by the values obtained for random sampling model (M1) under the same conditions, similarly to [17]. As our intention was to analyze multi-use memes, rather than rare memes, we removed hashtags with the number of adoptions lower than 10. This also prevents including artifacts like tweets starting to become popular just before end of data collection. Next, we aggregated dominance metrics into 30 logarithmic bins. The results of simulation quantified using adoption dominance and usage dominance are shown in Fig. 3. Overall, inverse exposure (M9), unique information overexposure (M5) and local increase of generation probability (M7) models are performing the best when comparing to all hashtag ground truth results (ground_truth). This is confirmed by both the highest correlation and the lowest error, as shown in Table 2 and Table 3. There is one exception - large error for M9 model in usage dominance, which we discuss below. Our models adapting different overexposure mechanisms (especially M7), have generally better results than the baseline M2–M4 models from literature [17]. The critical value used for testing statistical significance of correlation is 0.05. We performed additional tests to verify how well the models reflect dynamics of hashtags divided into two groups: high intra-community activity (specific, trapped memes) and low intra-community activity (generic, viral-candidates).

Comparing the results of spreading with the ground truth below-average \(a_{intra}\) (ground_truth_small_intra), we see that existing models slightly outperform our overexposure models when it comes to correlation. This is visible mostly for M2 (correlation 0.93 and 0.94), based on the plots, we observe that this is mostly due to the initial phase of spreading. Errors are still much lower for our models (especially M5). It is worth noting, that correlation and errors are almost on the same level for M5 and M8 models.

The correlations of spreading pattern ground_truth_high_intra (higher than average \(a_{intra}\)) with all the simulated trends are either statistically insignificant or negative. Nevertheless, based only on the error value, we can see that the overexposure models are mimicking real world spreading more precisely. Much higher absolute values of the errors suggest that this type of spreading is hard to be modeled by all frameworks M1–M9.

A significant difference is observed between usage and adoption dominance measures for model M9 for initial phase of spreading. As visible in high and similar correlation values (0.94 and 0.92 accordingly), the shape of the function is preserved but the error value is much higher for usage dominance, especially for low \(a_{intra}\). The effect is also present in the ground truth data, but with a much smaller amplitude. Most likely this is because the probability of tweeting the same hashtag by the same user is lower in real life compared to the models. The M9 model is more susceptible to that, because it can spread more freely inside a community. This is validated by comparing it with M5, which is a more directed spreading model demonstrating almost no difference between the values of usage and adoption dominance.

Table 4. Correlation of non normalized and root mean square error of normalized usage dominance \(D_u\) for highly popular tweets (popularity > 500). Highlighted entries represent the best two values in each column.

Because viral or really popular hashtags are the most important in our analysis, in Table 4, we also present adoption dominance \(D_a\) results for hashtags with more than 500 spreading events. Good results for M5 model show, that it is the most robust framework presented in this article. Interestingly, highly popular hashtags with lower than average \(a_{intra}\) are modeled much better with the inclusion of overexposure. This is visible in low error and high correlation values for models M5, M7 and M9. From the baseline models from literature, M4 has the best results.

5 Related Works

In the broad literature of the subject we find the works focusing on different aspects of meme popularity. These can be grouped into four major categories: content, social influence, time characteristics and network structure. Content appeal is typically analyzed for longer posts, for which sentiment, affect or emotional load can be established. In [7], the authors present the study of sample tweets and based on news/non-news classification and sentiment analysis conclude that negative sentiment for news messages and positive sentiment for non-news improves retweetability. The influence of psychological arousal (emotions resulting from a content) on information diffusion was analyzed in [2]. The authors provide evidence of perceptive claim that the high-arousal content triggers significant activation of users and boosts meme proliferation. Exhaustive sociolinguistic analysis of social media communication presented in [15] reveals important observations regarding meme lifecycle and virality. The authors emphasize the importance of phatic communication in maintaining vitality of social network. This type of communication does not transport quality content or meaning but still enables emergence of convivality, understood as a production of social engagement, based on temporal and elastic collectives consuming virals. Simple interactions and networking seem more important drivers of social network dynamics than sharing valuable information. The comprehensive approach for meme bursts prediction is presented in [1], where features based on content, network and time are used to determine future meme popularity. The interesting method of meme identification based on content and time was presented in [13]. The authors represent the content stream dumps in a form of Graph-of-Words and use k-core decomposition algorithm to identify viral memes.

The notion of social network influencer is widely spread in mass culture and affects modern online marketing strategies causing emergence of semi-professional content producers on Twitter or Instagram. The importance of seed agents responsible for initiating information spread over social network was studied in many works including [10, 16]. Cognitive constraints caused by heavy information flows such as limited attention or confirmation bias, forces individuals to aggressively filter content and optimize evaluation time by adapting options of others [9]. The high authority of peer providing information and multiple exposure increases the probability of meme adoption (retweet, mention). From structural perspective, the influence can be quantified using vertex centrality measures such as degree, betweenness centrality, clustering coefficient or Page Rank. Nevertheless, the more complete model of meme propagation should also include non-structural external knowledge such as forwarding activity and interests.

Temporal patterns of meme diffusion were also analyzed in the context of process modeling and virality prediction. In [19], the authors used wavelet transform and trajectory clustering techniques to discover six major types of growth-decay characteristics of user attention curves. Different types of time series were identified for professional blogging, micro-blogging or news agencies. Long-term time variations in Twitter hashtag exposure and adoption was studied in [11]. The authors found that timing characteristics highly depend on the topic and frequency of hashtag exposure. The features extracted from the time series of first tweet adopters were used in [18] for classification-based virality prediction. It was also shown, that structural features of social network are more discriminative than time-based features.

The structure of social network was identified as a key component of meme spreading dynamics [3, 6, 12, 14]. In particular, community structures can trap information flows and transform simple epidemic spreading to complex contagion, affected additionally by social mechanisms of homophily, reinforcement or overexposure. Inter-community concentration of early meme adopters is one of the most important features in predicting virality [17, 18]. Surprisingly, as presented in [14], the local structural properties of communities are less important than mesoscopic community view. In the context of hashtag propagation over Twitter network, it was shown that spreading of less popular hashstags should be described as complex contagion, subject to homophily and social reinforcement, while virals propagate similar to diseases [17]. Apart from communities, the structure of subgraphs induced by early adopters is also crucial in modeling future meme dynamics. In [3], the authors perform extensive feature engineering for hashtag virality prediction and build classification model, which uses subgraph conductance as one of the most discriminative features. More recently, in [6], the authors analyze structural properties of meme adoption graphs (MAGs) and formulate MAG-based, generic framework for spreading models comparison.

In our work we were inspired by structural models taking into account social reinforcement and homophily [17, 18], but we decided to additionally include negative feedback loop related to overused memes.

6 Conclusion

We demonstrated that hashtag spreading is a complex process, which cannot be accurately modeled based on the network structure only. Many factors like intra-community activity or popularity play an important role. In general, our models taking into account meme overexposure follow the dynamics of hashtag spreading more precisely, especially, when compared to the models from other works. Unique information overexposure model (M5) is performing the best when compared to the ground truth dynamics obtained from all hashtags.

The good results obtained for M9 model should also be emphasised. This model takes into account overexposure constraint by generating tweet with the probability inversely proportional to the number of exposures. This is the most apparent proof, that the negative feedback loop for the social reinforcement should be considered, when modeling meme spreading.

As visible in large error values and negative or insignificant correlations, hashtags with the higher than average intra-community activity are the hardest to reproduce. Interestingly, overexposure models have smaller rmse errors compared to social-reinforcement and homophily ones. Spreading with the lower than average intra-community activity values is overall best described by the homophily model (M4). This is mostly due to its accuracy for unpopular hashtags. For virals, defined as having popularity greater than 500 spreading events, the M5 and M7 (the local increase of generation probability model) are much more precise. The accuracy of M5 model is the order of magnitude better than one obtained for M2–M4 models.

To have more complete overview of hashtag dynamics, the further work should focus on analysing the timescale of spreading and the stability of solutions when working with missing data.