5.1 Datasets
In order to ensure the generalizability of the insights from our coherence study, we evaluate the coherence scores over two datasets. The first one, ISOT fake news dataset, is a publicly available dataset comprising 10k+ articles focused on politics. The second dataset is one sourced by ourselves comprising 1k articles on health and well-being (HWB) from various online sources. These datasets are very different both in terms of size and the kind of topics they deal with. It may be noted that the nature of our task makes datasets comprising long text articles more suitable. Most fake news datasets involve tweets, and are thus sparse with respect to text dataFootnote 5, making them unsuitable for textual coherence studies such as ours. Thus, we limit our attention to the aforementioned two datasets both of which comprise textual articles. We describe them separately herein.
ISOT Fake News Dataset. The ISOT Fake News datasetFootnote 6 [1] is the largest public dataset of textual articles with fake and legitimate labellings that we have come across. The ISOT dataset comprises various categories of articles, of which the politics category is the only one that appears within both fake and real/legitimate labellings. Thus, we use the politics subset from both fake and legitimate categories for our study. The details of the dataset as well as the statistics from the sentence segmentation and entity linking appear in Table 1.
HWB Dataset. We believe that health and well-being is another domain that is often targeted by fake news sources. As an example, fake news on topics such as vaccinations has raised significant concerns [13] in recent times, not to mention the COVID-19 pandemic. Thus, we curated a set of articles with fake/legitimate labellings tailored to the health domain. For the legitimate news articles, we crawled 500 news documents on health and well-being from reputable sources such as CNN, NYTimes, Washington Post and New Indian Express. For fake news, we crawled 500 articles on similar topics from well-reported misinformation websites such as BeforeItsNews, Nephef and MadWorldNews. These were manually verified for category suitability, thus avoiding blind reliance on source level labellings. This dataset, which we will refer to as HWB, short for health and well-being, will be made available at https://dcs.uoc.ac.in/cida/resources/hwb.html. HWB dataset statistics also appear in Table 1.
On the Article Size Disparity. The disparity in article sizes between fake and legitimate news articles is important to reflect upon, in the context of our comparative study. In particular, it is important to note how coherence assessments may be influenced by the number of sentences and entity references within each article. It may be intuitively expected that coherence quantification would yield a lower value for longer documents than for shorter ones, given that all pairs of sentences/entities are used in the comparison. Our study stems from the hypothesis that fake articles may be less coherent; the article length disparity in the dataset suggests that the null hypothesis assumption (that coherence is similar across fake and legitimate news) being true would yield higher coherence score for the fake news documents (being shorter). In a way, it may be observed that any empirical evidence illustrating lower coherence scores for fake articles, as we observe in our results, could be held to infer a stronger departure from the null hypothesis than in a dataset where fake and legitimate articles were of similar sizes. Thus, the article length distribution trends only deepen the significance of our results that points to lower coherence among fake articles.
Table 1. Dataset Statistics (SD = Standard Deviation)
5.2 Experimental Setup
We now describe some details of the experimental setup we employed. The code was written in Python. NLTKFootnote 7, a popular natural language toolkit, was used for sentence splitting and further processing. The word embedding coherence assessments were performed using Google’s pre-trained word2vec corpusFootnote 8, which was trained over news articles. Explicit Semantic Analysis (ESA) was run using the publicly available EasyESAFootnote 9 implementation. For the entity linking method, the named entities were identified using the NLTK toolkit, and their vectors were looked up on the Wikipedia2vecFootnote 10 pre-trained modelFootnote 11.
5.3 Analysis of Text Coherence
We first analyze the mean and standard deviation of coherence scores of fake and legitimate news articles as assessed using each of our three methods. Higher coherence scores indicate higher textual coherence. Table 2 summarizes the results. In each of the three coherence assessments across two datasets, thus six combinations overall, the fake news articles were found to be less coherent than the legitimate news ones on the average. The difference was found to be statistically significant with \(p<0.05\) under the two-tailed t-testFootnote 12 in five of six combinations, recording very low p-values (i.e., strong difference) in many cases.
Table 2. Coherence Results Summary (SD = Standard Deviation). Statistically significant results with \(p<0.05\) (two-tailed t-test) in bold. XE-Y is a commonly used mathematical abbreviation to stand for \(X\times 10^{-Y}\)
Trends Across Methods. The largest difference in means are observed for the word embedding coherence scores, with the legitimate news articles being around 4% and 8% more coherent than fake news articles in the ISOT and HWB datasets respectively. The lowest p-values are observed for the entity linking method, where the legitimate articles are 3+% more coherent than fake news articles across datasets for the ISOT dataset; the p-value being in the region of 1E-100 indicates the presence of consistent difference in coherence across fake and legitimate news articles. On the other hand, the coherence scores for ESA vary only slightly in magnitude across fake and legitimate news articles. This is likely because ESA is primarily intended to separate articles from different domains; articles within the same domain thus often get judged to be very close to each other, as we will see in a more detailed analysis in the later section. The smaller differences across fake and legitimate news articles are still statistically significant for the ISOT dataset (p-value \(<0.05\)), whereas they are not so in the case of the much smaller HWB dataset. It may be appreciated that statistical significance assessments depend on degrees of freedom, roughly interpreted as the number of independent samples; this makes it harder to approach statistical significance in small datasets. Overall, these results also indicate that the word embedding perspective is best suited, among the three methods, to discern textual coherence differences between legitimate and fake news.
Coherence Spread. The spread of the coherence across articles was seen to be largely broader for fake news articles, as compared to legitimate ones. This is evident from the higher standard deviations exhibited by the coherence scores in the majority of cases. From observations over the dataset, we find that the coherence scores for legitimate articles generally form a unimodal distribution, whereas the distribution of coherence scores across fake news articles show some minor deviations from unimodality. In particular, we find a small number of scores clustered in the low range of the spectrum, and another much larger set of scores forming a unimodal distribution centered at a score lesser than that the respective centre for the legitimate news articles. This difference in character, of which more details follow in the next section, reflects in the higher standard deviation in coherence scores for the fake news documents.
Coherence Score Histograms. We visualize the nature of the coherence score distributions for the various methods further by way of histogram plots in Figs. 1, 2 and 3 for the ISOT dataset. The corresponding histograms for the HWB dataset appear in Figs. 4, 5 and 6. We have set the histogram buckets in a way to amplify the region where most documents fall, density of scores being around different ranges for different methods. This happens to be around 0.5 for word embedding scores, 0.999 for ESA scores, and around 0.3 for entity linking scores. With the number of articles differing across the fake and legitimate subsets, we have set the Y-axis to indicate the percentage of articles in each of the ranges (as opposed to raw frequency counts), to aid meaningful visual comparison. All documents that fall outside the range in the histogram are incorporated into the leftmost or rightmost pillar in the histogram as appropriate.
The most striking high-level trend, across the six histograms is as follows. When one follows the histogram pillars from left to right, the red pillar (corresponding to fake news articles) is consistently taller than the green pillar (corresponding to legitimate news articles), until a point beyond which the trend reverses; from that point onwards the green pillar is consistently taller than the red pillar. Thus, the lower coherence scores have a larger fraction of fake articles than legitimate ones and vice versa. For example, this point of reversal is at 0.55 for Fig. 1, 0.9997 for Fig. 3 and 0.9996 for Fig. 5. There are only two histogram points that are not very aligned with this trend, both for the entity linking method, which are 0.36 in Fig. 3 and 0.34 in Fig. 6; even in those cases, the high-level trend is still consistent with our analysis.
The second observation, albeit unsurprising, is the largely single-peak (i.e., unimodal) distribution of the coherence score in each of the six charts for both fake and legitimate news articles; this also vindicates our choice of the statistical significance test statistic in the previous section, t-test being suited best for comparing unimodal distributions. A slight departure from that unimodality, as alluded to earlier, is visible for fake article coherence score. This is most expressed for the ESA method, with a the leftmost red pillar being quite tall in Figs. 2 and 5; it may be noted that the leftmost pillars count all documents below that score and thus, the small peak that exists in the lower end of the fake news scores is overemphasized in the graphs due to the nature of the plots.
Thirdly, the red-green reversal trend as observed earlier, may be interpreted as largely being an artifact of the relative positioning of the centres of the unimodal score distribution across fake and legitimate news articles. The red peak appears to the left (i.e., at a lower coherence score) of the green peak; it is easy to observe this trend if one looks at the red and green distributions separately on the various charts. For example, the red peak for Fig. 1 is at 0.55, whereas the green peak is at 0.60. Similarly, the red peak in Fig. 3 is at 0.28, whereas the green peak is at 0.30. Similar trends are easier to observe for the HWB results.
Summary. Main observations from our computational social science study are:
-
Fake news articles less coherent: The trends across all the coherence scoring mechanisms over the two widely different datasets (different domains, different sizes) indicate that fake news articles are less coherent than legitimate news ones. The trends are statistically significant in all but one case.
-
Word embeddings most suited: From across our results, we observe that word embedding based mechanism is most suited, among our methods, to help discern the difference in coherence between fake and legitimate news articles.
-
Unimodal distributions with different peaks: The high-level trend points to a unimodal distribution of coherence scores (slight departure observed for fake articles), with the score distribution for fake news peaking at a lower score.