Abstract
The YuleSimpson paradox refers to the fact that outcomes of comparisons between groups are reversed when groups are combined. Using Essential Sciences Indicators, a part of InCites (Clarivate), data for countries, it is shown that although the YuleSimpson phenomenon in citation analysis and research evaluation is not common, it isn’t extremely rare either. The YuleSimpson paradox is a phenomenon one should be aware of, otherwise one may encounter unforeseen surprises in scientometric studies.
Introduction: COVID19 victims
This work is meant as an illustration of Simpson’s paradox, also known as the YuleSimpson paradox (Yule 1903; Simpson 1951). We use the COVID19 pandemics as an occasion to show how some basic mathematical observations apply to many aspects of life, in this case victims of the COVID19 pandemics and scientific contributions of countries as measured by citations per publication.
On June 22, 2020, R.R.’s local Flemish newspaper, De Standaard, mentioned that in any age group men have a higher COVID19 infection fatality rate (IFR) than women, but in the total population of Belgium women have a higher IFR, see Table 1, as taken from this newspaper article (De Smet 2020). The IFR is the probability that one dies, given that one is infected.
The fact that in any age group men have a higher infection fatality rate (IFR) than women, but when bringing all age groups together the opposite is the case, seems to be contradictory. This phenomenon is wellknown in statistics and is known as Simpson’s paradox. Being a quality newspaper, De Standaard also mentioned that the data of Table 1 reflect Simpson’s paradox (a term not often used in dailies). In this case, the underlying reason is that there are much more older women than men in Belgium. The reporter got his information from an—yet unpublished—article by Flemish colleagues (Molenberghs et al. 2020). Data shown in Table 1 have been updated recently, but do not detract from the fact that in June 2020 the best available data showed the YuleSimpson paradox.
The YuleSimpson paradox
Simpson’s theoretical example
The YuleSimpson paradox (Yule 1903; Simpson 1951; Blyth 1972; Gardner 1976) is an expression of a counterintuitive result that may occur in statistical aggregations. The paradox refers to the fact that outcomes of comparisons between groups are reversed when groups are combined. Realworld examples have been observed in surgery (Charig et al. 1986), clinical trials (Rücker and Schumacher 2008), ecological studies (Allison and Goldberg 2002; Clark et al. 2011) and, citation analysis (RamananaRahary et al. 2009), among others.
Let us consider the following example shown by Simpson (1951). It appears that the two sets of data separately support a certain hypothesis, but, considered together, support the opposite hypothesis. Simpson provided the following fictitious case related to the outcome of a medical treatment (Table 2).
There are 52 cases in total. Among the male population 4/7 ≈ 0.57 of the untreated survived, while 8/13 ≈ 0.62 of the treated ones did. Hence the treatment had a positive effect among males. Among the females, 2/5 = 0.4 of the untreated survived, while 12/27 ≈ 0.44 of the treated ones did. So, also among the female population, the treatment had a positive effect. However if we consider the whole population (bringing males and females together) we see that among the untreated ones 6 survived and 6 died and among the treated ones 20 survived and 20 died, pointing at no effect from the treatment.
The general framework
The YuleSimpson paradox occurs in the following situation. Three stochastic variables are involved: X, Y and Z. In Simpson’s example X takes two values: surviving or not surviving; Y also takes two values: being treated or not; and Z represents males or females (these are the ones that are brought together). For the COVID19 case, X represents dying or not from COVID19; Y represents males and females and Z represents different age groups (here the age groups are brought together).
Now the YuleSimpson paradox occurs if the following happens (Blyth, 1972): X takes values A and A’ (the complement of A); Y takes values B and B’ (the complement of B); Z takes values C_{1}, C_{2}, C_{3}, … (and if there are only two outcomes possible, we denote them by C and C’).
For all j = 1,2,3,… : P (A  B and C_{j}) > P (A  B’ and C_{j})
and yet: P (A  B) ≤ P (A  B’)
Here P (.  .) represents a conditional probability. We also say that the YuleSimpson paradox occurs if:
For all j = 1,2,3,… : P (A  B and C_{j}) ≥ P (A  B’ and C_{j})
and yet: P (A  B) < P (A  B’)
Intuitively one might think that as P(AB) is an average of the P(AB and C_{j}) and similarly for P(AB’) and the P(AB’ and C_{j}) the paradox is not possible. Yet, the point is that these averages have different weightings (Blyth 1972). We further note that if Y and Z are independent then the YuleSimpson paradox is not possible (Blyth 1972).
The YuleSimpson paradox and its interpretation can be illustrated graphically using socalled median fractions. As we did that already in our previous article (RamananaRahary et al. 2009), published in this journal, we refer the interested reader to that publication.
A short overview of some historical cases of the YuleSimpson paradox
As suggested by a reviewer we provide some details on other historical cases of the YuleSimpson paradox.
A wellknown historical example relates to tuberculosis deaths in 1910. Referring to Cohen and Nagel (1934, page 449), Wagner (1982) shows that although the overall mortality rate was lower in New York City than in Richmond (VA), the opposite held when data were stratified into whites and nonwhites.
One of the bestknown examples of the YuleSimpson paradox is a study of possible gender bias among graduate school admissions in 1973 to the University of California, Berkeley. On the whole, male students were more likely than female ones to be admitted. However, when examining the individual departments, it appeared that six out of 85 departments were significantly biased against male applicants, whereas four were significantly biased against female ones. A detailed study of the data by Bickel et al. (1975) revealed that female students tended to apply to more competitive departments with low rates of admission whereas men tended to apply to less competitive departments with high rates of admission. It was concluded that there was no bias from the side of the university, but a selection bias on the part of the applicants.
Julious and Mullee (1994) analyzed data obtained by Charig et. al. (1986) on the efficiency of two treatments to remove kidney stones (open surgery vs. percutaneous nephrolithotomy). The new technique proved successful on the whole, but stratification by the size of the kidney stones led to different conclusions. The confounding factor was that surgeons’ choice of treatment was not random but influenced by the size of the stone. This example supported the necessity to use random trials.
We next illustrate this example with the real data in the form of a contingency table (Table 3).
The stochastic variable X takes the values successful or not; Y takes the values open surgery or percutaneous nephrolithotomy and Z takes the values large stones or small ones.
Finally, we discuss Yule’s original example (Yule 1903, p. 133). This case is related to the study of inheritance and is formulated differently. Yule provides (fictitious) data on a trait that is not hereditary in the male line and neither in the female line, but occurs with a different probability (Tables 4, 5 and 6). Bringing data together in equal proportions suggests inheritance, which is the wrong conclusion.
Whether or not the father has the trait, the probability that his son has it is 25/50 = 50%; whether or not the mother has the trait, the probability that her daughter has it is 1/10 = 9/90 = 10%. Yet making these calculations in the sum table yields 26/60 = 43% and 34/140 = 24%. Yule writes that a large but illusory inheritance is created simply by mixing the two distinct records. He then warns against pooling data about heterogeneous material in general.
Mittal (1991) refers to this form of the paradox as Yule’s association paradox, while he refers to the case shown by Simpson (1951) as Yule’s reversal paradox (because the signs in the aggregated table are reversed). Mittal (1991) quotes Nagel as the source for attaching the name of Yule to these two types of paradoxes.
An interpretation related to impact
Direct impact: a fictitious example
The boxes in Table 7, taken from (RamananaRahary et al. 2009), represent direct impact (citations per publication). Research is performed by two countries in two related disciplines. We add a row for 'All countries' (here the two countries). We see that Country 2 is better than Country 1 in Discipline 1 as well as in Discipline 2. Yet adding the results leads to the opposite conclusion.
Relative impact: an example
The example above also produces an inversion for relative impacts. If, instead of comparing Country 1 and Country 2, we compare each country separately to their aggregate 'All countries’, say “the World’, we see that in the above example: (score country 1) < (Score all countries) in Disciplines 1 and 2, i.e. the relative impact with respect to the world, of Country 1 is inferior to unity, but (Country 1) > (All countries) for All disciplines, i.e. its world relative impact is superior to unity. For Country 2 the opposite holds. Numerical values of relative impacts are given in Table 8.
An abstract framework
Let us put this in an abstract framework. The YuleSimpson paradox occurs if Table 9 is given, together with the requirements that A/U < C/W and B/V < D/X while (A + B)/(U + V) ≥ (C + D)/(W + X).
Note that A, B, C, D, U, V, W and X are given, not just the numerical values of the fractions. From now on, we assume that the reader understands the YuleSimpson paradox and hence we will simply refer to it as the YuleSimpson phenomenon.
We recall from (RamananaRahary et al. 2009) two simple mathematical results, using the general term ‘player’ instead of ‘country’.
Proposition 1
The YuleSimpson phenomenon is present for the pair (Player 1, Player 2) for direct impact, if and only if it is also present for relative impact.
This follows immediately from the fact that if in an equality both sides are multiplied or divided by the same positive number, the inequality stays invariant.
Proposition 2
The YuleSimpson phenomenon for the pair (Player1, Player2) is present if and only if it is present for the pair (Player1, Both Players).
Realworld citation examples
The Essential Science Indicators
The examples we will show are retrieved from the Essential Science Indicators (ESI). Data from the science citation indexexpanded (SCIE) and the social sciences citation index (SSCI) in the web of science (WoS) core collection are subdivided into 22 broad fields based on publication and citation performance (Essential Science Indicators 2020). These 22 broad fields are shown in the appendix (Table 16). Data, only articles and reviews, cover a rolling 10 year period and include bimonthly updates. For our investigation, it is important to recall that articles are classified according to the journal in which they are published and that each journal is assigned to only one field. Multidisciplinary journals, however, are an exception to this rule: here a reclassification is performed at the paper level, based on an analysis of the cited references. Data were collected in September 2020. We restricted data to countries that have at least 500 publications (over a 10 year period), except for a few cases where we compared a country with all the other countries in the database, for which the YuleSimpson phenomenon occurs rarely.
The role of a discipline (as in Table 9) is played by one ESI field. We do not intend to be complete and to combine each ESI field with each other ESI field (it makes little sense to combine, e.g., chemistry with social sciences, general). We just provide some examples in fields for which it may be acceptable to combine them (Tables 10, 11, 12, 13 and 14).
Differences in impact (cites per publication) are often rather small so that one may say that they are not statistically significant. Yet, we do not step into the statistical morass of significance testing (Schneider 2015) and just stick to rankings.
Examples where the YuleSimpson phenomenon occurs
Although a rather rare event, we also found examples of the YuleSimpson phenomenon between a country and all other countries in the database, see Table 15. Here, we included cases with less than 500 publications.
Remarks

1.
We found several more examples involving countries and fields with fewer publications.

2.
The cases shown in this contribution are just examples of a phenomenon that might not be wellknown to all colleagues. We did not check the ‘correctness’ of the data in the used database.

3.
Countries that are compared have relative citations that do not differ much, although their absolute numbers of publications and citations may differ considerably. As countries are rarely compared in this way, this leads to unexpected ‘relatives’. So we see India and China, the Netherlands and England, Qatar and Argentina and the USA and Greece, to name a few.

4.
The ESI categories are disjoint and hence it makes sense to add publications and citations. A similar exercise is not directly possible with WoS categories or SCImago categories.

5.
Incites uses whole counting and hence when two countries are compared a part of their data overlap. The YuleSimpson phenomenon between two countries might or might not occur if fractional counting were used. Moreover, assume that one removes all coauthored articles between two countries then again the YuleSimpson phenomenon may or may not occur. Indeed, inequalities may reverse when removing joint publications and their citations. Let A/U < C/W as in 400/300 < 500/350. If now these countries have 200 publications with 100 citations in common and these are removed then we have 300/100 > 400/150 with reversed inequality.
Conclusions
Although the YuleSimpson phenomenon in citation analysis is not common, it isn’t extremely rare either. This is shown in this contribution. It is a phenomenon one should be aware of, otherwise one may encounter unforeseen surprises. Assume, for instance, that it is the scientific aim of a country to do better, citationwise, than world average in the two related fields F_{1} and F_{2}. Then this aim may be reached for the union of the two fields, but for none of the fields separately. Such a possibility is just a mathematical fact. The COVID19 example and the historical examples illustrated that the YuleSimpson phenomenon may occur in any aspect of life.
From the historical examples, we learned that one can make a distinction between two cases. Sometimes, such as in the Berkeley students case and for the kidney stone case, there is a clear (human) selection procedure at work and it makes no sense to aggregate data. Sometimes, as in the COVID19 example, there is a natural stratification (age groups), but again it is not important at all to collect information on the aggregated data. So, in general, we think that it is not a good idea to aggregate data as it leads to a clear loss of information.
In the citation analysis presented here, we artificially aggregated fields, yet these fields themselves are aggregates and we did not try to find the relation, e.g., between mathematics and its subfields (algebra, geometry, topology, analysis, etc.). So for citation analysis, the answer to the question “Should one aggregate or not?” depends on the aim of the investigation.
As an aside we showed that in terms of relative citations, i.e. citations per publication, large, wellknown countries such as England and the USA may, in some fields, become comparable with smaller ones such as the Netherlands and Greece.
References
Allison, V. J., & Goldberg, D. E. (2002). Specieslevel versus communitylevel patterns of mycorrhizal dependence on phosphorus: an example of Simpson’s paradox. Functional Ecology, 16(3), 346–352.
Bickel, J. P., Hammel, A. E., & O’Connell, J. W. (1975). Sex bias in graduate admissions: Data from Berkeley. Science, 187(4175), 398–404.
Blyth, C. R. (1972). On Simpson’s paradox and the surething principle. Journal of the American Statistical Association., 67(338), 364–366.
Charig, C. R., Webb, D. R., Payne, S. R., & Wickham, O. E. (1986). Comparison of treatment of renal calculi by operative surgery, percutaneous nephrolithotomy and extracorporeal shock wave lithotripsy. BMJ, 292(6524), 879–882.
Clark, J. S., Bell, D. M., Hersh, M. H., Kwit, M. C., Moran, E., Salk, C., et al. (2011). Individualscale variation, speciesscale differences: inference needed to understand diversity. Ecology Letters, 14(12), 1273–1287.
Cohen, M. R., & Nagel, E. (1934). An Introduction to Logic and Scientific Methods. New York: Hartcourt, Brace and World.
De Smet, D. (2020). Is corona erger dan de griep? (Is corona worse than the flu?) De Standaard, 22 June 2020.
Essential Science Indicators. (2020). Essential science indicators. Clarivate analytics. Retrieved from https://clarivate.com/webofsciencegroup/solutions/essentialscienceindicators/. Accessed September 2020.
Gardner, M. (1976). Mathematical games On the fabric of inductive logic and some probability paradoxes. Scientific American, 234(3), 119–124.
Julious, S. A., & Mullee, M. A. (1994). Confounding and Simpson’s paradox. BMJ, 309(6967), 1480–1481.
Mittal, Y. (1991). Homogeneity of subpopulations and Simpson’s paradox. Journal of the American Statistical Association, 86(413), 167–172.
Molenberghs, G., Faes, C., Aerts, J., Theeten, H., Devleesschauwer, B., Bustos Sierra, N., et al. (2020). Belgian COVID19 mortality excess deaths number of deaths per million and infection fatality rates. MedRxiv Preprint Server for Health Sciences. https://doi.org/10.1101/2020.06.20.20136234.
RamananaRahary, S., Zitt, M., & Rousseau, R. (2009). Aggregation properties of relative impact and other classical indicators: convexity issues and the YuleSimpson paradox. Scientometrics, 9(2), 311–327.
Rücker, G., & Schumacher, M. (2008). Simpson’s paradox visualized: The example of the Rosiglitazone metaanalysis. BMC Medical Research Methodology, 8, 34.
Schneider, J. W. (2015). Null hypothesis significance tests. A mixup of two different theories: the basis for widespread confusion and numerous misinterpretations. Scientometrics, 102(1), 411–432.
Simpson, E. H. (1951). The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society Series B, 13(2), 238–241.
Wagner, C. H. (1982). Simpson’s paradox in real life. The American Statistician, 36(1), 46–48.
Yule, G. U. (1903). Notes on the theory of association of attributes of statistics. Biometrika, 2(2), 121–134.
Acknowledgements
The authors thank the reviewers for useful suggestions to improve the original manuscript.
Author information
Affiliations
Corresponding author
Appendix
Appendix
See Table 16.
Rights and permissions
About this article
Cite this article
Wang, Z., Rousseau, R. COVID19, the YuleSimpson paradox and research evaluation. Scientometrics (2021). https://doi.org/10.1007/s1119202003830w
Received:
Published:
Keywords
 YuleSimpson paradox
 Relative citations
 Scientometric comparisons between countries