Introduction

The essence of much of the statistical treatment of data is making inferences about an underlying population from a sample. For example, to explore the foraging behaviour of bumblebees we might collect a sample of 25 Bombus terrestris and explore the relationship between distance from the nest and body masses of these 25 individuals. We might expect that heavier individuals forage more widely. A natural way to quantify such a relationship would be through the Pearson’s product moment correlation coefficient (hereafter called Pearson’s r). Advice on the effective use of this statistical measure was recently summarised by Puth et al. (2014), who also presented the results of a survey of published papers that suggested that this measure of association was commonly used across biology. We found 26 papers published in Oecologia in the last 12 months, for which a primary outcome of the study involved calculation of this statistic (see Supplementary Information). In this hypothetical bumblebee example, interest lies not in the association between foraging range and body mass in this sample of 25 individuals, but in the underlying population. That is, we want to use the sample to make inferences about the association between these two traits in the underlying population of all individuals of this species that could theoretically have been included in this sample. In fact, Pearson’s r is unusual among commonly used statistical measures in that the sample measure is not an unbiased estimator of the population value. Specifically, the correlation measured on the sample tends to underestimate the correlation that exists in the whole population. This phenomenon is well known in the statistics literature (see below), but is generally not mentioned in statistics texts aimed at biologists. Consequently, this effect generally goes unacknowledged and unappreciated in the biology literature [but see brief mention on p. 566 of Sokal and Rohlf (1981), and more full treatment in DeGhett (2014) for exceptions]. The large spatial scale at which ecologists work makes manipulative experiments often impractical, so correlative studies are more common than in fields such as animal behaviour. For this reason, it is vital that ecologists use the classical measure of correlation (Pearson’s r) as effectively as they can. Our aim here is to provide a summary of existing evidence supplemented by our own investigations to offer researchers in ecology clear advice on what to do about the bias in Pearson’s r.

Materials and methods

Review of the existing literature

A range of correction factors are available in the statistics literature, which might be applied to the value of r calculated from a sample to reduce the bias, i.e., to make it more reflective of the population value. Shieh (2010) compared five such measures and found that the most effective of them was due to Olkin and Pratt (1958). Under this correction (generally called OPA after the original authors), if the sample measure is r, then they recommend correcting this to OPA(r) where

$$ {\text{OPA}}(r) = r\left( {1 + \frac{{1 - r^{2} }}{{2\left( {N - 4} \right)}}} \right). $$
(1)

Here, N is the sample size. However, Shieh points out that while such corrections can reduce the bias in estimation, they can increase the mean square error (MSE). That is, the corrected version is less likely to be consistently lower than the population value, but will on average be further away from the population value (reducing bias at a cost of reduced precision). Shieh further argued that the problem of increasing MSE was particularly acute for less strong correlations. Shieh offered the rule of thumb that if the magnitude of the sample r is less than 0.6, then no correction should be applied because the issue of increased uncertainty would dominate the issue of bias, but if the magnitude of r is greater than 0.6 then the OPA correction should be considered. If the sample size is very small (ten or less), then Sinsomboonthong et al. (2013) offer a method of correction based on jackknife sampling that might be more effective than OPA, but any improved performance would be relatively modest compared to the considerable increase in calculation complexity. Gorsuch and Lehmann (2010), on the basis of their simulations and a review of the literature, offer the rule of thumb that bias is strongest for moderate levels of r (with magnitudes between 0.3 and 0.7), but when N > 30 then issues of underestimation can be considered trivial. Zimmerman et al. (2003) also recommended the OPA correction after comparing it to alternatives in a simulation study (although note that they, in common with some other authors, utilise a formula with a “3” rather than “4” in the denominator). Although Pearson’s r is generally quite robust to deviation of the underlying assumption of normality in the underlying traits (Bishara and Hittner 2012), the corrections designed to reduce bias in bivariate normal data (like OPA) increase bias when underlying populations are non-normal (Bishara and Hittner 2015).

Thus, on the basis of previous literature, it is already possible to offer clear advice to the researcher in many situations. Correction for the issue of underestimation should not be adopted if either or both of the underlying distributions deviate from normality—in such a situation the issue of violation of the assumption of normality is more of a concern than that of underestimation, alternative measures of association may be appropriate; and Bishara and Hittner (2012) and Puth et al. (2014) provide clear advice on how to deal with this. Secondly, if sample size is greater than around 30, then the issue of underestimation is trivial, and so there is no benefit in complicating the analysis of data by applying a correction. In the next section, we focus on closing the gap in the literature, to offer advice on correction for the situation where both distributions are well approximated by the normal distribution and the sample size is low (N < 30). In our survey of 26 recent Oecologia papers, sample size was 30 or less in 6 cases and could not be determined from the paper in 12.

Plan of our simulation studies

We evaluate the performance of different statistical approaches over 1000 samples drawn from a population with normal marginal distributions and a specified correlation (ρ), using the same methodology as Puth et al. (2014). We first consider the estimation of the 95% confidence interval for the population value of Pearson’s r. Puth et al. (2014) considered three methods for calculating the confidence interval: the BCa method of bootstrapping, the method due to both Muddapur (1988) and Jeyaratnam (1992) utilising F statistics, and the most commonly used version (due to Fisher, 1925) based on a z-statistic. For the first two of these, we compared the uncorrected versions used by Puth et al. (2014) with modifications where OPA correction is applied to all calculated r values. For the z-method, we compare the uncorrected method used in Puth et al. (2014) with one where after the value of z is calculated, it is then replaced by a value (z*) that was designed to correct for bias that causes z to be slightly larger than it should be. This correction is originally due to Hotelling (1953), was recommended by DeGhett (2014) and is given by:

$$ z^{*} = \left\{ {\begin{array}{*{20}ll} {z - \frac{3z + r}{4(N - 1)},\quad {\text{if}}\, N > 10 } \\ {z - \frac{3z + r}{4(N - 1)} - \frac{{23z + 33r - 5r^{3} }}{{96(N - 1)^{2} }}, \quad {\text{if}}\, N \le 10} \\ \end{array} } \right. . $$
(2)

In Table 1, we evaluate this technique for samples drawn using the method described by Puth et al. (2014) with sample sizes N = 10, 20 and 30 for population correlations ρ = 0, 0.1, 0.3, 0.5, 0.7, 0.9. For each of the six methods, we calculate the mean coverage of the confidence intervals, defined as the fraction of 1000 confidence intervals that include the actual population value ρ. Values higher than 0.95 suggest that the confidence interval is too large, and values lower than 0.95 suggest that it is too narrow. For each combination of sample size and underlying correlation, we present a 3 × 2 set of numbers. For each of the three methods, we embolden whichever of the corrected or uncorrected situations offers coverage closer to 0.95, and we underline whichever of the six values is closest to 0.95.

Table 1 Estimations of the 95% confidence interval for the population value of Pearson’s r using three methods: BCa bootstrapping, F statistics and Z-statistics

We then turn to testing the null hypothesis ρ = 0 (at the significance level α = 0.05) in Table 2 for the same combination of sample sizes and underlying ρ values. For ρ = 0 we give the type I error rate, otherwise we give the power. Again, there is a 3 × 2 combination of numbers in each cell, the first column being uncorrected and the second corrected. The three rows again refer to three methods considered in Puth et al. (2014). Firstly, we consider the standard method where t* is given by:

$$ t^{*} = \frac{r}{{\sqrt {\frac{{1 - r^{2} }}{N - 2}} }}. $$
(3)

t* is compared to a t-distribution with N − 2 degrees of freedom, the null hypothesis being rejected if the absolute value of t* is greater than the (1 – α/2) quantile of the respective t-distribution. Secondly, we consider a permutation test, where the null hypothesis is rejected if the observed value of r lies outside the 2.5 and 97.5 percentiles of a distribution of r scores calculated from permutations of the original sample. Finally, we use Fisher’s method, first calculating a z score as:

$$ z = 0.5\log_{\text{e}} \left( {\frac{1 + r}{1 - r}} \right). $$
(4)
Table 2 Testing the null hypothesis ρ = 0 (at the significance level α = 0.05) for N = 10, 20 and 30 for population correlations ρ = 0, 0.1, 0.3, 0.5, 0.7, 0.9

Then we compare

$$ Z = \frac{z}{{\sqrt {\frac{1}{N - 3}} }}, $$
(5)

with the (1 − α/2) quantile of the standard normal distribution (i.e., 1.96 if α = 0.05), rejecting the null hypothesis if the absolute value of the calculated value is bigger than or equal to 1.96. To implement correction for underestimation, for the first two methods we replace all calculated values of r with the OPA-corrected value at all stages of the procedure; for the final method, we replace z with the appropriate corrected value z* as defined above. Results are shown in Table 2; for each combination of sample size and the three methods, we calculate the power (or type I error rate for ρ = 0), using both the uncorrected and corrected methods (columns “r” and “r*”, respectively). For each pair of uncorrected or corrected values, we embolden whichever offers the higher power (or type I error rate closest to the nominal 0.05 level). For each group of six values, we underline whichever uncorrected or corrected method performs best of the six (in terms of highest power or type I error rate closest to the nominal level).

In Fig. 1a, we then plot the OPA-corrected value divided by the original r value calculated from a sample, for sample sizes 8–30 and r values 0.1, 0.3, 0.5, 0.7 and 0.9. In Fig. 1b, we do the same for z correction where, after the correction has been made to Fisher’s z, the corrected r value is recovered using:

$$ r = \frac{{\exp \left( {2z^{*} } \right) - 1}}{{\exp \left( {2z^{*} } \right) + 1}}. $$
(6)
Fig. 1
figure 1

a OPA-corrected values divided by the original r value, for sample sizes 8–30 and r values 0.1, 0.3, 0.5, 0.7 and 0.9. b Corrected r values divided by the original r values produced via the z-method using Eq. (2), for sample sizes 8–30 and r values 0.1, 0.3, 0.5, 0.7 and 0.9. The corrected r values are recovered after the correction has been made to Fisher’s z by the formula given in Eq. (6)

Finally, in Fig. 2a, we investigate the spread of sample values by plotting the frequency of r values calculated from 1000 samples with N = 15 and ρ = 0.25, drawing attention to the mean, standard deviation and mean squared error. In Fig. 2b, we show the same for the OPA(r)-corrected values for the same sample size and r.

Fig. 2
figure 2

a Histogram of r values calculated from 1000 samples with N = 15 and ρ = 0.25; mean (full line), standard deviation (SD) (dashed line) and mean squared error (MSE) are shown. b Histogram of OPA(r) values calculated from 1000 samples with N = 15 and ρ = 0.25; mean (full line), standard deviation (SD) (dashed line) and mean squared error (MSE) are shown

Results

Table 1 gives no evidence to support adoption of the OPA correction for calculation of confidence intervals. Regardless of the method used, correction does not cause a general tendency to give coverage values closer to the nominal 0.95 value. There is perhaps a tendency for correction to lead to confidence intervals that are too wide (hence with coverage above 0.95), but this tendency is not consistent.

We now turn to Table 2 for testing the null hypothesis that ρ = 0. Considering type I error rate first, we find that all methods are overwhelmingly conservative, with type I error rates being mostly below 0.05: something that correction does not substantially change. Turning to power (with ρ = 0.1, 0.3, 0.5, 0.7, 0.9), we find unsurprisingly that the power for all (corrected and uncorrected) methods increases with sample size and with the population value of ρ . Puth et al. (2014) did not find a strong difference in power between the three uncorrected methods, and our results agree with this. We find the same to be true when comparing powers of the three corrected versions. Most importantly, for any specific method we do not observe correction offering a conspicuous and consistent improvement in power. Hence, we do not find strong evidence in support of correcting calculated correlation coefficients as part of null hypothesis testing.

Figure 1 shows that it appears that—irrespective of the size of r—where sample sizes are > 15, there is very little difference between r and OPA(r), a similar trend can be seen for the correction to z in Fig. 1b. From Fig. 2, it can then be observed that, firstly, such small samples can produce a broad range of different r values across our 1000 samples. Secondly, the mean r of the 1000 samples is lower than the population value of 0.25 (i.e. it is downwardly biased, as expected), but the mean value of OPA(r) is noticeably (slightly) closer to 0.25 (so the correction slightly reduced bias on average). Finally, the standard deviation and the mean squared error of the OPA-corrected values are larger than for the r values; this suggests that the reduction in bias through the use of OPA corrections comes at a cost in imprecision—and imprecision is a more dominant feature than bias in this example situation.

Discussion

On the basis of our survey of the literature and our own simulations, we can offer clear advice to the many researchers in ecology who use Pearson’s r in the statistical treatment of their data.

Firstly, they should be aware that the value measured on their sample will be more often biased towards underestimating than overestimating the true value of the underlying population they are interested in. This possible bias was not discussed in any of the papers in our survey.

Further, they should be aware that testing the null hypothesis of no association is conservative, rejecting the null hypothesis when it is true at lower than the nominal rate α. This hypothesis was tested in 21 of the 26 papers in our survey; but none of these discussed the conservatism of this test.

Next, they should not attempt any of the methods offered in the literature for correcting bias. No method yet developed offers consistently reliable performance. Additionally, the fact that the standard deviation of OPA-corrected values (Fig. 2b) was greater than that for the r values (Fig. 2a) illustrates that any reduction in bias through corrections could increase imprecision.

Finally, when discussing the importance of this bias towards underestimation for the biological conclusions to be drawn from their study, they should quantify the likely extent of this bias. We see in Fig. 1a that (regardless of the size of the actual correlation ρ) as long as N > 15, the difference between r and OPA(r) is less than 5% of r. Sample size was less than 15 in 3 papers out of 26 in our survey. Thus, unless sample size is very small, the issue of sample bias is unlikely to call for substantial modification of biological conclusions. For such sample sizes, statistical power is likely to be very low (see Tables 1, 2) and thus imprecision may often be a greater concern than bias even in this situation. In our survey of 26 papers, 1 provided a confidence interval, and none of the others discussed precision in any way. We have demonstrated here three simple and general ways that such a confidence interval can be calculated as a very useful aid to discussing imprecision of estimation.