# Estimating the mean and variance from the median, range, and the size of a sample

- 119k Downloads
- 1.7k Citations

## Abstract

### Background

Usually the researchers performing meta-analysis of continuous outcomes from clinical trials need their mean value and the variance (or standard deviation) in order to pool data. However, sometimes the published reports of clinical trials only report the median, range and the size of the trial.

### Methods

In this article we use simple and elementary inequalities and approximations in order to estimate the mean and the variance for such trials. Our estimation is distribution-free, i.e., it makes no assumption on the distribution of the underlying data.

### Results

We found two simple formulas that estimate the mean using the values of the median (*m*), low and high end of the range (*a* and *b*, respectively), and *n* (the sample size). Using simulations, we show that median can be used to estimate mean when the sample size is larger than 25. For smaller samples our new formula, devised in this paper, should be used. We also estimated the variance of an unknown sample using the median, low and high end of the range, and the sample size. Our estimate is performing as the best estimate in our simulations for very small samples (*n* ≤ 15). For moderately sized samples (15 <*n* ≤ 70), our simulations show that the formula range/4 is the best estimator for the standard deviation (variance). For large samples (*n* > 70), the formula range/6 gives the best estimator for the standard deviation (variance).

We also include an illustrative example of the potential value of our method using reports from the Cochrane review on the role of erythropoietin in anemia due to malignancy.

### Conclusion

Using these formulas, we hope to help meta-analysts use clinical trials in their analysis even when not all of the information is available and/or reported.

### Keywords

Erythropoietin Cochrane Review Skewed Distribution Good Estimator Estimation Formula## Background

To perform meta-analysis of continuous data, the meta-analysts need the mean value and the variance (or standard deviation) in order to pool data. However, sometimes, the published reports of clinical trials only report the median, range and the size of the trial. In this article we use simple and elementary inequalities in order to estimate the mean and the variance for such trials. Our estimation is distribution-free, i.e., it makes no assumption on the distribution of the underlying data. In fact, the value of our approximation(s) is in giving a method for estimating the mean and the variance exactly when there is no indication of the underlying distribution of the data. In current practice, the median is often substituted for the mean, and the Range/4 or Range/6 for the standard deviation. However, it has not been shown that median can indeed be used to replace mean values, nor when the range-formulas are appropriate.

## Methods

### Assumptions

Suppose a clinical trial reports the following summary measures for a certain event:

*m* = Median

*a* = The smallest value (minimum)

*b* = The largest value (maximum)

*n* = The size of the sample.

In this article, we want to estimate the mean, and the standard deviation of this sample of size *n*. First we will order this sample by size:

*a* = *x*_{1} ≤ *x*_{2} ≤ *x*_{3} ≤ … *x*_{M-1 }≤ *x*_{ M } = *m* ≤ *x*_{M+1 }≤ … ≤ *x*_{n-1 }≤ *x*_{ n }= *b*,

where the *M*^{th} number is the median, and Open image in new window (for the sake of simplicity, we will assume that *n* is an odd number).

## Results

### Estimating the sample mean Open image in new window

We begin with several simple inequalities:

Adding up and diving by *n*, the middle column is exactly the sample mean, Open image in new window .

Adding up and diving by *n* for all three columns, we get the following inequality:

Therefore, the lower bound for the sample mean is

The upper bound for the sample mean is

The sample mean can than be estimated as

When the size of the sample is fairly large, the second fraction becomes negligible and the estimate can be written in a simplified form:

We can use this simple expression even if we do not know the size of the sample. The length of the interval which contains the sample mean (the interval [LB, UB]), is approximately

### Estimating the sample variance

Even when the only information we have about a set of data is it's range: *R* = *b* - *a*, we can still estimate the standard deviation. If our data are normally distributed, then *P*[-2*σ* <*X* - *μ* < 2*σ*] = 0.95, and therefore, the range covers approximately 4*σ*, i.e., Open image in new window .

When the data we are dealing with are not normally distributed, we can still use the Chebyshev's inequality [1, 2] Open image in new window , and obtain the following for *k* = 3: Open image in new window . Therefore, the range covers approximately 6*σ*, i.e., Open image in new window .

On the other hand, if the summary results for a clinical trial include the median and the size of the sample, we can presumably do better than the two range approximations above. Next section deals with that situation.

### The Variance S^{2}– distribution free inequalities

Using the inequalities (1) and taking in consideration that all the data are non-negative, we can multiply each row *i* with the value *x*_{i} (*i* = 1, 2, 3, ..., *n*). We obtain the following inequalities:

Adding up by columns, we have the following:

Using the inequalities (1) again, we estimate the sums in LB and UB as

Therefore, the expressions in (7) can be estimated as

The sum of squares can be therefore estimated as

The sample variance can be evaluated from the computational formula

Note that if we let *n* grow without bound, the expression (12) becomes the well-known range formula Open image in new window .

### The Variance S^{2}– equidistantly spaced data

The formula (4) can also be obtained by dividing the range [*a*, *b*] into two parts: [*a*, *m*), and [*m*, *b*]. We then subdivide each of these two parts into subintervals using equally spaced partition points. In other words, we are estimating each of the data points (except for *a*, *m*, and *b*) with uniformly spaced approximate points:

and

Therefore our sample is approximately given as

*a* = *x*_{1} ≤ *x*_{2} ≤ *x*_{3} ≤ … ≤ *x*_{M-1 }≤ *y*_{1} = *m* ≤ *y*_{2} ≤ … ≤ *y*_{M-1 }≤ *y*_{ M }= *b*.

*S*). After a little algebra, the sample variance can be estimated by

If we let the number of estimation points increase without bounds, i.e., assume that *n* in the expression (15) is very large, we obtain a simplified version of the expression above:

## Discussion

### Analysis and performance of estimates

In order to verify the accuracy of these estimates, we ran several simulations using the computer package Maple where the data were variously distributed, and obtained the tables below.

We drew samples from five different distributions, Normal, Log-normal, Beta, Exponential and Weibull. The size of the sample ranged from 8 to about 100. In the first subsection we present the results of our estimation for a normal distribution, which is what meta-analysts would commonly assume. We also show the results of simulations where the data were selected from a skewed distributions. In each case we compared the relative error made by estimating the sample mean with the approximation given by formulas (4) and (5), as well as by the median, and the relative error made by estimating the sample variance by the formulas (12) and (16), as well as the well-known standard deviation estimators Range/4 and Range/6.

### Normal distribution

We drew 200 random samples of sizes ranging from 8 to 100 from a Normal Distribution with a population mean 50 and standard deviation 17. Then we graphed the average relative error vs. the sample size. Both estimators for the mean, formulas (4) and (5), are very close to the sample mean (within 4%). For sample sizes smaller than 29, formula (5) is actually outperforming the median as a mean estimator. For larger sample sizes, however, the median is more consistent estimator for a normally distributed sample.

The variance estimators however show greater distinction. For a very small sample size (up to 15) the formula (16) is performing the best (within 10% of the real sample standard deviation). When the sample size is between 16 and 70, the formula Range/4 is the best estimator of the sample standard deviation, with a relative error between 10–15%. However, for larger sample sizes, the formula Range/6 performs the best for this distribution. To compare the precision of these estimates on average, we collected the results of our simulation in the Additional file 1.

### Simulation with a skewed distribution (Log-Normal, Beta, Exponential and Weibull)

*μ*= 4, and

*σ*= 0.3, Beta distribution with parameters

*a*= 9 and

*b*= 4, Exponential distribution with the parameter

*λ*= 10 and Weibull distribution

*a*= 2 and

*b*= 35. These parameters were chosen arbitrarily, and the simulation results did not differ when we used different parameters (naturally, larger variance translates into larger relative error for mean estimators for any distribution). Just like in the case of Normal distribution, we ran our algorithm 200 times for each sample size ranging from 8 to 100. For each of the estimation formulas we then calculated the average relative error. We will summarize the best formula for estimation in Table 1.

The best formula for estimation by distribution.

Best Formula for Sample size (n) | Mean Estimation | Standard Deviation Estimation | |||
---|---|---|---|---|---|

Formula (5) | Median | Formula (16) | Range/4 | Range/6 | |

Log-Normal |
| 23 < |
| 15 < | 64 < |

Beta |
| 30 < |
| 15 < | 100 < |

Exponential |
| 21 < |
| 15 < | 66 < |

Weibull |
| 25 < |
| 16 < | 110 < |

Therefore, counter intuitively, even for the skewed distributions we tested, it seems like that for a larger sample size (usually more than 25) simply replacing sample mean with the reported median is the best estimate of the sample mean. This is an interesting result and we are not aware that it was previously demonstrated. It gives assurance to meta-analysts that simple replacement of mean with medians in meta-analysis is a viable option. Formula (5), even though taking more parameters into account (the range and the sample size), on average only outperforms the median for small sample sizes. However, a large number of trials used in meta-analyses do have very small number of patients for each arm (as small as 10–15). For these trials, formula (5) seems to give an alternative to just using the median.

When estimating the standard deviation, formula (16) is the best estimate for very small sample sizes (less than 16), after which the range formulas (Range/4 and Range/6) are better. Range/4 formula works best for samples of moderate size (between 16 and about 70), while for really large samples, Range/6 is the best estimator.

Detailed results of each simulation with a skewed distribution are given in the Additional file 2, Additional file 3, Additional file 4, and Additional file 5.

If the reader wants to try these formulas with a different set of data, we have provided an Excel spreadsheet file with the formulas at http://www.iun.edu/~mathiho/medmath/Estimating.xls

## Effect on the mean difference in meta-analysis

In this section we will discuss the use of these estimating formulas on the effect size for the meta-analysts. When pooling the means from various sources for a meta-analysis, the usual procedure is to calculate differences in the means between the experimental arm of a study and the control arm, *m* _{ p }= *m*_{ c } - *m*_{ e } , and the combined variance for each study, Open image in new window (for example, see [3]). The pooled mean difference is then calculated by using weighted sum of these differences, where the weight is the reciprocal of the combined variance for each study.

To determine whether our estimates make a huge difference when compared to the actual mean difference and variance, we drew two samples of the same size from a same distribution. We applied our methods to the Log-Normal [4, 0.3] distribution since this skewed distribution is frequently encountered in biology and medicine.

Results of our meta-analysis with the real sample data as one subgroup, and our estimates of the sample as the second subgroup.

Actual Sample | Our Estimate | |||
---|---|---|---|---|

WMD [95% CI] | % Weight | WMD [95% CI] | % Weight | |

| -0.37 [-37.17, 36.44] | 42.00 | 0.41 [-30.92, 31.73] | 58.00 |

| 0.08 | 100.00 | ||

Heterogeneity statistic | degrees of freedom | P | I-squared | |

Sample | 0.04 | 9 | 1.000 | 0.0% |

Estimate | 0.04 | 9 | 1.000 | 0.0% |

Overall | 0.08 | 19 | 1.000 | 0.0% |

Overall Test for heterogeneity between sub-groups | ||||

0.00 | 1 | 0.975 | ||

Significance test(s) of WMD = 0 | Sample | z = 0.02 | p = 0.984 | |

Estimate | z = 0.03 | p = 0.980 | ||

Overall | z = 0.01 | p = 0.995 |

In order to capture a more consistent measure of the effect of our estimation on pooled mean difference, we repeated this process by varying the number of trials in the meta-analysis from 8 to 100. In particular we are interested in the difference between the real pooled weighted mean difference in the sample group and the pooled weighted mean difference from a meta-analysis using estimated means and variances.

As seen from the Figure 2, the estimates of the mean were fairly accurate and useful. On the other hand, the estimates for the variance were a lot less precise, missing the actual value of the variance by 10 % – 20% (see the Additional Files 1, 2, 3, 4, 5). However, in some situations, using these estimates might still be better than the alternative – excluding the trials which reported the wrong summary data (median instead of mean). Using our estimation method, we can see the effect of such trials on pooled summary measures. In the next section we will illustrate our method in an actual systematic review.

## An illustrative example of the potential value of our methods

American Society of Hematology/ American Society of Oncology (ASH/ASCO) developed practice guidelines for the use of erythropoietin (Epo), a drug whose annuals sales exceed several billions of dollars in the US alone, based on the systematic review of the effects of Epo on various clinical outcomes of interest including improvement of anemia by increase of hemoglobin[4]. The results were expressed as the mean increase in hemoglobin in Epo arm compared with the control. However, a number of the papers reported median increase instead of mean increase and standard deviation. Due to lack of available methods to use median values, the authors of this important review, decided not to use these papers in their meta-analysis. Recently, the Cochrane review was published attempting to provide more updated analysis of the effects of Epo in anemia related to malignancy [5]. The Cochrane reviewers did meta-analyze data to calculate an average weighted mean increase in hemoglobin as the result of Epo treatment. However, the Cochrane investigators could not include the totality of evidence in relation to this outcome since a number of the trials reported data as medians instead of means. Therefore, published meta-analyses related to the effect of Epo in anemia due to malignancy suffer from the phenomena akin to the outcome reporting bias [6] simply due to fact that methods are not yet developed to allow researches to use data medians.

Here we illustrate that it is actually possible to use medians and pool, and improve inclusiveness of meta-analyses. For example, the Cochrane investigators were only able to pool 2 studies [7, 8] to evaluate the effect of Epo on change in hemoglobin in the patients with the baseline level of hemoglobin >12 g/dl who underwent chemotherapy. Their results show that on average Epo increases hemoglobin by 2.05 g/dl. However, the Cochrane investigators could not pool data from other available studies in the literature with similar eligibility. ASH/ASCO guidelines listed two other studies that were eligible for the meta-analysis (and two that were not).

Our estimates come with some uncertainty. To see what effect this uncertainty has on the outcome of our meta-analysis, we varied the estimated means in Thatcher at al by 4% and the estimated standard deviation in both, Thatcher at al and Welch at al, by 10% to 15% (according to sample sizes, as indicated in the Additional Files 1, 2, 3, 4, 5). The summary pooled estimate now ranged from the low of 1.09 to the high of 1.32, which represents a decrease between 36% and 47%.

This example outlines how our method can be potentially useful for meta-analysts. It is important to realize that this example is provided only to illustrate our method. Our goal here is not to challenge the Cochrane review or ASH/ASCO guidelines. Nevertheless, we believe that this example is a good illustration of the potential of our method. While it is common practice that the investigators simply pool what is available to them it is actually not known how often studies are excluded because of reporting a different summary statistic. In future we will attempt to systematically address this issue and evaluate, for example, how often the Cochrane reviews did not pool data from the available median values when they pooled data on continuous outcomes. We hope that availability of our methods to the wider meta-analytic audience may further improve the inclusiveness of all relevant studies for the Cochrane and other meta-analyses.

## Conclusion

*m*), low and high end of the range (

*a*and

*b*, respectively).

Using simulation methods we were able to determine that formula (5) is a best estimator for the mean when dealing with a small sample size. As soon as sample size exceeds 25, the median itself is the best estimator.

The variance can be estimated using the formula (16)

Together with the well-known estimators (Range/4 for a normal distribution, and Range/6 for any random distribution) this formula provides a useful tool for meta-analysts. Using simulations, we determined that for very small samples (up to 15) the best estimator for the variance is the formula (16). When the sample size increases, Range/4 is the best estimator for the standard deviation (and variance) until the sample sizes reach about 70. For large samples (size more than 70) Range/6 is actually the best estimator for the standard deviation (and variance).

The best estimating formula for an unknown distribution.

Sample Size: |
| 15 < | 25 < | 70 < |
---|---|---|---|---|

Estimate Mean | Formula (5) | Median | ||

Estimate Standard Deviation | Formula (16) | Range/4 | Range/6 |

Using these formulas, we hope to enable meta-analysts use clinical trials even when not all of the information is available and/or reported.

## Notes

## Supplementary material

### References

- 1.Hogg RV, Craig AT: Introduction to mathematical statistics. 1995, New YorkToronto , Macmillan College Pub. Co. ;Maxwell Macmillan Canada ;Maxwell Macmillan International, xi, 564-5th,Google Scholar
- 2.Mood AMF, Graybill FA, Boes DC: Introduction to the theory of statistics. 1974, New York, , McGraw-Hill, xvi, 564-3dGoogle Scholar
- 3.Petiti DB: Meta-analysis, decision analysis and cost-effectiveness analysis. Methods for quantitative synthesis in medicine. 2nd ed. 2000, New York , Oxford pressGoogle Scholar
- 4.Rizzo JD, Lichtin AE, Woolf SH, Seidenfeld J, Bennett CL, Cella D, Djulbegovic B, Goode MJ, Jakubowski AA, Lee SJ, Miller CB, Rarick MU, Regan DH, Browman GP, Gordon MS: Use of epoetin in patients with cancer: evidence-based clinical practice guidelines of the American Society of Clinical Oncology and the American Society of Hematology. J Clin Oncol. 2002, 20 (19): 4083-4107. 10.1200/JCO.2002.07.177.CrossRefPubMedGoogle Scholar
- 5.Bohlius J, Langensiepen S, Schwarzer G, Seidenfeld J, Piper M, Bennet C, Engert A: Erythropoietin for patients with malignant disease. Cochrane Database Syst Rev. 2004, CD003407.-Google Scholar
- 6.Chan AW, Hrobjartsson A, Haahr MT, Gotzsche PC, Altman DG: Empirical Evidence for Selective Reporting of Outcomes in Randomized Trials: Comparison of Protocols to Published Articles. JAMA. 2004, 291 (20): 2457-2465. 10.1001/jama.291.20.2457.CrossRefPubMedGoogle Scholar
- 7.Del Mastro L, Venturini M, Lionetto R, Garrone O, Melioli G, Pasquetti W, Sertoli MR, Bertelli G, Canavese G, Costantini M, Rosso R: Randomized phase III trial evaluating the role of erythropoietin in the prevention of chemotherapy-induced anemia. J Clin Oncol. 1997, 15 (7): 2715-2721.PubMedGoogle Scholar
- 8.Kunikane H, Watanabe K, Fukuoka M, Saijo N, Furuse K, Ikegami H, Ariyoshi Y, Kishimoto S: Double-blind randomized control trial of the effect of recombinant human erythropoietin on chemotherapy-induced anemia in patients with non-small cell lung cancer. Int J Clin Oncol. 2001, 6 (6): 296-301. 10.1007/s10147-001-8031-y.CrossRefPubMedGoogle Scholar
- 9.Welch RS, James RD, Wilkinson PM, Fb: Recombinant Human Erythropoietin and Platinum-Based Chemotherapy In Advanced Ovarian Cancer. Cancer J Sci Am. 1995, 1 (4): 261-PubMedGoogle Scholar
- 10.Thatcher N, De Campos ES, Bell DR, Steward WP, Varghese G, Morant R, Vansteenkiste JF, Rosso R, Ewers SB, Sundal E, Schatzmann E, H. S: Epoetin alpha prevents anaemia and reduces transfusion requirements in patients undergoing primarily platinum-based chemotherapy for small cell lung cancer. Br J Cancer. 1999, 80 (3-4): 396-402. 10.1038/sj.bjc.6990369.CrossRefPubMedPubMedCentralGoogle Scholar

### Pre-publication history

- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/5/13/prepub

## Copyright information

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.