Sample size calculations in pediatric clinical trials conducted in an ICU: a systematic review
- 2.2k Downloads
At the design stage of a clinical trial, several assumptions have to be made. These usually include guesses about parameters that are not of direct interest but must be accounted for in the analysis of the treatment effect and also in the sample size calculation (nuisance parameters, e.g. the standard deviation or the control group event rate). We conducted a systematic review to investigate the impact of misspecification of nuisance parameters in pediatric randomized controlled trials conducted in intensive care units. We searched MEDLINE through PubMed. We included all publications concerning two-arm RCTs where efficacy assessment was the main objective. We included trials with pharmacological interventions. Only trials with a dichotomous or a continuous outcome were included. This led to the inclusion of 70 articles describing 71 trials. In 49 trial reports a sample size calculation was reported. Relative misspecification could be calculated for 28 trials, 22 with a dichotomous and 6 with a continuous primary outcome. The median [inter-quartile range (IQR)] overestimation was 6.9 [-12.1, 57.8]% for the control group event rate in trials with dichotomous outcomes and -1.5 [-15.3, 5.1]% for the standard deviation in trials with continuous outcomes. Our results show that there is room for improvement in the clear reporting of sample size calculations in pediatric clinical trials conducted in ICUs. Researchers should be aware of the importance of nuisance parameters in study design and in the interpretation of the results.
Keywordsclinical trials sample size power standard deviation event rate study design
control group event rate
clinically relevant effect size
International Committee of Medical Journal Editors
minimum detectable effect size
pediatric intensive care unit
randomized controlled trial
In randomized controlled trials (RCTs), a priori sample size calculations aim at enrolling sufficient participants to detect a clinically relevant treatment effect. Including too many participants may expose some to an inferior treatment unnecessarily. Including too few may make the likelihood of reaching a definite conclusion too small. The importance of adequate sample size calculations has been widely stressed in the biomedical literature [1, 2, 3, 4], including internationally recognized guidelines [5, 6, 7]. Most sample size calculations are easily conducted nowadays using specialized software.
In recent years, increasing attention has been given to pediatric RCTs [8, 9, 10, 11, 12, 13] for pharmacological interventions due to the fact that many drugs used in children have not (yet) been tested . Drug regulatory agencies implemented guidance for sponsors to promote drug research in children, leading to more trials being designed and conducted . Recruitment difficulties [16, 17] and ethical considerations [18, 19] make pediatric trials more challenging, especially with critically ill children, e.g. children being treated in ICUs. In such cases, the importance of a rigorously designed RCT is stressed.
In the design phase of an RCT, the sample size is calculated based on the primary outcome variable. Sample size depends on parameters that are estimated or assumed, in addition to the set criteria of type I error and power. In addition to the clinically relevant treatment effect to be detected, assumptions need to be made about so-called nuisance parameters (NPs). A NP is a parameter that is not of direct interest but must be accounted for in the analysis of the treatment effect and thus also in the sample size calculation. Examples of NPs are the event rate in the control group (control group event rate, CER) when the clinical outcome of interest is dichotomous and the standard deviation (SD, assumed equal across groups), when the clinical outcome is continuous.
The value of a NP substantially affects the sample size calculation; therefore the value assumed should be as reliable as possible. Of course, the observed value once the trial is completed will differ from the assumed value at the design stage. If the assumed value is different from the (unknown) population value, we refer to it as the misspecification of the nuisance parameter. Misspecification can have serious consequences for the actual power of the trial and the smallest possible effect size that can be detected.
When a sample size calculation is performed, the value of the NP used corresponds to its assumed population value. Therefore, misspecification can be shown on a per trial basis in terms of statistical significance. That is, test whether the observed value is significantly different from the assumed population value. However, our focus will be on systematic misspecification. We are interested in exploring whether there is systematic over- or underestimation of NPs in a specific population of pediatric RCTs, and what the consequences of such a systematic misspecification are on the design aspects of these RCTs and the inference that can be drawn from them. There are various ways to arrive at an assumption about the value of a NP. One can estimate it based on data from earlier trials or other types of studies, or conduct a pilot study. However, all these methods can lead to misspecification of the NP [20, 21, 22, 23, 24].
Previous research has shown that RCTs in general use sample sizes that are too small due to unduly optimistic a priori assumptions . This optimism is partly reflected in the assumed clinically relevant treatment effect, but can also occur as a direct effect of misspecifying a NP. For example, the value of the risk ratio (RR), which is the event rate in the experimental group divided by the event rate in the control group, is directly dependent on the event rate in the control group, which has to be estimated before the start of the trial. Similarly, for a continuous outcome, the value of the SD determines how large the difference in means is . For instance, in a sample of 100 patients per arm, a difference of 10 units in some continuous measurement would be significant (P = 0.047) if the SD was equal to 30. If the SD was 40, this difference would no longer be statistically significant (P = 0.11).
When comparing two groups with respect to a dichotomous outcome, the absolute risk difference is customarily used in sample size calculations as the effect size that is considered clinically relevant. The absolute risk difference is easier to interpret for clinical purposes, since it can be translated into a number needed to treat. However, for our present research we consider the RR to be a more consistent way to compare the efficacy of two treatments regardless of the CER; its value represents a relative measure of difference, taking into account the level of efficacy in the control group. For example, one could argue it is not logical to expect the same absolute difference, e.g. 20%, if the CER is 50% or 30%.
There is published research addressing the accuracy and quality of sample size calculations and their reporting in clinical trials [22, 26, 27]. These papers reported several discrepancies between protocols and reports  but also inadequate reporting and inaccuracies in general . Important guidelines for the reporting of RCTs are the CONSORT statement  and the statement from the International Committee of Medical Journal Editors (ICMJE) on clinical trial registration . Reporting of sample size calculations would be expected to have improved, since it is explicitly required by these statements.
Besides the above mentioned papers, which cover the general spectrum of sample size calculation in RCTs, little is known about the misspecification of NPs in pediatric RCTs in particular. To investigate the impact of systematic misspecification of NPs in pediatric RCTs, we reviewed published papers reporting results of pediatric RCTs. We focused on trials conducted in neonatal intensive care units and pediatric intensive care units (PICUs) due to the vulnerability of the target populations in such studies. We furthermore focused on trials evaluating pharmacological interventions because of the increased interest from regulators and the ethical considerations mentioned above. These aspects require a high standard of clinical trial design. Finally, we will provide guidance about what can be done to prevent misspecification and its consequences.
We searched MEDLINE through PubMed, following the sensitivity- and precision-maximizing search strategy for identifying RCTs as suggested by the Cochrane Handbook for Systematic Reviews of Interventions. We searched for papers between 1 January 2006 and 31 October 2011, which covers a 5-year span from the application of the clinical trial registration statement from the ICMJE. Further limits imposed were ‘Humans’ for species, ‘English’ for language and ‘All Child: 0–18 years’ for age. Additional keywords ‘Intensive care’, ‘ICU’, ‘PICU’ or ‘NICU’ were used.
Selection and data extraction were performed by two authors (SN and IvdT) independently. Disagreements were discussed to reach consensus. Selection was restricted to publications concerning two-arm parallel group RCTs where efficacy assessment was the main objective. We only included trials with pharmacological interventions. Only trials with a dichotomous or a continuous outcome were included. Trials that were specifically described as Phase I or II, pilot or exploratory were excluded. We excluded trials that were designed with more than two groups (e.g. factorial designs and dose–response trials).
General characteristics of each study, namely, year of publication, included patients, experimental and control interventions, primary outcome, type of primary outcome (dichotomous/continuous), registration (yes/no and if yes, registration code) and use of a crossover design were extracted. For the a priori sample size calculations, the following information was extracted: type I error, power, one- or two-sided testing, the assumed value of NPs (since we only considered dichotomous and continuous outcomes, the NPs recorded were the assumed CER and common SD, respectively), expected effect size (i.e., the standardized effect size, expressed as Cohen’s d for continuous outcomes, which is the mean difference between the two groups divided by the common standard deviation, and the risk ratio for dichotomous outcomes), the required sample size (with and without accounting for dropout, if applicable) and, if reported, the information source on which the assumptions concerning the NP were based, e.g. literature, own experience or pilot study. From the results sections of the articles, we extracted information on the actual sample size randomized, the one used in the analysis (irrespective of whether an intention-to-treat or per-protocol analysis was conducted), the observed value of the NP and the observed effect size.
Some papers were included in this review because the outcomes measured were continuous or dichotomous, but it was not made clear, either in the sample size calculations or in the text, which outcome was the primary one. In these cases, the primary outcome type was coded as ‘unclear’. For a trial to be considered as reporting an a priori sample size calculation, at least the power should have been mentioned in the methods section of the publication. When the type I error was not reported, a value of 0.05 (two-sided) was assumed. The reported assumed NP value was taken into consideration when it was explicitly mentioned or traceable from a cited publication; thus, we did not attempt to (re-) calculate the assumed NP from the information provided in the methods section of the article.
Two authors (SN and IvdT) replicated the sample size calculations independently, based on the assumed parameters. These replicated sample sizes were calculated based on Student’s t-test for continuous variables and based on the chi-square test for dichotomous variables, which is equivalent to the two-sample binomial test (Z-test). We also recalculated required sample sizes based on the empirical values of the NPs as published in the paper. For a continuous outcome for which median and range were reported instead of mean and SD, the SD was calculated according to Hozo et al. .
As mentioned before, empirically obtained estimators of nuisance parameters are also subject to random variation; therefore any systematic trend in the direction of possible misspecifications was our main interest. Statistical analyses were conducted with R, version 2.13.1 and SPSS (PASW statistics) version 17.
Basic characteristics of the 70 included papers
Neonates (0 to 1 years old)
Children (>1 years old)
Intervention in the control group
Characteristics of sample size calculations of the 71 included trials (70 papers)
Type of primary outcome
A priori sample size calculation
At least power reported
→ Of these
NP reported as proportion of total
No report of a priori sample size calculation
Information source of NP assumption
→ Of the 49 trials that report at least power
→ Of the 35 trials that report an assumed NP value
In the reports of all 12 registered trials, an a priori sample size calculation (at least power mentioned) was reported; this was the case in 35 reports out of 58 trials (60%) that did not report a registration. The rate of reporting the expected NP was 10 out of 12 for papers that reported registration (83%) and 24 out of 58 (41%) for papers that did not. Note that for these figures the number of papers is the total sample size (70) rather than the number of trials (71).
Misspecification of nuisance parameters
The effect of the misspecification of the NP was apparent on the average power of the studies reviewed. The average power required by design was 83% while the average power taking the observed NPs into account, based on the sample sizes calculated in the papers, would have been 73.9%. Based on our replicated sample sizes the power achieved would be 71.8%. However, these results should only be taken as indicative and exploratory, as we share the concerns of other authors about power calculations after data is collected . More specifically, researchers should be very careful with interpreting the post-hoc power, which is the power calculated for the observed treatment effect, and the same applies to the observed NP. There was no evidence of a relation between the source used to make assumptions for the NP and the magnitude of misspecification.
Minimum detectable effect size in trials with dichotomous outcomes
In this review of 71 pediatric clinical trials, our main goal was to assess the presence and magnitude of systematic misspecification of NPs in sample size calculations. Deviations between assumed and realized values of NPs can lead to undesirable trial characteristics like underestimated sample size and overestimated power. This can possibly lead to important clinical improvements being missed and to an increased number of trials unnecessarily considered negative or failures. It also reduces the value of individual patients participating in clinical trials. Some experts consider underpowered trials to be unethical .
Of course, observed parameter values deviate from the assumed ones, due to random fluctuation and this is incorporated in sample size estimation. If estimation is accurate, it is expected that these discrepancies will take place in both directions (both over- and underestimating), causing no overall effect in the design characteristics of the RCTs reviewed. However, as the results of our review show, there is systematic misspecification of nuisance parameters, resulting in about 10% lower average power of the studies than required in the design stage. As a result, more patients should have been studied for the conclusions of the studies to be in compliance with their design characteristics. The loss in power theoretically results in 10% of studies with promising interventions being expected to conclude incorrectly that there is no benefit.
An important issue of concern is that reporting of sample size calculations is still not adequate. This is in accordance with the findings by Charles et al. . We assumed that the CONSORT statement and the clinical trials registration would have led to more transparent reporting, but the percentage of registered trials was very low (17%). It should be noted though that while trial registration was stated as a requirement for publication by ICMJE, we did not restrict our search to these journals. The rate of registration may in reality have been higher, since our information depended on explicit reporting in journal articles.
Misspecification of the NP has more severe consequences for trials with a dichotomous outcome than for those with a continuous outcome. As the results of our review show, the CER was found to be up to 200% misspecified. One way around this is to avoid dichotomizing continuous outcomes, if possible, and also to avoid treating time-to-event outcomes as binary. Misspecification, especially underestimation, of the SD for a trial with a continuous outcome also has considerable consequences. We are unable to draw reliable conclusions from our study, because of the very small number of trials with a continuous endpoint reporting both assumed and realized values of the standard deviation.
Further limitations of our study are the quite specific inclusion criteria (trials with pharmacological interventions and conducted in an ICU). The findings may not be generalizable beyond this group of trials. Additionally, RCTs that are not analyzed by the intention-to-treat principle are likely to introduce bias in estimation of the treatment effect, which could also have implications for sample size calculations. However, it was seldom reported whether the trial was analyzed by the intention-to-treat principle (in only 19 papers was it clearly stated). Furthermore, even though the search was conducted in a systematic way, the possibility that some trials that could fit our inclusion criteria were missed, cannot be excluded. However, we do not expect this to affect the validity of our results since the scope of this review is to explore the state of affairs rather than, e.g., evaluate the effectiveness of an intervention where missing a trial would be considered a caveat.
Misspecification of NPs occurs frequently in pediatric clinical trials conducted in ICUs. Failure of reporting a priori assumptions about NPs appeared to be more common in the trial reports included in this review than in trials published in high-impact medical journals , even though these trials included an extremely vulnerable population. Awareness should be raised of this matter and journal editors should be more demanding concerning reporting standards adopted by the high-impact journals.
Methodologies exist that are less sensitive to assumptions of NPs, such as using a more flexible design and analysis (e.g. sequential trials) or re-estimation of the sample size (internal pilot). Another way would be to state the expectations for the clinically relevant effect size in a standardized way (e.g. use of Cohen’s standardized effect size, [25, 33]). This allows one not to make specific assumptions for the NP but rather state the magnitude of the effect size considered clinically relevant (e.g. small, medium or large effect size).
Research in vulnerable populations, like children, is challenging and demanding. Cumulative knowledge is difficult to acquire but necessary for evidence-based evaluation of medical interventions. This should be done in the most efficient and ethical way possible and a well-thought-out study design is a crucial step towards this goal. We would strongly advise editors of all medical journals to adopt the reporting standards guidance and be more demanding that authors conform to these standards.
This research was partly funded by the Netherlands Organization for Health Research and Development (ZonMW) through grant number 152002035, for ‘Optimal design and analysis for clinical trials in orphan diseases’.
- 6.Harmonised ICH, Tripartite Guideline: Statistical principles for clinical trials: International Conference on Harmonisation E9 expert working group. Stat Med. 1999, 18: 1905-1942.Google Scholar
- 7.De Angelis CD, Drazen JM, Frizelle FA, Haug C, Hoey J, Horton R, Kotzin S, Laine C, Marusic A, Overbeke AJ, Schroeder TV, Sox HC, Van Der Weyden MB: Is this clinical trial fully registered? – A statement from the International Committee of Medical Journal Editors. N Engl J Med. 2005, 352: 2436-2438.CrossRefPubMedGoogle Scholar
- 13.Food and Drug Administration: International conference on harmonisation; guidance on E11 clinical investigation of medicinal products in the pediatric population; availability: notice. Fed Regist. 2000, 65: 78493-78494.Google Scholar
- 19.Gill D, Ethics Working Group of the Confederation of European Specialists in Paediatrics: Ethical principles and operational guidelines for good clinical practice in paediatric research. Recommendations of the Ethics Working Group of the Confederation of European Specialists in Paediatrics (CESP). Eur J Pediatr. 2004, 163: 53-57.CrossRefPubMedGoogle Scholar
- 28.The Cochrane collaboration. [http://www.cochrane-handbook.org]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.