In an article recently published in Intensive Care Medicine, Constant et al. [1] report the results of their observational study investigating the effectiveness of targeted temperature management (TTM) in improving outcome in patients who were successfully resuscitated after cardiac arrest during surgical procedures [2]. The authors retrospectively reviewed 101 cases that occurred between 2008 and 2013 in 11 centres, 30 treated with TTM. Using logistic regression TTM did not turn out to be an independent predictor of favourable neurological outcome. Consistently with data from the literature, shockable rhythms were strongly protective while emergency surgery worsened the prognosis [2, 3].

The use of logistic regression was advisable to compensate the unbalances between the study and the control group in terms of important prognostic variables, estimating their prognostic weight independently one of the other. The multivariable approach increases the reliability of the findings, although it accounts only for measured variables and not for unmeasured ones (including those that are unknown) as randomized controlled trials do.

Moreover, in an observational study investigating the efficacy of a treatment we have a further complication. The probability of receiving the study treatment is different between study arms and may be related to important prognostic factors. For example, physicians may not prescribe a treatment, especially when it is expensive or demanding in terms of workload, to patients whose condition is too severe. In such cases a beneficial effect would be untruly attributed to treatment.

From the statistical perspective, the study by Constant et al. has two weak points.

First, its small sample size does not allow one to draw conclusions of causal relation between the variables included in the multivariable model and the outcome. Actually, when this is the purpose of the analysis all the most important predictors should be included in the model [4]. An underfitted model (i.e. including an insufficient number of variables) may not include important causal factors and wrongly estimate weights of the included variables. On the other hand, overfitting can occur when too many variables in relation to the number of outcomes are included in the model, with a high risk of generating biased estimates of variables weights [5]. It has been demonstrated that when dealing with binary outcomes (yes or no events, the outcome in logistic regression) having at least ten outcomes for each variable is a safe threshold [6, 7], although it has been suggested that the risk of bias could be acceptable when the outcome/variable ratio is between 5 and 10 [8]. What researchers often disregard is that this ratio should be computed using the initial number of variables entering the model and not those left after the selection process has been carried out.

Thus, Constant et al. were dealing with a small sample for explanatory purposes (i.e. seeking causal relations between TTM and long-term neurological outcome) but were caught between the risk of over- and underfitting. The researchers included in their logistic regression model only six variables (eventually, only two variables turned out to be statistically significant) and could not include more because of the risk of overfitting. Moreover, because of the limited sample size the model was probably underpowered to detect important predictors. For example, time from cardiac arrest to restoration of spontaneous circulation, a plausible predictor of poor neurological outcome [2], was barely statistically non-significant (p = 0.11). Is this because of insufficient power or because of specific features of intraoperative cardiac arrest? Although the no-flow time was very short, and it is reasonable to think it had little or any influence on the outcome, technically the study cannot provide the answer.

The second weak point is that their logistic regression does not account for prognostically important variables linked to indications for TTM. This can be done by developing a propensity score, usually using logistic regression including baseline variables as predictors and treatment as the dependent variable [7]. Patients with the same propensity score will, hence, have the same chance to receive or not to receive the treatment. The propensity score is then included in the final logistic regression model, which thus measures the prognostic weight of model variables (including TTM) given the same probabilities of receiving the treatment [9].

Alternatively, the propensity score can be used to create matched pairs of treated and untreated patients, creating a study and a control arm with the same probabilities of receiving the treatment and resembling randomized controlled trials, with the (not negligible) limit of being based only on measured variables. Constant et al. have adopted this approach as a sensitivity analysis. Propensity score matching was probably a secondary analysis because it allows one to assess the effectiveness of TTM but does not measure the prognostic weight of other variables (such as shockable rhythm) as logistic regression does. However, because of the small sample size, the propensity score was limited to three variables, not accounting for the complexity of criteria that rule the decision whether or not to give a specific treatment, with a high risk of generating a biased model [10].

The authors honestly acknowledge the limits and the exploratory nature of their study, calling for confirmatory research, without overemphasizing their results [11].

Why then should readers be interested in a study providing very low evidence in support or against TTM in intraoperative cardiac arrest?

In my opinion this study deserves credit for raising the important issue of how evidence should be applied in clinical practice. We have evidence from two randomized controlled trials that TTM is effective in improving prognosis after cardiac arrest [12, 13]. However, the authors stress the importance of not transferring automatically this evidence to intraoperative cardiac arrest, which bears specific features. I agree with this interpretation for several reasons. First, when general anaesthesia is performed a neuroprotective effect may be present. Second, in monitored patients detection of cardiac arrest is immediate, and consequently treatment is timely. Third, in patients under anaesthesia it is difficult to assess the presence of coma. Fourth, patients may frequently die because of causes related to surgical procedures and not because of anoxia following cardiac arrest. The combination of these factors can affect the severity and prognostic relevance of anoxia, the correct diagnosis of coma, and consequently the effectiveness of TTM.

Besides fuelling a debate on a clinically meaningful question, the study provides detailed descriptive data on the subject. Although the sample is limited, we have a clear picture of the characteristics of patients undergoing intraoperative cardiac arrest. Larger samples may be provided by permanent registries that, however, not being focused on the specific subject, usually lack important information.

Thus, small observational studies, based on clinically relevant hypotheses and carried out in unexplored fields where it is difficult to collect detailed data, are a valuable resource for medical science for the hypotheses they generate and for the descriptive data they provide, rather than for inferential analyses that are inherently weak. Their value would, however, be even greater if researchers made their data publicly available [14], stimulating the replication of their studies and favouring the progressive expansion of a common dataset. This would allow analyses carried out at patient level with greater statistical power and greater external validity, providing a contribution to evidence that each single small observational study could never provide.

Under the paradigm of data sharing, small observational studies could thus be saved in a common database as coins in the piggy bank of evidence.