Background

The coronavirus disease 2019 (COVID-19) pandemic, with its overwhelming resource use, has been a major challenge for clinicians and health care institutions worldwide. Identifying patients at high risk of disease progression may help allocating resources more efficiently. Since presentation and course of the infection can vary considerably (including asymptomatic cases), no single trait is sufficient to appropriately categorise patients [1,2,3,4,5,6,7,8,9]. Thus, several scores have attempted to improve identification of patients at high risk of progression or death of COVID-19. Among these scores, the CALL, CHOSEN, HA2T2 and the ANDC score have generated much interest [10,11,12,13].

The CALL score (Comorbidity, Age, Lactate dehydrogenase (LDH) and Lymphocyte count) showed great discriminatory potential for disease progression with an area under the curve (AUC) of 0.91 (95%-CI 0.86–0.94) in its derivation cohort [10]. Disease progression was defined as respiratory rate ≥ 30 breaths per minute (bpm), peripheral oxygen saturation (SpO2) ≤ 93%, arterial partial oxygen pressure (PaO2)/fraction of inspired oxygen (FiO2) ≤ 300 mmHg, mechanical ventilation or worsening of lung computer tomography (CT) findings [10]. The CHOSEN score used age, FiO2 and albumin to predict progression defined as requiring supplemental oxygen, admission to the intensive care unit (ICU) or death [11]. The authors reported a good discriminative capacity for their score with an AUC of 0.89 (95%-CI 0.87–0.91) in their derivation and 0.87 (95%-CI 0.81–0.93) in their validation cohort [11]. The HA2T2 score was used to predict all-cause in-hospital mortality in COVID-19 patients based on need for supplemental oxygen, age and troponin [12]. It showed good discriminative power in both their derivation (AUC 0.83, 95%-CI 0.79–0.88) and their validation cohort (AUC 0.78, 95%-CI 0.72–0.84) [12]. The ANDC score, based on age, neutrophil-to-lymphocyte ratio (NLR), d-dimer and C-reactive protein (CRP), predicted all-cause in-hospital mortality with an excellent AUC of 0.92 (95%-CI 0.84–0.97) in their derivation and 0.98 (95%-CI 0.95–1.00) in their validation cohort [13].

So far, only the CALL score has undergone external validation, with the score performing markedly worse than in the original cohort (AUC 0.62 vs. 0.91) [14]. Thus, before wide-spread implementation, independent external validation of all these scores is mandatory. Herein, we validated four severity scores (i.e., the CALL, CHOSEN, HA2T2 and ANDC scores) in patients with COVID-19 hospitalised in a tertiary care centre in Switzerland.

Methods

Study design and participants

This retrospective observational analysis included all consecutive adult patients (≥ 18 years) with a confirmed Severe Acute Respiratory Syndrome Corona Virus type 2 (SARS-CoV-2) infection that required hospitalisation for at least 24 h at the Medical University Clinic of the Cantonal Hospital Aarau (Switzerland) between February 26, 2020 and April 30, 2020 (first wave) and between October 1, 2020 and December 31, 2020 (second wave). In this tertiary care centre with 130 medical ward beds, indications for in-hospital treatment of COVID-19 were respiratory distress with need for oxygen supplementation, high fever or relevant clinical deterioration. This study was approved by the local ethics committee (EKZN, 2020-01306).

Detailed description of the study methodology has been reported previously [6, 15]. A confirmed SARS-CoV-2 infection was defined as a combination of typical clinical symptoms (e.g., respiratory symptoms with or without fever, and/or pulmonary infiltrates and/or anosmia/dysgeusia) and a positive real-time reverse-transcription polymerase chain reaction (RT-PCR) test, obtained from nasopharyngeal swabs or lower respiratory tract samples, according to guidance by the World Health Organization (WHO) [16, 17]. Data for the second wave also included patients with positive rapid-antigen tests. However, due to their lower positive predictive value, we excluded asymptomatic patients unless their rapid-antigen results were confirmed by a positive RT-PCR test. We further excluded patients from the analysis if they did not provide general informed consent or if they had not yet been discharged when data collection was closed (January 20, 2021). This study adheres to the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement for reporting of prediction models.

Data collection

All analysed data were collected as part of the clinical routine during the hospitalisation (from admission to discharge or death). We performed chart reviews and automatic export from electronic health records (EHR), including vital signs and clinical characteristics upon admission as well as sociodemographic factors, comorbidities based on pre-existing diagnoses and home medication. COVID-19-specific inpatient medication was assessed until hospital discharge or death and exported from the EHR. Experimental treatment was offered to all suitable patients according to ongoing clinical trials and WHO guidelines [16,17,18]. During the second wave, this also included the application of high-dose glucocorticoids [19]. The age-adjusted Charlson comorbidity index (ACCI) [20] and the Clinical Frailty Scale score (CFS) [21] were calculated for all patients as part of the clinical routine or through chart review. Laboratory values were available according to clinical routine and derived from the first blood draw obtained within 7 days from admission.

Definition of endpoints

All-cause in-hospital mortality was defined as the primary endpoint. The secondary endpoint, disease progression, had different definitions in the original studies. For easier comparability between the scores, we defined disease progression as needing invasive ventilation, ICU admission or death in our own analysis. Originally, the CALL score defined progression as respiratory rate ≥ 30 bpm, SpO2 ≤ 93%, PaO2/FiO2 ≤ 300 mmHg, requiring mechanical ventilation or worsening of lung CT findings. CT findings were not available for our analysis and thus not considered. The definition of progression for the CHOSEN score was requirement of supplemental oxygen, admission to the ICU or death. Validation results were based on these original definitions.

Statistical analysis

Discrete variables are expressed as frequency (percentage) and continuous variables as medians with interquartile ranges (IQR, for skewed data) or mean with standard deviation (SD, for normally distributed data). We used the Wilcoxon rank-sum test to compare continuous variables and the Pearson's chi-squared test to compare categorical or binary variables. Odds ratios (OR) were calculated with corresponding 95% confidence intervals (CI) as measures of association. We assessed calibration for mortality numerically by tabulating the observed risks against those reported in the original studies. These were not available for the CALL and CHOSEN scores. We considered a two-sided p-value of < 0.05 significant and calculated the unadjusted area under the receiver operating characteristic curve (AUC) as a measure of discrimination. Statistical analysis was performed as a complete-case-analysis based on the original regression coefficients using Stata 15.1 (StataCorp, College Station, TX, USA).

Results

Figure 1 provides an overview of the study flow and Table 1 shows overall patient demographics, comorbidities, laboratory values and vital signs on admission as well as stratified according to the individual score cohorts. In total, 399 patients hospitalised with a confirmed SARS-CoV-2 infection were included in this analysis (mean age 66.6 years ± 13.4 SD, 68% male). Complete data sets to allow for the calculation of the CALL and CHOSEN score were available in 297 and 380 patients, respectively. Fewer patients had all values necessary to calculate the HA2T2 (n = 151) and ANDC score (n = 124). There were several noticeable differences between the score cohorts, for example, transfer rates from other hospitals (range from 14.5% for ANDC to 28.5% for HA2T2), supplemental oxygen (29.8% for CALL to 45.7% for HA2T2), obesity (30.8% for CHOSEN to 41.7% for ANDC) and ICU admission (19.5% for CHOSEN to 46.4% for HA2T2). However, overall comorbidity and frailty were similar.

Fig. 1
figure 1

Overview of study flow. In total, 399 patients were included in the final analysis, 67 of whom had complete data sets available

Table 1 Baseline characteristics and treatment of patients hospitalised with confirmed SARS-CoV-2 infection

Table 2 shows the discriminative power of each score for mortality and disease progression (defined as requiring invasive ventilation, ICU admission or death for all scores for easier comparability). For mortality, the HA2T2 performed best (AUC 0.78, 95%-CI 0.70–0.85). For progression, overall discriminative capacity was lower, with the CHOSEN score performing slightly better than the others (AUC 0.66, 95%-CI 0.72–0.60). All scores were associated with mortality.

Table 2 Score values stratified by survivorship with corresponding OR and AUC

Sensitivity and specificity as well as positive and negative predictive value for each proposed cut-off are summarised in Table 3 and visualised in Fig. 2. The negative predictive value of the CALL score was highest (≥ 6 points: 100%, 95%-CI 75.3–100), while the highest positive predictive value was found for the HA2T2 score (≥ 3 points: 58.6%, 95%-CI 38.9–76.5).

Table 3 Sensitivity, specificity, positive and negative predictive values for mortality and disease progression for all scores and their original cut-offs
Fig. 2
figure 2

Survival time analysis for a CALL score, b CHOSEN score, c HA2T2 score, d ANDC scores and their respective cut-off subgroups

The direct comparison with the original outcomes can be found in Table 4. Only the HA2T2 score performed similarly with an AUC of 0.78 (95%-CI 0.72–0.84) in the original validation cohort and an AUC of 0.78 (95%-CI 0.70–0.85) in our sample. The discriminative power for all other scores was markedly worse in comparison with their respective original cohorts. These results persisted when performed in the cohort with full data sets for all scores (n = 67, data not shown).

Table 4 Comparison of current analysis with original study results and outcomes

The calibration assessment for mortality for the HA2T2 and ANDC scores can be found in the additional files 1 and 2 (Tables S1 and S2). Overall, calibration was poor, with the ANDC score performing slightly better (overprediction up to 18 percentage points) than the HA2T2 score (underprediction up to 30 percentage points). Calibration for the CALL and CHOSEN scores were not possible due to lacking published data.

Discussion

In this validation study, four currently available scores to predict mortality and disease progression in COVID-19 patients performed markedly worse in patients hospitalised at a Swiss tertiary care centre than in their original cohorts. The HA2T2 score showed the best discrimination for mortality (AUC 0.78, 95%-CI 0.70–0.85) and the only results similar to the derivation cohort.

Some loss of predictive ability can be explained by the differences between our study population and the original derivation cohorts. This is most apparent when comparing age, which has been recognised as an important risk factor for worse outcomes [22] and is included in all four scores. Mean age ranged from 44 to 65 years for the CALL, CHOSEN, HA2T2 and ANDC scores in the original publications whereas the mean age in our population was 67 years. However, even when comparing the scores among the 67 patients who had all parameters required for all scores, the HA2T2 score showed the best discriminative power (data not shown). Apart from the small sample size, further limitations in this comparison arise from the fact that the study populations were also different in their origins. The CALL and ANDC scores were based on Chinese patients while the CHOSEN and the HA2T2 score were derived in US American patients. Interestingly, the other currently available external validations of the CALL score in Italian and Turkish patients resulted in AUCs that were very similar to our own (original AUC for disease progression 0.91 vs. Italian AUC 0.62, Turkish AUC 0.59, our AUC 0.61) [14, 23]. Hence, it seems that compatibility and comparability of these scores for different populations cannot be assumed.

Further difficulties are rooted in the novelty of COVID-19. Much is still unknown about the disease including which factors best predict progression or mortality. This is reflected in the very different factors included in the scores. Still, these more recent approaches are already an improvement to initial scores which included up to 12 different items, making them difficult to use in a clinical setting [24]. However, in a busy environment such as the emergency department, ease of use is crucial. The scores discussed here all use no more than four variables that are relatively readily available in middle- to high-income countries. There also exists a simplified version of the CHOSEN score that does not rely on laboratory values but did also not perform as well in the original cohort [11].

All scores were significantly associated with mortality and their respective discriminative capacities were moderate to good but calibration was poor due to considerable population differences. Furthermore, the negative predictive value of the CALL score was particularly high and could thus help identify patients who are not at risk. The CHOSEN score, whose explicit aim was to differentiate between patients who needed hospitalisation and those who could be sent home safely, also had a high negative predictive value and, in addition, showed a relatively balanced relation between sensitivity and specificity, making it a potentially valuable tool for risk stratification. Since we did not include outpatients in our study, our results are likely to underestimate the true value of the CHOSEN score.

Limitations

There are certain limitations to our study. First, our findings are limited to hospitalised patients in a single centre in Switzerland, limiting generalisability. In addition, baseline parameters of our population were markedly different from the original study populations including ethnicity and important predictors such as age. Unfortunately, regression coefficients could not be updated based on the available data. Similarly, we could not calculate calibration for the CALL and CHOSEN score. Internal validity is also limited due to the retrospective design, which meant that a considerable proportion of patients had to be excluded from certain score cohorts because the required data were missing. Additional validation analyses should be conducted in larger data sets. Furthermore, troponin and d-dimer values (required for the HA2T2 and ANDC scores, respectively) were usually available for sicker patients who reached the primary and secondary endpoints more often, which not only limited study population sizes but also comparability between scores. Finally, we had to exclude four patients due to missing outcome data, thus increasing the risk for selection bias.

Conclusions

In our independent validation, the four analysed scores performed worse than in their original cohorts regarding prediction of mortality and disease progression. However, all scores were significantly associated with mortality. While the HA2T2 score identified high risk patients, the negative predictive values of the CALL and CHOSEN scores allowed reliable identification of patients at low risk, which may make them suitable for outpatient management.