Background

Mathematical and simulation models of infectious disease dynamics are essential for understanding and forecasting the development of epidemics [1]. The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has called increased attention to epidemiological modeling both as a method of scientific inquiry and as a tool to inform political decision making [2, 3].

Among epidemiological modeling methods, a distinction between mechanistic and phenomenological approaches is frequently made. While mechanistic approaches model the transmission dynamics based on substantial concepts from biology, virology, infectology, and related disciplines, phenomenological (sometimes termed ‘statistical’) models are looking for a mathematical function that fits observed data well without clear assumptions about the underlying processes [1, 2]. Mechanistic models are usually used to compare possible scenarios and to estimate the relative effects of different interventions rather than to produce precise predictions. On the contrary, phenomenological models are commonly optimized for forecasting. From a broader perspective, mechanistic and phenomenological approaches can be considered as the epidemiological modeling representatives of the long-standing explanation-prediction controversy [4]. It should be noted that although the distinction between these two model classes is instructive and one side usually predominates, most approaches have both mechanistic and phenomenological components, and some are explicitly balanced (so called ‘semi-mechanistic’ or ‘hybrid’ models).

Although the value of any predictive model is ultimately determined by whether it improves critical decision making [5, 6], a rigorous scientific appraisal should also include a comparison of what have been predicted to what have actually happened [1, 7, 8]. Unfortunately, the predictive accuracy of infectious disease predictions models is rarely evaluated during or after outbreaks [7, 8]. Notable exceptions include systematic evaluation of models about the epidemiology of severe acute respiratory syndrome (SARS) [9, 10], influenza [11, 12], ebola [5, 7, 13, 14], dengue [8, 15], foot-and-mouth disease [6], and trachoma [16].

The SARS-CoV-2 pandemic has prompted a large amount of epidemiological modeling efforts, including studies with primarily mechanistic (e.g., references [17,18,19,20,21]) and primarily phenomenological (e.g., references [22, 23]) approaches. According to the knowledge of the author up to October 2020, a truly prognostic evaluation of existing models using new data that had not yet been available at the point of model development is not common practice. In order to start closing this gap, the objective of the present study was to evaluate the predictive accuracy of a phenomenologically oriented model that was calibrated on data up to the end of March 2020 for forecasting the development of the cumulative number of reported SARS-CoV-2 cases in countries and administrative regions worldwide [24].

Methods

Data

As described in detail elsewhere [24], the model was fitted using information on the cumulative number of confirmed SARS-CoV-2 infections in the COVID-19 data repository of the Johns Hopkins University Center for Systems Science and Engineering [25, 26]. Cumulative case count data from 251 countries and administrative regions were used for calibrating the model, with daily time series from the day of the first reported case to 29 March 2020 in each region. For evaluation, data on confirmed cases were extracted from the same database two weeks, one month, and two months after model development (12 April, 29 April, and 29 May 2020). Sufficient information for creating predictions of the most likely number of cases in all investigated countries and administrative regions for any time horizon was made publicly available at the beginning of April 2020 [24].

Model

A hierarchical logistic model was fit to observed data [24]. The logistic part of the model was based on the ecological concept of self-limiting population growth [27] and used a formulation with five parameters [28], controlling the expected final case count at the end of the outbreak (parameter a), the maximum speed of reaching the expected final case count (parameter b), the approximate time point of the transition of the outbreak from an accelerating to a decelerating dynamic (parameter c), the case count at the beginning of the outbreak (parameter d), and the degree of asymmetry between the accelerating and decelerating phases of the outbreak (parameter g). The predicted number of cumulative case counts in region i at day t from the first reported case was estimated as

$$ {n}_{pred, it}={d}_i+\frac{a_i-{d}_i}{{\left(1+{\left(\frac{t}{c_i}\right)}^{b_i}\right)}^{g_i}} $$

with log-normally distributed errors.

The hierarchical part of the model was inspired by random-effect meta-analysis assuming that the parameters of the logistic equation are similar, but not necessarily identical, across the investigated regions [29, 30]. This was implemented by restricting the parameters of the logistic equation to follow a normal distribution in the population of regions. With respect to interpretation, this means that the model was based on the hypothesis, that the pandemic runs a similar course in all countries and regions, even though they are expected to differ to a certain degree regarding the number of cases in their first report, the expected final case count, the time point and speed of the accelerating and decelerating phases of the outbreak, as well as the time point, extent, and effects of control measures.

Estimation

The statistical procedures and program code are described in detail elsewhere [24]. Computations were performed in a Bayesian framework using Markov chain Monte Carlo sampling methods in WinBUGS version 1.4.3 [31]. Parameter estimates were given uninformative priors, and results were obtained from three independent Markov chains with a total of 60,000 iterations and a thinning rate of 60, after dropping 40,000 burn-in simulations.

Evaluation metrics

For evaluating each individual estimate i at time point t, four measures were calculated.

The difference between logarithmic predicted and observed counts (“error in logs”, EIL) was defined as

$$ {EIL}_{it}=\ln {n}_{pred, it}-\ln {n}_{obs, it}, $$

with ln being the natural logarithm, and npred and nobs being the predicted and the observed cumulative case counts, respectively.

The absolute error in logs (AIEL) was calculated as

$$ {AEIL}_{it}=\left|\ln {n}_{pred, it}-\ln {n}_{obs, it}\right|. $$

The percentage error (PE) was calculated as

$$ {PE}_{it}=\frac{n_{pred, it}-{n}_{obs, it}}{n_{obs, it}}, $$

and the absolute percentage error (APE) as

$$ {APE}_{it}=\frac{\left|{n}_{pred, it}-{n}_{obs, it}\right|}{n_{obs, it}}. $$

Summary estimates of predictive accuracy across all k regions at a given time point t are listed in the following.

The root mean squared error in logs (RMSE) was defined as

$$ {RMSE}_t=\sqrt{\sum \limits_{i=1}^k{\left(\ln {n}_{pred, it}-\ln {n}_{obs, it}\right)}^2,} $$

and the mean absolute percentage error (MAPE) was calculated as

$$ {MAPE}_t=\frac{\sum_{i=1}^k\frac{\left|{n}_{pred, it}-{n}_{obs, it}\right|}{n_{obs, it}}}{k}. $$

The coefficient of determination R2t was additionally determined from a linear model regressing the logarithmic observed values on the logarithmic predictions with the intercept fixed at zero. Furthermore, the intraclass correlation coefficient ICC(3,1)t was calculated for quantifying the level of absolute agreement between predicted and observed values from a two-way mixed-effects model [32]. Bootstrapping was used with 1000 samples to create 95% confidence intervals for summary estimates of predictive accuracy.

Factors associated with accuracy

In order to identify factors associated with the accuracy of the predictions, the AEIL was regressed on the number of available data points, the difference in the logarithm of the first and the last case count at the moment of estimation (as a proxy for progress of the epidemic), and their interaction term. Estimates are reported with 95% parametric confidence intervals.

Furthermore, strongly affected regions (defined by a minimum of 10,000 cases at the forecasted time point) with the most extreme under- and overestimation were identified to gain additional qualitative insights on model performance.

Results

Data

In 251 regions, the number of available data points at estimation ranged from 2 to 68 with a median of 25 and a mean of 31.48 days. The cumulative number of reported cases at the point of the first non-zero count ranged from 1 to 444 with a median of 1 and a mean of 4.09 across regions. The cumulative number of reported cases at model estimation (29 March 2020) ranged from 1 to 140,886 with a median of 139 and a mean of 2869.

Individual estimates of predictive accuracy

The probability density function of the percentage error (PE) at the day of estimation as well at the forecasts after two weeks, one month, and two months, respectively, is displayed in Fig. 1. At the day of estimation, the median relative error indicated an average underestimation of the cumulative case count by about one third across regions. The relative error distribution was rather narrow, with only a tenth of predictions showing an underestimation exceeding − 62.8% and none of the predictions having more than 36.9% error. Across forecasts, the median percentage error was always less than 20%, although an overestimation by more than two hundred percent was observed in 7.2, 19.1, and 19.5% of the cases at the two weeks, one months, and two months forecasts, respectively. The proportion of regions with an underestimation exceeding minus two thirds (− 66.6%) was 12.4, 19.5, and 28.7% at the two weeks, one months, and two months forecasts, respectively.

Fig. 1
figure 1

Probability density function of the percentage error at different forecast horizons. The solid line shows the median, the dashed lines show the first and third quartiles, and the dotted lines show the first and ninth deciles. The x-axis is trimmed at 2.5

The calibration plots suggest an increasing number of regions for which case counts are substantially under- or overestimated with increasing length of the forecast period (Fig. 2). Nevertheless, a strong positive association between predicted and observed case counts is apparent even after two months.

Fig. 2
figure 2

Calibration plots at different forecast horizons. Points refer to regions. The solid black line indicates no prediction error, the blue area indicates a prediction error by a factor of two or less, and the green area indicates a prediction error by a factor of ten or less. Both axes are log-transformed

Summary estimates of predictive accuracy

All parameters show an increasing amount of error with increasing length of the forecast period (Table 1). The MAPE shows that, on average, estimates are off by more than one hundred, two hundred, and four hundred percent at the two weeks, one month, and two months forecasts, respectively. The coefficient of determination indicates a very strong relative association between predicted and observed case counts, and the intraclass correlation coefficient suggests that the level of absolute agreement is excellent after two weeks and still high after one month, but sinks to a moderate level after two months.

Table 1 Summary estimates of predictive accuracy

Factors associated with accuracy

Visual analysis suggests that a larger number of available data points at estimation (Fig. 3) and a more extensive growth of the logarithmic case counts from the first reported case until estimation (Fig. 4) are associated with a lower prediction error. This is confirmed by regression analyses indicating statistically significant associations that are becoming stronger with increasing forecast horizon (Table 2). These two factors have also a multiplicative effect, as indicated by the statistically significant interaction term.

Fig. 3
figure 3

Association of the amount of available data at estimation and predictive accuracy (AEIL) at different forecast horizons. AEIL = absolute difference between logarithmic predicted and observed case counts. Points refer to regions. The grey line corresponds to a linear smoothing curve

Fig. 4
figure 4

Association of growth in logarithmic case counts until estimation and predictive accuracy (AEIL) at different forecast horizons. AEIL = absolute difference between logarithmic predicted and observed case counts. Points refer to regions. The grey line corresponds to a linear smoothing curve

Table 2 Linear regression coefficients for factors associated with prediction accuracy (AEIL)

Strongly affected regions (a minimum of 10,000 cases) with extreme under- or overestimation of the cumulative case counts are presented in Table 3. Among the listed regions, the extent of underestimation was considerable (an EIL below − 1.6, roughly corresponding to an underestimation by a factor of five) at the one and two months forecasts, with most regions being located in Asia. Among strongly affected regions, overestimation was rather moderate (an EIL below 0.7, roughly corresponding to an overestimation by a factor of two) in most cases. Substantial overestimation (an EIL between 0.7 and 1.6) was present in Austria and Switzerland at the one and two months forecasts and in the United States at the one month forecast. No strongly affected region with a considerable overestimation (EIL above 1.6) was identified.

Table 3 Most extreme under- or overestimation for regions with a minimum number of 10,000 cases

Discussion

In the present study, a hierarchical logistic model was used to predict cumulative counts of confirmed SARS-CoV-2 cases in 251 countries and administrative regions with two weeks, one month, and two months forecasting horizons in the early phase of the pandemic. Several metrics were used to evaluate predictions visually and statistically. In summary, case counts could be predicted in the majority of the regions with a surprising accuracy. In spite of the facts that at the time of estimation (29 March 2020) only about one month’s data were available on average in each region, and that most regions were at the very beginning of the epidemic, a massive difference between forecast and observation was rather the exception than the rule. Summary metrics of predictive accuracy suggested very strong prognostic validity the model for a horizon of two weeks, substantial accuracy after one month, and still notable, although markedly lower, accuracy after two months. This is in good agreement with studies finding that the horizon for reasonable epidemiological predictions covers a few weeks at most [7, 15].

Although most predictions were fairly accurate, some were still considerably off. They were most likely to be found in regions with a lower amount of available data at the date of estimation and/or with a more limited growth between the date of the first case and the date of estimation. In general, underestimation seems to be somewhat more pronounced than overestimation, particularly in strongly affected regions (i.e., with cumulative case counts above 10,000 at the point of validation). The strongly affected regions for which the model provided too low predictions included several countries in which mitigation strategies might have been less effective than in other regions, as suggested by the only slowly or not at all decelerating cumulative case growth curves at the beginning of June 2020 (e.g., India, Bangladesh, Qatar). On the other hand, the strongly affected regions with a substantial overestimation of cumulative case counts are characterized by an extremely successful mitigation of the initial phase of the epidemic (mainly Austria and Switzerland). Hence, predictive errors are likely to be closely related to one of the central assumptions of the model, i.e., that timing, extent, and effectiveness of control measures are comparable across regions. Obviously, the forecasts based on the presented model are likely to reach their limits in regions that deviate too strongly from the average case. As shifting individual estimates towards the group mean is also a statistical property of hierarchical models [33], extreme cases are likely to fall outside the scope of validity of the presented approach. As the variation in the course of epidemic trajectories among regions is likely to increase with time, the similarity assumption is expected to become more and more problematic with an ongoing epidemic. In consequence, generalizing the presented findings beyond the initial phase of epidemics is not warranted.

A notable feature of the model that it provides predictions without any reference to measures taken to control the epidemic. This “ignorance” towards interventions, paired with fairly accurate predictions, may be misinterpreted as evidence of dispensability of the mitigation and containment measures implemented in most countries. However, it is far more likely that the key model assumption suggesting similarity of the course of the epidemic and of the control measures taken across regions in the early phase of the epidemic holds to a substantial extent. In cases when it does not, model performance is very poor, as discussed above. Bringing these issue together, the hierarchical structure of the model appears to have both benefits and risks: sufficiently accurate predictions for a large number of regions even at a very early stage of the epidemic come with the price of considerably erroneous predictions for atypical regions. Consequently, if used with the aim of generating locally applicable predictions for a particular region, forecasts may be improved by using data from comparable regions with a higher probability than from rather dissimilar regions [34].

The presented evaluation study has several limitations. First, the case counts were not standardized in any form. Expressing them as cumulative incidence rates (e.g., per 100,000 persons) is likely to have increased homogeneity across regions and enhanced interpretability. As it has been shown in a specific analysis of the development of the SARS-CoV-2 epidemic in German federal states, standardization has rendered using log-transformation of case counts for homogenization superfluous [35]. Second, in the present study uncertainty of the predictions remained unconsidered, although measures of uncertainty, such as reliability and sharpness, can be just as important for forecasting as bias [7]. Third, predictions only at selected time points were analyzed, and it cannot be excluded that choosing other time points would have led to different results. Nevertheless, the general pattern of findings is unlikely to have changed substantially.

The forecasting model itself has some weaknesses as well [24, 35]. Most importantly, it models the reported rather than the true number of cases and therefore can be subject to different forms of testing and reporting bias. Considerable improvement regarding this point can realistically be expected first when regional findings form well-conducted epidemiological studies become available. Second, using cumulative rather than new case counts for modeling can lead to serious errors [36]. Another major limitation of the model is that it works only as long as the conditions of the epidemic remain largely unchanged in each region, i.e., within a single epidemic wave with fairly constant testing and reporting practices and without serious disruptions. This issue could perhaps be addressed by using dynamic (time-dependent) rather than fixed (time-invariant) model parameters [37]. Finally, the primarily phenomenological nature of the model warns to be careful with interpretation [38,39,40] and calls for integration with mechanistic components, in order to create a hybrid approach that is capable of producing widely generalizable conclusions [41].

Conclusions

As stated by one of the most prominent epidemiologist of the SARS-CoV-2 pandemic, Neil Ferguson, models are “not crystal balls” [3]. However, without rigorous scientific evaluation, they run the risk of becoming one, characterized not by correct predictions but by obscurity. Some state that epidemiological forecasting is “more challenging than weather forecasting” [42], and complexity of modeling and reliance on assumptions make it difficult to assess the trustworthiness of models based solely on their inherent structure. Just like we trust weather forecasts that prove to be accurate by experience, empirical comparison of modeling predictions with actual observations should become an essential step of epidemiological model evaluation.