Online detection and quantification of epidemics
- 7.6k Downloads
Time series data are increasingly available in health care, especially for the purpose of disease surveillance. The analysis of such data has long used periodic regression models to detect outbreaks and estimate epidemic burdens. However, implementation of the method may be difficult due to lack of statistical expertise. No dedicated tool is available to perform and guide analyses.
We developed an online computer application allowing analysis of epidemiologic time series. The system is available online at http://www.u707.jussieu.fr/periodic_regression/. The data is assumed to consist of a periodic baseline level and irregularly occurring epidemics. The program allows estimating the periodic baseline level and associated upper forecast limit. The latter defines a threshold for epidemic detection. The burden of an epidemic is defined as the cumulated signal in excess of the baseline estimate. The user is guided through the necessary choices for analysis. We illustrate the usage of the online epidemic analysis tool with two examples: the retrospective detection and quantification of excess pneumonia and influenza (P&I) mortality, and the prospective surveillance of gastrointestinal disease (diarrhoea).
The online application allows easy detection of special events in an epidemiologic time series and quantification of excess mortality/morbidity as a change from baseline. It should be a valuable tool for field and public health practitioners.
KeywordsInfluenza Training Period Baseline Model Public Health Surveillance Public Health Practitioner
The generalization of electronic data capture in health care has made time series data increasingly available for public health surveillance . How to best analyse these data will likely be case dependent and require expert statistical advice. There is however a well agreed "good analysis practice" in particular classes of surveillance problems, so that less expert users may consider undertaking the analysis themselves. This requires making software available online and providing guidance on its use: this is exactly what was done with online tools for DNA sequences alignment (BLAST, FASTA), allowing biologists to successfully use these methods on their own data.
Here, we focus on epidemic detection and quantification from time series data. There is a widely used approach for this purpose originating from Serfling's work on influenza . He proposed calculating excess P&I mortality due to seasonal influenza using deviations from a periodic regression model that captured the annual seasonality of the data. It was first necessary to (subjectively) select years without excess death to train the baseline regression model. The approach has then been extended to address several issues: refining regression equations and extracting baseline model information without subjective filtering of the data [3, 4, 5]. Algorithms for prospective outbreak detection were also proposed in this framework [6, 7, 8].
In this paper, we describe an online tool allowing users to detect unexpected events, eg outbreaks, in a seasonal epidemiologic time series. Two applications are detailed to illustrate how results are obtained.
Required inputs from the user for baseline model fitting
Length of the training period
Number of years, number of observations
Retrospective : All data
Prospective : Half dataset
Purge of the training period
Data above selected percentile, above cut-off value, or in user-defined periods
Above the 15% highest percentile
Linear, quadratic or cubic terms. Annual, semi-annual or quarterly periodicity
Automated model selection
Upper forecast limit (UFL)
Percentile between 50% and 100%
Minimum duration above UFL defining an unexpected change
Number of observations
14 days/2 weeks/1 month
Even if long time series are available, it is not generally the case that all data should be included in the training period . Indeed, changes in case reporting and demographics will likely be present over long time periods, and this may affect how well the baseline model fits the data. Modelling of influenza mortality typically uses the five preceding years in baseline determination [2, 10, 11]. Including more past seasons improves the seasonal components estimates, while limiting the quantity of data allows capturing recent trends. In our system, we propose using the whole dataset in the model fitting for retrospective analysis (as done, for example, in [12, 13]), and to limit to a past few years in the case of prospective detection of epidemics (as, for example, in [7, 8]). In the latter case, the user is invited to specify the length of the training data in an input field. He can define it in number of years or in number of observations. In either case, the minimal time span accepted is one year.
Purge of the training period
In order to model the non-epidemic baseline level, the model must be fitted on non-epidemic data. For seasonal diseases such as influenza in the Northern hemisphere, it is difficult to find long epidemic-free periods since epidemics typically occur every year. There are two choices to deal with the presence of epidemics in the training data: excluding the corresponding data from the series, or explicitly modelling the epidemics.
In the first choice, epidemics must first be identified. Several rules have been suggested in this respect. Viboud et al. excluded the 25% highest values from the training period . Costagliola et al. removed all data above a given threshold (more than three influenza-like illness cases per sentinel general practitioner) . Olson et al. excluded the months with "reported increased respiratory disease activity or a major mortality event" . Others deleted entire periods: e.g. December to April , or September to mid-April .
The second choice, less common, requires explicit modelling of the epidemic periods during the training data. In this case, an epidemic indicator must be included as a covariable in the model. For influenza epidemic, one may choose the number of laboratory influenza A and B isolates [5, 16]. However, the availability of an independent epidemic indicator is uncommon in practice.
In summary, data points may be excluded either because they exceed a (possibly data determined) threshold, because they were collected during a period known to be epidemic prone (for example winters), or because the user wishes to exclude the points. These three options are available in our system.
A variety of formulations may be used for the regression equation, including linear regression , linear regression on the log-transformed series , Poisson regression , and Poisson regression allowing for over-dispersion . Linear regression is suitable when working with large frequencies or incidences, while working with the log transformed series or applying Poisson regression is advised when observations are small in magnitude.
In the regression equation, the trend is generally modelled using a linear term [2, 4, 11], or a second degree polynomial [3, 7, 19]. In our application we propose these two trends plus the third degree polynomial, to offer more flexibility. When the model is used for prospective detection of epidemics, it is often safer to use only a linear trend to avoid inconsistencies when the model will be extrapolated into the future. Thus, the application restrains the user's choice to the models that have linear trend. For retrospective analysis, where extrapolation is not an issue, more complex trends may improve the fit of the baseline model. So, the application allows the user to choose among all the proposed models with linear, quadratic and cubic trends. For the seasonal component, a simple yet effective description may be obtained using sine and cosine terms with period one year . Refined models are found in the literature, often with terms of period 6 months , sometimes 3 months , and, rarely, smaller . In our application, we chose to propose the most widely used periodicities, ie 12, 6 and 3 months. As a result, all regression equations for the observed value Y(t) are special cases of the following model: Y(t) = α0 + α1 t + α2 t2 + α3 t3 + γ1 cos(2πt/n) + δ1 sin(2πt/n) +γ2 cos(4πt/n) + δ2 sin(4πt/n) + γ3 cos(8πt/n) + δ3 sin(8πt/n) + ε(t). For prospective modelling, α2 and α3 are always 0. Model coefficients are estimated by least squares regression.
Retrospective evaluation of the excess P&I mortality in France for 1968–1999, using nine periodic regression models. The components included in each model are indicated by a*. #Model options: exclusion of the top 15% percentile from the training period; forecast interval: 95%
Cumulated excess mortality over the whole period
As the baseline model is fitted to the observations, the variation around the model fit may be estimated by the standard deviation of the residuals (difference between observed and model value). It is therefore possible to calculate forecast intervals for future observations, assuming that the baseline model holds in the future. Thresholds signalling an unexpected change are typically obtained by taking an upper percentile for the prediction distribution (assumed to be normal), typically the upper 95th percentile , or upper 90th percentile . A rule is then used to define when epidemic alerts are produced: for example as soon as an observation exceeds the threshold , or if a series of observations fall above the threshold, for example during 2 weeks , or 1 month .
Users may input their own dataset (eg incidences, mortalities, medication sales) as a plain text file (ie ASCII file) containing the time series as a single column, i.e. the values are separated by a carriage return. Observations must be aggregated by day, week or month. The user will be invited to specify this time step in a scrolling list. Missing values are allowed, provided they are coded by "NA". It is assumed that the dataset will contain at least one year of data. Several example datasets from France are included in the system: incidence rates per 100,000 population for influenza-like illness and diarrhoea for 1991–2001, and P&I mortality series for 1968–1999 . They are available as daily, weekly or monthly time series.
Retrospective analysis of influenza epidemics
The first example uses monthly P&I mortality in France over the period 1968–1999. The user wishes to retrospectively identify the epidemic periods and quantify the cumulated mortality in these epidemics. Use of the system begins with selecting the corresponding dataset on the main page.
After data input, the user is taken through three successive webpages to specify the baseline model parameters (Table 1). The first page allows choosing the type of analysis. Here, the user selected to conduct retrospective analysis, therefore the whole time series is included in the training period.
The third page allows the user to select the mathematical form for the baseline model. This page is dependent on the type of analysis, prospective or retrospective. For the retrospective analysis, nine models are available, combining the three choices for the trend and periodicity (see Table 2). Using the automated selection feature, the model with cubic trend and annual periodicity is chosen for baseline P&I mortality. Figure 1 presents the detail of the selection algorithm.
In the third page, the user defines the epidemic threshold by selecting a percentile of the prediction distribution, between 50% and 100%. Here, default value (95%) was selected. Increasing this value will lead to less observations outside the thresholds and more specific detection. On the contrary, decreasing the threshold will increase sensitivity and timeliness of the alerts.
To avoid making alerts for isolated data points, a minimum duration above the threshold may be required. Default values are 14 days (2 weeks) for daily and weekly data, and 1 month if the data are monthly-aggregated. The beginning of the epidemic is the first date the observations exceed the threshold, and the end the first time observations return below the threshold. Here, the default value (1 month) was selected.
Prospective surveillance of gastrointestinal diseases
In this analysis, the user wishes to define epidemic thresholds for prospective monitoring of diarrhoea. We briefly summarize the differences between this analysis and the retrospective case. As above, a time series must be first provided (here we selected diarrhoea with weekly observations). After choosing "prospective analysis", the user must select the duration of the training period, typically a few years. We select five years for the training data. Data exclusion before fitting the baseline model is carried out as in the first example. The regression equation is limited to a linear trend, but all three periodicities are available. Here, the automated selection leads to a model with a linear trend, and annual, semi-annual and quarterly periodic terms.
Alert thresholds are defined by selecting a percentile of the prediction distribution, between 50% and 100%. A typical choice is 95%. The application then generates a plot showing the whole data, the baseline and threshold values over the training period and model extrapolation for the following year (Figure 3b). An output table contains the expected baseline and threshold values for each date in the dataset and the following year.
We have presented an online application for analysing epidemiologic surveillance time series. The program can be used to extract the dates and size of past epidemics, or to establish epidemic thresholds for prospective surveillance. We intend this application to be a practical tool for field and public-health practitioners. We designed a user-friendly interface that provides default-values options and interactive graphical feedback. Since all the parameters can be changed by the user, the program provides an easy way to check how the analysis changes with different choices.
The epidemiologic time series most suitable for analysis are those where the monitored signal consists of a seasonal background with outbreaks. This is clearly the case for influenza surveillance data. Influenza-like syndromes occur at all times of the year, although typically more in the winter than the summer, even when no influenza viral strain is circulating. Viral testing is considered the gold standard method to provide the real number of influenza-affected patients but since this test is not part of routine diagnoses, morbidity and mortality in a population can not be specifically attributed to influenza. One way to estimate the impact of influenza in a population from surveillance data including surveillance of influenza-like syndromes, pneumonia or influenza associated admissions, or cause-specific mortality, is to use statistical methods such as periodic regression. This hypothesis also holds for other infectious diseases, for example gastroenteritis where syndromes under surveillance (diarrhoea, fever) can be due to various pathogens which are more active in some seasons than others. Alternative detection methods exist that do not rely on the hypothesis of a seasonal baseline. For instance, Hidden Markov Models assume that the observations are generated from a finite mixture of distributions governed by an underlying Markov chain [25, 26]. These methods have shown good aptitude in distinguishing epidemic and non epidemic phases in seasonal and non-seasonal time series. Another alternative is control-chart methods, which may be calibrated on data from recent months rather from previous years .
A minimum of one year historical data is required to fit the models discussed here, but we note that more reliable predictions require at least two or three year historical data to calculate the baseline level. Other methods have been developed for disease surveillance with limited historical data sets [27, 28]. We also recommend, for the prospective setting, to make sure that the one year long predictions begin outside the epidemic season, in order to highlight the incoming epidemic in its entirety. While first and second degree polynomial trends are frequently used in periodic regression models in the literature [2, 3], we have added the option of a third degree polynomial to offer more flexibility, only for the retrospective analysis. For the seasonal components, we included the most widely used periodicities, ie 12, 6 and 3 months. We did not propose higher degree polynomials or seasonal terms because higher order terms may be more prone to result in unidentifiable models or other problems with model fit.
The application is based on a general periodic regression model that contains most previous published models as special cases. Yet, we did not implement some specialised models encountered in the literature. For example, some authors modelled the secular trend with a smoothing spline fitted on summer months [12, 29]. Others included autoregressive terms in their models [5, 30, 31]. Additional variables may also be incorporated into the regression model, for example day of the week, holiday, and post-holiday effects , sex and age , or temperature and humidity . A few authors replaced the epidemic values in the training period by expected non-epidemic values, rather than deleting them [10, 33]. We have not included these options in the application for reasons of parsimony. One of the most important features of an online tool such as the one presented here is that it should allow inferences to be made by front-line practitioners who often do not have detailed knowledge of statistical software. We have attempted to balance the desire to provide a user-friendly interface while at the same time offering sufficient options to cover the needs of most surveillance datasets.
The online application presented here should be a valuable tool for public health surveillance. Its user-friendly interface facilitates fairly complex modelling, offering public health practitioners the possibility to rapidly investigate the burden of epidemics, or to utilise the same statistical approaches to set epidemic thresholds for prospective surveillance.
Availability and requirements
Project name: Periodic regression models
Project home page: http://www.u707.jussieu.fr/periodic_regression/
Operating systems: Web based application
This work was supported by the EU Sixth Framework Programme for research for policy support (contract SP22-CT-2004-511066). The funding source had no involvement in the work process.
- 8.Tsui FC, Wagner MM, Dato V, Chang CC: Value of ICD-9 coded chief complaints for detection of epidemics. Proc AMIA Symp. 2001, 711-715.Google Scholar
- 9.Sebastiani P, Mandl K: Biosurveillance and Outbreak Detection. Data Mining: Next Generation Challenges and Future Directions. Edited by: Press. MIT. 2004, 185-198.Google Scholar
- 20.Burnham KP, Anderson D: Model Selection and Multi-Model Inference. 2003, Springer, 3rdGoogle Scholar
- 22.R: A language and environment for statistical computing. [http://www.R-project.org]
- 23.An online tool for detecting and measuring epidemics in time series data. [http://www.u707.jussieu.fr/periodic_regression/]
- 30.Ozonoff A, Forsberg L, Bonetti M, Pagano M: Bivariate method for spatio-temporal syndromic surveillance. MMWR Morb Mortal Wkly Rep. 2004, 53 Suppl: 59-66.Google Scholar
- 33.Grigoryan VV, Wagner MM, Waller K, Wallstrom GL, Hogan WR: The Effect of Spatial Granularity of Data on Reference Dates for Influenza Outbreaks. RODS Laboratory Technical Report. 2005Google Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6947/7/29/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.