Climatological assessment
Over Europe, the ECMWF ENS model provides skillful forecasts of temperatures with a low bias (not shown, Haiden et al. 2014) and a correlation, using concatenation of 2-week lead time forecasts, of the temperature quantiles of both \({\text {T}}_{\text {min}}\) and \({\text {T}}_{\text {max}}\) above 0.65 (up to 0.78 over the south-eastern Europe, see Fig. S1 in supplementary material). Note that the use of the quantiles of temperature removes the seasonal variability and thus focuses on the high frequency variabilities that are more difficult to predict. Only over Spain, Greece and the UK, the correlations are recorded under 0.6. The temporal variabilities at yearly time scales (Fig. S2a in supplementary material) depicts low differences during the last 20 years. This is mainly due to the use of the hindcasts that are based on the same version of the model during the entire period. Some higher skill is found in 1998, 2002 and 2010 and lower correlation in 1997 and 2009. The explanations of these differences could be related to more intense large scale forcing or more stationary conditions during these years that are more predictable than high-frequency variabilities. The same analysis (see Fig. S2b in supplementary materials) is done at monthly scale to show the variation of correlation depending on the month.
To analyze the predictability of the ensemble for the relative extreme temperatures (defined as above or under the quantiles Q90 or Q10), the reliability diagrams based on different lead times are presented for summer and winter in Fig. 2a. Generally, the ensemble model is overconfident and tends to converge too quickly toward the same solution (curves located under the diagonal line on the right part of the graph). By adding the skillful region, between the skill lines (in grey in Fig. 2a, b, Toth et al. 2003; Wilks 2011) it is possible to analyse the reliability of the forecasts at each lead time. In summer, the reliability diagrams are skillful up to 10-day lead time and up to 15-day lead time for the winter Q10. Tests are not designed to distinguish two lead times. Nevertheless, we can consider there are two groups, one with skillful reliability and the second one that converges to the climatology. The figure allows to highlight a certain limitation of the predictability to forecast a single hot day, though the HWs/CWs, associated with persistency and large scale features, could be more predictable. It is worth to note that the system is slightly underconfident for low rates of probability forecasts. The ROC area scores for the same variables (Fig. 2c) are in agreement with the reliability diagram showing an abrupt decrease of the score beyond 15-day lead time. Two main characteristics appear: (1) there is a diurnal cycle of the score due to the better forecast of \({\text {T}}_{\text {max}}\) in relation to \({\text {T}}_{\text {min}}\), (2) the skill score of the ROC area for cold waves is better than the one for the heat waves, and appears skillful up to two weeks according to (Buizza et al. 1999).
For the next validation, a re-composition of a continuous time series of 20 years of hindcasts is built by concatenating the 15-day lead time forecasts of \({\text {T}}_{\text {min}}\) and \({\text {T}}_{\text {max}}\) from runs separated by two weeks. The choice of using the first two weeks is done following the skill scores of the predictability of temperature quantiles above Q90 (or lower than Q10) shown previously. This time window is long enough to ensure a consistent detection of HWs/CWs by the forecast system and decrease the influences of abrupt changes or drifts of the model. Sensitivity tests are conducted to assess the influence of this strategy of climatological reconstruction. To do so, two additional time series are built by merging every week the first week of forecast (e.g., from 1-day to 7-day lead time), or the second week of forecast (e.g. from 8-day to 14-day lead time). Results, shown in the supplementary materials (Figs. S3 and S4), show a decrease of HW/CW detection during the second week using identical methodologies (ensemble mean, ensemble median, same percentage). This is mainly explained by the increase in the ensemble spread. Nevertheless, by using an adapted methodology depending on the lead time (e.g., changing the percentage threshold of ensemble members associated with extreme events) the results allow to conclude that the model does not present important drift during the two first weeks. The occurrences are comparable and the spatial variability is well represented. Because of this stability and since the focus of this study is on long lead times, we use the entire 15-day lead time forecast to build the climatology. The main objective here is to compare spatial and temporal climatologies and variabilities from the observations and the model. Indeed, it is important to evaluate the quality of the climatology used as reference, before assessing the predictability of events that will be done in the following subsection.
To detect HWs/CWs in the ensemble, ten methods are tested to transform the probabilistic onto a deterministic forecast. These methods are defined in Table 1. For each method, the climatological occurrences are compared to those obtained with the observations. The optimum forecasts (i.e. associated with the lowest bias and standard error) for the climatology are depending on the characteristics of the HW/CW analyzed. For example, to predict the occurrences, the optimum forecast is when 50% of members for the HWs (40% for the CWs) are associated with an extreme event. Such bias could be optimized by choosing the best percentage threshold. Once the best methods are defined, it is interesting to note the good representation of the spatial distributions with a clear gradient pointing to north-eastern Europe for both HWs and CWs (Fig. 3). The main differences are located over the Alps for HWs (negative bias) and Poland for CWs (positive bias). The same tests were performed to define the optimum forecasts of HW/CW intensities, resulting in the median and the Q25 for HWs and CWs respectively. To assess the forecasts, the strongest intensities recorded over each grid point were compared between the observations and the forecasts for both Idev and Ihum (Fig. 4). The correlations are better for Idev than Ihum and for HWs than CWs. These scores allow to show the proportion of spatial variability that is reasonably well represented, yet highlighting also the specific problem of assessing the wave intensities that need to be evaluated separately.
The inter-annual variability during the entire period is then analysed to assess the temporal stability of the forecast skill (Fig. 5). Note that, since 1995 and 2015 are not complete, these two years are removed from this analysis. The yearly mean occurrences of HWs/CWs (Fig. 5, left panels) reveal the low negative (positive) bias of the forecasted HWs (CWs). The inter annual variability of HWs is well represented except in 1998 and 2007 where the events are underestimated and in 2014 with an overestimation by the forecasts. In this figure, the wide Russian event in 2010 is clearly visible with the increase of HW occurrence. Nevertheless, in 2003, there is no signal related to the extreme event in France and Western Europe. This is mainly due to the not exceptional spread of this HW regarding the climatology. This event was defined as extreme because of its intensity (related to the deviation of the temperature anomalies) and the number of people exposed. The CW evolution is also well represented, despite a small overestimation at the end of the period that creates an underestimation of the linear trends toward a decrease of CWs with time. The Pearson’s correlations of the temporal variability of the observations and the forecasts are significant and equal to 0.82 (0.88) for the HWs (CWs) with a 90% confidence interval of [0.72;0.97].
The long-term trend is assessed using the yearly values, Loess and linear regressions. The linear trend provides a global overview with an increase (decrease) of the HW (CW) occurrences in both observations and forecasts. Using the Mann-Kendall trend and the Sens slope tests, the positive trends of the HWs in summer are significant with a confidence level of 90% (Fig. 5, top left). The tests reveal no significant trends for the CWs (bottom left). The Loess regression, which allows to observe with more details the low frequency variations, highlights an increase of CWs from 2010 to 2013 that could limit the significance of the trends.
The last climatological verification done is on the model assessment to represent the interannual variability of the occurrence frequencies at monthly time scale. This analysis allows to consider the reliability of the model to represent occurrence anomalies, and to check if there is no bias for each month along the hindcast period. Despite some errors for specific months, there is no systematic bias for the monthly occurrences, and the model is able to represent correctly the periods with both small or large occurrence frequencies (Fig. 5, right panels). This is true for both the HWs and CWs with Pearsons correlation equal to 0.79. Only September is associated with a non-significant temporal correlation mainly due to the small interannual variability (black dots in Fig. 5 top right).
Predictability of HWs and CWs
Temporal and spatial distributions
To assess the predictability of the forecasted HWs and CWs, 10 methods to transform the probabilistic forecasts into dichotomous solutions are tested (see Table 1) to find the most reliable prediction. Figure 6 displays the POD, FAR and GSS scores depending on the lead time and method. It shows that HWs have generally shorter predictability than the CWs. According to the positive values of GSS, the forecasts are skillful up to 15 days for the HWs and up to 21 days for the CWs. This affirmation is validated using two significance tests, one based on the standard error of the threat scores and the second based on the single population proportion test. Both indicate skillful positive values of GSS with a confidence interval up to 0.9. Nevertheless, the two scores display a drastic decrease during the first week of forecast. The importance of the method is also clearly visible with differences that reach 30% of the POD and FAR and 25% of the GSS depending on the method used. Based on these results, it is also possible to define the best method depending on the lead times. Considering only the first 10-day lead time, the best method to forecast the detection of HWs (CWs) is based on the 50% of the members (40% respectively). For longer lead time, missed events increase drastically due to the increase in the ensemble spread.
The spatial variability of the GSS versus the lead time is assessed in Fig. 7 using the most accurate method for each lead time (indicated at the top of each panel). From 1 to 5-day lead time, the spatial structures and the intensities are close, with an increase of the score over eastern Europe and over Russia for both HWs and CWs and over a band from northern France to northern Poland for CWs. After 7-day lead time, the GSS scores halve over almost the entire European continent. The maxima of predictability are located in the eastern part of the domain (with predominant temperate continental climates) and along the coast of the North and Baltic Sea. These positive values of GSS are also significant with a confidence interval of 90% by using the test for comparing two proportions. At 2-week lead time, skillful scores (i.e., positive GSS) are found only over northern France and Russia for the HWs and north and central Europe (except Scandinavia) for the CWs. At 3-week lead time, there is no better predictability than the climatology of HWs. For CWs, low and no significant scores are found over central Europe. Finally, regarding the method used to forecast these events, it is possible to notice a decrease in the percentage of members associated with the best forecasts, from 60% (70%) to 40% (40%) for the HWs (CWs respectively).
Sensitivity of the temporal and spatial scales
Two sensitivity tests are conducted to measure the influence of the temporal and spatial scales of the HWs/CWs. The temporal sensitivity test is conducted by testing different daily resolutions, from 1 to 6 days. The period is defined (in the observation and forecasts) as affected by HWs/CWs if at least one day within the window is affected. This modifies the time series and decreases the number of time steps (from 32 time steps for the 1-day resolution to 7 time steps for the 6-day resolution). Once the observations and the hindcast signals are transformed, the same scores are applied. Results for the entire period and using the best forecasts are provided in Fig. S5 in the supplementary material. According to these results, the influence of the time resolution is negligible, demonstrating that most of the forecasted errors are not related to a shift in time of the predicted event but rather to misses or false alarms.
A second test is then conducted to evaluate the influence of the spatial scale of the HWs/CWs on their predictability. The original resolution of the forecasts and the observations is 1 square degree. Our algorithm to define events requires only a minimum of 3-day duration. That means the possibility to detect isolated, small scale events (i.e. one isolated grid point), which could be more complex to forecast than large scale events. To analyze the spatial sensitivity, four methods, keeping the same resolution but smoothing the signals, are tested in order to distinguish small and large scale events. The boolean files (with values equal to 0 for normal conditions and 1 when HWs/CWs are detected) of each forecast are used as input files to these methods. The two first methods are using the surrounding grid cells with different matrix sizes (3 \(\times\) 3 and 5 \(\times\) 5 respectively). The central grid cell of the moving matrix is equal to the matrix mean. Thus, the values indicate the percentage of the matrix affected by the HWs/CWs. A threshold (65%) considers if the central grid cell is affected or not. The choice means that 2/3 of the domain is affected and allows to detect large scale events by keeping a robust number of events. Note that this method allows also detection on the coastal regions since the undefined values are not taken into account. With these two methods the weight of each grid cell is equal to 1. The third method is a Gaussian smoothing applied with a 2-D convolution operator defined in a 5 \(\times\) 5 matrix. This method is similar to the mean filter, but it uses different weightings that represents the shape of a Gaussian hump. This method better represents the large scale features of the waves but may be too strict over coastal regions since there is no compensation for undefined values. Finally, the last method is based on a Nagao–Matsuyama filter (Nagao 2000) based on the lower spatial spread of a moving window. Nevertheless, this method, more adapted for smoothing satellite images, creates odd results (with HWs/CWs detected in wrong places) and tends to minimize the number of cases. For these reasons, this method was discarded. The results of the three first methods (see Fig. S6 in the supplementary materials) reveal a small but significant improvement of the forecasts when only the widest events are studied. The best results are found by using the Gaussian filter that tends to reduce the number of cases over the coastal regions. Nevertheless, for lead times longer than 10 days, there is no difference in the forecast skill depending on the spatial resolution.
As a conclusion of these sensitivity tests, the skill scores of the forecasts of HWs/CWs are not sensitive to the temporal scales. Regarding the spatial resolution, the skill scores appear significantly better when the HWs/CWs possess large scale features. Nevertheless, this is only true for lead times smaller than 7 days. Therefore, the forecast errors are mainly due to misses or false alarms associated with forecast errors of the large scale patterns and not due to uncertainties related to a spatial shift or a temporal delay.
Onset and end of HWs and CWs
The section aims at evaluating the skill scores of the forecasts on the onset and end of the HWs and CWs. Indeed, due to the link between the HWs/CWs and blocking situations (Matsueda 2011; Trigo et al. 2005), the predictability may be explained by the correct forecast of duration of an existing event. Also, for human health and economic impacts, users and decision makers need to know these specific onset and end dates to trigger the proper mitigation measures. To assess these scores, the dates of onsets and ends are derived from the boolean files of events. The starting (ending) date is defined as the first day with (without) a HW/CW detected after a non-event (event respectively). Thanks to the concatenation of the three previous daily observation with the forecasts mentioned earlier, the starting and ending dates can be detected from the first day of forecast.
The best method to define forecasted onsets/ends from the ensemble are based on the GSS scores of the first couple of weeks (not shown) and are defined as follows: 20% (30%) of members for the HW (CW respectively) to detect the onset and 40% (30%) for the HW (CW respectively) to forecast the end of the event. According to these methods, the POD of the onsets is set to 70% for 1-day lead time and decreases abruptly to 25% after 5-day lead time. The time resolution (different line colors in Fig. 8) has a significant impact up to a 3-day window (POD larger than 80%). After that the improvements are not significant. The conclusion is similar for the FAR, while the GSS outlines these behaviors with large improvements in the accuracy up to a 4-day window resolution. Finally, there is no significant advantage in using forecasts for the onset of HWs/CWs beyond a 2-week lead time, even at coarse temporal resolutions. For the end of the HWs/CWs (Fig. 8b), the scores are slightly better (except for the 1-day resolution) than for the onsets for the same lead time. This is due to the ability of the model to predict relatively well the correct durations of the HWs/CWs. The forecast skill for the end of HWs and CWs could go up to an 18-day lead time for HWs and a 21-day lead time for CWs.
Forecast skills of HW and CW intensities
The intensity of the HWs/CWs is one of the most important characteristic of the events to predict since it is closely related to the impacts on human health, especially for the intensity calculation Ihum. To assess the forecast ability, an adapted method of predictability assessment is needed and described in Fig. 1. First, the ability to correctly predict the events is verified. This requires at least one day of overlap in-between the forecasted and observed events. In total, about 25% of all the observed cases are well forecasted and are used for this validation. Based on this approach, the intensities, onset and durations are compared in Fig. 9. The delays on the onset and durations are first compared (Fig. 9 first and second columns), to detect any bias. The study of the onset delay reveals a good ability of the model to predict it. These scores are mainly explained by the short-term forecasts and it is important to notice that peaks in the histogram represent about 9% of all the observed events for the HWs and about 12% of the CWs. No specific positive or negative bias is found in the delays. Finally, the evaluation reveals longer durations in the forecasts highlighted by a positive bias of values in the HWs and CWs. The consequence of that is the slight positive delay in the forecast of wave ends in the two cases (not shown).
Then, the observed and forecasted intensities are compared. The method to calculate the intensity of the forecasted event is based on the median value of the member. This forecasted intensity is first unbiased using a quantile-quantile matching method. The correlation scores are 0.61 (0.65) for the HWs (CWs respectively) and the scatter plots in the right panels of Fig. 9 reveal the relative good agreements for intensities up to 5 and 10 for HWs and CWs respectively, while stronger intensities could not be evaluated due to large uncertainties. Nevertheless, these cases represent less than 1000 out of the 160,000 observed cases. Due to the relative short period of study to apply robust statistics, waves with intensities higher than 5 are considered as extreme events and no distinction will be done.