Human-induced changes in the climate system are already detectable on daily-to-decadal timescales (Santer et al. 2018; Sippel et al. 2020). Changing weather patterns (Rosenzweig et al. 2008), expanding dryland regions (Huang et al. 2016) and the dramatic reduction of Arctic sea ice (Screen and Simmonds 2010; Dai et al. 2019), are just a few from the vast set of climatic changes attributed to rising greenhouse gas concentrations. Besides global impacts, alterations in regional scale climate patterns are also observable. Specifically, East-Central Europe is believed to become more susceptible to the incidence of climate extremities in response to increased radiative forcing (Seneviratne et al. 2006; Bartholy and Pongrácz 2007; Beniston et al. 2007; Hirschi et al. 2010). Nevertheless, internal variability—an inherent feature of chaotic climate dynamics (Zeng et al. 1993)—is also known to have substantially contributed to, or masked the effects of anthropogenically forced climate change (Ding et al. 2014, 2019; Swart et al. 2015; Baxter et al. 2019; Topál et al. 2020), although conclusions on how internal variability might behave under future warming scenarios are still controversial (Deser et al. 2012, 2020; Dai et al. 2015; Haszpra and Herein 2019; Haszpra et al. 2020a,b).

Global climate models (GCMs) are elaborate tools for simulating past, present, and future climatic and environmental changes on various timescales; however, any projection is riddled with three commonly mentioned uncertainties: scenario uncertainty, model uncertainty, and internal variability (Stainforth et al. 2007; Knutti 2008; Hawkins and Sutton 2009). Thus, coordinated modeling experiments have launched to address these uncertainties. The Coupled Model Intercomparison Project Phase 5 (CMIP5, Taylor et al. 2012b) collected a number of GCMs with differing model physics, manifested chiefly in the different parametrization schemes applied in them. This dissimilarity tends to lead to structural differences between the models and thought to account for most of the uncertainty regarding their performance (Knutti 2008; Knutti and Sedláček 2013; Harrison et al. 2015), while the choice of the external forcing scenario plays a subtle role (Reichler and Kim 2008). One practice to acknowledge GCM limitations is to use a multi-model ensemble (Suh et al. 2012; L'Heureux et al. 2017) and consider each GCM with equal weight. However, to abandon “model democracy” and weight or give preference to certain models in a multi-model ensemble based on performance, ranking was also proposed (Knutti 2010; Merrifield et al. 2019). The latter has proved effective in constraining model uncertainty (Knutti et al. 2017) and is of particular importance when studying climatic variables (e.g., precipitation) whose future projections show large spread between different models (Garfinkel et al. 2020).

Action taken in this direction introduced the application of diverse model ranking methodologies, ranging from studies using correlation, root-mean-square error, and variance ratio (Boer and Lambert 2001; Gleckler et al. 2008) to the application of prediction indices (Murphy et al. 2004) or to those taking a Bayesian approach (Min and Hense 2006). In addition, the concern of interdependency of CMIP models (Sanderson et al. 2015) has been re-evaluated recently (Olson et al. 2019). Regarding the target area of model evaluation Garfinkel et al. (2020) studied the sources of CMIP5 intermodel spread in precipitation changes globally, however, ample analyses are targeted at more regional areas, e.g., the North-Atlantic (Perez et al. 2014), parts of Europe (Coppola et al. 2010; Pieczka et al. 2017), Africa (Brands et al. 2013; Dyer et al. 2019; Yapo et al. 2020), South-America (Lovino et al. 2018), or Asia (Ahmed et al. 2019).

Policy and management actions taken in response to the environmental hazards linked to climatic change, especially the consideration of the societal and agricultural impacts of extreme climate events, are limited by the uncertainties around GCM performance and internal variability too (Knutti and Sedláček 2013). There is growing body of evidence that the regional accuracy of GCM simulations is critical for regional climate model (RCM) projections. An RCM nested in a GCM that lacks the skillful representation of the observed large-scale climatic modes and circulation (i.e., boundary conditions for the RCM) cannot be expected to generate realistic results (Gautam and Mascaro 2018; Verfaillie et al. 2019). Despite a model’s agreement with current or more distant past observations does not always guarantee credibility to its future projections, the fact that it is based on physical principles supports the idea of using past observations as constraints in hope of selecting models with more reliable future projections (Reichler and Kim 2008; Sheffield et al. 2013; Sillmann et al. 2013; Wang et al. 2014; Barnes and Polvani 2015). Consequently, boosting the robustness of regional GCM projections via the rigorous evaluation of their uncertainties is crucial, especially for those that drive RCMs in the MED-CORDEX project (Ruti et al. 2016).

Several studies indicate that structural differences, namely the land-atmosphere feedback strength, between models can indeed be a source of uncertainty in future precipitation projections (e.g., Schwingshackl et al. 2018). However, the exact physical mechanisms such as how changes in soil moisture affect precipitation or temperature extremes remain uncertain (Boberg and Christensen 2012; Taylor et al. 2012a; Berg et al. 2016). Recently Vogel et al. (2018)—based on CMIP5 models with more realistic land-atmosphere couplings—concluded a future reduction in summer drying and warm extremes in Central Europe in line with a study by Selten et al. (2020). Nevertheless, how internal variability may influence the selection of best performing models and thus the uncertainty in future precipitation projections remains unaddressed.

Internal variability can be considered the parallel existence of numerous climate states at a given time (Lorenz 1963). It is an inherent feature of the climate system driven by chaotic dynamics (Bódai and Tél 2012; Drótos et al. 2015, 2016, 2017; Herein et al. 2016, 2017). In single-model initial condition large ensemble (SMILE) simulations, unlike the CMIP5 multi-model ensemble, the same model is run several times with perturbations in the initial condition. The single runs—differing in their initial conditions exclusively—constitute the members of the ensemble whose spread is related to internal variability with the ensemble mean reflecting the forced component.

In this paper, we aim to complement the inconclusive literature on East-Central European future precipitation projections and assess CMIP5 historical model performance based on long instrumental records from the HISTALP database (Auer et al. 2007) and the National Oceanic and Atmospheric Administration (NOAA) twentieth century reanalysis (Compo et al. 2011; Slivinski et al. 2019) for 1861–2005 and compare our constrained model ensemble’s future precipitation projections to six SMILEs (Deser et al. 2020). Within the CMIP5 multi-model ensemble, we cannot estimate the relative role for internal variability and model structural differences in influencing the spread of future precipitation projections since, for example, land-atmosphere feedbacks appear with different precision in the 32 CMIP5 models (Cheruy et al. 2014). However, with the inclusion of SMILEs, new opportunities open: we can explore the range of future precipitation projections solely due to internal variability (per model) and thus place CMIP5 model structural differences in the context of internal variability when assessing future hydroclimate uncertainty.

We identify three CMIP5 models with outstanding performance in simulating both regional past hydroclimate variability and land-atmosphere feedbacks (Seneviratne et al. 2013; Vogel et al. 2018), which unanimously indicate significantly less dry future summer conditions relative to the spread of CMIP5 and six SMILE simulations. This emphasizes the role for land-atmosphere coupling in regulating future summer hydroclimate uncertainty and a possible limitation affecting the state-of-the-art SMILE simulations that requires future work to disentangle. Our paper provides new insights into how those models that show better skills in reproducing observed climate variability can help refine future hydroclimate uncertainty in the presence of internal variability and advocates new efforts dedicated to improving model performance in simulating land-atmosphere feedbacks in Central Europe.

Data and methods

Study area description

The primary target area consists of the northeast (NE) and southeast (SE) subregions of the Greater Alpine Region (43° N-50° N; 13° E-19.5° E; Fig. 1), which have been delineated by Auer et al. (2007) based on the regionalization of certain climatic variables. It was chosen to cover the region of interest (East-Central Europe), where precipitation projections of CMIP5 models show large spread for both summer (Fig. 2a) and winter (Fig. 2b). To ensure the robustness of results based on the primary target area, supplementary calculations were performed on an extended domain (43° N-57° N; 4° E-20° E; Fig. 1) corresponding to the East-Central European part of the area used in Vogel et al. (2018).

Fig. 1
figure 1

Location of the study area on an orography map. The purple and blue dashed lines represent the HISTALP Northeast (NE) and Southeast (SE) regions. The primary target area is selected as the rectangle overlapping the NE and SE regions, i.e., 43° N-50° N;13° E-19.5° E. The black rectangle represents the Central Europe (CEU) domain (extended target area) used to evaluate CMIP5 models against the NOAA twentieth century reanalysis

Fig. 2
figure 2

Time series of standardized (relative to 1971–2000 mean) seasonal mean precipitation projections of 31 CMIP5 models for 2021–2085 (31-year moving averaged) for the primary target area (43° N-50° N; 13° E-19.5° E) a for summer (June-July-August: JJA) and b for winter (December-January-February: DJF)

HISTALP instrumental and the NOAA twentieth century reanalysis data

For the basis of model assessment, we used monthly surface temperature (TS) and precipitation (PR) data from the HISTALP coarse resolution subregional mean (CRSM) series for the NE and SE subregions of the Greater Alpine Region (Fig. 1) for 1861–2005 (Auer et al. 2007; we refer to this data as observations). The CRSMs are arithmetic means of the homogenized anomaly series (reference period: 1961–1990) for the stations situating within the boundaries of NE and SE subregions. We also utilize TS and PR data from the NOAA twentieth century reanalysis version 3 (Compo et al. 2011; Slivinski et al. 2019) for the assessment of model performance.

CMIP5 models and single-model initial condition large ensembles

For the analysis, we selected CMIP5 models that have so-called historical or future RCP8.5 simulations following the historical experimental design for 1861–2005 and RCP8.5 forcing scenario for 2006–2100 (Lamarque et al. 2010; Taylor et al. 2012b). This resulted in the selection of 32 different model versions for the historical and 31 models for the future timeframe from 16 modeling centers worldwide (Table 1). In addition, we made use of six SMILE simulations’ future (2006–2080) precipitation simulations: Max Planck Institute Grand Ensemble (MPI-GE), Canadian Earth System Model Large Ensemble (CanESM-LE), Community Earth System Model Large Ensemble (CESM-LE), Geophysical Fluid Dynamics Laboratory Earth System Model version 2 Large Ensemble (GFDL_ESM2M-LE), Commonwealth Scientific and Industrial Research Large Ensemble (CSIRO-LE), and EC-EARTH Large Ensemble (EC_EARTH-LE). In addition, we used two historical (1861–2005) precipitation simulations of the MPI-GE and EC_EARTH-LE (further details and references are found in Table 2).

Table 1 32 CMIP5 models used in the study (excluding NorESM1-M for the future timeframe). Expansions/definitions of the models are available online (
Table 2 Large ensemble simulations used in the study

Ranking the individual CMIP5 models

As a preliminary step, model output was interpolated onto the same regular 1.5° grid, and anomalies (relative to 1961–1990 to match the HISTALP anomaly time series) were calculated for all the individual historical CMIP5 simulations. Boreal summer (June-July-August: JJA) and winter (December-January-February: DJF) averages were derived annually for both the CMIP5 historical (1861–2005) and future (2006–2100) simulations and the observations of TS and PR. Additionally, both observational and model data were smoothed with a centralized 31-year moving average to mostly account for multidecadal low-frequency variability (as is the standard practice to minimize the effect of internal variability in single model realizations; e.g., McCabe and Palecki 2006; Senftleben et al. 2020) and to ensure comparability with the GCM data with relatively coarse grid resolution. Data preparation resulted in area-averaged and smoothed time series for the two subregions (NE and SE) for each model, variable, and season along with the observed time series.

We used three statistics for the individual CMIP5 models’ assessment with root-mean square error (RMSE) being the primary one in addition to the fraction of temporal Pearson correlation coefficient and mean-absolute error (referred to as: rank) and the Nash-Sutcliffe efficiency (NSE, Nash and Sutcliffe 1970) calculated between the observed and simulated time series. The reason we include temporal correlation is to measure to what extent simulated long-term (the time series are smoothed with a 31-year moving average) changes in PR and TS are in-phase with observations as it is expected for a model to reproduce observed low-frequency TS and PR changes. The NSE (Eq. 1) is calculated based on the observed (obs) and simulated (sim) time series pairs as:

$$ NSE=1-\frac{\sum_{i=1}^n{\left( obs- sim\right)}^2}{\sum_{i=1}^n{\left[\left( obs-\left(\overline{obs}\right)\right.\right]}^2} $$

where n is the length of the timeseries and \( \left(\overline{\mathrm{obs}}\right) \) indicates the time-mean of the observed timeseries. The NSE ranges from -∞ to 1, where 1 would mean the perfect observation-simulation match (which is not possible) and NSE = 0 indicates that the modeled time series’ mean-square-error is commensurable with the variance of the observed time series.

For simplicity, we now only go through the ranking steps for the RMSE as the calculations are the same for the other two statistics. First, RMSE corresponding to each of the two seasons (JJA and DJF) was calculated for both variables (referred to as TS and PR seasonal RMSE) for the two subregions separately. Then, the RMSE values were averaged for the two subregions and seasons for the two variables separately (referred to as TS and PR mean RMSE). We also assess the overall performance of a model in reproducing the observed past hydroclimate variability in the target region and introduce the grand-RMSE, which is the average of the TS and PR mean RMSE values. To ensure comparability of the RMSE of PR and TS, we rescaled the values (for both variables) to range between 0 and 1 (Eq. 2) before averaging them into the grand-RMSE, which is the arithmetic mean of the scaled mean RMSE of TS and PR.

$$ {RMSE}_{scaled}=\frac{meanRMSE-\underset{m=1,\cdots, M}{\min }(meanRMSE)}{\max_{m=1,\cdots, M}(meanRMSE)-\underset{m=1,\cdots, M}{\min }(meanRMSE)} $$

where m goes through the M = 32 CMIP5 models.Note that the applied rescaling is based on the maximum and minimum values of the mean RMSE to maintain the relative differences between each model’s performances.

Rank histogram to assess the performance of an ensemble

Additionally, to assess the performance of an ensemble as a whole, we apply the rank histogram on year-to-year seasonal (JJA and DJF) averaged HISTALP and simulated data (Talagrand et al. 1997; Annan and Hargreaves 2010; Maher et al. 2019) for the two SMILEs with sufficiently long historical simulations (MPI-GE and EC_EARTH-LE) and for the CMIP5 ensemble. To do so, let us consider an ensemble with n members and initially let the rank = 1. At each time-step (1861–2005), we count the number of members of a given ensemble that are greater than the observed value at that time-step, which can be between count = 0 and count = n. If count = 0, then the rank = 1, or if count = n, then the rank = n + 1, else the rank = count. We plot the histogram of the ranks and check for consistency with uniformity based on a chi-squared test (Annan and Hargreaves 2010). If the ensemble underestimates the observed variability, then observations will frequently lie close to, or outside the edges of the ensemble resulting in a u-shaped rank histogram, while a well performing ensemble would yield a flat rank histogram.

Using observations to constrain the CMIP5 ensemble

Ranking CMIP5 models based on their historical performance (1861–2005)

To begin with, we assess the historical performance of 32 CMIP5 models based on the HISTALP observations (1861–2005) and use the above described ranking method. At first, seasonal RMSE, rank, and NSE were calculated for the NE and SE subregions for both seasons separately, then averaged over the primary target area (Fig. 3). The 32 CMIP5 models show diverging performance in capturing past seasonal TS and PR variability (Fig. 3). Some models (e.g., FGOALS-s2; MRI-CGCM3) stand out from others, suggesting that abandoning the “one model one vote” approach (Knutti 2010) is a right decision for the target area. The spread between the performances of the models are larger in summer (Fig. 3a) than in winter (Fig. 3b), which discrepancy might be rooted in that (i) summertime convective precipitation is more challenging for GCMs to capture (Dai 2006) and (ii) that the regional surface temperature warming signal over the past decades is more pronounced in summer than in winter. For some of the models (e.g., CanESM2; CSIRO-Mk3.6) the seasonal rank is negative, along with higher seasonal RMSE values. The negative rank means negative correlation between the observed and simulated time series, which is indicative of that the model is out of phase with the long-term observed changes.

Fig. 3
figure 3

Surface temperature (TS: red crosses) and precipitation (PR: blue circles) seasonal ranks (uppermost panel), NSE (middle panel), and RMSE (lower panel) for the historical era (1861–2005) across 32 CMIP5 climate models (listed below the x-axis) a for summer (June-July-August: JJA) and b for winter (December-January-February: DJF)

In the next steps, first, we average the seasonal statistics and obtain the mean RMSE, rank, and NSE (Fig. 4a) per variable, second, re-scale them to range from 0 to 1 based on Eq. 2 and third, via averaging the re-scaled mean RMSE, rank, and NSE values, we obtain the grand-RMSE, -rank, and -NSE (Fig. 4b). Those models that performed above the 90th percentile of the CMIP5 ensemble based on any of the three metrics—a total of six models (FGOALS-s2; IPSL-CM5B-LR; MPI-ESM-LR; MRI-CGCM3; MRI-ESM1; GISS-E2-R-CC)—are selected that skillfully reproduce multidecadal TS and PR variability over the past ~ 150 years in East-Central Europe (Fig. 4b; Table 3). To get a more visual picture of the six top performing models’ past climate variability, 31-year moving averaged TS and PR time series for 1861–2005 are plotted against the HISTALP observations for both JJA and DJF for the two Greater Alpine subregions, separately (Fig. S1). Overall, models show large spread in their historical projections in the two subregions for both variables and seasons, which are visibly reduced among the six selected models (see the colored solid lines in Fig. S1).

Fig. 4
figure 4

a Surface temperature (TS) and precipitation (PR) mean ((JJA + DJF)/2) ranks (upper panel), mean NSE (middle panel), and mean RMSE (lower panel) for East-Central Europe for the historical era (1861–2005) across 32 CMIP5 climate models (indicated below the x-axis). b Box-and-whiskers plot of the scaled [0;1] grand-rank, grand-NSE, and grand-RMSE (indicated below the x-axis) showing the 10th and 90th percentiles of 32 CMIP5 model’s performances. Red circles show models above 90th and below 10th percentiles, respectively. Those six models above the 90th percentile in the grand-rank or -NSE or -RMSE are highlighted with green on a

Table 3 Scaled [0; 1] grand-RMSE, -NSE, and -rank of 32 CMIP5 models for East-Central Europe. Models performing above the 90th percentile of the CMIP5 ensemble are italicized

Validation of the ranking based on the NOAA twentieth century reanalysis

To account for possible obscuring effects of the moderate size of the primary target area on the selection of the best performing models, we repeat the ranking using only the RMSE statistic for the extended domain (section 2.1; Fig. 1) based on the NOAA twentieth century version 3 gridded reanalysis (Slivinski et al. 2019). The calculation method is equivalent to the one applied to the HISTALP records except the 32 CMIP5 models are evaluated against the gridded reanalysis product. In Fig. 5, we demonstrate the spatial distribution of the reanalysis-based TS/PR mean RMSE relative to the CMIP5 multi-model ensemble mean for the six previously selected models (Fig. 5a–l) and show the TS/PR mean RMSE and the grand-RMSE for each model averaged over the extended target area (Fig. 1) as a box-and-whiskers plot (Fig. 5m). Spatial maps of the TS/PR mean RMSE for each of the 32 models are additionally shown in Fig. S2. Based on the grand-RMSE for the extended target area (Fig. 5m), only three out of the previously selected six models exhibit similar good overall performance; thus, we further reduce the range of selected models to the MRI-CGCM3, MRI-ESM1, and FGOALS-s2 and refer to them as the constrained ensemble.

Fig. 5
figure 5

Spatial map of af surface temperature (TS) and gl precipitation (PR) RMSE based on the NOAA twentieth century reanalysis (1861–2005) relative to the CMIP5 ensemble mean shown for the top 6-ranked models (based on historical instrumental data, see Methods) and m box-and-whiskers plot of the TS (violet) and PR (light blue) RMSE averaged for the Central European (CEU) domain (43°-57° N;4° E-20° E) and the average of the TS and PR RMSE values (Grand RMSE with gray) for 32 CMIP5 models each of which is marked as in the legend. The whiskers extend to the minimums and maximums. The median of each group is indicated with orange horizontal lines. The means are marked with ×

Rank histograms

Furthermore, since internal variability cannot be correctly assessed in a multi-model ensemble because of the initial condition problem and differences in model structures (Branstator and Teng 2010; Knutti 2010; Bódai and Tél 2012), it must be considered that it may leave its fingerprint on our ranking and study rank histograms of historical precipitation projections of the MPI-GE and EC_EARTH-LE in the primary target area. Figure 6a–d exhibit that both SMILEs underestimate the observed summer and winter precipitation variability (histograms are u-shaped), which is reinforced by the chi-squared tests indicating significant differences from uniformity (on the 99% confidence level). Additionally, the CMIP5 multi-model ensemble shows similar rank histograms to the SMILEs’ (Fig. 6e–f), except that the winter rank histogram does not differ significantly from a flat one. These indicate that (i) conclusions based on simulated internal variability by these two state-of-the-art SMILEs (and possibly by the others as well) should be treated with caution and that (ii) observational constraints may indeed be helpful in revealing models with structural advances relative to other models. In the upcoming sections, we further elucidate these issues.

Fig. 6
figure 6

Rank histograms based on the year-to-year seasonal mean HISTALP observed precipitation in the target area for two large ensembles ab MPI-GE and cd EC-EARTH-LE, and the CMIP5 multi-model ensemble ef

A possible source for a reduced projection spread: land-atmosphere couplings

We are particularly concerned with how future projections of the constrained model ensemble look like in East-Central Europe. We find that not only did the ranking result in a reduced spread in historical simulations (Fig. S1), but the members of the constrained ensemble also show reduced spread in their future projections relative to the CMIP5 ensemble mean for both summer (Fig. 7a) and winter (Fig. 7b). Moreover, the difference between the CMIP5 ensemble mean (28 models’ mean: − 3.9%/decade) and the constrained ensemble mean (3 models’ mean: − 0.1%/decade) future precipitation trend is significant based on a two-sample t test (99% confidence level). The three top-ranked models indicate less dry summer and wetter winter conditions in the upcoming decades not only in the primary target area but also on the extended domain in parallel with considerable surface temperature rise (Fig. 8). Members of the constrained CMIP5 ensemble indicate − 0.7 to + 1%/decade summer and + 1 to + 5%/decade winter precipitation change for East-Central Europe relative to 1971–2000 (Fig. 8). Examining the constrained ensemble members’ future seasonal surface temperature projections, we find no noticeable differences relative to the CMIP5 ensemble mean; therefore, we rule out the possibility that the discrepancy in future precipitation projections may be due to a negative surface temperature bias in those models (Fig. 8).

Fig. 7
figure 7

Time series of standardized (relative to 1971–2000 mean) seasonal mean precipitation (PR) for 2021–2085 (31-year moving averaged) for the members of the constrained CMIP5 ensemble (colored solid lines) and the mean of 31 CMIP5 models (thick solid gray line) in addition to the 31 individual models in CMIP5 (thin solid gray lines) for the primary target area (blue box on Fig. 8: 43° N-50° N;13° E-19.5° E) a for JJA and b for DJF

Fig. 8
figure 8

Spatial map of the linear trend (relative to 1971–2000 mean) of ac and gi surface temperature (TS: K/decade) and df and jl precipitation (PR: %/decade) for 2021–2085 (31-year moving averaged) under RCP8.5 scenario in the members of the constrained CMIP5 ensemble (see Methods for ranking details) for JJA and DJF

Our results are partly at odds with previous expectations that project extensive summer drying in the Central European region (Feng and Fu 2013; Sheerwood and Fu 2014; Polade et al. 2015; Pfleiderer et al. 2019). One mechanism for the advanced summer aridification in the region has been associated with the moist lapse-rate feedback due to global warming (Brogli et al. 2019). A warmer atmosphere, deduced from Clausius-Clapeyron relation, can hold more moisture, which, during moist adiabatic vertical motions, allows enhanced latent heat release and thus upper-tropospheric warming. These altogether result in an increased dry atmospheric static stability as the thermal stratification remains close to the moist adiabat during summer (Schneider 2007; Brogli et al. 2019). Another mechanism regarding changes in atmospheric circulation regimes, such as the poleward shifted subsidence zone with the projected expansion of the Hadley-cell, has also been suggested to influence future hydroclimate in the region due to enhanced radiative forcing (Perez et al. 2014; Mann et al. 2018). Nevertheless, inconclusive literature (e.g., Kröner et al. 2017) hinders us from a complete understanding of possible future precipitation changes in transitional climatic zones, such as Central Europe.

Recent studies highlight a competing role for land-atmosphere interactions and the extent of its realistic representation in climate models in determining future hydroclimate uncertainty in the Mediterranean and Central Europe, where soil moisture largely affects temperature and precipitation via the partitioning of net radiation into sensible and latent heat fluxes (Boberg and Christensen 2012; Lorenz et al. 2016; Vogel et al. 2018; Al-Yaari et al. 2019; Selten et al. 2020). It has also been proposed that it is not enough for a model to faithfully represent observed soil-atmosphere feedbacks because convection, land-surface, and cloud parametrization schemes not only influence how soil moisture-precipitation feedbacks are handled in a model but also affect soil moisture-temperature feedbacks in turn (Christensen and Boberg 2012). This further complicates and highlights the importance of land-atmosphere interactions in determining future hydroclimate uncertainty in our target region.

Members of our constrained CMIP5 ensemble belong to the group of CMIP5 models that was identified by Vogel et al. (2018) with (i) more fidelity in representing land-atmosphere couplings and (ii) less pronounced summer hot and dry extremes for central Europe. A physical mechanism strongly connected to land-atmosphere feedbacks that might balance the decrease in precipitation during future transition into semi-arid regions in Central Europe was also suggested (Taylor et al. 2012a). In an early study, Dai (2006) showed that a previous version of MRI-CGCM3 (the MRI-CGCM version 2.3.2a) better captured observed global rainfall patterns than other models indicating that some basic features rooted in the model physics (most likely the convective and stratiform precipitation parametrization schemes) can indeed be sources of intermodel spread.

These lines of evidence reinforce the idea of ranking to constrain future hydroclimate projections of different CMIP5 models based on evaluating their historical performance and suggest an important physical mechanism that can explain why our selected models perform better regionally. Furthermore, presented results provide valuable implications for future RCM simulations and advocate future research to revisit the problem of the fidelity of land-atmosphere feedbacks in RCM simulations, where the enhanced resolution allows for a more detailed picture of regional feedback mechanisms.

Although the 32 GCMs differ in their external forcing components for their historical simulations, no pronounced differences are observable between the members of the constrained and the CMIP5 ensemble (not shown). We argue that the varying historical model projection skills are not primarily rooted in the differences between the external forcing components in line with previous studies, e.g., Reichler and Kim (2008). Nevertheless, we note that the choice of the external forcing scenario does influence future model projections (Santer et al. 2018), especially under a changing climate, when the external forcing components are time-dependent (Bódai et al. 2020; Haszpra et al. 2020a,b).

Placing future precipitation projections of the constrained ensemble in the context of SMILE projections

Based on the ranking, we identified a constrained CMIP5 multi-model ensemble that shows reduced spread in their historical and future precipitation projections indicating less dry summer and wetter winter conditions in the upcoming decades (Figs. 7 and 8). We have also seen that land-atmosphere feedbacks may be of key importance in explaining why some models perform better than others. The advantage of including SMILE simulations in our study is to provide an estimate (i) for the forced response (ensemble mean) in precipitation to greenhouse gas emissions and (ii) for all possible states allowed by internal variability in a certain model (ensemble spread), which allows us to place the observationally constrained CMIP5 ensemble (three top-ranked models) in the context of internal variability. What is more, with the inclusion of six SMILEs, we can compare the internal variability of projected precipitation of various models, thus, get a more robust estimate of future states of hydroclimate allowed by internal variability in the region. We are aware of caveats added by the coarse spatial and topography resolution of SMILEs; however, currently, it is our best estimate for projected hydroclimate uncertainty due to internal variability.

We plot the spatial map of the ensemble mean future precipitation projections’ linear trends for Europe and the spread across all members of the ensembles as a box-and-whiskers plot for our primary target area (Fig. 1) for summer (Fig. 9) and winter (Fig. 10). In summer, all SMILE mean simulations show drier future conditions in East-Central Europe indicating − 2 to − 7%/decade precipitation decrease during the upcoming decades relative to 1971–2000 (Fig. 9), while the constrained CMIP5 ensemble mean trend indicates less pronounced summer drying (− 0.1%/decade). However, the magnitudes of the ensemble mean projections and the ensemble spread of different SMILE simulations vary considerably across the six SMILEs that implies a role for model uncertainty in regulating future hydroclimate changes on top of internal variability. Furthermore, the constrained ensemble’s mean future (2021–2085) precipitation trend (− 0.1%/decade) emerges from the interquartile range of simulated internal variability by six SMILEs ((− 8%, − 1%)/decade). The difference between the group of future precipitation trends spanned by all the members of the six SMILEs (a total of 256 members) and the constrained ensemble (3 members) is significant based on a two-sample t test on the 99% confidence level (the means of the two groups’ trends: − 4.8%/decade for the six SMILEs and − 0.1%/decade for the constrained ensemble).

Fig. 9
figure 9

Above: af spatial map of the ensemble mean (forced component) linear trend (relative to 1971–2000) of summer (June-July-August: JJA) precipitation for 2021–2085 (31-year moving averaged) for the six SMILEs. Below: g box-and-whiskers plot (with the whiskers extending to 1.5 × interquartile range) of JJA precipitation linear trends (relative to 1971–2000) for 2021–2085 (31-year moving averaged) for the CMIP5 multi-model and the six SMILEs (indicated below the x-axis) for the primary target area (indicated by the red rectangles on af: 43° N-50° N; 13° E-19.5° E). The median of each ensemble is indicated with numbers above the boxes in addition to the orange lines. The means are marked with ×, while the outliers (extending 1.5 × interquartile range) are marked with +. Trend values of the members of the constrained CMIP5 ensemble are indicated with markers on the first box-and-whiskers

Fig. 10
figure 10

Above: af spatial map of the ensemble mean (forced component) linear trend (relative to 1971–2000) of winter (December-January-February: DJF) precipitation for 2021–2085 (31-year moving averaged) for the six SMILEs. Below: g box-and-whiskers plot (with the whiskers extending to 1.5×interquartile range) of DJF precipitation linear trends (relative to 1971–2000) for 2021–2085 (31-year moving averaged) for the CMIP5 multi-model and the six SMILEs (indicated below the x-axis) for the primary target area (indicated by the red rectangles on af: 43° N-50° N; 13° E-19.5° E). The median of each ensemble is indicated with numbers above the boxes in addition to the orange lines. The means are marked with ×, while the outliers (extending 1.5 ×interquartile range) are marked with +. Trend values of the members of the constrained CMIP5 ensemble are indicated with markers on the first box-and-whiskers

Since we used observations to constrain the CMIP5 ensemble, which resulted in the selection of models with more realistic representations of land-atmosphere feedbacks, we assume that the difference between the constrained ensemble’s and the six SMILEs’ future summer precipitation trends may be attributable to land-atmosphere coupling discrepancies between the models. Importantly, except for the CESM1, the base models of the large ensemble simulations were either involved in the ranking, or we evaluated their historical simulations (see Sect. 3.3). Thus, it is unlikely that the SMILE simulations would regionally outperform the members of the constrained CMIP5 ensemble in capturing observed precipitation variability. Although this needs further efforts to clarify, these lines of evidence suggest less extreme summer drying in East-Central Europe and that land-atmosphere coupling may play a key role in regulating future summer hydroclimate uncertainty in line with several recent studies (Boberg and Christensen 2012; Vogel et al. 2018; Selten et al. 2020).

On the other hand, in winter, the six SMILE mean simulations show future regional precipitation changes ranging from − 0.2 to + 4.5%/decade relative to 1971–2000 (Fig. 10a–f). For winter, the SMILEs show a similar range (− 3.7 to 7.1%/decade) of possible future precipitation conditions in our region to both the unconstrained and the constrained CMIP5 ensemble; however, the differences between the six SMILEs are discernible (Fig. 10g). Unlike in summer, the top 3–ranked CMIP5 models’ future winter precipitation trend values are well within 1.5×interquartile range of SMILE simulations (Fig. 10g). However, the relative role for internal variability compared with model structural differences, or the exact physical mechanism responsible for the spread in either among the different ensembles or among the members of a SMILE, remains uncertain and needs future work to untangle. For example, based on Fig. 10a–f, we note the importance of the exact geographical location of the simulated transition zone between regions with drier and wetter future conditions in the different models. This suggests that internal atmospheric circulation changes may play an important role (e.g., via regulating the extent of the northward expansion of the Hadley-cell and thus the subsidence zone (Lu et al. 2007)) in determining the geographical location of the transition between projected drier and wetter conditions that might also be dependent on the amount of emitted greenhouse gases in the future (Haszpra et al. 2020b). There are plenty of rooms for future studies in these directions, which is strongly encouraged in hope of a more comprehensive understanding of future hydroclimate uncertainties.

Summary and conclusions

In this paper, we applied a ranking method to account for the possible role of structural differences between 32 CMIP5 GCMs in regulating hydroclimate projection uncertainty. The assessment of historical performance of GCM projections resulted in a constrained CMIP5 ensemble with reduced future seasonal precipitation projection spread for East-Central Europe. Moreover, the members of the constrained ensemble belong to a group of models previously identified with more realistically simulated land-atmosphere coupling (Vogel et al. 2018) and the mean of their future summer precipitation projections—indicating little-to-no changes (− 0.1%/decade)—significantly differ from the unconstrained CMIP5 ensemble mean (− 3.9%/decade) and from the mean of the spread of six SMILEs (− 4.8%/decade). These altogether suggest an important role for land-atmosphere coupling differences among climate models in regulating future summer hydroclimate uncertainty on top of the irreducible internal variability and calls for caution when interpreting future summer precipitation projections of the state-of-the-art SMILE simulations. We urge coordinated efforts to further quantify the relative contribution of internal variability and model structural differences in regulating future seasonal hydroclimate uncertainty in Central Europe.

Our results also shed more light on how future efforts toward reducing hydroclimate uncertainty based on regional climate models may be organized. Recent studies note that RCMs driven by GCMs with more realistic precipitation variability are more likely to have reliable precipitation projections (Syed et al. 2019). Therefore, the careful selection of driving GCMs for RCMs and the thorough evaluation of RCMs based on their land-atmosphere coupling feedbacks (e.g., soil moisture-temperature/precipitation couplings) may be a useful step toward alleviating RCM projection uncertainty, which physical constraint must also be taken into account before downscaling SMILEs to get regional ensemble simulations. In light of our results, we emphasize the possibility of less than previously thought dry summer conditions in the upcoming decades and advocate the parallel application of SMILE simulations with multi-model ensembles when producing inputs for future policymaking.