Relative model score: a scoring rule for evaluating ensemble simulations with application to microbial soil respiration modeling

  • Ahmed S. Elshall
  • Ming YeEmail author
  • Yongzhen Pei
  • Fan Zhang
  • Guo-Yue Niu
  • Greg A. Barron-Gafford
Original Paper


This paper defines a new scoring rule, namely relative model score (RMS), for evaluating ensemble simulations of environmental models. RMS implicitly incorporates the measures of ensemble mean accuracy, prediction interval precision, and prediction interval reliability for evaluating the overall model predictive performance. RMS is numerically evaluated from the probability density functions of ensemble simulations given by individual models or several models via model averaging. We demonstrate the advantages of using RMS through an example of soil respiration modeling. The example considers two alternative models with different fidelity, and for each model Bayesian inverse modeling is conducted using two different likelihood functions. This gives four single-model ensembles of model simulations. For each likelihood function, Bayesian model averaging is applied to the ensemble simulations of the two models, resulting in two multi-model prediction ensembles. Predictive performance for these ensembles is evaluated using various scoring rules. Results show that RMS outperforms the commonly used scoring rules of log-score, pseudo Bayes factor based on Bayesian model evidence (BME), and continuous ranked probability score (CRPS). RMS avoids the problem of rounding error specific to log-score. Being applicable to any likelihood functions, RMS has broader applicability than BME that is only applicable to the same likelihood function of multiple models. By directly considering the relative score of candidate models at each cross-validation datum, RMS results in more plausible model ranking than CRPS. Therefore, RMS is considered as a robust scoring rule for evaluating predictive performance of single-model and multi-model prediction ensembles.


Scoring rule Continuous ranked probability score Bayes factor Log-score Dispersion Reliability 



This work was supported by the Department of Energy Early Career Award DE-SC0008272 and NSF-EAR Grant 1552329.


  1. Ajami NK, Duan Q, Sorooshian S (2007) An integrated hydrologic Bayesian multimodel combination framework: confronting input, parameter, and model structural uncertainty in hydrologic prediction. Water Resour Res 43:W01403. CrossRefGoogle Scholar
  2. Allison SD, Goulden ML (2017) Consequences of drought tolerance traits for microbial decomposition in the DEMENT model. Soil Biol Biochem 107:104–113CrossRefGoogle Scholar
  3. Anderson MP, Woessner WW (1992) Applied groundwater modeling: simulation of flow and advective transport, 2nd edn. Academic, LondonGoogle Scholar
  4. Annan JD, Hargreaves JC (2010) Reliability of the CMIP3 ensemble. Geophys Res Lett. CrossRefGoogle Scholar
  5. Annan JD, Hargreaves JC, Tachiiri K (2011) On the observational assessment of climate model performance. Geophys Res Lett 38(24):L24702CrossRefGoogle Scholar
  6. Bulygina N, Gupta H (2011) Correcting the mathematical structure of a hydrological model via Bayesian data assimilation. Water Resour Res 47(5):W05514CrossRefGoogle Scholar
  7. Dawid AP (1984) Statistical theory: the prequential approach. J R Stat Soc Ser A 147:278–292CrossRefGoogle Scholar
  8. Diks CGH, Vrugt JA (2010) Comparison of point forecast accuracy of model averaging methods in hydrologic applications. Stoch Environ Res Risk Assess 24(6):809–820CrossRefGoogle Scholar
  9. Elshall AS, Tsai FTC (2014) Constructive epistemic modeling of groundwater flow with geological structure and boundary condition uncertainty under the Bayesian paradigm. J Hydrol 517:105–119CrossRefGoogle Scholar
  10. Evin G, Thyer M, Kavetski D, McInerney D, Kuczera G (2014) Comparison of joint versus postprocessor approaches for hydrological uncertainty estimation accounting for error autocorrelation and heteroscedasticity. Water Resour Res 50(3):2350–2375CrossRefGoogle Scholar
  11. Exbrayat JF, Viney NR, Frede HG, Breuer L (2013) Using multi-model averaging to improve the reliability of catchment scale nitrogen predictions. Geoscientific Model Development 6(1):117–125CrossRefGoogle Scholar
  12. Foglia L, Mehl SW, Hill MC, Burlando P (2013) Evaluating model structure adequacy: the case of the Maggia Valley groundwater system, southern Switzerland. Water Resour Res. CrossRefGoogle Scholar
  13. Gneiting T, Raftery AE (2007) Strictly proper scoring rules, prediction and estimation. J Am Stat Assoc 102(447):359–378CrossRefGoogle Scholar
  14. Good IJ (1952) Decisions. J R Stat Soc Ser B 14(1):107–114Google Scholar
  15. Gulden LE, Rosero E, Yang ZL, Wagener T, Niu GY (2008) Model performance, model robustness, and model fitness scores: a new method for identifying good land-surface. Geophys Res Lett 35(11):L11404CrossRefGoogle Scholar
  16. Hargreaves JC, Annan JD, Yoshimori M, Abe-Ouchi A (2012) Can the Last Glacial Maximum constrain climate sensitivity? Geophys Res Lett 39(24):L24702CrossRefGoogle Scholar
  17. Heath MT (1997) Scientific computing: an introductory survey. McGraw-Hill, BostonGoogle Scholar
  18. Hersbach H (2000) Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather Forecast 15:559–570CrossRefGoogle Scholar
  19. Hill MC, Tiedeman CR (2007) Effective calibration of ground water models, with analysis of data, sensitivities, predictions, and uncertainty. Wiley, New York, p 480CrossRefGoogle Scholar
  20. Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999) Bayesian model averaging: a tutorial. Stat Sci 14(4):382–401CrossRefGoogle Scholar
  21. Kumar P (2011) Typology of hydrologic predictability. Water Resourc Res 47(3):W00H05CrossRefGoogle Scholar
  22. Laloy E, Vrugt JA (2012) High-dimensional posterior exploration of hydrologic models using multiple-try DREAM(ZS) and high-performance computing. Water Resour Res 48(1):W01526. CrossRefGoogle Scholar
  23. Lartillot N, Philippe H (2006) Computing Bayes factors using thermodynamic integration. Syst Biol 55(2):195–207CrossRefGoogle Scholar
  24. Lawrence CR, Neff JC, Schimel JP (2009) Does adding microbial mechanisms of decomposition improve soil organic matter models? A comparison of four models using data from a pulsed rewetting experiment. Soil Biol Biochem 41(9):1923–1934CrossRefGoogle Scholar
  25. Liu PG, Elshall AS, Ye M, Beerli P, Zeng XK, Lu D, Tao YZ (2016) Evaluating marginal likelihood with thermodynamic integration method and comparison with several other numerical methods. Water Resour Res 52(2):734–758CrossRefGoogle Scholar
  26. Lu D, Ye M, Meyer PD, Curtis GP, Shi X, Niu X-F, Yabusaki SB (2013) Effects of error covariance structure on estimation of model averaging weights and predictive performance. Water Resour Res. CrossRefGoogle Scholar
  27. Lu D, Ye M, Curtis GP (2015) Maximum likelihood Bayesian model averaging and its predictive analysis for groundwater reactive transport models. J Hydrol 529:1859–1873CrossRefGoogle Scholar
  28. Manzoni S, Moyano F, Kätterer T, Schimel J (2016) Modeling coupled enzymatic and solute transport controls on decomposition in drying soils. Soil Biol Biochem 95:275–287CrossRefGoogle Scholar
  29. Matheson JE, Winkler RL (1976) Scoring rules for continuous probability distributions. Manag Sci 22(10):1087–1096CrossRefGoogle Scholar
  30. Nash JE, Sutcliffe JV (1970) River flow forecasting through conceptual models part I—a discussion of principles. J Hydrol 10(3):282–290CrossRefGoogle Scholar
  31. Neuman SP (2003) Maximum likelihood Bayesian averaging of uncertain model predictions. Stoch Environ Res Risk Assess 17(5):291–305CrossRefGoogle Scholar
  32. Nowak W, Rubin Y, de Barros FPJ (2012) A hypothesis-driven approach to optimize field campaigns. Water Resour Res 48(6):W06509CrossRefGoogle Scholar
  33. Oldenborgh GJ, Reyes FJD, Drijfhout SS, Hawkins E (2013) Reliability of regional climate model trends. Environ Res Lett 8(1):014055CrossRefGoogle Scholar
  34. Poeter EP, Anderson DA (2005) Multimodel ranking and inference in ground water modeling. Ground Water 43(4):597–605. CrossRefGoogle Scholar
  35. Poeter EP, Hill MC, Banta ER, Mehl SW, Christensen S (2005) UCODE_2005 and six other computer codes for universal sensitivity analysis, inverse modeling, and uncertainty evaluation. U.S. Geological Survey Techniques and Methods, 6-A11Google Scholar
  36. Raftery AE, Gneiting T, Balabdaoui F, Polakowski M (2005) Using Bayesian Model Averaging to Calibrate Forecast Ensembles. Mon Weather Rev 133(5):1155–1174CrossRefGoogle Scholar
  37. Renard B, Kavetski D, Kuczera G, Thyer M, Franks SW (2010) Understanding predictive uncertainty in hydrologic modeling: the challenge of identifying input and structural errors. Water Resour Res 46(5):W05521CrossRefGoogle Scholar
  38. Ricciuto DM, King AW, Dragoni D, Post WM (2011) Parameter and prediction uncertainty in an optimized terrestrial carbon cycle model: effects of constraining variables and data record length. J Geophys Res Biogeosci 116(G1):G01033CrossRefGoogle Scholar
  39. Sadegh M, Vrugt JA (2013) Bridging the gap between GLUE and formal statistical approaches: approximate Bayesian computation. Hydrol Earth Syst Sci 17(12):4831–4850CrossRefGoogle Scholar
  40. Schöniger A, Wöhling T, Samaniego L, Nowak W (2014) Model selection on solid ground: rigorous comparison of nine ways to evaluate Bayesian model evidence. Water Resour Res 50(12):9484–9513CrossRefGoogle Scholar
  41. Schoups G, Vrugt JA (2010) A formal likelihood function for parameter and predictive inference of hydrologic models with correlated, heteroscedastic, and non-Gaussian errors. Water Resour Res 46(10):W10531Google Scholar
  42. Shi X, Ye M, Finsterle S, Wu J (2012) Comparing nonlinear regression and Markov chain Monte Carlo methods for assessment of predictive uncertainty in vadose zone modeling. Vadose Zone J 11(4):83–97CrossRefGoogle Scholar
  43. Shrestha DL (2014) Continuous rank probability score, MathWorks File Exchange. Last checked 8 Feb 2017
  44. Silverman BW (1998) Density estimation for statistics and data analysis. Chapman & Hall, Boca Raton, p 176Google Scholar
  45. Smith RC (2014) Uncertainty quantification: theory, implementation, and applications. Computational science and engineering series, vol XVIII. Society for Industrial and Applied Mathematics, Philadelphia, p 382 sGoogle Scholar
  46. Smith MW, Bracken LJ, Cox NJ (2010) Toward a dynamic representation of hydrological connectivity at the hillslope scale in semiarid areas. Water Resour Res 46(12):W12540Google Scholar
  47. Thyer M, Renard B, Kavetski D, Kuczera G, Franks SW, Srikanthan S (2009) Critical evaluation of parameter consistency and predictive uncertainty in hydrological modeling: a case study using Bayesian total error analysis. Water Resourc Res 45(12):W00B14CrossRefGoogle Scholar
  48. Tsai FTC, Elshall AS (2013) Hierarchical Bayesian model averaging for hydrostratigraphic modeling: uncertainty segregation and comparative evaluation. Water Resour Res 49(9):5520–5536CrossRefGoogle Scholar
  49. Vrugt JA, ter Braak CJF, Gupta HV, Robinson BA (2009) Equifinality of formal (DREAM) and informal (GLUE) Bayesian approaches in hydrologic modeling? Stoch Environ Res Risk Assess 23(7):1011–1026CrossRefGoogle Scholar
  50. Wenger SJ, Som NA, Dauwalter DC, Isaak DJ, Neville HM, Luce CH, Dunham JB, Young MK, Fausch KD, Rieman BE (2013) Probabilistic accounting of uncertainty in forecasts of species distributions under climate change. Glob Change Biol 19(11):3343–3354Google Scholar
  51. Winter CL (2010) Normalized Mahalanobis distance for comparing process-based stochastic models. Stoch Environ Res Risk Assess 24(6):917–923CrossRefGoogle Scholar
  52. Winter CL, Nychka D (2010) Forecasting skill of model averages. Stoch Environ Res Risk Assess 24(5):633–638CrossRefGoogle Scholar
  53. Wöhling T, Schöniger A, Gayler S, Nowak W (2015) Bayesian model averaging to explore the worth of data for soil-plant model selection and prediction. Water Resour Res 51(4):2825–2846CrossRefGoogle Scholar
  54. Xue L, Zhang D (2014) A multi-model data assimilation framework via the ensemble Kalman filter. Water Resour Res 50(5):4197–4219CrossRefGoogle Scholar
  55. Ye M, Neuman SP, Meyer PD (2004) Maximum likelihood Bayesian averaging of spatial variability models in unsaturated fractured tuff. Water Resour Res 40(5):W05113Google Scholar
  56. Ye M, Meyer PD, Lin Y-F, Neuman SP (2010) Quantification of model uncertainty in environmental modeling. Environ Res Risk Assess, Stoch. CrossRefGoogle Scholar
  57. Yokohata T, Annan JD, Collins M, Jackson CS, Tobis M, Webb MJ, Hargreaves JC (2012) Reliability of multi-model and structurally different single-model ensembles. Clim Dyn 39(3–4):599–616CrossRefGoogle Scholar
  58. Zeng X, Ye M, Wu J, Wang D, Zhu X (2018) Improved nested sampling and surrogate-enabled comparison with other marginal likelihood estimators. Water Resour Res 54:797–826. CrossRefGoogle Scholar
  59. Zhang X, Niu G-Y, Elshall AS, Ye M, Barron-Gafford GA, Pavao-Zuckerman M (2014) Assessing five evolving microbial enzyme models against field measurements from a semiarid savannah-What are the mechanisms of soil respiration pulses? Geophys Res Lett 41(18):6428–6434CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Scientific ComputingFlorida State UniversityTallahasseeUSA
  2. 2.Department of Earth, Ocean, and Atmospheric ScienceFlorida State UniversityTallahasseeUSA
  3. 3.School of Computer Science and Software EngineeringTianjin Polytechnic UniversityTianjinChina
  4. 4.Key Laboratory of Tibetan Environmental Changes and Land Surface Processes, Institute of Tibetan Plateau ResearchChinese Academy of SciencesBeijingChina
  5. 5.Biosphere 2University of ArizonaTucsonUSA
  6. 6.Department of Hydrology and Water ResourcesUniversity of ArizonaTucsonUSA
  7. 7.School of Geography and DevelopmentUniversity of ArizonaTucsonUSA
  8. 8.Department of Geology and Geophysics, and Water Resources Research CenterUniversity of Hawaii ManoaHonoluluUSA

Personalised recommendations