We argued in the previous section that in order to judge the credibility of a prediction we should predict its performance. As with any non-trivial prediction, we should not expect our predictions of performance to be perfect, but we should aim to minimize errors. To this end, we scrutinize now the arguments that we might employ to justify predictions of performance.
One way to justify a prediction of performance might be to argue that performance can be deduced logically from facts or assumptions about the predictor and predictand without reference to the past performance of the predictor. This is called a design- or construction-based argument by Parker (2010, 2011).
Construction-based arguments are valid in some circumstances. If the performance measure is independent of the predictand then performance can be deduced from the prediction alone. Examples include criteria that check the prediction for self-contradictions (e.g. daily minimum temperatures exceeding daily maximum temperatures), or that assess the confidence (e.g. sharpness) of the prediction.
If the performance measure does depend on the predictand then additional information must be used in order to justify a prediction of performance. One such construction-based argument that has been discussed in the context of climate predictions concerns the ‘perfect model scenario’ (Smith 2002, 2006). The argument here is that an ensemble of predictions will be reliable if the ensemble-generating model is structurally similar to the predicted system in all relevant aspects and if its inputs are sampled to represent the uncertainty (e.g. the known distribution of measurement error) in the initial conditions of the predicted system. See also Betz (2006, ch. 9) and Stainforth et al. (2007). Which aspects of the predicted system are relevant depends on the prediction problem, so a model may be adequate for one type of prediction but not for others. Some authors, however, consider the structure of today’s climate models to be insufficiently similar to that of the climate system for such an argument to provide a strong justification of reliability for most prediction problems (e.g. Betz 2006; Frame et al. 2007; Parker 2010). Smith (2006) doubts that weather and climate models sufficiently isomorphic to the climate system will ever be realized. See Parker (2010), Allen et al. (2006) and Otto (2012) for similarly negative conclusions regarding arguments based on ‘imperfect model scenarios’.
Unless rather trivial measures of performance are of interest, it seems that construction-based arguments are unable to justify predictions for the performance of most climate model predictions because climate models are imperfect and it is difficult to trace the effects of imperfections through complex systems. This applies equally to predictions of good and bad performance—construction-based arguments do not provide strong justification for predictions of poor performance—and so we must look to other arguments.
Another way to justify a prediction of performance might be to argue that performance can be extrapolated from information about the past performance of the predictor. This is called a performance-based argument by Parker (2010).
Performance-based arguments proceed by identifying a class of predictions that contains the prediction, p, whose performance is under consideration, such that one has no reason to believe in advance that any particular prediction in the class will perform better than any other prediction in the class. The performances of a sample of predictions in the class are then measured and the performance of p is subsequently inferred following standard statistical procedures.
The key stage in performance-based arguments is to justify the choice of reference class. Membership of the class should be determined by characteristics of the prediction problems, derived from knowledge about both the predictor and the predictand. For example, if future conditions are expected to differ from past conditions only in ways that are unlikely to affect the performance of predictions then past predictions and future predictions may be judged to belong to the same class.
Performance-based arguments are familiar in weather forecasting, where past performance is often taken to be a face-value indicator of future performance. Parker (2010) contends that such arguments fail to justify predictions for the performance of climate predictions, however, because past cases whose performances can be measured differ significantly from predictions into the future. In particular, the relevance of hindcasts is compromised by the possibility that climate models have been tuned to perform well in the hindcast period, and the future forcings (such as concentrations of greenhouse gases) prescribed in climate predictions differ significantly from the forcings prescribed in the hindcasts. The studies by Reifen and Toumi (2009) and Weigel et al. (2010) attest to changes in performance over time. See also Frame et al. (2007).
While performance-based arguments may not provide strong justifications for predictions of the performance of climate predictions, there does seem to be scope for them to provide some justification. Even a small number of past predictions similar to the future predictions under consideration can help to give a rough indication of likely performance, and this may be very informative if the indication differs markedly from any prior expectations of performance.
In this subsection we propose a third type of argument for justifying quantitative predictions of the performance of climate predictions. While there may be few similar prediction problems from which we can infer the performance of climate predictions via the performance-based arguments of the previous subsection, there are many other climate-model experiments providing data that are commonly used to justify judgments about credibility. We might judge that some of these experiments are sufficiently similar to one another to form a reference class, and the statistical properties of the performance of members of the class can then be inferred from a sample of cases, as before. Even if the prediction whose performance is in question is not judged to be a member of this class, we might be willing to judge that it is either a harder or easier prediction problem than those in the reference class. In other words, we expect the performance of the prediction in question to be either worse or better than the performance of randomly selected members of the reference class. In this case, the inferred performance for the members of the reference class provides an upper or lower bound on the performance of the prediction in question.
To be precise, let S denote the performance of a randomly selected prediction from a reference class C and suppose that the value of S must lie on the positive real line with smaller values indicating better performance. Suppose also that we have used a sample of cases from C to estimate the probability distribution of performances in C. Let this distribution be denoted by the cumulative distribution function \(F(s)=\Pr(S\leq s)\) for all s > 0. Now let S′ denote the performance of a randomly selected prediction from a class C′ that contains the prediction under consideration, and let \(F'(s)=\Pr(S'\leq s)\) denote the unknown distribution of performances in C′. If we judge that the prediction problems in C′ are harder than the prediction problems in C then we obtain the bound F′(s) < F(s) for all s > 0, i.e. the chance of the prediction in question achieving a performance as good as s is at most F(s). Similarly, if we judge that the prediction problems in C′ are easier than the prediction problems in C then we obtain the bound F′(s) > F(s) for all s > 0, i.e. the chance of the prediction in question achieving a performance as good as s is at least F(s). We shall give numerical examples of such bounds in Section 5. Simultaneous upper and lower bounds may be obtained if both a harder reference class and an easier reference class can be identified.
The key stage in the performance-based argument in the previous subsection is to justify the choice of reference class. Such a justification is also needed here to define the bounding reference class, but now the class need not include the prediction under consideration, and so there is more scope to identify such classes. On the other hand, the bounding argument proposed here also requires us to justify a judgment that the prediction problem of interest is either harder or easier than the prediction problems in the bounding reference class. This is a similar type of judgment to that used to define reference classes as it should be based on similar information, namely characteristics of the prediction problems. However, the judgment that a prediction problem is harder or easier than other problems is a stronger judgment than that required to define a reference class because the direction of departure from the reference class must be specified. Furthermore, some of the characteristics of the prediction problem under consideration may suggest that the problem is easier than those in the reference class, while other characteristics may suggest that the problem is harder. A further complication is the possibility that some levels of performance may be judged to be harder to achieve for the prediction in question than for the predictions in the reference class, while other levels of performance may be judged to be easier to achieve. In such circumstances, the ordering of the probabilities F(s) and F′(s) would be judged to vary with s. Justifying such detailed judgments is likely to be difficult, at least for climate predictions. There is no straightforward solution to these complications but we reiterate that we should not expect perfect predictions of performance; we merely seek to improve current predictions, and believe that the simple harder/easier judgment proposed above has the potential to meet this goal in the case of climate predictions at least.