Suppose you have evaluated a classifier’s performance on an independent testing set. To what extent can you trust your findings? When a flipped coin comes up heads eight times out of ten, any reasonable experimenter will suspect this to be nothing but a fluke, expecting that another set of ten tosses will give a result closer to reality. Similar caution is in place when measuring classification performance. To evaluate classification accuracy on a testing set is not enough; just as important is to develop some notion of the chances that the measured value is a reliable estimate of the classifier’s true behavior.