Table 2 For each classifier and threshold combination (threshold picked using validation data), we report three numbers: The number of “passing” problems (out of 30), where some test instances obtained a probability no less than the threshold τ, the number of “valid” problems, i.e., those passing problems for which the ratio of (true) positive test instances with score exceeding τ to all such instances is at least τ, and the average recall at threshold τ (averaged over the valid problems only). Note that if we average the recall over all problems, at τ=0.99 Append+ gets 0.06 (i.e., \(0.6 \times \frac{3}{30.0}\), since Append+ achieves 3 valid problems), while NoisyOR and AVG get respectively 0.21 and 0.26. Both the number of valid problems and recall are indicative of performance

From: On using nearly-independent feature families for high precision and confidence

  Threshold τ
≥0.99 ≥0.95
Audio (0, 0, 0) (8, 4, 0.32)
Visual (8, 3, 0.653) (24, 20, 0.56)
Append (early fuse) (3, 1, 0.826) (26, 16, 0.50)
Append+ (early fuse) (7, 3, 0.60) (23, 20, 0.63)
NoisyOR (24, 18, 0.35) (29, 22, 0.56)
AVG (0, 0, 0) (13, 13, 0.19)
Calibrated AVG (17, 12, 0.65) (30, 26, 0.62)