Decomposing the Brier score as simple expectations
There are plenty of articles out there on decomposing the Brier score, but they’re usually done in notation that’s non-standard for probability and statistics. I give the decomposition in more standard notation, for the simple case where we predict a binary outcome. I use expectations, instead of the empirical means that would be calculated in practice, since this simplifies the notation. Converting them to empirical means is straightforward.
Posterior / conditional decomposition
Suppose we are predicting the value of some variable
Once we observe
This works by inserting intermediate expressions for
As an alternative approach, we notice that the starting expectation is the expectation of a squared expression, so we can split that into the variance and the square of the expectation:
Again, this simplifies because
This is the standard variance-bias decomposition for the posterior mean squared error. It attains its minimum when
Prior / unconditional decomposition
If we also take the expectation over the data
We can obtain a clearer decomposition by taking the expectation of the prior decomposition, using the Tower property:
These decomposition terms are all positive, and have intuitive meanings.
- Uncertainty is the inherent variability of the value of
. Even if your predictions were perfect, with perfect information, this term would still remain. Since is a Bernoulli variable, the uncertainty is largest when . You want the uncertainty to be low, but it’s usually outside of your control. - Resolution measures how much the posterior mean
varies over different values of the data . The more the posterior mean varies, the more information the data is providing. At zero resolution, the posterior mean does not vary, but is always equal to the prior mean: the data provides no information. Resolution can therefore be thought of as how potentially helpful a prediction can be, since it reflects the quality of the data used to make it. You want this to be large, and it is sometimes within your control, depending on whether you’re in control of the data collection process. - Calibration, or reliability, measures the deviation of the prediction function
from the optimal prediction . You want the calibration to be low, which is confusingly the opposite of resolution, the other positively-phrased term. Whoops! It should really have been called miscalibration. This is also within your control: resolution depends on the quality of the data, calibration depends on the quality of the prediction made with that data. - Refinement is the difference between the uncertainty and the resolution. It’s the expected score for a perfectly-calibrated prediction. At zero refinement, the data gives enough information to predict
or with probability one; this is generally not possible, outside of degenerate cases such as knowing the outcome in advance, i.e. . Refinement is another positively-phrased term that you actually want to be small.
Note that the resolution and the reliability conveniently split two different modelling decisions: the resolution reflects the quality of the data used, and the reliability reflects the quality of the prediction made with that data.
A prediction with low resolution and low calibration is like a mathematician’s answer: technically correct, but unhelpful. A prediction with high resolution and high calibration is like a bad pundit’s answer: it takes account of lots of information, but in such a way that any prediction is likely to be far off the mark.
We can also see why we shouldn’t use Brier scores to compare predictions on different variables: since the uncertainty term depends only on what is being predicted, rather than the performance of the prediction, a predictor for a variable with lower uncertainty has the advantage of a smaller uncertainty term. This is true even if all the predictors being compared are perfectly calibrated, and use the same data.
As a simple example, we can make a predictor that we flip a fair coin and get a head, and a predictor that we flip a fair coin two times and get two heads. Both predictors are perfectly calibrated, but the former predictor is for a variable with a larger uncertainty, so it has a larger expected score.
The effect of binning
This section follows Stephenson et al. 2008, with changed notation.
One issue with empirically calculating the resolution and calibration is that, unless the predictions come from a small number of possible values, they are difficult to estimate well in practice, since you’re estimating
We effectively choose a summary function
We begin as above for the prior decomposition, but we use the expectation conditional on
Special cases:
- If
only depends on through as a summary statistic, then is invariant conditional on , and the within-bin terms are equal to zero. Let . This results in the expressions and being equal. - If
is the identity function, we have as many bins as possible prediction values. The within-bin terms are equal to zero, and we get the unconditional decomposition from before.
References:
- Bröcker, J. (2009), Reliability, sufficiency, and the decomposition of proper scores. Q.J.R. Meteorol. Soc., 135: 1512-1519. doi:10.1002/qj.456
- Stephenson, D.B., Coelho, C.A., and Jolliffe, I.T., 2008: Two extra components in the Brier score decomposition. Wea. Forecasting, 23, 752–757, doi:10.1175/2007WAF2006116.1