January 4, 2022

In search of a better way to score human predictors

If you track your predictions using Metaculus, you have three tools for comparing yourself to other players: the track record (Brier score or log score), calibration, and Metaculus point total. Unfortunately, none of these is a satisfying indicator of prediction quality.

Ideally, I’d like to have a proper scoring rule that gives you the same reward in expectation regardless of how likely you believe the event is. That way you won’t be incentivized to avoid predictions close to 50%. Does such a scoring rule exist?

Let \(S \colon [0,1] \to \mathbb{R}\) be a strictly increasing differentiable function. If we assign an event a probability of \(p\), and that event actually happens, we will give ourselves a score of \(S(p)\). And if the event doesn’t happen, we will give ourselves a score of \(S(1-p)\). If the event in question actually has probability \(p^*\), then our expected score is \[ p^*S(p) + (1-p^*)S(1-p). \] For this to be a proper scoring rule, we need our expected score to be maximized at \(p = p^*\), regardless of the value of \(p^*\). So if \(p^* \in (0,1)\), the derivative of the above expression should vanish at that point. This means \[ p^*S'(p^*) - (1-p^*)S'(1-p^*) = 0. \] Since this equation must be true for any \(p^* \in (0,1)\), we might as well write \(p\) instead of \(p^*\). We can rewrite the equation to get \[ S'(1-p) = \frac{p}{1-p}S'(p). \] We also want our expected score to be a constant as a function of \(p^*\). And since this is a proper scoring rule, we only care about our expected score when we choose \(p = p^*\). In other words, we need \[ p^*S(p^*) + (1-p^*)S(1-p^*) = C. \] Again, since this must be true for all \(p^* \in [0,1]\), we’ll just use \(p\) instead. Differentiating this equation gives us \[ S(p) + pS'(p) - S(1-p) - (1-p)S'(1-p) = 0. \] But using the equation for \(S'(1-p)\) from earlier, we can derive \[ \begin{align*} S(p) + pS'(p) - S(1-p) - (1-p)\cdot\frac{p}{1-p}S'(p) &= 0 \\ S(p) + pS'(p) - S(1-p) - pS'(p) &= 0 \\ S(p) - S(1-p) &= 0 \\ S(p) &= S(1-p). \end{align*} \] This contradicts the assumption that \(S\) is strictly increasing. We could drop that assumption. But if we did, then we could be punished for assigning a higher probability to an event that actually happens. This seems obviously wrong to me. Now, you might argue that if you predicted something with a probability of, say, 100%, and your prediction comes true, then you should be punished for being ridiculous. But imagine there were someone who only made 0% and 100% predictions, and not just on trivial things, and every one of their predictions came true. Such a person would clearly deserve a high score, since it’s not ridiculous if you can do it consistently.

Anyway, if my math is right, it looks like there aren’t any good scoring rules for human predictors. I can think of two possible ways around this:

  1. Have the scoring rule take into account your predictions on other events. I’m not sure how this would work.

  2. Come up with a rule that incentivizes predictions that are closer to 50% instead. I think these kinds of predictions are harder to make anyway, so I would prefer this to what we get with the Brier score.