January 4, 2022

# In search of a better way to score human predictors

If you track your predictions using Metaculus, you have three tools for comparing yourself to other players: the track record (Brier score or log score), calibration, and Metaculus point total. Unfortunately, none of these is a satisfying indicator of prediction quality.

• The track record (and in particular the Brier score, which you want to minimize), encourages you to predict only events you think are very likely or very unlikely. This is because your expected Brier score rises as the probability of the event approaches 50%. Roman Cheplyaka has also noticed this. The log score has the same problem.

• Poor calibration means you’re doing badly, but good calibration doesn’t mean you’re doing well. Zvi has explained this.

• A person who makes a lot of predictions will tend to have a lot of Metaculus points, regardless of how good a predictor they are. Metaculus rewards participation, which is fine, but it makes for a crappy scoring rule.

Ideally, I’d like to have a proper scoring rule that gives you the same reward in expectation regardless of how likely you believe the event is. That way you won’t be incentivized to avoid predictions close to 50%. Does such a scoring rule exist?

Let $$S \colon [0,1] \to \mathbb{R}$$ be a strictly increasing differentiable function. If we assign an event a probability of $$p$$, and that event actually happens, we will give ourselves a score of $$S(p)$$. And if the event doesn’t happen, we will give ourselves a score of $$S(1-p)$$. If the event in question actually has probability $$p^*$$, then our expected score is $p^*S(p) + (1-p^*)S(1-p).$ For this to be a proper scoring rule, we need our expected score to be maximized at $$p = p^*$$, regardless of the value of $$p^*$$. So if $$p^* \in (0,1)$$, the derivative of the above expression should vanish at that point. This means $p^*S'(p^*) - (1-p^*)S'(1-p^*) = 0.$ Since this equation must be true for any $$p^* \in (0,1)$$, we might as well write $$p$$ instead of $$p^*$$. We can rewrite the equation to get $S'(1-p) = \frac{p}{1-p}S'(p).$ We also want our expected score to be a constant as a function of $$p^*$$. And since this is a proper scoring rule, we only care about our expected score when we choose $$p = p^*$$. In other words, we need $p^*S(p^*) + (1-p^*)S(1-p^*) = C.$ Again, since this must be true for all $$p^* \in [0,1]$$, we’ll just use $$p$$ instead. Differentiating this equation gives us $S(p) + pS'(p) - S(1-p) - (1-p)S'(1-p) = 0.$ But using the equation for $$S'(1-p)$$ from earlier, we can derive \begin{align*} S(p) + pS'(p) - S(1-p) - (1-p)\cdot\frac{p}{1-p}S'(p) &= 0 \\ S(p) + pS'(p) - S(1-p) - pS'(p) &= 0 \\ S(p) - S(1-p) &= 0 \\ S(p) &= S(1-p). \end{align*} This contradicts the assumption that $$S$$ is strictly increasing. We could drop that assumption. But if we did, then we could be punished for assigning a higher probability to an event that actually happens. This seems obviously wrong to me. Now, you might argue that if you predicted something with a probability of, say, 100%, and your prediction comes true, then you should be punished for being ridiculous. But imagine there were someone who only made 0% and 100% predictions, and not just on trivial things, and every one of their predictions came true. Such a person would clearly deserve a high score, since it’s not ridiculous if you can do it consistently.

Anyway, if my math is right, it looks like there aren’t any good scoring rules for human predictors. I can think of two possible ways around this:

1. Have the scoring rule take into account your predictions on other events. I’m not sure how this would work.

2. Come up with a rule that incentivizes predictions that are closer to 50% instead. I think these kinds of predictions are harder to make anyway, so I would prefer this to what we get with the Brier score.