January 18, 2022

Followup on scoring human predictors

There’s nothing new under the sun

It turns out everything I said in my last article was already stated in greater generality in this easy-to-find 2007 article. Oh well.

That same article also answers my question from last time—whether you can actually reward people for predicting tricky questions—in the negative. Any proper scoring rule you can think of will have convex expectation. That’s too bad.

Since finding that article, I’ve decided that it’s probably a bad idea to grade human predictors using just one score. At the very least, I think a calibration plot combined with a histogram would help discourage people from choosing “easy” questions to predict on.

On the other hand, sometimes just one score is what you really need. And in that case, I find the “skill scores” mentioned in the linked article quite attractive. They’re not proper—not in any interesting cases, anyway. But they’re asymptotically proper, and that sounds good enough for me. (Actually, since Metaculus only allows predictions with limited precision, they don’t even need asymptotic propriety. Some kind of “bounded propriety” would do.) So let’s come up with a skill score for Metaculus.

A Metaculus skill score

First of all, let \(\ell\) denote the log loss function. If you give an event \(X\) a probability prediction \(p\) then your log loss is \[ \ell(X, p) = X\log p + (1-X)\log(1-p). \] Now suppose you’ve already predicted the events \(X_1, \ldots, X_n\), which you assigned the probabilities \(p_1, \ldots, p_n\). Meanwhile, suppose the Metaculus community assigned those same events probabilities \(m_1, \ldots, m_n\). This means Metaculus and you would have total log loss \[ \begin{aligned} M &= \sum_{i=1}^n \ell(X_i, m_i) \\ Y &= \sum_{i=1}^n \ell(X_i, p_i) \end{aligned} \] Then your score \(S\) would be decided by the following formula: \[ S = \frac{Y+n\log2}{M+n\log2}. \] Why the \(n\log2\)? Because that’s the opposite of the log loss you would get by just guessing 50% for everything. So if you do that, your score will just be 0. On the other hand, if you copy the community then your score will be 1. This means it’s easy to tell if someone is a terrible predictor (nearly zero) or a great predictor (over 1).

For the record, my score is \(0.96\). That’s based on final predictions. And at this point it’s natural to ask how we can average these scores over time, like how Metaculus does. I think this can be done, but at the moment I’m not sure what’s the best way to do it. Also, I’m not going to include a proof that this is asymptotically proper, for three reasons:

  1. It’s similar to the proof given here.
  2. It’s easy and boring.
  3. I think you need to assume nobody makes catastrophic predictions. The log loss can be so harsh near 0 and 1, and the proof may not work if you don’t put a limit on how extreme people’s predictions can be. I don’t feel like dealing with this, and it’s not an issue on Metaculus anyway.

Another possibility

I don’t think the skill score actually solves the problem I mentioned in my last post—having an incentive to choose easy questions over hard ones. I was screwing around in Desmos, and it looks like the skill score rewards you for choosing questions where you strongly disagree with the Metaculus prediction. So if the Metaculus prediction tends to be close to 50%, the only way to signal strong disagreement is by making extreme predictions. Moreover, if you agree with the Metaculus prediction, then the skill score rewards you more for questions where you both have extreme predictions.

So here’s another idea I haven’t tested yet: Suppose we score someone using the mutual information of their predictions and the events they predict. This would punish people who always predict things that obviously are going to happen, since the sequence of events will have very low entropy. On the other hand, it wouldn’t catch someone who predicts a mix of obvious-yes and obvious-no events. But you can break the mutual information down like so: \[ I(X; p) = I(X; p > 50\%) + I(X; p \:\vert\: p > 50\%) \] The first term tells you how informative the direction of someone’s prediction is, and the second term tells you how informative the extremity of it is.

I worry this will be impossible to estimate, or I’ve messed up the interpretation somehow. I’d like to test this idea, but this post is already pretty long, so that’ll have to come later.

Getting some data

With all of this in mind, we’ll need some more data to see if any of these ideas are worth pursuing further. Because clearly I’m not making enough predictions on my own.

Jgalt is the most active user on Metaculus. Who is Jgalt? I don’t know. But that’s not important. The important thing is he has over 1000 predictions on questions that have resolved. And one can access those predictions, for a price.

Was it worth it? Let’s find out.

Of course, Jgalt’s track record isn’t secret; he keeps a recent-ish screenshot of it as his Twitter banner. But I want the kind of detail a picture can’t provide.

After “purchasing” his track record, I can get the raw data by downloading his profile page. In the <head> there’s a <script> where the variable window.metacData.trackRecord is defined. There lies the data, all on one very long line. It’s an array of objects, which represent questions. Each object gives you a list of Jgalt’s predictions. Each object also gives the community prediction quartiles at various points in time. This means we can compare Jgalt’s predictions with the community predictions.

The annoying thing about these data is the fact that there are multiple predictions for each question. It would be cool to make an animated graph showing the evolution of Jgalt’s and the community’s predictions over time. But I don’t feel like doing that, so instead let’s look at initial predictions.

Jgalt really likes making 1% and 99% predictions. Is this what everyone else does? Have I been doing it wrong this whole time?

And now let’s look at final predictions.

This just raises further questions.

Sometimes Jgalt only makes a single prediction, so these two plots share some data. It looks to me like Jgalt chooses all sorts of questions to make predictions on, but his predictions are nonetheless really extreme. (I’m not showing his calibration plot here, but his 1% predictions tend to fare much better than his 99% ones. In fact, he’s decently well-calibrated in the bottom 20%.)

Anyway, Jgalt gets a skill score of 0.45. (Again, based on final predictions.) I’m guessing this is because of all the 99% predictions that didn’t pan out. I’m also guessing Jgalt cares a lot more about the Brier score than the log score, since a failed 99% prediction doesn’t hurt the Brier score so much. In fact, Jgalt has a better Brier score than Metaculus itself.

In the future I’d like to test my mutual information idea on Jgalt’s predictions. But I also want to get other people’s track records, since Jgalt strikes me as kind of a weirdo.