January 10, 2023
This is about that paper I mentioned in the previous article. It describes some hypothesis tests for “Bayesianess”—whether someone is using Bayes’ rule to update their beliefs. I found the idea ironic, but some people at Metaculus thought this could be useful. Now that I’ve had more time to read the paper, I think it’s actually a terrible idea and not useful at all.
The first hint that something is wrong comes from this statement:
A Bayesian can only move from 5% to 100% with 5% probability.
This is only true from that Bayesian’s perspective. But people can disagree about probabilities. If someone says “The chance of a regime change in Russia this year is 50%,” but I think the chance is 20%, then from their perspective they have a 50% chance of moving to 100%, while from my perspective they have a 20% chance.
The main issue with the paper, which this statement hints at, is that the tests all depend on two very strong assumptions:
The authors acknowledge the latter assumption, but they try to spin it as a good thing, saying it “might be a positive feature insofar as researchers are more interested in testing flawed reasoning rather than bad priors.” But this is ridiculous. If we’re assuming our model and priors are good, then surely “flawed reasoning” isn’t a concern, right?
To see how fragile the authors’ results are, I came up with a simple experiment. Suppose we’re trying to predict a binary event whose probability, which is unknown to us, is \(p\). Our evidence consists of several independent draws of a \(\mathrm{Binomial}(n, p)\) variable. But we don’t know what \(n\) is. Let’s say \(n\) is actually 10, but we think it’s 11. This means we will tend to underestimate \(p\). The following is a plot of how our beliefs might evolve over time, using different values of \(p\), starting with a uniform prior.
The tendency to underestimate is obvious. This tendency causes problems for the authors’ test. And here I’m referring to the test described on page 16 using the statistic \[ Z = \frac{\sqrt{n}}{s_{t,t+1}}(\overline{m}_{t,t+1} - \overline{r}_{t,t+1}) \rightsquigarrow N(0, 1). \]
There are two ways this statistic can go bad. The first is by having a bad model, and the second is by having a bad prior.
I ran two experiments: For the first I used the \(\mathrm{Binomial}(11, p)\) model as before. For the second I started with a \(\mathrm{Beta}(2, 1)\) prior but used a correct model. I drew values of \(p\) uniformly and used 50 belief streams to calculate each \(Z\). In neither experiment did the \(Z\) statistic have the desired distribution. I was surprised to see that the bad prior has a much stronger effect than the bad model. But this may depend on the amount of evidence. (I used 20 draws.)
The authors did an experiment similar to my second one and got a similar result (but no pretty pictures). Like I said, they acknowledge but downplay the problem of priors. On the other hand, I found no mention of the model issue in their paper. But it’s a really long paper, so maybe I missed it.
So what are we to make of this paper? Certainly it won’t work for what the Metaculus people wanted to do—detect cognitive biases in Metaculus users. It might work as a test of the Metaculus prediction itself, which is supposed to be super accurate. But it’s not even clear what it would be a test of. A useful test will have to make much weaker assumptions. I don’t know if such a test exists or if it’s even possible to make one.