February 15, 2022
In order to answer this question on Metaculus, I’ve been learning about Supreme Court prediction. And when you study Supreme Court prediction, the first thing you realize is that people aren’t very good at it. Take for example this article about how a group of experts were only able to correctly predict affirm/reverse decisions about 60% of the time. Of course, 60% isn’t nothing, but you have to realize the Supreme Court reverses lower court decisions more often than not. So 60% is roughly what you would get by ignoring the details of each case and just guessing “reverse.” The authors of that article came up with some decision trees that did a bit better: 75% correct. But the trees only had very limited information to work with, namely
So a simple model outperforms an ensemble of experts. (Tetlock was right again!) It makes me wonder what you could do by, say, text-mining amicus briefs. Sadly I haven’t found a place with a large enough archive of these briefs.
One other way we can try to improve the model is by allowing each justice’s political attitudes to change over time. In fact, two of the authors of that paper, Andrew Martin and Kevin Quinn, did exactly that by developing the Martin-Quinn score. Any time you see the MSM reporting on the political attitudes of the Supreme Court or a particular justice, there’s a good chance they’re using MQ scores.
How do these scores come about? They were introduced in this article and basically they’re parameters in a statistical model. The model assumes that in each term \(t\), each justice \(j\) has a number, \(\theta_{t,j}\), that indicates how conservative or liberal they are—sort of. Actually, there’s hardly anything about politics in the model. Their priors are pretty much the only hint the model needs. (Though their priors are pretty strong, frankly.) The model also assumes the decisions made by each justice are a product of their \(\theta_{t,j}\) as well as a couple of parameters unique to each case, with some normally-distributed noise thrown in, too. So it’s essentially a probit model.
One of the assumptions of the model is that the attitudes of each justice are a random walk. To test how viable this assumption is, I made a bunch of plots of actual random walks and compared them with the MQ scores available on the authors’ website.
OK so I thought the difference was pretty obvious, but then I scrambled the names and asked my girlfriend if she could tell which one was the real plot. She couldn’t tell. So maybe it’s only obvious to people who stare at random walks a lot. But it looks to me like the MQ scores have a lot of trends, which you wouldn’t expect to see if they were true random walks. It also kind of bugs me that a justice can move from anywhere, to anywhere, but only gradually. So after many years, Thomas and Sotomayor could switch places, but not because of some sudden change brought about by old age or a personal revelation?
There’s also this tidbit from the article:
The dynamic ideal point model correctly classifies 76% of the decisions. This is an improvement over the observed marginal percentage of reversals (63%). As a point of comparison, the model in which ideal points are assumed constant over time classifies 74% of the decisions correctly.
The dynamic model, where justices can change over time, is so much more complex that you’d hope for more than a 2% gain, right? Hundreds of extra parameters, and you only get another one or two cases per year? And it doesn’t sound like this came from testing on a holdout set, so the more complex model might just be fitting its training data better, at the expense of general performance.
Anyway, I wanted to try building a model like theirs in Stan, and in the process I learned that Stan is really bad at handling this type of data. The problem is, each term has a variable number of cases, and each case has a variable number of justices. It would be natural to store this data in ragged arrays, which Stan doesn’t have, so instead you have to store everything in flat arrays. Then you have to keep a separate array of indices to track where one term ends and another begins. Good Stan code can look very declarative, but trying to vectorize this was like programming in C without the good parts. Here’s the code. I ended up making a few changes, and I’ll probably make even more later. First, I used a logit model instead of their probit, because who cares. And instead of giving each case two parameters—α and β—I just stuck with α. I don’t think the β was actually doing anything important, besides slowing down convergence. Instead, I’m using the lcDispositionDirection variable from the Supreme Court Database. This also saves me from having to put my thumb on the scale just so the model knows Thomas is the conservative one.
Ultimately I think I’ll move away from the big, parameter-heavy MQ model and try to bring in more sources of information instead. It really is amazing how little information people are trying to predict the Supreme Court with. Oyez has little biographies for each justice. These include some demographic information like ethnicity, religion, nominating president, and sometimes mother’s and father’s occupation. Yeah, weird, but maybe something can be done with that. I’m sure there’s plenty of information you could extract from Wikipedia, too. And I continue to believe there’s low-hanging fruit to be found in amicus briefs, if only they were more accessible.