November 5, 2021

In this, the zillionth critique of null hypothesis significance testing ever written, I want to try to summarize the biggest problems with NHST, along with some smaller ones, in plain English. Because a non-statistician asked me. So, starting with the most important problem …

You have a hypothesis, \(H\). You think your hypothesis might be true. Specifically, you give it a credence of \(\Pr(H)\). Then you collect some data, \(D\), and you update your credence to \(\Pr(H \:{\mid}\: D)\). This update works according to Bayes’ theorem:

\[ \Pr(H \:{\mid}\: D) = \frac{{\color{red}\Pr(D \:{\mid}\: H)}\Pr(H)}{\Pr(D)} \]

This is the obvious way to do things. But NHST doesn’t do things the obvious way. Instead, NHST has you calculate \(\color{red}\Pr(D \:{\mid}\: H)\) (the red term from Bayes’ theorem—it’s called the *p-value*).

OK, the p-value *technically* isn’t the red term. It’s actually this:

\[ \Pr(\text{$D$ or ``more extreme'' $D$} \:{\mid}\: H) \]

(And \(H\) would be the *null* hypothesis, which comes with its own issues, but whatever.) Close enough. So why are things done this way? Two reasons:

- It’s easy.
- Many decades ago, some influential statisticians believed expressions like \(\Pr(H)\) were meaningless because they require you to treat probabilities as
*degrees of belief*. It turns out this is a perfectly sensible thing to do, but we had to wait for a few very stubborn people to die before the truth could prevail.

NHST doesn’t have many defenders these days, so I don’t want to dwell on it. Why do people still use it? Because change takes time. Because it’s still in our textbooks and taught in our schools. But this is changing, albeit very slowly.

A new scientific truth does not generally triumph by persuading its opponents and getting them to admit their errors, but rather by its opponents gradually dying out and giving way to a new generation that is raised on it.

Before I continue, I want to point out that NHST is actually *not* the biggest problem with statistics these days. Not even close. And replacing it *won’t* fix the bigger problems. It just happens to be an easy target. Just off the top of my head, here are some of the bigger problems:

- bad models
- no model validation
- researcher degrees of freedom
- publication bias
- just making shit up

People who are trying to fix these problems include Andrew Gelman, Nick Brown, the guys at Data Colada, and others as yet unknown to me. Everyone with an interest in statistics should follow them.

From Gwern: “The null hypothesis is false.”

This is something Gelman also likes to bring up. But the thing is, *a lot* of statistics is based on assumptions that are only approximately true. Is this a problem? Well …

I tend to think of it this way: The things we do are (probably) justified by a more sophisticated theory; we just don’t know the theory yet.

Another minor problem: NHST encourages people to think of results as being either significant or not, with no in-between. This is a problem not just for the general public, but also for the people who *do statistics*. Imagine: You do an experiment, collect some data, and when it comes time to analyze your results it turns out nothing is “statistically significant.” What are you going to do? Scrap the whole project and try something else? (And waste all that effort?) Try to get your null results published? (Lol, good luck with that!) Or go fishing until you find *something* with a low p-value?

One final problem: There’s no loss function! No matter what you’re testing, if \(p > 0.05\) then you’re supposed to accept—excuse me, *fail to reject* 🙄—the null hypothesis.

At this point some people would say “But you can adjust the level to suit your needs!” There are a few problems with this, but the biggest one is: nobody does it! For some reason, no matter how many times you tell people that 0.05 is arbitrary and they should choose their own level, they never actually do it. This explains a lot of what is wrong with statistics.