October 2, 2021

The formula for BIC is wrong

The usual formula for the Bayesian information criterion is as follows:

\[ -2\ell(\hat{\theta}) + k\log n \]

Here \(\ell\) is the log-likelihood function and \(\hat{\theta}\) is the MLE of the parameter(s) \(\theta\). I also use \(k\) for the number of parameters and \(n\) for the size of the dataset.

This formula is wrong! It should actually be this:

\[ -2\ell(\hat{\theta}) + k\log \frac{n}{2\pi} \]

For reference, check out the bottom of page 131 here or the upper and lower bounds in this paper. Let’s call this CBIC (corrected BIC), to distinguish it from the first one.

I’m sorry to say that I’m not the first person to notice this. I found this SE question after just a quick search, and I could no doubt have found many more examples if I’d searched harder. But this raises a question: Why hasn’t the corrected formula caught on?

I get the feeling that this is just due to inertia. In Schwarz’s paper that introduced BIC, he didn’t use a Gaussian approximation like most modern derivations do. So instead of \(k\log(2\pi)\) he got \(k\log(\pi/\lambda)\), and chose to fold it into his “remainder” term because it didn’t depend on \(n\). I suspect Schwarz would’ve kept this term if it weren’t for the \(\lambda\), but this is speculation. The important thing is that the formula at the top of this page is the one that is implemented in all of our software, and written in our textbooks, and taught in our schools, etc. It would take a lot of effort to switch to CBIC. Can that effort be justified?

One might argue that there is nothing to be gained by switching, since the difference becomes negligible when \(n\) is “large.” But what is large, exactly? After all, while \(\log(2\pi) \approx 1.84\) may look small, one must remember that even when \(n = 10^6\), \(\log n \approx 11.51\). So even when you have a huge amount of data, the usual BIC formula will have a penalty term that is nearly 20% larger than it should be. That rises to 36% if you have merely \(n = 1000\).

Some people have said BIC has a preference for simpler models. Well, here’s an explanation for that.

Still though, it remains to be seen whether the two formulas have a practical difference. If we change the formula, will that affect any of the decisions we make in the future? Perhaps not. In fact, if everyone used BIC correctly—as merely one heuristic among many—then I don’t think this would be a problem.

HA!

As I write this, the current version of the Wikipedia page for BIC suggests that a difference of more than 10 is “very strong” evidence for one model over another. This means treating ΔBIC as essentially a transformed Bayes factor, which is wrong for many reasons. For instance, if one model has at least 6 more parameters than the other (which is not unusual in some settings), this “very strong” evidence could disappear completely by using CBIC instead!

I’d like to see some empirical studies comparing CBIC with BIC and AIC. Maybe I’ll do that myself when I have the time. For now I’ll just content myself with correcting Wikipedia.