April 22, 2022

I’ve been learning about Dirichlet processes in an effort to solve the problem posed in my last post. Dirichlet processes give us a way of defining *infinite* mixture models. These are useful whenever you have clustered data, but you aren’t sure how many clusters there are, how big they are, or where they’re located. Unfortunately for me, Dirichlet processes aren’t *quite* flexible enough for what I have in mind. But you have to crawl before you can walk, so here is a collection of things I’ve learned in the last few weeks.

Credit for the invention of the Dirichlet process generally goes to Thomas Ferguson in the early 1970s, but David Blackwell and James MacQueen also made important early contributions. And special mention should go to Ferguson’s student, Charles Antoniak, who was the first to talk about *mixtures* of Dirichlet processes. This was a huge development. As I’ll explain later, essentially all applications of Dirichlet processes involve mixtures of them.

A Dirichlet process is basically a random discrete probability measure. In other words, it is a discrete probability measure that is randomly chosen from a set of probability measures. Here is how Ferguson formalized this idea: Let \((\Omega, \Sigma)\) be a measurable space. Choose a *base (probability) measure* \(G_0\) and positive *concentration parameter* \(\alpha\). We say a random probability measure \(G\) is a *Dirichlet process* if, for all measurable partitions \(E_1 \sqcup \cdots \sqcup E_n = \Omega\), \[
\bigl( G(E_1), \ldots, G(E_n) \bigr) \sim \mathrm{Dirichlet}\bigl( \alpha G_0(E_1), \ldots, \alpha G_0(E_n) \bigr).
\] And in this case we write \[
G \sim \mathrm{DP}(\alpha G_0).
\] From this property we can derive many others. For instance, if \(E \in \Sigma\) then \[
\operatorname{E}\bigl[G(E)\bigr] = G_0(E).
\] This fact, which Ferguson derived in his original paper, tells us that Dirichlet processes will, on average, resemble their base distribution. But unlike the base distribution, a Dirichlet process is necessarily discrete. Now, Ferguson didn’t prove that this follows from the abstract definition above. (For starters, he would’ve needed a finer \(\sigma\)-algebra than the one mentioned in his paper.) But every *concrete* Dirichlet process certainly is discrete. And concrete Dirichlet processes come in two flavors:

- Pólya urn/Chinese restaurant processes
- stick-breaking processes

These are both adequately described on Wikipedia. For some reason, people tend to treat Pólya urns and Chinese restaurants as completely different processes, but they really aren’t. In my mind, the stick-breaking process is “distribution-first” while the urn/restaurant processes are “sample-first.” Stick-breaking builds the random measure \(G\) explicitly, but the urn/restaurant doesn’t do that.

The Dirichlet process is used for *infinite* mixture models, as I explain in the next section. So it allows for models where data can come from infinitely many different clusters. But those clusters will all be of different sizes, and almost all of them will be so small that they don’t matter. Essentially, the concentration parameter \(\alpha\) controls how many clusters matter. Larger \(\alpha\) means more clusters. This is easiest to understand if you think of the Chinese restaurant process. If there are already \(n\) people in the restaurant, the probability that a new customer sits at an empty table is \(\alpha / (\alpha + n)\).

We can also think of \(\alpha\) as controlling the *expected* number of clusters represented in a sample. Let \(G \sim \mathrm{DP}(\alpha G_0)\), and let \(X_1, \ldots, X_n \sim G\). Finally, let \(Z_n\) be the number of distinct values in \(\{X_1, \ldots, X_n\}\). We can find \(\operatorname{E}[Z_n]\) using the linearity of the expectation. Define \[
N_i = \begin{cases}
1 & \text{if $X_i \notin \{X_1, \ldots, X_{i-1}\}$} \\
0 & \text{otherwise.}
\end{cases}
\] Then \(Z_n = N_1 + \cdots + N_n\), so \[
\begin{align*}
\operatorname{E}[Z_n]
&= \operatorname{E}[N_1] + \cdots + \operatorname{E}[N_n] \\
&= \Pr(N_1 = 1) + \cdots + \Pr(N_n = 1) \\
&= \frac{\alpha}{\alpha} + \cdots + \frac{\alpha}{\alpha + n - 1}.
\end{align*}
\] So \(\operatorname{E}[Z_n] \to n\) as \(\alpha \to \infty\). Likewise \(\operatorname{E}[Z_n] \to 1\) as \(\alpha \to 0\). If \(\alpha = 1\) then \(\operatorname{E}[Z_n] = H_n\), the \(n\)^{th} harmonic number; and in this case we can infer \(\operatorname{E}[Z_n] \sim \log n + \gamma\).

The discreteness of Dirichlet processes makes them totally inappropriate when you want a continuous distribution. But there’s a fix for that: a mixture model. Assume your continuous distribution is a mixture of simple distributions with parameter(s) \(\theta\), then let the distribution of \(\theta\) be a Dirichlet process. Here is how this would look if you use a mixture of Gaussian distributions with a common variance:

\[ \begin{align*} X_i \:\vert\: \mu_i &\sim N(\mu_i, \sigma^2) \\ \mu_i \:\vert\: G &\sim G \\ G &\sim \mathrm{DP}(\alpha G_0) \end{align*} \]

(Writing \(\mu_i \:\vert\: G\) in the second line is necessary. If we don’t condition on \(G\) then \(\mu_i \sim G_0\). Remember that \(G\) is discrete while \(G_0\) usually isn’t.) When conditioning on your sample \(\{X_1, \ldots, X_n\}\), the posterior for \(G\) is a mixture of Dirichlet processes, and this is why Antoniak’s contribution is so important.

A combination of Gibbs sampling and Metropolis-Hastings gets you pretty far. **But** if you do it naïvely then your chains won’t mix well. Neal 2000 is a must-read reference, and is what the `dirichletprocess`

R package is based on. (See Neal’s Algorithm 8 especially.) But it doesn’t tell the full story. Neal’s algorithms are based on the Pólya urn process, and they have trouble splitting and merging clusters (i.e. poor mixing). Neal 2004 offers a solution to this problem, but another solution is to use the stick-breaking process instead. But this article is already getting long, so that’ll have to wait for a future article.