February 26, 2023

Exploring Gaussian process kernels

Introduction

Inspired by the cherry blossom prediction contest, I was thinking about time series models for daily temperature, and that led me to learn about Gaussian processes. Together with Dirichlet processes, which I wrote about before, Gaussian processes are a cornerstone of Bayesian nonparametric statistics. Basically, if we want to model something that involves an unknown function (such as temperature over time), we can use a Gaussian process as a prior for that function.

Five samples from a GP with the squared exponential kernel.

This sounds like it should be complicated, but GPs are actually really simple. Essentially, if the function \(f\) is GP-distributed, then any finite set of values \(f(x_1), \ldots, f(x_n)\) follows a multivariate normal distribution. The parameters of that distribution—the mean and covariance—are up to you. For demonstration purposes, people usually just assume the mean is zero. The covariance is where things get interesting. To choose the covariance, you need a function, usually called \(k\), that takes any two points, say \(x_i\) and \(x_j\), and returns their covariance \(\operatorname{cov}(x_i, x_j)\). This function is called a kernel, for reasons I’m not going to get into here.

There are several popular kernels to choose from. For example, there’s the squared exponential kernel, \[ k_{\mathrm{SE}}(x, x') = \exp\biggl(-\frac{(x - x')^2}{2\ell^2}\biggr). \]

This kernel makes smooth functions, and it gives you a parameter \(\ell\) you can fiddle with. What you probably want to do, instead of assigning \(\ell\) just one fixed value, is to assign it a prior, which affords you some extra flexibility.

Uses of the squared exponential kernel with different values of \(\ell\).

This is all relatively easy in Stan, which allows models like this: \[ \begin{align*} \ell &\sim \mathrm{Gamma}(3, 0.5) \\ f \:\vert\: \ell &\sim \mathrm{GP}(0, k_{\mathrm{SE},\ell}) \\ y_i \:\vert\: f &\sim N\bigl(f(x_i), 0.1\bigr) \end{align*} \]

Here we’re imagining we have some pairs \((x_i, y_i)\) that came from an unknown function, with some added noise. It would be even better to put a prior on the amount of noise, but I don’t feel like it.

Three \((x,y)\) pairs and 100 samples from the posterior distribution.

More power

The typical kernels only come with one or two parameters. How can we get more expressive power? We could add or multiply two kernels together, but that’s not a big improvement.

Samples from a GP with a spectral mixture kernel.

This paper shows us a better way: Pick any probability distribution whose density is symmetric around the origin, apply the inverse Fourier transform, and you get a valid kernel. It’s true! The authors suggest using a mixture of Gaussian distributions. Suppose we use the following mixture: \[ w_1 N(\mu_1, \sigma_1) + \cdots + w_n N(\mu_n, \sigma_n). \] Then our kernel will be \[ k(\Delta x) = \sum_n w_n \exp\bigl(-2\pi^2\sigma_n^2(\Delta x)^2\bigr)\cos\bigl(2\pi \mu_n \Delta x\bigr). \] The authors call this a “spectral mixture” kernel. This method always gives you a stationary kernel, meaning it depends on \(\Delta x = x - x'\) only. The popular kernels are all stationary too, so this isn’t a big deal.

Samples from a GP with a Dirichlet process spectral mixture kernel.

Now we have a way of making kernels with periodicity on several different time scales at once. And, like before, we can have our model infer the appropriate time scales and frequencies for us, if we want. Any, hey, how about a Dirichlet process prior for our mixture? I mean, why not?

A problem

I tried this out with a toy model, and oh boy Stan really did not like the spectral mixture kernel. Divergent transitions out the wazoo, gnarly traceplots, rock bottom effective sample counts, and … floating point arithmetic issues? I don’t know, I was getting slightly asymmetric covariance matrices somehow. Possibly I programmed it wrong, but I don’t think that’s the issue. My hypothesis is that the posterior of the kernel has lots of modes that are related by aliasing. That sort of thing always makes Stan angry.

Can we fix it?

I’m not sure. The original paper that proposed the spectral mixture kernel used maximum likelihood to get a point estimate. I’ll have to do more research to see if anyone has successfully used this kernel with MCMC. Or I programmed it wrong and there’s no problem after all. I think in any case we can say the spectral mixture kernel is powerful and expressive, but not easy to work with.