Gaussian process regression

So far, our modeling of time-series data has focused on proposing a specific probabilistic structure and then fitting the parameters of the model induced by that structure. That is, given the parameters of our model \boldsymbol{\theta} and some observed dataset \boldsymbol{X}, we calculate:

\begin{equation*} p(\boldsymbol{\theta} \mid \boldsymbol{X})=\frac{p(\boldsymbol{X} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta})}{p(\boldsymbol{X})} \tag{9.1} \end{equation*}

and then we use this posterior on our model parameter to predict future data \boldsymbol{X}_{\star} :

\begin{equation*} p\left(\boldsymbol{X}_{\star} \mid \boldsymbol{X}\right)=\int p\left(\boldsymbol{X}_{\star} \mid \boldsymbol{\theta}\right) p(\boldsymbol{\theta} \mid \boldsymbol{X}) d \boldsymbol{\theta} \tag{9.2} \end{equation*}

What if instead of picking a specific parameteric form for our function we marginalized over functions themselves:

\begin{equation*} p\left(\boldsymbol{X}_{\star} \mid \boldsymbol{X}\right)=\int p\left(\boldsymbol{X}_{\star} \mid f\right) p(f \mid \boldsymbol{X}) d f \tag{9.3} \end{equation*}

This may seem like a nearly impossible task, but it turns out that there are tools to accomplish this inference. In this lecture we will focus on one of the most popular and powerful options: Gaussian processes (GPs). The key insight behind the GP method is that we do not need to represent a distribution over a function, only the distribution over the function’s values at an arbitrary and finite set of points. That is, we need to specify:

\begin{equation*} p\left(f\left(\boldsymbol{x}_{1}\right), f\left(\boldsymbol{x}_{2}\right), \ldots, f\left(\boldsymbol{x}_{T}\right)\right), \tag{9.4} \end{equation*}

for a set of points \boldsymbol{x}_{1}, \ldots \boldsymbol{x}_{T}. Note that we are shifting our notation a bit here. \boldsymbol{x}_{1} is some arbitrary dimensional input value, of which only one dimension may be our time t_{1}. The GP assumes that this distribution is jointly Gaussian with some mean \boldsymbol{\mu}(\boldsymbol{X}) and covariance \boldsymbol{\Sigma}(\boldsymbol{X}) with the form:

\begin{equation*} \Sigma_{i j}=\kappa\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right), \tag{9.5} \end{equation*}

where \kappa is some positive-definite kernel function¹. Our prior assumptions about the distribution of functions stems from our specification of the mean and covariance of our GP. As we will see, it is common to specify a mean function \boldsymbol{\mu}(\boldsymbol{x})=0, so in reality the kernel function we assume will often be the lens through which we interpret our functional prior.

So what does it mean for this to be a prior? How would we draw from this prior? Well, in the absence of data, our prior over our function values at \boldsymbol{x}_{1: T} is simply (Equation 9.6):

\begin{align*} p\left(\left[\begin{array}{c} f\left(\boldsymbol{x}_{1}\right) \\ \vdots \\ f\left(\boldsymbol{x}_{T}\right) \end{array}\right]\right) &=\mathcal{N}\left(\left[\begin{array}{c} \mu\left(\boldsymbol{x}_{1}\right) \\ \vdots \\ \mu\left(\boldsymbol{x}_{T}\right) \end{array}\right],\left[\begin{array}{cccc} \kappa\left(\boldsymbol{x}_{1}, \boldsymbol{x}_{1}\right) & \kappa\left(\boldsymbol{x}_{1}, \boldsymbol{x}_{2}\right) & \ldots & \kappa\left(\boldsymbol{x}_{1}, \boldsymbol{x}_{T}\right) \\ \vdots & \vdots & \vdots & \vdots \\ \kappa\left(\boldsymbol{x}_{T}, \boldsymbol{x}_{1}\right) & \kappa\left(\boldsymbol{x}_{T}, \boldsymbol{x}_{2}\right) & \ldots & \kappa\left(\boldsymbol{x}_{T}, \boldsymbol{x}_{T}\right) \end{array}\right]\right) \tag{9.6} \end{align*}

The marginals of this distribution may seem fairly boring, but samples from our prior can actually exhibit some rich structure from the covariance between points.

Noiseless prediction

Let’s assume that we have a set of observations X=\left\{\boldsymbol{x}_{1: T}\right\} and \left\{f_{1: T}\right\}. That is, we have the observed function values at T points in our space. Let’s also start by assuming that our function has no noise and is deterministic. In other words, if we observed f(\boldsymbol{x}) for a given x more than once we would always get the same value. Given our GP model with corresponding kernel and mean, we would like to know the posterior over a set of future observations X_{\star}=\left\{\boldsymbol{x}_{1: N}^{\star}\right\}.

Let’s consider the full distribution:

p\left(\left[\begin{array}{c} \boldsymbol{f}(\boldsymbol{X}) \tag{9.7}\\ \boldsymbol{f}\left(\boldsymbol{X}_{\star}\right) \end{array}\right]\right)=\mathcal{N}\left(\left[\begin{array}{c} \boldsymbol{\mu}(\boldsymbol{X}) \\ \boldsymbol{\mu}\left(\boldsymbol{X}_{\star}\right) \end{array}\right],\left[\begin{array}{cc} \boldsymbol{K} & \boldsymbol{K}_{\star} \\ \left(\boldsymbol{K}_{\star}\right)^{T} & \boldsymbol{K}_{\star \star} \end{array}\right]\right)

where \boldsymbol{K}=\kappa(\boldsymbol{X}, \boldsymbol{X}), \boldsymbol{K}_{\star}=\kappa\left(\boldsymbol{X}, \boldsymbol{X}_{\star}\right), \boldsymbol{K}_{\star \star}=\kappa\left(\boldsymbol{X}_{\star}, \boldsymbol{X}_{\star}\right). We want to calculate:

\begin{equation*} p\left(\boldsymbol{f}\left(\boldsymbol{X}_{\star}\right) \mid \boldsymbol{f}(\boldsymbol{X}), \boldsymbol{X}, \boldsymbol{X}_{\star}\right) . \tag{9.8} \end{equation*}

This is just our trusty conditional Gaussian distribution! If we recall the equations:

\begin{align*} p\left(\boldsymbol{x}_{a} \mid \boldsymbol{x}_{b}\right) & =\mathcal{N}\left(\boldsymbol{x}_{a} \mid \boldsymbol{\mu}_{a \mid b}, \boldsymbol{\Sigma}_{a \mid b}\right) \tag{9.9}\\ \boldsymbol{\mu}_{a \mid b} & =\boldsymbol{\mu}_{a}+\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1}\left(\boldsymbol{x}_{b}-\boldsymbol{\mu}_{b}\right) \tag{9.10}\\ \boldsymbol{\Sigma}_{a \mid b} & =\boldsymbol{\Sigma}_{a a}-\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \boldsymbol{\Sigma}_{b a} \tag{9.11} \end{align*}

we can apply them to our GP posterior prediction:

\begin{align*} p\left(\boldsymbol{f}\left(\boldsymbol{X}_{\star}\right) \mid \boldsymbol{f}(\boldsymbol{X}), \boldsymbol{X}, \boldsymbol{X}_{\star}\right) & =\mathcal{N}\left(\boldsymbol{\mu}_{\star}, \boldsymbol{\Sigma}_{\star}\right) \tag{9.12}\\ \boldsymbol{\mu}_{\star} & =\boldsymbol{\mu}\left(\boldsymbol{X}_{\star}\right)+\boldsymbol{K}_{\star}^{T} \boldsymbol{K}^{-1}(\boldsymbol{f}(\boldsymbol{X})-\boldsymbol{\mu}(\boldsymbol{X})) \tag{9.13}\\ \Sigma_{\star} & =\boldsymbol{K}_{\star \star}-\boldsymbol{K}_{\star}^{T} \boldsymbol{K}^{-1} \boldsymbol{K}_{\star} . \tag{9.14} \end{align*}

If we make the common choice of setting \mu(\boldsymbol{x})=0 we get:

\begin{align*} p\left(\boldsymbol{f}\left(\boldsymbol{X}_{\star}\right) \mid \boldsymbol{f}(\boldsymbol{X}), \boldsymbol{X}, \boldsymbol{X}_{\star}\right) & =\mathcal{N}\left(\boldsymbol{\mu}_{\star}, \boldsymbol{\Sigma}_{\star}\right) \tag{9.15}\\ \boldsymbol{\mu}_{\star} & =\boldsymbol{K}_{\star}^{T} \boldsymbol{K}^{-1} \boldsymbol{f}(\boldsymbol{X}) \tag{9.16}\\ \boldsymbol{\Sigma}_{\star} & =\boldsymbol{K}_{\star \star}-\boldsymbol{K}_{\star}^{T} \boldsymbol{K}^{-1} \boldsymbol{K}_{\star} . \tag{9.17} \end{align*}

Let’s do a quick exercise to build intuition with these equations. At the outset, we made the assumption that our observations were noiseless. If that’s true, then setting \boldsymbol{X}=\boldsymbol{X}_{\star}=\{\boldsymbol{x}\} in the above equations should result in no covariance and a mean value equal to f(\boldsymbol{x}). Well, in this case \boldsymbol{K}=\boldsymbol{K}_{\star}=\boldsymbol{K}_{\star \star} and our equations reduce to:

\begin{align*} \boldsymbol{\mu}_{\star} & =\boldsymbol{f}(\boldsymbol{X}) \tag{9.18}\\ \boldsymbol{\Sigma}_{\star} & =0 \tag{9.19} \end{align*}

In general, if we assume no noise, our GP will interpolate between the data. This means that at each datapoint in the training set, the GP’s posterior is exactly the observed value for that datapoint.

Noisy prediction

The next step is to introduce the concept of noise to our GP. As we have seen with our HMMs, we will often want to think of our observations as a noisy realization of some latent process. In that case, our observational dataset is now X=\left\{\boldsymbol{x}_{1: T}\right\} and \left\{y_{1: T}\right\} where:

\begin{equation*} y_{t}=f\left(\boldsymbol{x}_{t}\right)+w_{t} . \tag{9.20} \end{equation*}

We will assume that w_{t} is white noise. That is w_{t} \sim \mathcal{N}\left(0, \sigma_{n}^{2}\right). Our covariance matrix for our observations Y is now:

\begin{equation*} p(\boldsymbol{y})=\mathcal{N}\left(\boldsymbol{\mu}(\boldsymbol{X}), \boldsymbol{K}+\sigma_{n}^{2} \mathbb{I}\right) \tag{9.21} \end{equation*}

The noise term appears as a identity matrix because we have assumed no covariance between the noise in our observations.

We can now repeat our previous derivation with this noise, starting from the full posterior:

p\left(\left[\begin{array}{c} \boldsymbol{y} \tag{9.22}\\ \boldsymbol{f}\left(\boldsymbol{X}_{\star}\right) \end{array}\right]\right)=\mathcal{N}\left(\left[\begin{array}{c} \boldsymbol{\mu}(\boldsymbol{X}) \\ \boldsymbol{\mu}\left(\boldsymbol{X}_{\star}\right) \end{array}\right],\left[\begin{array}{cc} \boldsymbol{K}+\sigma_{n}^{2} \mathbb{I} & \boldsymbol{K}_{\star} \\ \left(\boldsymbol{K}_{\star}\right)^{T} & \boldsymbol{K}_{\star \star} \end{array}\right]\right)

Our conditional distribution is therefore:

\begin{align*} p\left(\boldsymbol{f}\left(\boldsymbol{X}_{\star}\right) \mid \boldsymbol{y}, \boldsymbol{X}, \boldsymbol{X}_{\star}\right) & =\mathcal{N}\left(\boldsymbol{\mu}_{\star}, \boldsymbol{\Sigma}_{\star}\right) \tag{9.23}\\ \boldsymbol{\mu}_{\star} & =\boldsymbol{\mu}\left(\boldsymbol{X}_{\star}\right)+\boldsymbol{K}_{\star}^{T}\left(\boldsymbol{K}+\sigma_{n}^{2} \mathbb{I}\right)^{-1}(\boldsymbol{y}-\boldsymbol{\mu}(\boldsymbol{X})) \tag{9.24}\\ \boldsymbol{\Sigma}_{\star} & =\boldsymbol{K}_{\star \star}-\boldsymbol{K}_{\star}^{T}\left(\boldsymbol{K}+\sigma_{n}^{2} \mathbb{I}\right)^{-1} \boldsymbol{K}_{\star} . \tag{9.25} \end{align*}

Let’s try to build some intuition for these noisy posterior equations. Let’s start by considering the case of infinite noise in our data, \sigma_{n} \rightarrow \infty :

This is just the prior distribution we started with in Equation 9.6, which is just the statement that if our data is infinitely noisy it does not update our beliefs about the distributions. What about if we assume that \boldsymbol{X}=\boldsymbol{X}_{\star}=\boldsymbol{x} again? Plugging into our equations we find:

\begin{align*} p\left(\boldsymbol{f}\left(\boldsymbol{X}_{\star}\right) \mid \boldsymbol{y}, \boldsymbol{X}, \boldsymbol{X}_{\star}\right) & =\mathcal{N}\left(\boldsymbol{\mu}_{\star}, \boldsymbol{\Sigma}_{\star}\right) \tag{9.29}\\ \boldsymbol{\mu}_{\star} & =\boldsymbol{\mu}(\boldsymbol{X})+\boldsymbol{K}^{T}\left(\boldsymbol{K}+\sigma_{n}^{2} \mathbb{I}\right)^{-1}(\boldsymbol{y}-\boldsymbol{\mu}(\boldsymbol{X})) \tag{9.30}\\ \boldsymbol{\Sigma}_{\star} & =\boldsymbol{K}-\boldsymbol{K}^{T}\left(\boldsymbol{K}+\sigma_{n}^{2} \mathbb{I}\right)^{-1} \boldsymbol{K} . \tag{9.31} \end{align*}

So our model is no longer predicting the observations nor does it assume zero covariance at the observed points. In figures presented in the lecture slides we build further intuition for these equations.

Impact of kernel parameters

Let us quickly introduce a popular choice of kernel, the squared exponential kernel².

\begin{equation*} \kappa_{\mathrm{SE}}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\sigma_{f}^{2} \exp \left[\frac{-\left(\boldsymbol{x}-\boldsymbol{x}^{\prime}\right)^{T}\left(\boldsymbol{x}-\boldsymbol{x}^{\prime}\right)}{2 l^{2}}\right] \tag{9.32} \end{equation*}

We can think of the parameter \sigma_{f} as controlling the diagonal elements of our covariance matrix and therefore the marginal uncertainty. The parameter l controls the horizontal length scale over which we allow the function to vary. If we include a noise term, our covariance is:

\begin{equation*} \Sigma_{i, j}=\kappa_{\mathrm{SE}}\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right)+\sigma_{n}^{2} \delta_{\boldsymbol{x}_{i}, \boldsymbol{x}_{j}} \tag{9.33} \end{equation*}

In the lecture slides, we provide some visualizations of how varying the kernel parameters changes our inferred GP posteriors.

Evaluating and fitting kernel parameters

It is clear that the choice of kernel and kernel parameters has a sizable impact on the posterior distributions we infer. Logically, we may want to calculate the likelihood of our kernel parameters given the observed data³. We can then use this to get the MAP or MLE estimate for the parameters of our kernel. If we call our kernel parameters \boldsymbol{\theta}, what we want to maximize is the likelihood of our observed data:

\begin{equation*} p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{\theta})=\int p(\boldsymbol{y} \mid \boldsymbol{f}) p(\boldsymbol{f} \mid \boldsymbol{X}, \boldsymbol{\theta}) d \boldsymbol{f} \tag{9.34} \end{equation*}

Here were are being explicit about our observation noise p(\boldsymbol{y} \mid \boldsymbol{f})=\mathcal{N}\left(\boldsymbol{f}, \sigma_{n}^{2} \mathbb{I}\right). This is just an integral over two Gaussians which gives:

\begin{equation*} \log p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{\theta})=-\frac{1}{2} \boldsymbol{y}^{T}\left(\boldsymbol{K}(\boldsymbol{\theta})+\sigma_{n}^{2} \mathbb{I}\right)^{-1} \boldsymbol{y}-\frac{1}{2} \log \left|\boldsymbol{K}(\boldsymbol{\theta})+\sigma_{n}^{2} \mathbb{I}\right|-\frac{T}{2} \log (2 \pi), \tag{9.35} \end{equation*}

where \boldsymbol{K}(\boldsymbol{\theta}) is being explicit about our kernel’s dependence on the kernel parameters \theta. Notice that this likelihood has two important terms. The first term rewards fitting the data well, while the second term punishes the complexity of our functions. In the lecture slides we show an example for the SE kernel’s parameter l.

Gaussian Process Kernels

So far, we have framed our understanding of GPs in the function space. This framing has allowed us to construct priors, predictions, and data likelihoods with our GP models, but it is not a sufficient framework to understand the assumptions built into a choice of kernel. It turns out the weight-space representation for the GP will be useful. Let’s imagine that we have some model defined by the linear combination of M basis functions:

\begin{equation*} y(x)=\boldsymbol{w}^{T} \boldsymbol{\phi}(x), \tag{9.36} \end{equation*}

where \boldsymbol{\phi}(x)_{m} is our m^{\text {th }} basis function and w_{m} is its weight. One concrete example is polynomials of order 2 for which:

\boldsymbol{\phi}(x)=\left[\begin{array}{c} 1 \tag{9.37}\\ x \\ x^{2} \end{array}\right]

and the weights determine the coefficients of each term in our polynomial. The prior over y will be determined by the prior over \boldsymbol{w}, which we will take to be:

\begin{equation*} p(\boldsymbol{w})=\mathcal{N}\left(0, \sigma_{w}^{2} \mathbb{I}\right) \tag{9.38} \end{equation*}

As with the function-space view, we are interested in evaluating y at some set of points x_{1}, \ldots, x_{T}. We can write this as:

\begin{equation*} y=\boldsymbol{\Phi} \boldsymbol{w} \tag{9.39} \end{equation*}

where y_{t}=y\left(x_{t}\right) and \Phi_{t k}=\phi_{k}\left(x_{t}\right). \boldsymbol{y} is a linear combination of Gaussian random variables, so it is Gaussian. We therefore just require the mean and covariance of \boldsymbol{y} to describe its distribution:

\begin{align*} \boldsymbol{\mu}_{\boldsymbol{y}} & =\mathbb{E}[\boldsymbol{y}] \tag{9.40}\\ & =\boldsymbol{\Phi} \mathbb{E}[\boldsymbol{w}] \tag{9.41}\\ & =0 \tag{9.42}\\ \boldsymbol{\Sigma}_{\boldsymbol{y}} & =\mathbb{E}\left[\boldsymbol{y} \boldsymbol{y}^{T}\right] \tag{9.43}\\ & =\boldsymbol{\Phi} \mathbb{E}\left[\boldsymbol{w} \boldsymbol{w}^{T}\right] \boldsymbol{\Phi}^{T} \tag{9.44}\\ & =\sigma_{w}^{2} \boldsymbol{\Phi} \boldsymbol{\Phi}^{T} \tag{9.45} \end{align*}

Let’s take this result back to the function view. There we asserted that there was some distribution over functions (without specifying what functions) with a mean and a covariance. Here, we have a distribution over polynomial functions of order 2⁴, and we found it has mean 0 and a covariance matrix with the form:

\begin{equation*} \mathbb{K}=\sigma_{w}^{2} \boldsymbol{\Phi} \boldsymbol{\Phi}^{T} \tag{9.46} \end{equation*}

Notice that:

\begin{equation*} K_{i j}=\kappa\left(x_{i}, x_{j}\right)=\sigma_{w}^{2} \boldsymbol{\phi}\left(x_{i}\right)^{T} \boldsymbol{\phi}\left(x_{j}\right) . \tag{9.47} \end{equation*}

If we take \sigma_{w}=1 we can write:

\begin{align*} \kappa\left(x_{i}, x_{j}\right) & =\boldsymbol{\phi}\left(x_{i}\right)^{T} \boldsymbol{\phi}\left(x_{j}\right) \tag{9.48}\\ & =\left[\begin{array}{lll} 1 & x_{i} & x_{i}^{2} \end{array}\right]\left[\begin{array}{c} 1 \\ x_{j} \\ x_{j}^{2} \end{array}\right] \tag{9.49}\\ & =x_{i}^{2} x_{j}^{2}+x_{i} x_{j}+1 \tag{9.50} \end{align*}

Using this function as our kernel for our GP gives us the posterior over order two polynomials given our observation⁵.

For a kernel function \kappa to produce a valid kernel, it must lead to a positive semi-definite matrix K. In general, you can show⁶ that it is necessary and sufficient for a valid kernel function to have a feature representation \boldsymbol{\phi} such that:

\begin{equation*} \kappa\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\boldsymbol{\phi}(\boldsymbol{x})^{T} \boldsymbol{\phi}\left(\boldsymbol{x}^{\prime}\right) . \tag{9.51} \end{equation*}

However this feature representations is not unique nor does it need to be finite-dimensional. The squared-exponential kernel we introduced earlier can be shown to have an infinite-dimensional feature representation.

We now have two ways of thinking about kernels. We can either specify the specific feature vector of interest \boldsymbol{\phi} and then calculate the implied kernel \kappa\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right), or we can start from the kernel function and reconstruct the feature vector it implies. We will generally want to do the later, since many interesting kernels will have an infinite-dimensional feature vector. This is actually a fascinating feature of the GP: we can predict the posterior over the function space of an infinite-dimensional feature vector without ever calculating those features. Working in the kernel / function space rather than the feature / weight space is often called the kernel trick.

If we want to guarantee that a function is a valid kernel function we must either show that the produced matrix is positive semi-definite, or show that it can be written in the form of Equation 9.51. However, we can also use functions that we already know to be valid to generate new valid kernel functions. Here is a list of properties you can use to generate a new valid kernel \kappa^{\prime} from two valid kernels \kappa_{1}, \kappa_{2} :

\begin{align*} \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =c \kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & & \text { constant } c \ge 0 \tag{9.52}\\ \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =g(\boldsymbol{x}) \kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) g\left(\boldsymbol{x}^{\prime}\right) & & \text { function } g \tag{9.53}\\ \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =q\left(\kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)\right) & & \text { polynomial function } q \tag{9.54}\\ \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =\exp \left(\kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)\right) & & \text { exponential } \tag{9.55}\\ \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =\kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)+\kappa_{2}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & & \text { sum } \tag{9.56}\\ \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =\kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) \times \kappa_{2}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) . & & \text { product } \tag{9.57} \end{align*}

We can also divide the input \boldsymbol{x}=\left[\boldsymbol{x}_{a}, \boldsymbol{x}_{b}\right] and start from two valid kernel on the subset a, \kappa_{a}, and the subset b, \kappa_{b} :

\begin{align*} \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =\kappa_{a}\left(\boldsymbol{x}_{a}, \boldsymbol{x}_{a}^{\prime}\right)+\kappa_{b}\left(\boldsymbol{x}_{b}, \boldsymbol{x}_{b}^{\prime}\right) & & \text { sum of subsets } \tag{9.58}\\ \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =\kappa_{a}\left(\boldsymbol{x}_{a}, \boldsymbol{x}_{a}^{\prime}\right) \times \kappa_{b}\left(\boldsymbol{x}_{b}, \boldsymbol{x}_{b}^{\prime}\right) . & & \text { product of subsets } \tag{9.59} \end{align*}

Let’s quickly prove Equation 9.52 and Equation 9.53 and leave the rest as an exercise. First we note that, since \kappa_{1} is a valid kernel, there must exist a feature vector such that:

\begin{equation*} \kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\phi(\boldsymbol{x})^{T} \phi\left(\boldsymbol{x}^{\prime}\right) . \tag{9.60} \end{equation*}

For the product by a constant we have:

\begin{align*} \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =c \kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) \tag{9.61}\\ & =c \phi(\boldsymbol{x})^{T} \phi\left(\boldsymbol{x}^{\prime}\right) \tag{9.62}\\ & =\phi^{\prime}(\boldsymbol{x})^{T} \phi^{\prime}\left(\boldsymbol{x}^{\prime}\right) \tag{9.63}\\ \phi^{\prime}(\boldsymbol{x}) & =\sqrt{c} \phi(\boldsymbol{x}) \tag{9.64} \end{align*}

Since we can express our candidate function as a scalar product in some feature space \phi^{\prime}(\boldsymbol{x}), we have shown that we have a valid kernel. For the product by a function the proof follows almost identically:

\begin{align*} \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =g(\boldsymbol{x}) \kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) g\left(\boldsymbol{x}^{\prime}\right) \tag{9.65}\\ & =g(\boldsymbol{x}) \phi(\boldsymbol{x})^{T} \phi\left(\boldsymbol{x}^{\prime}\right) g\left(\boldsymbol{x}^{\prime}\right) \tag{9.66}\\ & =\phi^{\prime}(\boldsymbol{x})^{T} \phi^{\prime}\left(\boldsymbol{x}^{\prime}\right) \tag{9.67}\\ \phi^{\prime}(\boldsymbol{x}) & =g(\boldsymbol{x}) \phi(\boldsymbol{x}) \tag{9.68} \end{align*}

There are two more concepts that you may run into when choosing between kernels that are worth briefly defining:

Stationary kernels: If a kernel function can be rewritten as a function of the difference between the input points - \kappa\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\kappa\left(\boldsymbol{x}-\boldsymbol{x}^{\prime}\right) - then it is called a stationary kernel. They are often useful because they are invariant to translations in the input space. You can also demand that your kernel only depends on the magnitude of the distance between the two input points - \kappa\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\kappa\left(\left\|\boldsymbol{x}-\boldsymbol{x}^{\prime}\right\|\right)⁷. These types of kernels are called radial basis functions. They are often an intuitive choice since they directly encode the notion that the correlation between two function values should only be determined by the distance between their inputs.
Universal kernels: The exact definition of universal kernels is a bit wordy, but the general concepts is that for any continuous function on your input space g, there exists a function, f, in the space defined by your kernel that is arbitrarily close to g⁸. There is no guarantee that a universal kernel will converge to the true function g given more and more observations, but it is a desirable property that your distribution over function space includes some function f that comes arbitrarily close to g. The most commonly used universal kernel is the squared-exponential kernel.

Gaussian Process Strengths and Weaknesses

With this introduction to Gaussian Processes under our belt, it is useful to discuss a few of their strengths and weaknesses at a high level. Let’s start with their strengths:

Analytic: Given kernel hyperparameters and a set of observations, the posterior predictive distribution is analytic (a Gaussian), and no iterative inference (like MCMC or VI) is required
Expressive / Universal: The kernel function allows us to control the structural assumptions of our model, and some kernel functions lead to universal kernels.
Model Comparison: The likelihood of the data can be computed allowing us to compare different kernels (and kernel parameters) easily.
Uncertainties: By default, the GP has an expression for the certainty of the model’s predictions. Assuming an appropriate choice of kernel, these uncertainties will grow as we move away from the regions constrained by the data.

This set of strengths has driven the popularity of Gaussian processes as a tool for regression. But like any modeling tool, GPs have weaknesses:

Poor Scaling: Computing the predictions of a Gaussian process requires inverting a T \times T matrix, where T is the number of observations. That inverse computation scales like \mathcal{O}\left(T^{3}\right), making prediction using sufficiently large datasets computationally intractable. There are a number of approximations and tricks that have been developed over the years to alleviate this problem, but it still places fundamental limitations on GPs.
Kernels: As we will explore in lab, your choice of kernel and kernel parameters will have a substantial impact on the distribution of functions predicted by the GP. Making the appropriate choice of kernel function will often require some level of expert knowledge about the dataset being modeled. There have also been efforts to automate the process of kernel construction or “meta learn” the kernel.

Here we mean that the matrix calculated on some arbitrary set of points \left\{\boldsymbol{x}_{1}, \ldots, \boldsymbol{x}_{T}\right\} is positive-definite.↩︎
This can also go by the name of Radial Basis Function (RBF) kernel.↩︎
As soon as we start fitting our kernel parameters they lose their strict interpretation as a prior and we begin delving into the world of empirical Bayes.↩︎
with a specific prior on the coefficients↩︎
Notice that we could have made different choices for the prior on our weights which would have led to different final kernels. These kernels would all still be operating over polynomial functions of order two, but they would be placing different prior weights on those functions.↩︎
See Micchelli et. al 2006 for a proof of this property and universal kernels.↩︎
We have not defined distance here, but a common choice is Euclidean distance.↩︎
The exact definition requires a compact metric space and the concept of a reproducing kernel Hilbert space. These are both outside the scope of these lecture notes.↩︎