Gaussian processes

Lecture 9

Course: DS-GA 1018: Probabilistic Time Series Analysis
Date: November 5, 2025 (Wednesday)
Name:
NYU NetID:

Gaussian process regression

So far, our modeling of time-series data has focused on proposing a specific probabilistic structure and then fitting the parameters of the model induced by that structure. That is, given the parameters of our model \boldsymbol{\theta} and some observed dataset \boldsymbol{X}, we calculate:

\begin{equation*} p(\boldsymbol{\theta} \mid \boldsymbol{X})=\frac{p(\boldsymbol{X} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta})}{p(\boldsymbol{X})} \tag{9.1} \end{equation*}

and then we use this posterior on our model parameter to predict future data \boldsymbol{X}_{\star} :

\begin{equation*} p\left(\boldsymbol{X}_{\star} \mid \boldsymbol{X}\right)=\int p\left(\boldsymbol{X}_{\star} \mid \boldsymbol{\theta}\right) p(\boldsymbol{\theta} \mid \boldsymbol{X}) d \boldsymbol{\theta} \tag{9.2} \end{equation*}

What if instead of picking a specific parameteric form for our function we marginalized over functions themselves:

\begin{equation*} p\left(\boldsymbol{X}_{\star} \mid \boldsymbol{X}\right)=\int p\left(\boldsymbol{X}_{\star} \mid f\right) p(f \mid \boldsymbol{X}) d f \tag{9.3} \end{equation*}

This may seem like a nearly impossible task, but it turns out that there are tools to accomplish this inference. In this lecture we will focus on one of the most popular and powerful options: Gaussian processes (GPs). The key insight behind the GP method is that we do not need to represent a distribution over a function, only the distribution over the function’s values at an arbitrary and finite set of points. That is, we need to specify:

\begin{equation*} p\left(f\left(\boldsymbol{x}_{1}\right), f\left(\boldsymbol{x}_{2}\right), \ldots, f\left(\boldsymbol{x}_{T}\right)\right), \tag{9.4} \end{equation*}

for a set of points \boldsymbol{x}_{1}, \ldots \boldsymbol{x}_{T}. Note that we are shifting our notation a bit here. \boldsymbol{x}_{1} is some arbitrary dimensional input value, of which only one dimension may be our time t_{1}. The GP assumes that this distribution is jointly Gaussian with some mean \boldsymbol{\mu}(\boldsymbol{X}) and covariance \boldsymbol{\Sigma}(\boldsymbol{X}) with the form:

\begin{equation*} \Sigma_{i j}=\kappa\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right), \tag{9.5} \end{equation*}

where \kappa is some positive-definite kernel function1. Our prior assumptions about the distribution of functions stems from our specification of the mean and covariance of our GP. As we will see, it is common to specify a mean function \boldsymbol{\mu}(\boldsymbol{x})=0, so in reality the kernel function we assume will often be the lens through which we interpret our functional prior.

So what does it mean for this to be a prior? How would we draw from this prior? Well, in the absence of data, our prior over our function values at \boldsymbol{x}_{1: T} is simply (Equation 9.6):

\begin{align*} p\left(\left[\begin{array}{c} f\left(\boldsymbol{x}_{1}\right) \\ \vdots \\ f\left(\boldsymbol{x}_{T}\right) \end{array}\right]\right) &=\mathcal{N}\left(\left[\begin{array}{c} \mu\left(\boldsymbol{x}_{1}\right) \\ \vdots \\ \mu\left(\boldsymbol{x}_{T}\right) \end{array}\right],\left[\begin{array}{cccc} \kappa\left(\boldsymbol{x}_{1}, \boldsymbol{x}_{1}\right) & \kappa\left(\boldsymbol{x}_{1}, \boldsymbol{x}_{2}\right) & \ldots & \kappa\left(\boldsymbol{x}_{1}, \boldsymbol{x}_{T}\right) \\ \vdots & \vdots & \vdots & \vdots \\ \kappa\left(\boldsymbol{x}_{T}, \boldsymbol{x}_{1}\right) & \kappa\left(\boldsymbol{x}_{T}, \boldsymbol{x}_{2}\right) & \ldots & \kappa\left(\boldsymbol{x}_{T}, \boldsymbol{x}_{T}\right) \end{array}\right]\right) \tag{9.6} \end{align*}

The marginals of this distribution may seem fairly boring, but samples from our prior can actually exhibit some rich structure from the covariance between points.

Noiseless prediction

Let’s assume that we have a set of observations X=\left\{\boldsymbol{x}_{1: T}\right\} and \left\{f_{1: T}\right\}. That is, we have the observed function values at T points in our space. Let’s also start by assuming that our function has no noise and is deterministic. In other words, if we observed f(\boldsymbol{x}) for a given x more than once we would always get the same value. Given our GP model with corresponding kernel and mean, we would like to know the posterior over a set of future observations X_{\star}=\left\{\boldsymbol{x}_{1: N}^{\star}\right\}.

Let’s consider the full distribution:

p\left(\left[\begin{array}{c} \boldsymbol{f}(\boldsymbol{X}) \tag{9.7}\\ \boldsymbol{f}\left(\boldsymbol{X}_{\star}\right) \end{array}\right]\right)=\mathcal{N}\left(\left[\begin{array}{c} \boldsymbol{\mu}(\boldsymbol{X}) \\ \boldsymbol{\mu}\left(\boldsymbol{X}_{\star}\right) \end{array}\right],\left[\begin{array}{cc} \boldsymbol{K} & \boldsymbol{K}_{\star} \\ \left(\boldsymbol{K}_{\star}\right)^{T} & \boldsymbol{K}_{\star \star} \end{array}\right]\right)

where \boldsymbol{K}=\kappa(\boldsymbol{X}, \boldsymbol{X}), \boldsymbol{K}_{\star}=\kappa\left(\boldsymbol{X}, \boldsymbol{X}_{\star}\right), \boldsymbol{K}_{\star \star}=\kappa\left(\boldsymbol{X}_{\star}, \boldsymbol{X}_{\star}\right). We want to calculate:

\begin{equation*} p\left(\boldsymbol{f}\left(\boldsymbol{X}_{\star}\right) \mid \boldsymbol{f}(\boldsymbol{X}), \boldsymbol{X}, \boldsymbol{X}_{\star}\right) . \tag{9.8} \end{equation*}

This is just our trusty conditional Gaussian distribution! If we recall the equations:

\begin{align*} p\left(\boldsymbol{x}_{a} \mid \boldsymbol{x}_{b}\right) & =\mathcal{N}\left(\boldsymbol{x}_{a} \mid \boldsymbol{\mu}_{a \mid b}, \boldsymbol{\Sigma}_{a \mid b}\right) \tag{9.9}\\ \boldsymbol{\mu}_{a \mid b} & =\boldsymbol{\mu}_{a}+\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1}\left(\boldsymbol{x}_{b}-\boldsymbol{\mu}_{b}\right) \tag{9.10}\\ \boldsymbol{\Sigma}_{a \mid b} & =\boldsymbol{\Sigma}_{a a}-\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \boldsymbol{\Sigma}_{b a} \tag{9.11} \end{align*}

we can apply them to our GP posterior prediction:

\begin{align*} p\left(\boldsymbol{f}\left(\boldsymbol{X}_{\star}\right) \mid \boldsymbol{f}(\boldsymbol{X}), \boldsymbol{X}, \boldsymbol{X}_{\star}\right) & =\mathcal{N}\left(\boldsymbol{\mu}_{\star}, \boldsymbol{\Sigma}_{\star}\right) \tag{9.12}\\ \boldsymbol{\mu}_{\star} & =\boldsymbol{\mu}\left(\boldsymbol{X}_{\star}\right)+\boldsymbol{K}_{\star}^{T} \boldsymbol{K}^{-1}(\boldsymbol{f}(\boldsymbol{X})-\boldsymbol{\mu}(\boldsymbol{X})) \tag{9.13}\\ \Sigma_{\star} & =\boldsymbol{K}_{\star \star}-\boldsymbol{K}_{\star}^{T} \boldsymbol{K}^{-1} \boldsymbol{K}_{\star} . \tag{9.14} \end{align*}

If we make the common choice of setting \mu(\boldsymbol{x})=0 we get:

\begin{align*} p\left(\boldsymbol{f}\left(\boldsymbol{X}_{\star}\right) \mid \boldsymbol{f}(\boldsymbol{X}), \boldsymbol{X}, \boldsymbol{X}_{\star}\right) & =\mathcal{N}\left(\boldsymbol{\mu}_{\star}, \boldsymbol{\Sigma}_{\star}\right) \tag{9.15}\\ \boldsymbol{\mu}_{\star} & =\boldsymbol{K}_{\star}^{T} \boldsymbol{K}^{-1} \boldsymbol{f}(\boldsymbol{X}) \tag{9.16}\\ \boldsymbol{\Sigma}_{\star} & =\boldsymbol{K}_{\star \star}-\boldsymbol{K}_{\star}^{T} \boldsymbol{K}^{-1} \boldsymbol{K}_{\star} . \tag{9.17} \end{align*}

Let’s do a quick exercise to build intuition with these equations. At the outset, we made the assumption that our observations were noiseless. If that’s true, then setting \boldsymbol{X}=\boldsymbol{X}_{\star}=\{\boldsymbol{x}\} in the above equations should result in no covariance and a mean value equal to f(\boldsymbol{x}). Well, in this case \boldsymbol{K}=\boldsymbol{K}_{\star}=\boldsymbol{K}_{\star \star} and our equations reduce to:

\begin{align*} \boldsymbol{\mu}_{\star} & =\boldsymbol{f}(\boldsymbol{X}) \tag{9.18}\\ \boldsymbol{\Sigma}_{\star} & =0 \tag{9.19} \end{align*}

In general, if we assume no noise, our GP will interpolate between the data. This means that at each datapoint in the training set, the GP’s posterior is exactly the observed value for that datapoint.

Noisy prediction

The next step is to introduce the concept of noise to our GP. As we have seen with our HMMs, we will often want to think of our observations as a noisy realization of some latent process. In that case, our observational dataset is now X=\left\{\boldsymbol{x}_{1: T}\right\} and \left\{y_{1: T}\right\} where:

\begin{equation*} y_{t}=f\left(\boldsymbol{x}_{t}\right)+w_{t} . \tag{9.20} \end{equation*}

We will assume that w_{t} is white noise. That is w_{t} \sim \mathcal{N}\left(0, \sigma_{n}^{2}\right). Our covariance matrix for our observations Y is now:

\begin{equation*} p(\boldsymbol{y})=\mathcal{N}\left(\boldsymbol{\mu}(\boldsymbol{X}), \boldsymbol{K}+\sigma_{n}^{2} \mathbb{I}\right) \tag{9.21} \end{equation*}

The noise term appears as a identity matrix because we have assumed no covariance between the noise in our observations.

We can now repeat our previous derivation with this noise, starting from the full posterior:

p\left(\left[\begin{array}{c} \boldsymbol{y} \tag{9.22}\\ \boldsymbol{f}\left(\boldsymbol{X}_{\star}\right) \end{array}\right]\right)=\mathcal{N}\left(\left[\begin{array}{c} \boldsymbol{\mu}(\boldsymbol{X}) \\ \boldsymbol{\mu}\left(\boldsymbol{X}_{\star}\right) \end{array}\right],\left[\begin{array}{cc} \boldsymbol{K}+\sigma_{n}^{2} \mathbb{I} & \boldsymbol{K}_{\star} \\ \left(\boldsymbol{K}_{\star}\right)^{T} & \boldsymbol{K}_{\star \star} \end{array}\right]\right)

Our conditional distribution is therefore:

\begin{align*} p\left(\boldsymbol{f}\left(\boldsymbol{X}_{\star}\right) \mid \boldsymbol{y}, \boldsymbol{X}, \boldsymbol{X}_{\star}\right) & =\mathcal{N}\left(\boldsymbol{\mu}_{\star}, \boldsymbol{\Sigma}_{\star}\right) \tag{9.23}\\ \boldsymbol{\mu}_{\star} & =\boldsymbol{\mu}\left(\boldsymbol{X}_{\star}\right)+\boldsymbol{K}_{\star}^{T}\left(\boldsymbol{K}+\sigma_{n}^{2} \mathbb{I}\right)^{-1}(\boldsymbol{y}-\boldsymbol{\mu}(\boldsymbol{X})) \tag{9.24}\\ \boldsymbol{\Sigma}_{\star} & =\boldsymbol{K}_{\star \star}-\boldsymbol{K}_{\star}^{T}\left(\boldsymbol{K}+\sigma_{n}^{2} \mathbb{I}\right)^{-1} \boldsymbol{K}_{\star} . \tag{9.25} \end{align*}

Let’s try to build some intuition for these noisy posterior equations. Let’s start by considering the case of infinite noise in our data, \sigma_{n} \rightarrow \infty :

\begin{align*} p\left(\boldsymbol{f}\left(\boldsymbol{X}_{\star}\right) \mid \boldsymbol{y}, \boldsymbol{X}, \boldsymbol{X}_{\star}\right) & =\mathcal{N}\left(\boldsymbol{\mu}_{\star}, \boldsymbol{\Sigma}_{\star}\right) \tag{9.26}\\ \boldsymbol{\mu}_{\star} & =\boldsymbol{\mu}\left(\boldsymbol{X}_{\star}\right) \tag{9.27}\\ \boldsymbol{\Sigma}_{\star} & =\boldsymbol{K}_{\star \star} \tag{9.28} \end{align*}

This is just the prior distribution we started with in Equation 9.6, which is just the statement that if our data is infinitely noisy it does not update our beliefs about the distributions. What about if we assume that \boldsymbol{X}=\boldsymbol{X}_{\star}=\boldsymbol{x} again? Plugging into our equations we find:

\begin{align*} p\left(\boldsymbol{f}\left(\boldsymbol{X}_{\star}\right) \mid \boldsymbol{y}, \boldsymbol{X}, \boldsymbol{X}_{\star}\right) & =\mathcal{N}\left(\boldsymbol{\mu}_{\star}, \boldsymbol{\Sigma}_{\star}\right) \tag{9.29}\\ \boldsymbol{\mu}_{\star} & =\boldsymbol{\mu}(\boldsymbol{X})+\boldsymbol{K}^{T}\left(\boldsymbol{K}+\sigma_{n}^{2} \mathbb{I}\right)^{-1}(\boldsymbol{y}-\boldsymbol{\mu}(\boldsymbol{X})) \tag{9.30}\\ \boldsymbol{\Sigma}_{\star} & =\boldsymbol{K}-\boldsymbol{K}^{T}\left(\boldsymbol{K}+\sigma_{n}^{2} \mathbb{I}\right)^{-1} \boldsymbol{K} . \tag{9.31} \end{align*}

So our model is no longer predicting the observations nor does it assume zero covariance at the observed points. In figures presented in the lecture slides we build further intuition for these equations.

Impact of kernel parameters

Let us quickly introduce a popular choice of kernel, the squared exponential kernel2.

\begin{equation*} \kappa_{\mathrm{SE}}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\sigma_{f}^{2} \exp \left[\frac{-\left(\boldsymbol{x}-\boldsymbol{x}^{\prime}\right)^{T}\left(\boldsymbol{x}-\boldsymbol{x}^{\prime}\right)}{2 l^{2}}\right] \tag{9.32} \end{equation*}

We can think of the parameter \sigma_{f} as controlling the diagonal elements of our covariance matrix and therefore the marginal uncertainty. The parameter l controls the horizontal length scale over which we allow the function to vary. If we include a noise term, our covariance is:

\begin{equation*} \Sigma_{i, j}=\kappa_{\mathrm{SE}}\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right)+\sigma_{n}^{2} \delta_{\boldsymbol{x}_{i}, \boldsymbol{x}_{j}} \tag{9.33} \end{equation*}

In the lecture slides, we provide some visualizations of how varying the kernel parameters changes our inferred GP posteriors.

Evaluating and fitting kernel parameters

It is clear that the choice of kernel and kernel parameters has a sizable impact on the posterior distributions we infer. Logically, we may want to calculate the likelihood of our kernel parameters given the observed data3. We can then use this to get the MAP or MLE estimate for the parameters of our kernel. If we call our kernel parameters \boldsymbol{\theta}, what we want to maximize is the likelihood of our observed data:

\begin{equation*} p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{\theta})=\int p(\boldsymbol{y} \mid \boldsymbol{f}) p(\boldsymbol{f} \mid \boldsymbol{X}, \boldsymbol{\theta}) d \boldsymbol{f} \tag{9.34} \end{equation*}

Here were are being explicit about our observation noise p(\boldsymbol{y} \mid \boldsymbol{f})=\mathcal{N}\left(\boldsymbol{f}, \sigma_{n}^{2} \mathbb{I}\right). This is just an integral over two Gaussians which gives:

\begin{equation*} \log p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{\theta})=-\frac{1}{2} \boldsymbol{y}^{T}\left(\boldsymbol{K}(\boldsymbol{\theta})+\sigma_{n}^{2} \mathbb{I}\right)^{-1} \boldsymbol{y}-\frac{1}{2} \log \left|\boldsymbol{K}(\boldsymbol{\theta})+\sigma_{n}^{2} \mathbb{I}\right|-\frac{T}{2} \log (2 \pi), \tag{9.35} \end{equation*}

where \boldsymbol{K}(\boldsymbol{\theta}) is being explicit about our kernel’s dependence on the kernel parameters \theta. Notice that this likelihood has two important terms. The first term rewards fitting the data well, while the second term punishes the complexity of our functions. In the lecture slides we show an example for the SE kernel’s parameter l.


Gaussian Process Kernels

So far, we have framed our understanding of GPs in the function space. This framing has allowed us to construct priors, predictions, and data likelihoods with our GP models, but it is not a sufficient framework to understand the assumptions built into a choice of kernel. It turns out the weight-space representation for the GP will be useful. Let’s imagine that we have some model defined by the linear combination of M basis functions:

\begin{equation*} y(x)=\boldsymbol{w}^{T} \boldsymbol{\phi}(x), \tag{9.36} \end{equation*}

where \boldsymbol{\phi}(x)_{m} is our m^{\text {th }} basis function and w_{m} is its weight. One concrete example is polynomials of order 2 for which:

\boldsymbol{\phi}(x)=\left[\begin{array}{c} 1 \tag{9.37}\\ x \\ x^{2} \end{array}\right]

and the weights determine the coefficients of each term in our polynomial. The prior over y will be determined by the prior over \boldsymbol{w}, which we will take to be:

\begin{equation*} p(\boldsymbol{w})=\mathcal{N}\left(0, \sigma_{w}^{2} \mathbb{I}\right) \tag{9.38} \end{equation*}

As with the function-space view, we are interested in evaluating y at some set of points x_{1}, \ldots, x_{T}. We can write this as:

\begin{equation*} y=\boldsymbol{\Phi} \boldsymbol{w} \tag{9.39} \end{equation*}

where y_{t}=y\left(x_{t}\right) and \Phi_{t k}=\phi_{k}\left(x_{t}\right). \boldsymbol{y} is a linear combination of Gaussian random variables, so it is Gaussian. We therefore just require the mean and covariance of \boldsymbol{y} to describe its distribution:

\begin{align*} \boldsymbol{\mu}_{\boldsymbol{y}} & =\mathbb{E}[\boldsymbol{y}] \tag{9.40}\\ & =\boldsymbol{\Phi} \mathbb{E}[\boldsymbol{w}] \tag{9.41}\\ & =0 \tag{9.42}\\ \boldsymbol{\Sigma}_{\boldsymbol{y}} & =\mathbb{E}\left[\boldsymbol{y} \boldsymbol{y}^{T}\right] \tag{9.43}\\ & =\boldsymbol{\Phi} \mathbb{E}\left[\boldsymbol{w} \boldsymbol{w}^{T}\right] \boldsymbol{\Phi}^{T} \tag{9.44}\\ & =\sigma_{w}^{2} \boldsymbol{\Phi} \boldsymbol{\Phi}^{T} \tag{9.45} \end{align*}

Let’s take this result back to the function view. There we asserted that there was some distribution over functions (without specifying what functions) with a mean and a covariance. Here, we have a distribution over polynomial functions of order 24, and we found it has mean 0 and a covariance matrix with the form:

\begin{equation*} \mathbb{K}=\sigma_{w}^{2} \boldsymbol{\Phi} \boldsymbol{\Phi}^{T} \tag{9.46} \end{equation*}

Notice that:

\begin{equation*} K_{i j}=\kappa\left(x_{i}, x_{j}\right)=\sigma_{w}^{2} \boldsymbol{\phi}\left(x_{i}\right)^{T} \boldsymbol{\phi}\left(x_{j}\right) . \tag{9.47} \end{equation*}

If we take \sigma_{w}=1 we can write:

\begin{align*} \kappa\left(x_{i}, x_{j}\right) & =\boldsymbol{\phi}\left(x_{i}\right)^{T} \boldsymbol{\phi}\left(x_{j}\right) \tag{9.48}\\ & =\left[\begin{array}{lll} 1 & x_{i} & x_{i}^{2} \end{array}\right]\left[\begin{array}{c} 1 \\ x_{j} \\ x_{j}^{2} \end{array}\right] \tag{9.49}\\ & =x_{i}^{2} x_{j}^{2}+x_{i} x_{j}+1 \tag{9.50} \end{align*}

Using this function as our kernel for our GP gives us the posterior over order two polynomials given our observation5.

For a kernel function \kappa to produce a valid kernel, it must lead to a positive semi-definite matrix K. In general, you can show6 that it is necessary and sufficient for a valid kernel function to have a feature representation \boldsymbol{\phi} such that:

\begin{equation*} \kappa\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\boldsymbol{\phi}(\boldsymbol{x})^{T} \boldsymbol{\phi}\left(\boldsymbol{x}^{\prime}\right) . \tag{9.51} \end{equation*}

However this feature representations is not unique nor does it need to be finite-dimensional. The squared-exponential kernel we introduced earlier can be shown to have an infinite-dimensional feature representation.

We now have two ways of thinking about kernels. We can either specify the specific feature vector of interest \boldsymbol{\phi} and then calculate the implied kernel \kappa\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right), or we can start from the kernel function and reconstruct the feature vector it implies. We will generally want to do the later, since many interesting kernels will have an infinite-dimensional feature vector. This is actually a fascinating feature of the GP: we can predict the posterior over the function space of an infinite-dimensional feature vector without ever calculating those features. Working in the kernel / function space rather than the feature / weight space is often called the kernel trick.

If we want to guarantee that a function is a valid kernel function we must either show that the produced matrix is positive semi-definite, or show that it can be written in the form of Equation 9.51. However, we can also use functions that we already know to be valid to generate new valid kernel functions. Here is a list of properties you can use to generate a new valid kernel \kappa^{\prime} from two valid kernels \kappa_{1}, \kappa_{2} :

\begin{align*} \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =c \kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & & \text { constant } c \ge 0 \tag{9.52}\\ \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =g(\boldsymbol{x}) \kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) g\left(\boldsymbol{x}^{\prime}\right) & & \text { function } g \tag{9.53}\\ \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =q\left(\kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)\right) & & \text { polynomial function } q \tag{9.54}\\ \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =\exp \left(\kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)\right) & & \text { exponential } \tag{9.55}\\ \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =\kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)+\kappa_{2}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & & \text { sum } \tag{9.56}\\ \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =\kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) \times \kappa_{2}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) . & & \text { product } \tag{9.57} \end{align*}

We can also divide the input \boldsymbol{x}=\left[\boldsymbol{x}_{a}, \boldsymbol{x}_{b}\right] and start from two valid kernel on the subset a, \kappa_{a}, and the subset b, \kappa_{b} :

\begin{align*} \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =\kappa_{a}\left(\boldsymbol{x}_{a}, \boldsymbol{x}_{a}^{\prime}\right)+\kappa_{b}\left(\boldsymbol{x}_{b}, \boldsymbol{x}_{b}^{\prime}\right) & & \text { sum of subsets } \tag{9.58}\\ \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =\kappa_{a}\left(\boldsymbol{x}_{a}, \boldsymbol{x}_{a}^{\prime}\right) \times \kappa_{b}\left(\boldsymbol{x}_{b}, \boldsymbol{x}_{b}^{\prime}\right) . & & \text { product of subsets } \tag{9.59} \end{align*}

Let’s quickly prove Equation 9.52 and Equation 9.53 and leave the rest as an exercise. First we note that, since \kappa_{1} is a valid kernel, there must exist a feature vector such that:

\begin{equation*} \kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\phi(\boldsymbol{x})^{T} \phi\left(\boldsymbol{x}^{\prime}\right) . \tag{9.60} \end{equation*}

For the product by a constant we have:

\begin{align*} \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =c \kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) \tag{9.61}\\ & =c \phi(\boldsymbol{x})^{T} \phi\left(\boldsymbol{x}^{\prime}\right) \tag{9.62}\\ & =\phi^{\prime}(\boldsymbol{x})^{T} \phi^{\prime}\left(\boldsymbol{x}^{\prime}\right) \tag{9.63}\\ \phi^{\prime}(\boldsymbol{x}) & =\sqrt{c} \phi(\boldsymbol{x}) \tag{9.64} \end{align*}

Since we can express our candidate function as a scalar product in some feature space \phi^{\prime}(\boldsymbol{x}), we have shown that we have a valid kernel. For the product by a function the proof follows almost identically:

\begin{align*} \kappa^{\prime}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) & =g(\boldsymbol{x}) \kappa_{1}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right) g\left(\boldsymbol{x}^{\prime}\right) \tag{9.65}\\ & =g(\boldsymbol{x}) \phi(\boldsymbol{x})^{T} \phi\left(\boldsymbol{x}^{\prime}\right) g\left(\boldsymbol{x}^{\prime}\right) \tag{9.66}\\ & =\phi^{\prime}(\boldsymbol{x})^{T} \phi^{\prime}\left(\boldsymbol{x}^{\prime}\right) \tag{9.67}\\ \phi^{\prime}(\boldsymbol{x}) & =g(\boldsymbol{x}) \phi(\boldsymbol{x}) \tag{9.68} \end{align*}

There are two more concepts that you may run into when choosing between kernels that are worth briefly defining:


Gaussian Process Strengths and Weaknesses

With this introduction to Gaussian Processes under our belt, it is useful to discuss a few of their strengths and weaknesses at a high level. Let’s start with their strengths:

This set of strengths has driven the popularity of Gaussian processes as a tool for regression. But like any modeling tool, GPs have weaknesses: