Gaussian processes (student)

Instructions: Submit your completed lab worksheet to Gradescope by 1:30 PM the Monday of the next lab.

You can find a Colab version of the required code, including plotting functions here.

This lab is meant as an introduction to Gaussian Process regression.

Students should:

Be able to implement the a Gaussian Process with a square exponential kernel.
Gain intuition for how the kernel function parameters affect the prior.
Gain intuition for how the kernel function parameters affect the prediction with and without noise.
Be able to implement the data likelihood given the kernel parameters.
Gain intuition for the tradeoff between fitting the data and model complexity.

import functools
from typing import Any, Optional, Tuple

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal

Kinematics Dataset

We will start from a pre-generated dataset for this lab. This data comes from a simple kinematics simulation that follows these equations:

X_t = t * \sin\left( \frac{2 \pi t}{P} \right) + W_t

where T is a parameter of our system and W_t is white-noise with W_t \sim \mathcal{N}(0, \sigma_w^2). This equation is the product of a periodic function and a linear function, so it will be interesting to see how different kernel assumptions for our Gaussian Process adapt to the data. Our goal is to infer the true trajectory of the object from the noisy data.

We will generate a dataset where P = 0.2 and \sigma_w = 0.5.

# Let's generate the kinematics dataset.
x = np.linspace(start=0, stop=10, num=1000).reshape(-1, 1)
f_true = 0.5 * np.squeeze(x * np.sin(x / 0.2))

rng = np.random.seed(2)
obs_indices = np.random.choice(np.arange(f_true.size), size=10, replace=False)
x_obs, f_obs = x[obs_indices], f_true[obs_indices]
y_obs = f_obs + np.random.normal(loc=0.0, scale=0.5, size=f_obs.shape)

Part I: Implementing the Base Gaussian Process Class

To compare different kernel you will need to implement a GP class that:

Implement the calculation of the K matrix.
Store the data and the inverse K matrix for prediction.
Implement the mean and covariance prediction functions.
Implement calculation of likelihood of observations given kernel parameters.

We will assume that \mu(t) = 0 for our GP.

class GaussianProcess:
    """Class that implements a Gaussian Process.

    Args:
        kernel_function: Function for calculating kappa(x,x').
    """
    def __init__(self, kernel_function: Any):
        """Initialize our class."""
        # Save the vectorized kernel function. Your kernel function
        # will take as input two scalar, but the vectorized function needs
        # to take in two arrays, one of length $m$ and one of length $n$,
        # and return a mxn matrix. We use one call of numpy.vectorize and take
        # advantage of the signature options.
        self._kernel_function = (
            np.vectorize(kernel_function,
                         signature='(),(m)->()'
                        )
        )

        # Initialize to None so we know to predict the prior to start.
        self.k_matrix_inv = None
        self.observed_x = None
        self.observed_y = None

    def calc_k_matrix(self, x_rows: np.ndarray, x_cols: np.ndarray) -> np.ndarray:
        """Calculate the K matrix at the given positions.

        Args:
            x_rows: X-position for the rows.
            x_cols: X-position for the columns.

        Returns:
            K(x_rows,x_cols) matrix.
        """
        # TODO: All this function needs to do is call our vectorized kernel function
        return # TODO

    def set_observations(self, observed_x: np.ndarray, observed_y: np.ndarray,
                         sigma_w: float):
        """Store the x and y that have been observed.

        Args:
            observed_x: Observed input points x.
            observed_y: Observed (noisy) outputs y.
            sigma_w: Observational noise standard deviation.

        Notes:
            Modifies internal variables.
        """
        # Save the observations.
        self.observed_x = observed_x
        self.observed_y = observed_y

        # TODO: Save the K matrix inverse with the noise included.
        self.k_matrix_inv = # TODO

    def predict(self, predict_x: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
        """Predict the mean and covariance at given points.

        Args:
            predict_x: X-positions at which to make predictions.

        Returns:
            Mean and covariance matrix at the requested points.

        Notes:
            If no observations have been set, this will return the prior.
        """
        # TODO: Calculate the k_star_star matrix.
        k_star_star_matrix = # TODO

        # TODO: Start with the prior predictions
        mean_pred = # TODO
        cov_pred = # TODO

        # TODO: Correct for observed data if it exists.
        if self.observed_x is not None:
            # Calculate the k_star matrix. Use the saved k_matrix_inv for speed
            # of computation.
            k_star_matrix = # TODO

            # Update mean and covariance prediction.
            mean_pred += # TODO
            cov_pred -= # TODO

        return mean_pred, cov_pred

    def sample_f(self, predict_x: np.ndarray, n_samples: Optional[int] = 1) -> np.ndarray:
        """Return samples of the function space at the desired points.

        Args:
            predict_x: X-positions at which to make predictions.
            n_samples: Number of function samples to draw.

        Returns:
            Samples of the function space outputs.
        """
        # Get the mean and covariance prediction from our function.
        mean_pred, cov_pred = self.predict(predict_x)

        # Draw samples.
        f_samples = np.zeros((n_samples, len(predict_x)))
        for i in range(len(f_samples)):
            f_samples[i] = multivariate_normal.rvs(mean=mean_pred, cov=cov_pred)

        return f_samples

    def log_likelihood(self) -> float:
        """Return the log likelihood of the data for the given kernel.

        Returns:
            Log likelihood of the data.
        """
        # TODO: Implement the log likelihood of the data.
        log_likelihood_calc = # TODO

        return log_likelihood_calc

Let’s test our implementation:

# A few tests for our GP functions.
def test_kernel_function(x, x_prime):
    return np.exp(-np.sum(np.square(x - x_prime))/20)

gp_test = GaussianProcess(test_kernel_function)

x_test = np.linspace(0,10,4).reshape(-1, 1)

# Test the K matrix calculation.
k_matrix = gp_test.calc_k_matrix(x_test, x_test)
np.testing.assert_array_almost_equal(
    k_matrix,
    [[1.0, 0.573753, 0.108368, 0.006738],
     [0.573753, 1.0, 0.573753, 0.108368],
     [0.108368, 0.573753, 1.0, 0.573753],
     [0.006738, 0.108368, 0.573753, 1.0]]
)

# Test the predictions without observations.
mean_test, cov_test = gp_test.predict(x_test)
np.testing.assert_array_almost_equal(mean_test, np.zeros(4))
np.testing.assert_array_almost_equal(
    cov_test,
    [[1.0, 0.573753, 0.108368, 0.006738],
     [0.573753, 1.0, 0.573753, 0.108368],
     [0.108368, 0.573753, 1.0, 0.573753],
     [0.006738, 0.108368, 0.573753, 1.0]]
)

# Test the predictions with observations
x_obs_test = np.linspace(0,10,4).reshape(-1, 1)
y_obs_test = np.linspace(5,50,4)
gp_test.set_observations(x_obs_test, y_obs_test, sigma_w=0.3)
mean_test, cov_test = gp_test.predict(x_test)
np.testing.assert_array_almost_equal(
    mean_test,
    np.array([5.346965, 18.628744, 34.861521, 46.078638])
)
np.testing.assert_array_almost_equal(
    cov_test,
    [[ 0.078941,  0.007533, -0.003469,  0.001145],
     [ 0.007533,  0.073929,  0.009536, -0.003469],
     [-0.003469,  0.009536,  0.073929,  0.007533],
     [ 0.001145, -0.003469,  0.007533,  0.078941]]
)

# Test the log likelihood calculation.
np.testing.assert_almost_equal(
    gp_test.log_likelihood(),
    -1262.211359395657
)

Part II: Comparing Kernels for our GP.

Let’s compare how our different kernels impact our predictions. We can also test how different kernel values are favored by the model. We will:

Compare the prediction of a squared exponential and a cosine kernel.
See how different choices of the period for our cosine kernel impact the likelihood of the observed data.

The squared exponential kernel is given by:

\kappa(x,x') = A * \exp\left(\frac{-(x-x')^2}{2 l^2}) \right)

The cosine kernel is given by:

\kappa(x,x') = A * \cos\left(\frac{|x-x'|}{P}\right)

# First we need to implement our two kernels.
def cosine_kernel(x: float, x_prime: float, period: float, amplitude: float) -> float:
    """Cosine kernel function.

    Args:
        x: First x-position at which to calculate the kernel.
        x_prime: Second x-position at which to calculate the kernel.
        period: Period of the cosine kernel.
        amplitude: Amplitude of the cosine kernel.

    Returns:
        Kernel function value.
    """
    # TODO: Implement cosine kernel function.
    return # TODO

def squared_exp_kernel(x: float, x_prime: float, length_scale: float, amplitude: float) -> float:
    """Cosine kernel function.

    Args:
        x: First x-position at which to calculate the kernel.
        x_prime: Second x-position at which to calculate the kernel.
        length_scale: Length scale of the squared exponential kernel.
        amplitude: Amplitude of the squared exponential kernel.

    Returns:
        Kernel function value.
    """
    # TODO: Implement cosine kernel function.
    return # TODO

# First let's visualize the squared exponential kernel.
plt.figure(figsize=(10,5))

# TODO: Use functools.partial to set our kernel.
length_scale = 1.0
amplitude = 2.0
kernel_function = # TODO
gp = GaussianProcess(kernel_function)
sigma_w = 0.5

gp.set_observations(x_obs, y_obs, sigma_w)
y_pred_mean, y_pred_cov = gp.predict(x)
y_pred_std = np.sqrt(np.diag(y_pred_cov))
plt.plot(x, y_pred_mean, color = '#99d8c9', label = 'GP Posterior Distribution')
plt.plot(x_obs, y_obs, 'x', color = 'k', label = 'Observed Data', markersize=7)
plt.plot(x, f_true, '--', color = 'k', label = 'True Generative Function')
plt.fill_between(x[:,0], y_pred_mean - 2 * y_pred_std, y_pred_mean + 2 * y_pred_std,
                 color = '#99d8c9', alpha=0.5)
plt.legend()
plt.xlabel('Time')
plt.ylabel('f(Time)')
plt.ylim([-10,10])
plt.show()

Why are these predictions doing poorly? Limit your answer to the form of the kernel and not the specific parameters values.

The kernel doesn’t include the periodic structure of the dataset, rather it favors polynomial functions. This prior bias, combined with the small data sample, leads to poor agreement between the predicted functional distribution and the true generative function.

# Next let's visualize the cosine kernel.
plt.figure(figsize=(10,5))

# TODO: Use functools.partial to set our kernel.
period = 0.2
amplitude = 2.0
kernel_function = # TODO
gp = GaussianProcess(kernel_function)
sigma_w = 0.5

gp.set_observations(x_obs, y_obs, sigma_w)
y_pred_mean, y_pred_cov = gp.predict(x)
y_pred_std = np.sqrt(np.diag(y_pred_cov))
plt.plot(x, y_pred_mean, color = '#99d8c9', label = 'GP Posterior Distribution')
plt.plot(x_obs, y_obs, 'x', color = 'k', label = 'Observed Data', markersize=7)
plt.plot(x, f_true, '--', color = 'k', label = 'True Generative Function')
plt.fill_between(x[:,0], y_pred_mean - 2 * y_pred_std, y_pred_mean + 2 * y_pred_std,
                 color = '#99d8c9', alpha=0.5)
plt.legend()
plt.xlabel('Time')
plt.ylabel('f(Time)')
plt.ylim([-10,10])
plt.show()

How are these new predictions better than the previous? Why are the predictions still limited? Limit your answer to the form of the kernel and not the specific parameters values.

The kernel now includes the periodicity of the dataset, so the predicted functions exhibit periodic structure. However, since the kernel depends entirerly on this periodic structure, it fails to capture the growing amplitude of the signal over time.

# What if we combine our two kernels together?
def combined_kernel(x: float, x_prime: float, period_cosine: float, amplitude_cosine: float,
                    length_scale: float, amplitude_se: float) -> float:
    """Sum of cosine and squared exponential kernel.

    Args:
        x: First x-position at which to calculate the kernel.
        x_prime: Second x-position at which to calculate the kernel.
        period_cosine: Period of the cosine kernel.
        amplitude_cosine: Amplitude of the cosine kernel.
        length_scale: Length scale of the squared exponential kernel.
        amplitude_se: Amplitude of the squared exponential kernel.

    Returns:
        Kernel function value.
    """
    # TODO: Implement the combination.
    return # TODO

# Now let's visualize the combined kernel.
plt.figure(figsize=(10,5))

# TODO: Use functools.partial to set our kernel.
period_cosine = 0.2
amplitude_cosine = 2.0
length_scale = 0.2
amplitude_se = 2.0
kernel_function = # TODO
gp = GaussianProcess(kernel_function)
sigma_w = 0.5

gp.set_observations(x_obs, y_obs, sigma_w)
y_pred_mean, y_pred_cov = gp.predict(x)
y_pred_std = np.sqrt(np.diag(y_pred_cov))
plt.plot(x, y_pred_mean, color = '#99d8c9', label = 'GP Posterior Distribution')
plt.plot(x_obs, y_obs, 'x', color = 'k', label = 'Observed Data', markersize=7)
plt.plot(x, f_true, '--', color = 'k', label = 'True Generative Function')
plt.fill_between(x[:,0], y_pred_mean - 2 * y_pred_std, y_pred_mean + 2 * y_pred_std,
                 color = '#99d8c9', alpha=0.5)
plt.legend()
plt.xlabel('Time')
plt.ylabel('f(Time)')
plt.ylim([-10,10])
plt.show()

How are these new predictions better than the previous? Why might this be the case? Limit your answer to the form of the kernel and not the specific parameters values.

The kernel now includes correlations that can capture the periodic structure and the growth of the amplitude over time. Therefore the predicted distribution is in much better agreement with the true generative function.

# We will compare the quality of different period choices.
periods = np.linspace(0.1,0.3,1000)
log_likelihoods = np.zeros(periods.shape)

# Let's start by plotting few length scales.
fig, ax = plt.subplots(1, 3, figsize=(20, 5), dpi=100)

for i, period in enumerate([0.1, 0.2, 0.3]):
    # TODO: Use the looped period parameter to initialize the kernel function.
    amplitude_cosine = 2.0
    length_scale = 0.2
    amplitude_se = 2.0
    kernel_function = # TODO
    gp = GaussianProcess(kernel_function)
    gp.set_observations(x_obs, y_obs, sigma_w)
    y_pred_mean, y_pred_cov = gp.predict(x)
    y_pred_std = np.sqrt(np.diag(y_pred_cov))
    ax[i].plot(x, y_pred_mean, color = '#99d8c9', label = 'GP Posterior Distribution')
    ax[i].plot(x_obs, y_obs, 'x', color = 'k', label = 'Observed Data', markersize=7)
    ax[i].plot(x, f_true, '--', color = 'k', label = 'True Generative Function')
    ax[i].fill_between(x[:,0], y_pred_mean - 2 * y_pred_std, y_pred_mean + 2 * y_pred_std,
                       color = '#99d8c9', alpha=0.5)
    ax[i].legend()
    ax[i].set_xlabel('Time')
    ax[i].set_ylabel('f(Time)')
    ax[i].set_ylim([-10,10])
    ax[i].set_title(f'Period = {period}')

# plt.show()

# For each length scale, calculate the log likelihood.
for i, period in enumerate(periods):
    # TODO: Use the looped period parameter to initialize the kernel function.
    amplitude_cosine = 2.0
    length_scale = 0.2
    amplitude_se = 2.0
    kernel_function = # TODO
    gp = GaussianProcess(kernel_function)
    gp.set_observations(x_obs, y_obs, sigma_w)
    log_likelihoods[i] = gp.log_likelihood()

plt.figure(figsize=(20,5))
plt.plot(periods, log_likelihoods)
# plt.axvline(0.2, c='k')
plt.xlabel(r'Period $P$', fontsize=20)
plt.ylabel(r'$\log p(\mathbf{y}|\mathbf{X},P)$', fontsize=20)

plt.show()

What value of the period parameter maximizes the data likelihood? Does this make sense?

How would you describe the predictions from p=0.1? Too complex or to too simple? Does this make sense?

The maximum is around P = 0.2, which is the true periodicity of the data. This makes sense – the ideal tradeoff between fit and complexity should follow the true generative distribution.

This is a little less clear than in the squared exponential case, but these predictions are probably too complex. The functions the model predicts have rapid variations that can describe the data fairly well, but the likelihood is lower because the resulting functions are more complicated.