ARMA modeling continued (student)

Instructions: Submit your completed lab worksheet to Gradescope by 1:30 PM the Monday of the next lab.

You can find a Colab version of the required code, including plotting functions here.

You’ll also need the data file arma_data.csv!

This lab is meant to further improve your understanding of the ARMA model and the correlation functions we’ve been discussing in class. To get you more comfortable with the type of work you will need to do for the project, this will be less on-the-rails than the previous lab. You should take inspiration from the previous labs to help organize your code.

By the end of the lab, students should:
1. Be able to identify a good ARMA model to apply on a dataset.
2. Be able to fit an ARMA model to the data.
3. Be able to display the conditional posterior coming from an ARMA process fit.

from typing import Any, Optional, Tuple

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import norm, uniform
from scipy.optimize import minimize
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.arima_process import arma_acovf

Part I: Identifying the Best ARMA Model

We will start from a pre-generated dataset for this lab. We believe that this data comes from a stationary process, and we would like to model the data as ARMA(p,q). The question is what the correct value of p and q are for this process. Using the ACF and PACF (you are free to use the implementations from statsmodels that are imported above) make an argument for why this data corresponds to an AR(3) process. This should include:

Calculating the ACF, PACF, and expected error on both estimates (use the formulas from class!).
Plotting the data, the ACF, and the PACF along with the errors. This should be three separate plots, one for the data, one for the ACF, and one for the PACF.
Labeling all of the plots. This includes axes, sensible colors, and legends for each of your plots.

Remember, the previous labs have examples of how to make nice plots; see especially plt.subplots, fill_between, and stem.

# TODO: Read the signal from the provided file arma_data.csv. 
# Be sure to convert to a numpy array!
df = # TODO
signal = # TODO

# TODO: Plot the ACF and PACF along with the 
# expected error for the estimators.
n_lags = 20
acf_estimate = # TODO
pacf_estimate = # TODO
acf_error_estimate = # TODO
pacf_error_estimate = # TODO

# TODO: Do the plotting here. See the previous labs for examples.
raise ValueError('There should be some plots here!')

Based on the plots you created, make your argument for why this data corresponds to an AR(3) process.

The ACF has a non-zero value out to large lags. This is indicative of an AR process. There is no sharp cut in the ACF, which means that the coefficients of the MA process are small or that the order of the MA process is 0. Here we assume the latter explanation.

The PACF has a sharp drop at |h| = 4 to within roughly the error bars. For an AR(p) process, the PACF is non-zero only for |h| < p, suggesting that this is an AR process of order 3.

Part II: Fitting our AR(3) Model

You’ve hopefully made a reasonable argument that fitting the parameters of an AR(3) model will give us a good result. Now it’s time to do that fit. To do this you will have to:

Calculating the posterior for the parameters of an AR(3) process.
Calculate the conditional distribution for an AR(3) process given its parameters.
Plot samples and the conditional distribution for our AR(3) process.

class ARThreeModel:
    """Class implementing prior, likelihood, posterior, and predictions for an 
    AR(3) model.

    Args:
        sigma_phi: Sigma for the prior on parameters phi_1, phi_2, and phi_3.
        sigma_max: Maximum value for uniform prior on sigma_w.
    """
    # TODO

Let’s run some tests to make sure the implementations are correct.

a_model = ARThreeModel(2.0, 5.0)

phi_one, phi_two, phi_three = 0.5, 0.2, 0.1
sigma_w = 1.0
params_test = np.array([phi_one, phi_two, phi_three, sigma_w])
x_t = np.array([3.52810469, 3.2699877, 4.9520883, 8.60224575, 10.74710566, 
                7.28886735, 9.15180511, 7.56132263, 6.91684916, 7.17525595])

np.testing.assert_almost_equal(a_model.log_prior([0.3, 0.4, 0.5, 2.0]), 
                               -6.508195053727955)
np.testing.assert_almost_equal(a_model.log_prior([0.3, 0.4, 0.5, 6.0]), 
                               -np.inf)
np.testing.assert_almost_equal(a_model.log_likelihood(x_t, params_test), 
                               -50.47684010239573)
np.testing.assert_almost_equal(a_model.log_posterior(params_test, x_t), 
                               -56.96003515612368)

Now, let’s fit our model to the data using the priors.

sigma_phi = 2.0
sigma_max = 5.0
a_model_wide = ARThreeModel(sigma_phi, sigma_max)

# TODO run the fit with minimize from scipy. 
# Results will be [phi_1, phi_2, phi_3, sigma_w].
results = # TODO: Run the fit. See last week's lab for an example.

for i in range(3):
    print(f'phi_{i} has MAP value of {results[i]}')
print(f'sigma_w has MAP value of {results[-1]}')

We want to calculate the conditional distribution of our AR(3) process. As we discussed in class, we can take advantage of the fact that all of the structure of our process can be captured by a multivariate Gaussian distribution. To use our conditional Gaussian equation from lecture we need a mean and a covariance matrix. The mean is easy, but we will need a function for the covariance matrix.

def ar_three_cov_matrix(phis: np.ndarray, max_h: int, 
                        sigma_w: float) -> np.ndarray:
    """Calculate the covariance matrix for an AR(3) process.

    Args:
        phis: Phi parameters of the AR(3) process.
        max_h: Maximum time index separation to calculate the covariance matrix for.

    Returns:
        Covariance matrix up to lag of max_h.

    Notes:
        Use the function arma_acovf to calculate the autocovariance. 
        It takes as input ([phi_0, -phi_one, ...], [theta_0, theta_one, ...], 
        max_h + 1, sigma_w ** 2). Note that it requires a value of phi_0 and 
        theta_0 (which we always take to be 1.0).
    """
    # TODO: Implement.

Let’s quickly test you got the right answer.

cov_test = ar_three_cov_matrix(np.array([0.5, 0.2, 0.1]), 1, 1.5)
cov_correct = np.array([[4.795507, 3.369816], [3.369816, 4.795507]])
np.testing.assert_array_almost_equal(cov_test, cov_correct)

Now, plot the covariance matrix for your fitted parameters and a max_h of 20.

cov_twenty = # TODO: Calculate and plot. 
# We recommend plt.imshow(). Labels are less important here, 
# but a colorbar is good.

Does the structure make sense? How would it be different if it was an MA(3) process?

Yes, it’s symmetric and the covariance falls off exponentially fast from the diagonal. For an MA(3) process, only three indices off the diagonal would have a non-zero value.

Now we want the conditional distribution for the next m time steps given the past n. The equation can be found in the lecture notes.

def conditioned_mean_covariance(mu_vec: np.ndarray, 
                                covariance_matrix: np.ndarray,
                                conditioned_observations: np.ndarray) -> 
                                    Tuple[np.ndarray, np.ndarray]:
    """
    Calculates the conditioned mean and covariance matrix of a multivariate 
    Gaussian.

    Args:
        mu_vec: Mean of distribution with length (n+m).
        covariance_matrix: Covariance matrix of the distribution with shape 
            (n+m * n+m).
        conditioned_observations: First n observations that are being 
            conditioned on.

    Returns:
        Mean and covariance of the next m observations.

    Notes:
        The code from Lab 2 may be useful, but notice that the calculation here is 
        slightly different.
    """
    # TODO: Calculate the conditioned mean and covariance.

With the conditioned mean and covariance we can now better understand our model’s fit. We want to do two things here:

Do one-step-ahead prediction using the previous 40 time steps to confirm that our model gives a reasonable fit to the data.
Predict the next 70 timesteps given the last 40 time steps in our signal (i.e. predict the future).

Want to test the function you wrote above? The sample_x_future function from last week might be useful!

# Initialize our mean and covariances for the one ahead.
one_ahead_start = len(signal) - 100
mu_one_ahead = np.zeros(len(signal) - one_ahead_start)
cov_one_ahead = np.zeros(len(signal) - one_ahead_start) # Covariance matrix is a single value for one-ahead prediction.

n_lookback = 40
for time_step in range(one_ahead_start, len(signal)):
    # TODO: Use your function to calculate the covariance 
    # matrix for the time step in question using the last
    # 40 datapoints of signal.
    mu_joint = # TODO
    cov_joint = # TODO
    mu_cond, cov_cond = # TODO

    mu_one_ahead[time_step - one_ahead_start] = mu_cond[0]
    cov_one_ahead[time_step - one_ahead_start] = cov_cond[0,0]

# TODO: Predict the next 70 timesteps given the last 40.
predict_next = 70
n_lookback = 40

mu_joint = # TODO
cov_joint = # TODO
mu_future, cov_future = # TODO

mu_total = np.concatenate([mu_one_ahead, mu_future])
var_total = np.concatenate([cov_one_ahead, np.diag(cov_future)])

# Plot the result
colors = ['#7fcdbb', '#2c7fb8']
fontsize = 15
t_data = np.arange(len(signal))
t_cond = np.arange(one_ahead_start, len(signal))
t_total = np.arange(one_ahead_start, len(signal) + predict_next)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5), dpi=200)
ax1.plot(t_data, signal, color=colors[1], label='Data')
ax1.plot(t_total, mu_total, color=colors[0])
ax1.fill_between(t_total, mu_total - 2 * np.sqrt(var_total), mu_total + 2 * np.sqrt(var_total),
                 color=colors[0], label=r'Conditional Distribution', alpha=0.4)

ax1.set_xlim([one_ahead_start, len(signal) + predict_next + 2])
ax1.set_ylabel(r'$X_t$', fontsize=fontsize)
ax1.set_xlabel('Time Index', fontsize=fontsize)
ax1.set_title('Posterior Predictions', fontsize=fontsize)
ax1.legend(fontsize=fontsize)

# Plot the residual.
ax2.plot(t_cond, signal[one_ahead_start:] - mu_one_ahead, color=colors[1], label='Data')
ax2.plot(t_cond, np.zeros(len(t_cond)), color=colors[0])
ax2.fill_between(t_cond,  -2 * np.sqrt(cov_one_ahead), 2 * np.sqrt(cov_one_ahead),
                 color=colors[0], label=r'Conditional Distribution', alpha=0.4)

ax2.set_xlim([one_ahead_start, len(signal) + 2])
ax2.set_ylabel(r'Residual', fontsize=fontsize)
ax2.set_xlabel('Time Index', fontsize=fontsize)
ax2.set_title('Posterior Residuals', fontsize=fontsize)

plt.show()

If everything worked, almost all of your residuals should fall within the conditional distribution contours. Does your plot match this expectation? What does this tell you about the quality of your AR(3) model fit?

Yes, if the model is a good fit, almost all residuals should fall within the 2-sigma bounds of the conditional distribution. This indicates that the AR(3) model captures the underlying time series structure well, and the residuals behave like white noise with the expected variance. If many residuals fall outside these bounds, it would suggest model misspecification or inadequate parameter estimation.