Estimation and prediction in ARMA models

Lecture 3

Course: DS-GA 1018: Probabilistic Time Series Analysis
Date: September 17, 2025 (Wednesday)
Name:
NYU NetID:

Estimating statistics of an ARMA process

Auto- and cross-correlation functions (ACF and CCF)

We have a strong definition of our ARMA process now, but we’ll also want to quantify its statistics and parameters given data. First, let’s talk about how to estimate our two main statistics of interest: the mean and the covariance function. Throughout we will assume the process is stationary. The sample mean is the easiest, it can be estimated by:

\begin{equation*} \hat{\mu}=\frac{1}{T} \sum_{t=0}^{T} X_{n} \tag{3.1} \end{equation*}

where T is the length of our sample. The sample covariance is a bit more involved:

\begin{equation*} \hat{\gamma}(h)=\frac{1}{T} \sum_{t=1}^{T-h}\left(x_{t+h}-\hat{\mu}\right)\left(x_{t}-\hat{\mu}\right) . \tag{3.2} \end{equation*}

Note that the effective size of our sample is smaller when we calculate the sample covariance because we require an offset of h. Regardless we still divide by T because this estimator has a few nicer properties than the estimator we get if we divide by T-h (see Shumway and Stoffer). We can also calculte the sample correlation from the sample covariance:

\begin{equation*} \hat{\rho}(h)=\frac{\hat{\gamma}(h)}{\hat{\gamma}(0)} \tag{3.3} \end{equation*}

These estimators are great, but unless T \rightarrow \infty they are not going to give us the true values. It’s useful to understand the expected error of these estimators to have a handle on the significance of the measurement. For our correlation function, the best metric is a comparison to white noise (which we know has a correlation coefficient of 0 for all h>0 ). For a large sample and finite h, the sample ACF of white noise is normally distributed with mean 0 and variance:

\begin{equation*} \sigma_{\hat{\rho}(h)}^{2}=\frac{1}{T} \tag{3.4} \end{equation*}

It is common to declare a measurement of \hat{\rho}(h) significant if:

\begin{equation*} |\hat{\rho}(h)|>\frac{2}{\sqrt{T}} \tag{3.5} \end{equation*}

This is equivalent to the 95^{\text {th }} percentile.


Difference equations

Let’s combine two concepts and see how the backshift operator allows us calculate the correlation coefficient for an AR(2) process. Let’s start with our equation:

\begin{equation*} X_{t+h}=\phi_{1} X_{t+h-1}+\phi_{2} X_{t+h-2}+W_{t+h} \tag{3.6} \end{equation*}

Multiply through by X_{t} and taking an expectation:

\begin{align*} \mathbb{E}\left[X_{t} X_{t+h}\right] & =\phi_{1} \mathbb{E}\left[X_{t+h-1} X_{t}\right]+\phi_{2} \mathbb{E}\left[X_{t+h-2} X_{t}\right]+\mathbb{E}\left[W_{t+h} X_{t}\right] \tag{3.7}\\ \end{align*}

For a stationary process, we can substitute the covariance terms, recalling that \mathbb{E}\left[X_{t+h} X_t\right]=\gamma(h). The final term, \mathbb{E}\left[W_{t+h} X_t\right], is zero for h>0 because the white noise term W_{t+h} is uncorrelated with all past values of X_t. This gives us the key result:

\begin{align*} \gamma(h) & =\phi_{1} \gamma(h-1)+\phi_{2} \gamma(h-2) \tag{3.8}\\ \end{align*}

This is known as a difference equation. We can rewrite this in terms of our backshift operator:

\begin{align*} P(B) \gamma(h) & =0 \tag{3.9} \end{align*}

The roots of our backshift operator are the same roots of our difference equation. The solutions of a difference equation of order p:

\begin{equation*} u_{n}-\alpha_{1} u_{n-1}-\ldots-\alpha_{p} u_{n-p}=0 \tag{3.10} \end{equation*}

are given by:

\begin{equation*} u_{n}=z_{1}^{-n} P_{1}(n)+z_{2}^{-n} P_{2}(n)+\ldots+z_{r}^{-n} P_{r}(n), \tag{3.11} \end{equation*}

where r is the number of distinct roots and P_{j}(n) is a polynomial in n of degree m_{j}-1 where m_{j} is the multiplicity of the root z_{j}. For \operatorname{AR}(2) we have a difference equation of order 2. If there are two distinct roots, z_{1} \neq z_{2}, then the solution has the form:

\begin{equation*} \rho(h)=c_{1} z_{1}^{-|h|}+c_{2} z_{2}^{-|h|} \tag{3.12} \end{equation*}

If instead the two roots are equal then the solution has the form:

\begin{equation*} z_{1}^{-|h|}\left(c_{1}+c_{2}|h|\right) . \tag{3.13} \end{equation*}

For higher orders there is a wider variety of solutions.


Partial auto-correlation function (PACF)

Let’s imagine we have a process we suspect is a MA process. If we have a reliable estimate of the ACF we can immediately tell its order by the largest non-zero value. For an AR process that’s a lot less clear. We have already shown that for an AR(2) process the ACF is non-zero for all separations. More generically, even if X_{t+h} does not directly depend on X_{t} in the equation that defines the process, it can still be correlated through some intermediary X_{t+k} with k<h. What we would like to know is the correlation between X_{t+h} and X_{t} conditioned on X_{t+1: t+h-1} :

\begin{equation*} \gamma_{X_{t+h}, X_{t} \mid X_{t+1: t+h-1}} \tag{3.14} \end{equation*}

If we assume that our process is Gaussian, between X_{t+h} and X_{t} after having subtracted the optimal linear predictor using X_{t+1: t+h-1}. We will call these linear predictor \hat{X}_{t+h} and \hat{X}_{t}. This is known as the partial auto-correlation function:

Partial Autocorrelation Function: The partial auto-correlation function (PACF) for a stationary process, X_{t}, is defined as:

\begin{align*} \phi_{11} & =\rho(t+1, t)=\rho(1) \tag{3.15}\\ \phi_{hh} &= \rho((X_{t+h} - \hat{X}_{t+h}), (X_t - \hat{X}_t)) \tag{3.16} \end{align*}

where \hat{X}_{t+h} and \hat{X}_{t} are the regressions on both variables using the variables X_{t+1: t+h-1} that minimizes mean squared error.

For now we will assert that the PACF of an AR(p) process is 0 for h>p.


PACF of AR Process and ACF of MA Process

Let’s quickly show mathematically that we can guarantee that the PACF drops to 0 for lags greater than p. If we remember the equation defining an AR process, the best regression for X_{t+h} given X_{t+1: t+h-1} for h>p is simply:

\begin{equation*} \hat{X}_{t}=\sum_{j=1}^{p} \phi_{j} X_{t+h-j} \tag{3.17} \end{equation*}

where the remainder is fixed to W_{t}. We don’t even need the estimator, \hat{X}_{t}, since we can then write:

\begin{align*} \phi_{h h} & =\mathbb{E}\left[\left(X_{t+h}-\hat{X}_{t+h}\right)\left(X_{t}-\hat{X}_{t}\right)\right] \tag{3.18}\\ & =\mathbb{E}\left[\left(W_{t+h}\right)\left(X_{t}-\hat{X}_{t}\right)\right] \tag{3.19}\\ & =0 \tag{3.20} \end{align*}

where we are taking advantage of causality in the final line. While it is a little more involved, it’s also possible to show that the PACF coefficients for an \operatorname{AR}(p) process respect:

\begin{equation*} \phi_{h h}=\phi_{h}, \tag{3.21} \end{equation*}

that is they are a measurement of the fundamental coefficients of that AR process.

Similarly, we have asserted that for an MA process that ACF is a good measurement of its order. Let us circle back to that claim quickly by considering the auto-covariance function of an MA process for order q :

\begin{align*} \gamma(h) & =\mathbb{E}\left[\left(X_{t+h}\right)\left(X_{t}\right)\right] \tag{3.22}\\ & =\mathbb{E}\left[\left(\sum_{j=0}^{q} \theta_{j} W_{t+h-j}\right)\left(\sum_{j=0}^{q} \theta_{j} W_{t-j}\right)\right] \tag{3.23}\\ & = \begin{cases}\sigma_{w}^{2} \sum_{j=0}^{q-h} \theta_{j} \theta_{j+h} & |h| \leq q \\ 0 & |h|>q\end{cases} \tag{3.24} \end{align*}

and to get the ACF we only need to divide by \gamma(0) :

\rho(h)= \begin{cases}\frac{\sigma_{w}^{2} \sum_{j=0}^{q-h} \theta_{j} \theta_{j+h}}{1+\sum_{j=1}^{q} \theta_{j}^{2}} & |h| \leq q \tag{3.25}\\ 0 & |h|>q\end{cases}

where I have simplified our notation by setting \theta_{0}=1. An AR(p) process will have an ACF that tails off and a PACF with a sharp cut at lag p. An MA( q ) process will have an ACF with a sharp cut at lag q and a PACF that tails off. An \operatorname{ARMA}(p, q) that is invertible and causal can be expressed as an infinite AR or MA process. Therefore, it will have both an ACF and a PACF that tails off, although if one of the two processes is more dominant a sharp cut may still be observed.


Estimating parameters of an ARMA process

Once we’ve selected what model to use on our data, we will want to estimate its parameters. For now, let’s introduce the simplest choice we can make in Bayesian inference: the maximum a posteriori (MAP) estimate. Assume we have some data X and some model parameters \theta. The MAP estimate is given by:

\begin{align*} \hat{\theta}_{\mathrm{MAP}} & =\underset{\theta}{\operatorname{argmax}} p(\theta \mid X) \tag{3.26}\\ & =\underset{\theta}{\operatorname{argmax}} \frac{p(X \mid \theta) p(\theta)}{p(X)} \tag{3.27} \end{align*}

We already have all the tools we need to evaluate this on our ARMA processes. We will see an example of prior choices in lab, and we can write the likelihood from the equations that define the process. For an AR(1) process we have two parameters, \phi_{1} and \sigma_{w}. If we set the boundary condition X_{1}=W_{1}, we can write:

\begin{align*} & p\left(X_{1: t} \mid \phi_{1}, \sigma_{w}\right)=p\left(X_{1} \mid \phi_{1}, \sigma_{w}\right) p\left(X_{2} \mid X_{1}, \phi_{1}, \sigma_{w}\right) \times \\ & p\left(X_{3} \mid X_{1: 2}, \phi_{1}, \sigma_{w}\right) \ldots p\left(X_{t} \mid X_{1: t-1}, \phi_{1}, \sigma_{w}\right) . \tag{3.28} \end{align*}

But we know that for an \mathrm{AR}(1) process X_{t} is conditionally independent of X_{1: t-2} given X_{t-1} so we can simplify this equation to:

\begin{align*} & p\left(X_{1: t} \mid \phi_{1}, \sigma_{w}\right)=p\left(X_{1} \mid \phi_{1}, \sigma_{w}\right) p\left(X_{2} \mid X_{1}, \phi_{1}, \sigma_{w}\right) \times \\ & p\left(X_{3} \mid X_{2}, \phi_{1}, \sigma_{w}\right) \ldots p\left(X_{t} \mid X_{t-1}, \phi_{1}, \sigma_{w}\right) . \tag{3.29} \end{align*}

The likelihood of X_{t} given X_{t-1} and the parameters is just a Gaussian distribution with mean \phi_{1} X_{t-1} and variance \sigma_{w}^{2} :

\begin{align*} p\left(X_{t} \mid X_{t-1}, \phi_{1}, \sigma_{w}\right) & =\mathcal{N}\left(\phi_{1} X_{t-1}, \sigma_{w}\right) \tag{3.30}\\ & =\frac{1}{\sqrt{2 \pi} \sigma_{w}} \exp \left[\frac{-\left(X_{t}-\phi_{1} X_{t-1}\right)^{2}}{2 \sigma_{w}^{2}}\right] \tag{3.31} \end{align*}

You will need to extend this to \operatorname{AR}(2) for lab this week, but I will spoil the answer and tell you that it’s just a matter of adding \phi_{2} X_{t-2} to the mean.

One final note - when we take the product of probability distributions we quickly get vanishingly small numbers. From an implementation standpoint, this will run into the numerical precision of our computer. To avoid this, we often operate with log pdfs instead of the pdf itself and take advantage of the property that:

\begin{equation*} \log [p(x) p(y)]=\log [p(x)]+\log [p(y)] \tag{3.32} \end{equation*}


Conditional prediction

Imagine that we have some time series \left\{X_{1}, \ldots, X_{t}\right\} that is ARMA and we want to predict \left\{X_{k+1}, \ldots X_{t}\right\} given \left\{X_{1}, \ldots X_{k}\right\}. How would we do that? Well, this is where our linearized Gaussians come into play. First, since every random variable in our ARMA process is linear sum of Gaussian random vairable, we know that the full time series can be described as a multivariate Gaussian. If we think of our time series as a vector, we can write our joint likelihood as:

\begin{equation*} \mathcal{N}(\boldsymbol{X}=\boldsymbol{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma})=\frac{1}{(2 \pi)^{d / 2}|\boldsymbol{\Sigma}|^{1 / 2}} \exp \left[-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^{T} \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})\right] \tag{3.33} \end{equation*}

Now conditioning on the first k observations doesn’t seem so hard: that’s just the conditional distribution for a multivariate Gaussian. As a reminder from Lecture 2 that was:

\begin{align*} p\left(\boldsymbol{x}_{a} \mid \boldsymbol{x}_{b}\right) & =\mathcal{N}\left(\boldsymbol{x}_{a} \mid \boldsymbol{\mu}_{a \mid b}, \boldsymbol{\Sigma}_{a \mid b}\right) \tag{3.34}\\ \boldsymbol{\mu}_{a \mid b} & =\boldsymbol{\mu}_{a}+\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1}\left(\boldsymbol{x}_{b}-\boldsymbol{\mu}_{b}\right) \tag{3.35}\\ \boldsymbol{\Sigma}_{a \mid b} & =\boldsymbol{\Sigma}_{a a}-\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \boldsymbol{\Sigma}_{b a} \tag{3.36} \end{align*}

with:

\mathcal{N}(\boldsymbol{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma})=\mathcal{N}\left(\left[\begin{array}{l} \boldsymbol{x}_{a} \tag{3.37}\\ \boldsymbol{x}_{b} \end{array}\right] \left\lvert\,\left[\begin{array}{l} \boldsymbol{\mu}_{a} \\ \boldsymbol{\mu}_{b} \end{array}\right]\right.,\left[\begin{array}{cc} \boldsymbol{\Sigma}_{a a} & \boldsymbol{\Sigma}_{a b} \\ \boldsymbol{\Sigma}_{b a} & \boldsymbol{\Sigma}_{b b} \end{array}\right]\right)

The mean of our ARMA process is 0 since it is stationary, so all we need to do is to calculate our covariance matrix entries \gamma(t, t+h)=\gamma(h). This may seem daunting, but for a causal process we can write an ARMA process as an infinite MA process:

\begin{equation*} X_{t}=\sum_{j=0}^{\infty} \psi_{j} W_{t-j} \tag{3.38} \end{equation*}

To solve for the coefficients we can take advantage of the fact that P(z) \psi(z)= \theta(z) :

\begin{equation*} \left(1-\phi_{1} z-\ldots\right)\left(\psi_{0}+\psi_{1} z+\ldots\right)=\left(1+\theta_{1} z+\ldots\right) \tag{3.39} \end{equation*}

For this to be true for all z, the coefficients on the left- and right-hand sides for every power of z must be equal. That is:

\begin{align*} \psi_{0} & =1 \tag{3.40}\\ \psi_{1}-\phi_{1} \psi_{0} & =\theta_{1} \tag{3.41}\\ \vdots & \tag{3.42} \end{align*}

Getting the generic equation from these solutions is highly non-trivial, but we can iteratively solve for each of the \psi_{j} up to some arbitrary order computationally. Once we have our \psi_{j}, we can use them to calculate the auto-correlation function using equation (3.25).