So far, we’ve derived our equations without assuming any specific form for the transition probability or the observation probability. One flexible and mathematically tractable choice to make is known as a Linear Dynamical System (LDS).
For an LDS we state that our initial state is a Gaussian random variable with some initial mean and covariance:
p\left(\boldsymbol{z}_{0}\right)=\mathcal{N}\left(\boldsymbol{\mu}_{0}, \boldsymbol{\Sigma}_{0}\right) \tag{5.1}
Our observation is defined by the equation:
\boldsymbol{x}_{t}=\boldsymbol{C} \boldsymbol{z}_{t}+\boldsymbol{v}_{t} \tag{5.2}
with \boldsymbol{C} being our observation matrix and p\left(\boldsymbol{v}_{t}\right)=\mathcal{N}(0, \boldsymbol{R}). The transition between states is defined by the equation:
\boldsymbol{z}_{t+1}=\boldsymbol{A} \boldsymbol{z}_{t}+\boldsymbol{w}_{t+1} \tag{5.3}
with \boldsymbol{A} being our transition matrix and p\left(\boldsymbol{w}_{t}\right)=\mathcal{N}(0, \boldsymbol{Q}). The free parameters of our system are \left\{\boldsymbol{A}, \boldsymbol{Q}, \boldsymbol{C}, \boldsymbol{R}, \boldsymbol{\mu}_{0}, \boldsymbol{\Sigma}_{0}\right\}. Because every variable in our system is a linear sum of multivariate Gaussians, we know that all of our conditional, joint, and marginal distributions will behave like multivariate Gaussians. In fact, we can write:
\begin{align} p\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{z}_{t}\right) & =\mathcal{N}\left(\boldsymbol{A} \boldsymbol{z}_{t}, \boldsymbol{Q}\right) \tag{5.4}\\ p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) & =\mathcal{N}\left(\boldsymbol{C} \boldsymbol{z}_{t}, \boldsymbol{R}\right) \tag{5.5} \end{align}
If we call d the dimensionality of \boldsymbol{z}_{t} and n the dimensionality of \boldsymbol{x}_{t}, notice that our system allows d \neq n.
A note on notation
The notation for estimates like \boldsymbol{\mu}_{a \mid b} and \boldsymbol{\Sigma}_{a \mid b} follows a
single rule: An estimate of the latent state \boldsymbol{z} at time a, given
observations \boldsymbol{x} up to time
b, where the first subscript (a) is the state’s time. and the second
subscript (b) is the data’s time.
For example, we will observe the following three types of estimates
in this lecture:
- \boldsymbol{\mu}_{t \mid t-1} is the
forecast for state \boldsymbol{z}_t
using data up to the previous time step, \boldsymbol{x}_{1: t-1}.
- \boldsymbol{\mu}_{t \mid t} is the
updated estimate for state \boldsymbol{z}_t using data up to the current
time step, \boldsymbol{x}_{1: t}.
- \boldsymbol{\mu}_{t \mid T} is the
revised, most accurate estimate for state \boldsymbol{z}_t using the entire dataset,
\boldsymbol{x}_{1: T}.
We’ll use similar notation for the covariance, i.e., \boldsymbol{\Sigma}_{t \mid t-1}, \boldsymbol{\Sigma}_{t \mid t}, and \boldsymbol{\Sigma}_{t \mid T}.
Kalman filtering and smoothing
For this lecture, we will assume we know our free parameters \left\{\boldsymbol{A}, \boldsymbol{Q}, \boldsymbol{C}, \boldsymbol{R}, \boldsymbol{\mu}_{0}, \boldsymbol{\Sigma}_{0}\right\} and we are interested in calculating p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: T}\right) so that we can understand our latent space and use it for prediction. To do this, we will want to calculate our forward pass \alpha and backwards pass \beta. In the LDS case, calculating \alpha will give rise to our Kalman filtering equations, and calculating \beta will give rise to our Kalman smoothing equations.
Let’s start with our filtering equations. The forward pass recursively computes the joint probability of the latent state and the observations up to that point, \alpha\left(\boldsymbol{z}_{t}\right)=p\left(\boldsymbol{x}_{1: t}, \boldsymbol{z}_{t}\right). This quantity, often called the forward message, is proportional to the filtered posterior we want, p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t}\right). The recursion is defined as:
\alpha\left(\boldsymbol{z}_{t}\right)=p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) \int \alpha\left(\boldsymbol{z}_{t-1}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{z}_{t-1}\right) d \boldsymbol{z}_{t-1} \tag{5.6}
Let’s assume we have already calculated \alpha\left(\boldsymbol{z}_{t-1}\right) since we are doing things recursively. Let’s define:
\alpha\left(\boldsymbol{z}_{t-1}\right) \propto \mathcal{N}\left(\boldsymbol{\mu}_{t-1 \mid t-1}, \boldsymbol{\Sigma}_{t-1 \mid t-1}\right) \tag{5.7}
Then we can write:
\begin{align} \alpha\left(\boldsymbol{z}_{t}\right) &\propto \mathcal{N}\left(\boldsymbol{x}_{t} \mid \boldsymbol{C} \boldsymbol{z}_{t}, \boldsymbol{R}\right) \\ &\quad \int \mathcal{N}\left(\boldsymbol{z}_{t-1} \mid \boldsymbol{\mu}_{t-1 \mid t-1}, \boldsymbol{\Sigma}_{t-1 \mid t-1}\right) \\ &\quad\quad \mathcal{N}\left(\boldsymbol{z}_{t} \mid \boldsymbol{A} \boldsymbol{z}_{t-1}, \boldsymbol{Q}\right) d \boldsymbol{z}_{t-1} \tag{5.8} \end{align}
The integral may seem daunting, but we can take advantage of a useful property of multivariate Gaussians:
\begin{align} \int \mathcal{N}\left(\boldsymbol{x} \mid \boldsymbol{\mu}_{x}, \boldsymbol{\Sigma}_{x}\right) \mathcal{N}\left(\boldsymbol{y} \mid \boldsymbol{A} \boldsymbol{x}, \boldsymbol{\Sigma}_{y}\right) d \boldsymbol{x} \\ \quad = \mathcal{N}\left(\boldsymbol{y} \mid \boldsymbol{A} \boldsymbol{\mu}_{x}, \boldsymbol{\Sigma}_{y}+\boldsymbol{A} \Sigma_{x} \boldsymbol{A}^{T}\right) \tag{5.9} \end{align}
Plugging that into our equation above we get:
\alpha\left(\boldsymbol{z}_{t}\right) \propto \mathcal{N}\left(\boldsymbol{x}_{t} \mid \boldsymbol{C} \boldsymbol{z}_{t}, \boldsymbol{R}\right) \mathcal{N}\left(\boldsymbol{z}_{t} \mid \boldsymbol{A} \boldsymbol{\mu}_{t-1 \mid t-1}, \boldsymbol{Q}+\boldsymbol{A} \Sigma_{t-1 \mid t-1} \boldsymbol{A}^{T}\right) \tag{5.10}
where we have defined:
\begin{align} \boldsymbol{\mu}_{t \mid t-1} & =\boldsymbol{A} \boldsymbol{\mu}_{t-1 \mid t-1} \tag{5.11}\\ \boldsymbol{\Sigma}_{t \mid t-1} & =\boldsymbol{Q}+\boldsymbol{A} \Sigma_{t-1 \mid t-1} \boldsymbol{A}^{T} \tag{5.12} \end{align}
We can think of these as our knowledge about the latent state coming from the previous state before we have incorporated our data. Which leaves us with:
\alpha\left(\boldsymbol{z}_{t}\right) \propto \mathcal{N}\left(\boldsymbol{x}_{t} \mid \boldsymbol{C} \boldsymbol{z}_{t}, \boldsymbol{R}\right) \mathcal{N}\left(\boldsymbol{z}_{t} \mid \boldsymbol{\mu}_{t \mid t-1}, \boldsymbol{\Sigma}_{t \mid t-1}\right) \tag{5.13}
Kalman filtering equations
While a bit more challenging to derive (see Appendix A), we can get:
\begin{align} \alpha\left(\boldsymbol{z}_t\right) & \propto \mathcal{N}\left(\boldsymbol{z}_t \mid \boldsymbol{\mu}_{t \mid t}, \boldsymbol{\Sigma}_{t \mid t}\right) \tag{5.14}\\ \boldsymbol{\mu}_{t \mid t} & =\boldsymbol{\mu}_{t \mid t-1}+\boldsymbol{K}_t\left(\boldsymbol{x}_t-\boldsymbol{C} \boldsymbol{\mu}_{t \mid t-1}\right) \tag{5.15}\\ \boldsymbol{\Sigma}_{t \mid t} & =\boldsymbol{\Sigma}_{t \mid t-1}-\boldsymbol{K}_t \boldsymbol{C} \boldsymbol{\Sigma}_{t \mid t-1} \tag{5.16}\\ \boldsymbol{K}_t & =\boldsymbol{\Sigma}_{t \mid t-1} \boldsymbol{C}^T\left(\boldsymbol{C} \boldsymbol{\Sigma}_{t \mid t-1} \boldsymbol{C}^T+\boldsymbol{R}\right)^{-1} \tag{5.17} \end{align}
The matrix \boldsymbol{K}_t is called the Kalman gain. If you think of our updated mean and covariance matrices as the new belief about our mean and covariance after seeing the data, you can convince yourself that the Kalman gain controls how important the data is to updating our beliefs. We can call \boldsymbol{\mu}_{t \mid t} and \boldsymbol{\Sigma}_{t \mid t} our Kalman filtering mean and covariance. They encode the mean and covariance of the latent space if we only consider the data up until \boldsymbol{x}_{t}. Therefore, they are only the final mean and covariance for \boldsymbol{z}_{T}.
Kalman smoothing equations
Now we can conduct our backward pass to calculate the \alpha \beta quantity that defines our target probability. As before, we assume that we already have the solution:
\alpha\left(\boldsymbol{z}_{t+1}\right) \beta\left(\boldsymbol{z}_{t+1}\right) \propto \mathcal{N}\left(\boldsymbol{\mu}_{t+1 \mid T}, \boldsymbol{\Sigma}_{t+1 \mid T}\right) \tag{5.18}
Now we can take our equation:
\beta\left(\boldsymbol{z}_{t}\right) \alpha\left(\boldsymbol{z}_{t}\right) \propto \int \alpha\left(\boldsymbol{z}_{t}\right) \beta\left(\boldsymbol{z}_{t+1}\right) p\left(\boldsymbol{x}_{t+1} \mid \boldsymbol{z}_{t+1}\right) p\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{z}_{t}\right) d \boldsymbol{z}_{t+1} \tag{5.19}
which after some manipulation (see Appendix A) gives us the equations:
\begin{align} \alpha\left(\boldsymbol{z}_{t}\right) \beta\left(\boldsymbol{z}_{t}\right) & \propto \mathcal{N}\left(\boldsymbol{\mu}_{t \mid T}, \boldsymbol{\Sigma}_{t \mid T}\right) \tag{5.20}\\ \boldsymbol{\mu}_{t \mid T} & =\boldsymbol{\mu}_{t \mid t}+\boldsymbol{F}_{t}\left(\boldsymbol{\mu}_{t+1 \mid T}-\boldsymbol{\mu}_{t+1 \mid t}\right) \tag{5.21}\\ \boldsymbol{\Sigma}_{t \mid T} & =\boldsymbol{F}_{t}\left(\boldsymbol{\Sigma}_{t+1 \mid T}-\boldsymbol{\Sigma}_{t+1 \mid t}\right) \boldsymbol{F}_{t}^{T}+\boldsymbol{\Sigma}_{t \mid t} \tag{5.22}\\ \boldsymbol{F}_{t} & =\boldsymbol{\Sigma}_{t \mid t} \boldsymbol{A}^{T} \boldsymbol{\Sigma}_{t+1 \mid t}^{-1} \tag{5.23} \end{align}
All of the quantities in this equation are either terms from our forward pass, or terms that come from \alpha\left(\boldsymbol{z}_{t+1}\right) \beta\left(\boldsymbol{z}_{t+1}\right). We can call \boldsymbol{\mu}_{t \mid T} and \boldsymbol{\Sigma}_{t \mid T} our Kalman smoothing mean and covariance. So we can finally write:
p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: T}\right)=\mathcal{N}\left(\boldsymbol{\mu}_{t \mid T}, \boldsymbol{\Sigma}_{t \mid T}\right) \tag{5.24}
Note that for t=T the smoothing mean and covariance are equal to the filtering mean and covariance.
Algorithm summary
We now have all the pieces we need to conduct Kalman filtering and smoothing, which corresponds to calculating p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: T}\right) for all t, i.e., inference in our LDS. The full algorithm is:
- Set the initial state at t=0 using the prior.
\boldsymbol{\mu}_{0 \mid 0}=\boldsymbol{\mu}_0, \quad \boldsymbol{\Sigma}_{0 \mid 0}=\boldsymbol{\Sigma}_0
- Predict the state for t=1 before any data is seen.
\begin{aligned} \boldsymbol{\mu}_{1 \mid 0} & =\boldsymbol{A} \boldsymbol{\mu}_{0 \mid 0} \\ \boldsymbol{\Sigma}_{1 \mid 0} & =\boldsymbol{A} \boldsymbol{\Sigma}_{0 \mid 0} \boldsymbol{A}^T+\boldsymbol{Q} \end{aligned}
For t=1 to T, iterate through the predict and update steps using the Kalman filtering equations (5.11-5.17) to compute and store all \left\{\boldsymbol{\mu}_{t \mid t}, \boldsymbol{\Sigma}_{t \mid t}\right\} and \left\{\boldsymbol{\mu}_{t \mid t-1}, \boldsymbol{\Sigma}_{t \mid t-1}\right\}.
For t=T-1 down to 0 , use the stored values from the forward pass to compute the smoothed estimates \left\{\boldsymbol{\mu}_{t \mid T}, \boldsymbol{\Sigma}_{t \mid T}\right\} using the Kalman smoothing equations (5.21-5.23).
Prediction under LDS
::: {.text-heavy} For some LDS applications, we will only be interested in reconstructing the latent space. However, for most applications we will also be interested in making predictions. First we have already shown that:
p\left(\boldsymbol{z}_{T+1} \mid \boldsymbol{x}_{1: T}\right)=\frac{1}{p\left(\boldsymbol{x}_{1: T}\right)} \int p\left(\boldsymbol{z}_{T+1} \mid \boldsymbol{z}_{T}\right) \alpha\left(\boldsymbol{z}_{T}\right) d \boldsymbol{z}_{T} \tag{5.25}
So far we have done nothing specific to the LDS case, but this equation serves as a reminder that we only need the forward pass in order to do predictions. Of course, as we will see in Section 5.4, we may still need to do the backward pass if we want to estimate the model parameters. Returning to our equation we have:
p\left(\boldsymbol{z}_{T+1} \mid \boldsymbol{x}_{1: T}\right)=\int p\left(\boldsymbol{z}_{T+1} \mid \boldsymbol{z}_{T}\right) p\left(\boldsymbol{z}_{T} \mid \boldsymbol{x}_{1: T}\right) d \boldsymbol{z}_{T} \tag{5.26}
We know the transition probability for our LDS system, and we have solved for p\left(\boldsymbol{z}_{T} \mid \boldsymbol{x}_{1: T}\right), so we have that:
\begin{align} p\left(\boldsymbol{z}_{T+1} \mid \boldsymbol{x}_{1: T}\right) & =\mathcal{N}\left(\boldsymbol{z}_{T+1} \mid \boldsymbol{\mu}_{T+1 \mid T}, \boldsymbol{\Sigma}_{T+1 \mid T}\right) \tag{5.27}\\ \boldsymbol{\mu}_{T+1 \mid T} & =\boldsymbol{A} \boldsymbol{\mu}_{T \mid T} \tag{5.28}\\ \boldsymbol{\Sigma}_{T+1 \mid T} & =\boldsymbol{A} \boldsymbol{\Sigma}_{T \mid T} \boldsymbol{A}^{T}+\boldsymbol{Q} \tag{5.29} \end{align}
In fact, for some arbitrary latent state \boldsymbol{z}_{T+k} we have:
p\left(\boldsymbol{z}_{T+k} \mid \boldsymbol{x}_{1: T}\right)=\int p\left(\boldsymbol{z}_{T+k} \mid \boldsymbol{z}_{T+k-1}\right) p\left(\boldsymbol{z}_{T+k-1} \mid \boldsymbol{x}_{1: T}\right) d \boldsymbol{z}_{T+k-1} \tag{5.30}
and therefore:
\begin{align} p\left(\boldsymbol{z}_{T+k} \mid \boldsymbol{x}_{1: T}\right) & =\mathcal{N}\left(\boldsymbol{z}_{T+k} \mid \boldsymbol{\mu}_{T+k \mid T}, \boldsymbol{\Sigma}_{T+k \mid T}\right) \tag{5.31}\\ \boldsymbol{\mu}_{T+k \mid T} & =\boldsymbol{A} \boldsymbol{\mu}_{T+k-1 \mid T} \tag{5.32}\\ \boldsymbol{\Sigma}_{T+k \mid T} & =\boldsymbol{A} \boldsymbol{\Sigma}_{T+k-1 \mid T} \boldsymbol{A}^{T}+\boldsymbol{Q} \tag{5.33} \end{align}
So we can calculate the future predictions with a recursive forward pass up until the time index of interest. The equations of our forward pass get simpler once we pass the time index of our last observed datapoint. What if we want to predict the data itself? Well that follows a similar set of equations:
p\left(\boldsymbol{x}_{T+k} \mid \boldsymbol{x}_{1: T}\right)=\int p\left(\boldsymbol{x}_{T+k} \mid \boldsymbol{z}_{T+k}\right) p\left(\boldsymbol{z}_{T+k} \mid \boldsymbol{x}_{1: T}\right) d \boldsymbol{z}_{T+k} \tag{5.34}
which, given the observation probability of our LDS system, gives us:
p\left(\boldsymbol{x}_{T+k} \mid \boldsymbol{x}_{1: T}\right)=\mathcal{N}\left(\boldsymbol{x}_{T+k} \mid \boldsymbol{C} \boldsymbol{\mu}_{T+k \mid T}, \boldsymbol{C} \boldsymbol{\Sigma}_{T+k \mid T} \boldsymbol{C}^{T}+\boldsymbol{R}\right) \tag{5.35}
With these equations for inference (filtering and smoothing) and prediction, we have a complete framework for analyzing a time series, provided we know the model parameters. The next step is to learn these parameters—\left\{\boldsymbol{A}, \boldsymbol{Q}, \boldsymbol{C}, \boldsymbol{R}, \boldsymbol{\mu}_{0}, \boldsymbol{\Sigma}_{0}\right\}—directly from the data, which is the subject of our next lecture.
Appendix A: Kalman Filter and Smoothing Proofs
The integral calculations presented in Section 4 are a useful framing for the smoothing and filtering steps, but they are somewhat difficult to conduct. A simpler proof of the equations takes advantage of the Gaussian conditioning formula we’ve introduced on multiple occasions:
\begin{align} p\left(\boldsymbol{x}_{a} \mid \boldsymbol{x}_{b}\right) & =\mathcal{N}\left(\boldsymbol{x}_{a} \mid \boldsymbol{\mu}_{a \mid b}, \boldsymbol{\Sigma}_{a \mid b}\right) \tag{A.5.1}\\ \boldsymbol{\mu}_{a \mid b} & =\boldsymbol{\mu}_{a}+\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1}\left(\boldsymbol{x}_{b}-\boldsymbol{\mu}_{b}\right) \tag{A.5.2}\\ \boldsymbol{\Sigma}_{a \mid b} & =\boldsymbol{\Sigma}_{a a}-\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \boldsymbol{\Sigma}_{b a} \tag{A.5.3} \end{align}
Let’s start with our Kalman filtering equations. Our objective is to calculate the distribution:
\begin{align} \alpha\left(\boldsymbol{z}_{t}\right) & =p\left(\boldsymbol{x}_{1: t}, \boldsymbol{z}_{t}\right) \tag{A.5.4}\\ & =p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t}\right) p\left(\boldsymbol{x}_{1: t}\right) \tag{A.5.5}\\ & \propto p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t}\right) \tag{A.5.6}\\ & \propto \mathcal{N}\left(\boldsymbol{z}_{t} \mid \boldsymbol{\mu}_{t \mid t}, \boldsymbol{\Sigma}_{t \mid t}\right) \tag{A.5.7} \end{align}
where in the second to last line we’ve dropped the normalization constant provided by the probability of the data. We assume we have the previous solution already:
\begin{align} \alpha\left(\boldsymbol{z}_{t-1}\right) & \propto p\left(\boldsymbol{z}_{t-1} \mid \boldsymbol{x}_{1: t-1}\right) \tag{A.5.8}\\ & \propto \mathcal{N}\left(\boldsymbol{z}_{t-1} \mid \boldsymbol{\mu}_{t-1 \mid t-1}, \boldsymbol{\Sigma}_{t-1 \mid t-1}\right) \tag{A.5.9} \end{align}
Let’s select the joint distribution we want to apply Equation A.5.1 to to be:
p\left(\boldsymbol{z}_{t}, \boldsymbol{x}_{t} \mid \boldsymbol{x}_{1: t-1}\right) \tag{A.5.10}
In the formulation of Equation A.5.1, a=\boldsymbol{z}_{t} and b=\boldsymbol{x}_{t}, and we will just condition all of our terms on \boldsymbol{x}_{1: t-1}. Now let’s calculate the terms we need one-by-one starting with our equivalent of \mu_{a}:
\mu_{\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t-1}}=\boldsymbol{A} \boldsymbol{\mu}_{t-1 \mid t-1}=\boldsymbol{\mu}_{t \mid t-1} \tag{A.5.11}
which we get from \boldsymbol{z}_{t}=\boldsymbol{A} \boldsymbol{z}_{t-1}+\boldsymbol{w}_{t} and having access to the previous solution in Equation A.5.8. Using the same facts, we can derive our equivalent of \Sigma_{a a}:
\Sigma_{\boldsymbol{z}_{t} \boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t-1}}=\boldsymbol{A} \boldsymbol{\Sigma}_{t-1 \mid t-1} \boldsymbol{A}^{T}+\boldsymbol{Q}=\boldsymbol{\Sigma}_{t \mid t-1} \tag{A.5.12}
The linear relationship between \boldsymbol{z}_{t} and \boldsymbol{x}_{t} gets us the mean and expectations for our equivalent of the b terms in Equation A.5.1:
\begin{align} \mu_{\boldsymbol{x}_{t} \mid \boldsymbol{x}_{1: t-1}} &= \boldsymbol{C} \boldsymbol{\mu}_{t \mid t-1} \tag{A.5.13}\\ \boldsymbol{\Sigma}_{\boldsymbol{x}_{t} \boldsymbol{x}_{t} \mid \boldsymbol{x}_{1: t-1}} &= \boldsymbol{C} \boldsymbol{\Sigma}_{t \mid t-1} \boldsymbol{C}^{T}+\boldsymbol{R} \tag{A.5.14} \end{align}
Now we just need the cross term for our covariance matrix:
\begin{align} \boldsymbol{\Sigma}_{\boldsymbol{z}_{t} \boldsymbol{x}_{t} \mid \boldsymbol{x}_{1: t-1}} & =\mathbb{E}\left[\left(\boldsymbol{z}_{t}-\boldsymbol{\mu}_{t \mid t-1}\right)\left(\boldsymbol{C} \boldsymbol{z}_{t}+\boldsymbol{v}_{t}-\boldsymbol{C} \boldsymbol{\mu}_{t \mid t-1}\right)^T\right] \tag{A.5.15}\\ & =\mathbb{E}\left[\left(\boldsymbol{z}_{t}-\boldsymbol{\mu}_{t \mid t-1}\right) \boldsymbol{C}\left(\boldsymbol{z}_{t}-\boldsymbol{\mu}_{t \mid t-1}\right)\right] \tag{A.5.16}\\ & =\boldsymbol{\Sigma}_{t \mid t-1} \boldsymbol{C}^{T} \tag{A.5.17} \end{align}
where we took advantage of the independence between \boldsymbol{v}_{t} and \boldsymbol{z}_{t} to remove any terms with \boldsymbol{v}_{t} from our calculation. Now we can finally apply Equation A.5.1 to get:
\begin{align} \boldsymbol{\mu}_{\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t}}=\boldsymbol{\mu}_{t \mid t} & =\boldsymbol{\mu}_{t \mid t-1}+\boldsymbol{\Sigma}_{t \mid t-1} \boldsymbol{C}^{T} \\ &\quad \left(\boldsymbol{C} \boldsymbol{\Sigma}_{t \mid t-1} \boldsymbol{C}^{T}+\boldsymbol{R}\right)^{-1} \\ &\quad\quad \left(\boldsymbol{x}_{t}-\boldsymbol{C} \boldsymbol{\mu}_{t \mid t-1}\right) \tag{A.5.18}\\ & =\boldsymbol{\mu}_{t \mid t-1}+\boldsymbol{K}_{t}\left(\boldsymbol{x}_{t}-\boldsymbol{C} \boldsymbol{\mu}_{t \mid t-1}\right) \tag{A.5.19}\\ \boldsymbol{\Sigma}_{\boldsymbol{z}_{t} \boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t}} &=\boldsymbol{\Sigma}_{t \mid t} \\ & =\boldsymbol{\Sigma}_{t \mid t-1}-\boldsymbol{\Sigma}_{t \mid t-1} \boldsymbol{C}^{T}\left(\boldsymbol{C} \boldsymbol{\Sigma}_{t \mid t-1} \boldsymbol{C}^{T}+\boldsymbol{R}\right)^{-1} \boldsymbol{C} \boldsymbol{\Sigma}_{t \mid t-1} \tag{A.5.20}\\ & =\boldsymbol{\Sigma}_{t \mid t-1}-\boldsymbol{K}_{t} \boldsymbol{C} \boldsymbol{\Sigma}_{t \mid t-1} \tag{A.5.21} \end{align}
with \boldsymbol{K}_{t}=\boldsymbol{\Sigma}_{t \mid t-1} \boldsymbol{C}^{T}\left(\boldsymbol{C} \boldsymbol{\Sigma}_{t \mid t-1} \boldsymbol{C}^{T}+\boldsymbol{R}\right)^{-1}.
The Kalman smoothing equations can also be derived with conditional distributions, although it requires us to be a bit more tricky in our formulation of the problem. This proof strategy is pulled from Shumway and Stoffer. What we want to calculate is:
\alpha\left(\boldsymbol{z}_{t}\right) \beta\left(\boldsymbol{z}_{t}\right) \propto p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: T}\right) \tag{A.5.22}
As before, we assume that we have already solved:
\alpha\left(\boldsymbol{z}_{t+1}\right) \beta\left(\boldsymbol{z}_{t+1}\right) \propto \mathcal{N}\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{\mu}_{t+1 \mid T}, \boldsymbol{\Sigma}_{t+1 \mid T}\right) \tag{A.5.23}
We’ll start by introducing a proxy statistic:
\boldsymbol{m}_{t}=\mathbb{E}\left[\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t}, \boldsymbol{z}_{t+1}-\boldsymbol{\mu}_{t+1 \mid t}\right] \tag{A.5.24}
There’s no obvious reason to choose this statistic yet, but as we’ll see it’ll give us the conditional distributions we want. Note that \boldsymbol{x}_{1: t} and \boldsymbol{z}_{t+1}-\boldsymbol{\mu}_{t+1 \mid t} are mutually independent of one another. To calculate the mean and covariance matrix we’ll once again reframe our calculations in the form of Equation A.5.1, with a=\boldsymbol{z}_{t} and b=\boldsymbol{z}_{t+1}-\boldsymbol{\mu}_{t+1 \mid t}. We already know two of our pieces:
\begin{align} \boldsymbol{\mu}_{\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t}} & =\boldsymbol{\mu}_{t \mid t} \tag{A.5.25}\\ \boldsymbol{\Sigma}_{\boldsymbol{z}_{t} \boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t}} & =\boldsymbol{\Sigma}_{t \mid t} \tag{A.5.26} \end{align}
Now we need our mean and covariance for our second term:
\begin{align} \boldsymbol{\mu}_{\boldsymbol{z}_{t+1}-\boldsymbol{\mu}_{t+1 \mid t} \mid \boldsymbol{x}_{1: t}} & =\boldsymbol{\mu}_{t+1 \mid t}-\boldsymbol{\mu}_{t+1 \mid t} \tag{A.5.27}\\ & =0 \tag{A.5.28}\\ \boldsymbol{\Sigma}_{\left(\boldsymbol{z}_{t+1}-\boldsymbol{\mu}_{t+1 \mid t}\right)\left(\boldsymbol{z}_{t+1}-\boldsymbol{\mu}_{t+1 \mid t}\right) \mid \boldsymbol{x}_{1: t}} & =\mathbb{E}\left[\left(\boldsymbol{z}_{t+1}-\boldsymbol{\mu}_{t+1 \mid t}\right)\left(\boldsymbol{z}_{t+1}-\boldsymbol{\mu}_{t+1 \mid t}\right) \mid \boldsymbol{x}_{1: t}\right] \tag{A.5.29}\\ & =\boldsymbol{\Sigma}_{t+1 \mid t} \tag{A.5.30} \end{align}
Finally, we just need the covariance term:
\begin{align} \boldsymbol{\Sigma}_{\left(\boldsymbol{z}_{t}\right)\left(\boldsymbol{z}_{t+1}-\boldsymbol{\mu}_{t+1 \mid t}\right) \mid \boldsymbol{x}_{1: t}} & =\mathbb{E}\left[\left(\boldsymbol{z}_{t}-\boldsymbol{\mu}_{t \mid t}\right)\left(\boldsymbol{z}_{t+1}-\boldsymbol{\mu}_{t+1 \mid t}\right) \mid \boldsymbol{x}_{1: t}\right] \tag{A.5.31}\\ & =\mathbb{E}\left[\left(\boldsymbol{z}_{t}-\boldsymbol{\mu}_{t \mid t}\right) \right. \\ &\quad\quad \left. \left(\boldsymbol{A} \boldsymbol{z}_{t}+\boldsymbol{w}_{t+1}-\boldsymbol{\mu}_{t+1 \mid t}\right) \mid \boldsymbol{x}_{1: t}\right] \tag{A.5.32}\\ & =\boldsymbol{\Sigma}_{t \mid t} \boldsymbol{A}^{T} \tag{A.5.33} \end{align}
Where we get the last line from the conditional independence of the noise at t+1 from the latent state at t. Now we can plug right into Equation A.5.1:
\begin{align} \boldsymbol{m}_{t} & =\boldsymbol{\mu}_{t \mid t}+\boldsymbol{\Sigma}_{t \mid t} \boldsymbol{A}^{T} \boldsymbol{\Sigma}_{t+1 \mid t}^{-1}\left(\boldsymbol{z}_{t+1}-\boldsymbol{\mu}_{t+1 \mid t}\right) \tag{A.5.34}\\ & =\boldsymbol{\mu}_{t \mid t}+\boldsymbol{F}_{t}\left(\boldsymbol{z}_{t+1}-\boldsymbol{\mu}_{t+1 \mid t}\right) \tag{A.5.35} \end{align}
with \boldsymbol{F}_{t}=\boldsymbol{\Sigma}_{t \mid t} \boldsymbol{A}^{T} \boldsymbol{\Sigma}_{t+1 \mid t}^{-1}. Now we can note that:
\begin{align} \mathbb{E}\left[\boldsymbol{m}_{t} \mid \boldsymbol{x}_{1: T}\right] & =\mathbb{E}\left[\mathbb{E}\left[\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t}, \boldsymbol{z}_{t+1}-\boldsymbol{\mu}_{t+1 \mid t}\right] \mid \boldsymbol{x}_{1: T}\right] \tag{A.5.36}\\ & =\mathbb{E}\left[\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: T}\right] \tag{A.5.37}\\ & =\boldsymbol{\mu}_{t \mid T} \tag{A.5.38} \end{align}
So we can calculate:
\begin{align} \boldsymbol{\mu}_{t \mid T} & =\mathbb{E}\left[\boldsymbol{\mu}_{t \mid t}+\boldsymbol{F}_{t}\left(\boldsymbol{z}_{t+1}-\boldsymbol{\mu}_{t+1 \mid t}\right) \mid \boldsymbol{x}_{1: T}\right] \tag{A.5.39}\\ & =\boldsymbol{\mu}_{t \mid t}+\boldsymbol{F}_{t}\left(\mathbb{E}\left[\boldsymbol{z}_{t+1} \mid \boldsymbol{x}_{1: T}\right]-\boldsymbol{\mu}_{t+1 \mid t}\right) \tag{A.5.40}\\ & =\boldsymbol{\mu}_{t \mid t}+\boldsymbol{F}_{t}\left(\boldsymbol{\mu}_{t+1 \mid T}-\boldsymbol{\mu}_{t+1 \mid t}\right) \tag{A.5.41} \end{align}
That’s half the battle, now we just need to solve for our covariance. The full derivation involves some lengthy algebra, but the final result is:
\boldsymbol{\Sigma}_{t \mid T}=\boldsymbol{\Sigma}_{t \mid t}+\boldsymbol{F}_{t}\left(\boldsymbol{\Sigma}_{t+1 \mid T}-\boldsymbol{\Sigma}_{t+1 \mid t}\right) \boldsymbol{F}_{t}^{T} \tag{A.5.42}
which is our desired equation.