Extended Kalman Filtering and Smoothing

At this point, we can do most of what we would want with our LDS systems: we can infer the posteriors on individual latent states, we can predict future latent states and observations, and we have a good algorithm for parameter estimation. There are still more question we could ask in the LDS mode¹, but there are also some big simplifying assumptions we make in using the LDS model. The most concerning is the linearity: there will be a large number of processes were it is not appropriate to model the latent state transitions or the observation process as a simple linear matrix multiplication. We made these choices because it kept our math tractable (i.e. everything remained Gaussian). We can start relaxing our observation assumption to:

\boldsymbol{x}_{t}=f\left(\boldsymbol{z}_{t}\right)+\boldsymbol{v}_{t} \tag{8.1}

where f(\boldsymbol{z}) is a function that maps \mathbb{R}^{d} \rightarrow \mathbb{R}^{n}. The noise term behaves as before: p\left(\boldsymbol{v}_{t}\right) \sim \mathcal{N}(0, \boldsymbol{R}). Our function no longer has to be linear. Clearly, we can no longer just express p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) as a Gaussian, so we need to get a little more creative.

In Extended Kalman Filtering and Smoothing, we try to reframe this problem in the familiar language we’ve been using so far. That is, we want to find some approximation to the \boldsymbol{C} matrix we had before so that we can continue using our normal equations. To accomplish this, we Taylor expand our vector function f:

f\left(\boldsymbol{z}_{t}\right)=f\left(\boldsymbol{z}_{\mathrm{ref}}\right)+\nabla f\left(\boldsymbol{z}_{\mathrm{ref}}\right)\left(\boldsymbol{z}_{t}-\boldsymbol{z}_{\mathrm{ref}}\right)+\cdots \tag{8.2}

where \nabla f is known as the Jacobian matrix of f² and \boldsymbol{z}_{\text{ref}} is the value of \boldsymbol{z} around which we are Taylor expanding. It has the form:

\nabla f=\left[\begin{array}{cccc} \frac{\partial f_{1}}{\partial \boldsymbol{z}_{1}} & \frac{\partial f_{1}}{\partial \boldsymbol{z}_{2}} & \ldots & \frac{\partial f_{1}}{\partial \boldsymbol{z}_{d}} \tag{8.3}\\ \ldots & \ldots & \ldots & \ldots \\ \frac{\partial f_{n}}{\partial \boldsymbol{z}_{1}} & \frac{\partial f_{n}}{\partial \boldsymbol{z}_{2}} & \ldots & \frac{\partial f_{n}}{\partial \boldsymbol{z}_{d}} \end{array}\right]

If we take only the first and second term of our Taylor series, then we can write:

f\left(\boldsymbol{z}_{t}\right)=f\left(\boldsymbol{z}_{\mathrm{ref}}\right)+\nabla f\left(\boldsymbol{z}_{\mathrm{ref}}\right)\left(\boldsymbol{z}_{t}-\boldsymbol{z}_{\mathrm{ref}}\right) \tag{8.4}

Essentially, we are making a linear approximation for our function. In a second we will see why this approximation is useful, but it’s worth discussing if it’s valid. In order to sanely cut our Taylor series expansion off at the second term (first order in \left(\boldsymbol{z}_{t}-\boldsymbol{z}_{\text{ref}}\right)) we need the product of our deviations times our higher-order derivatives to be small. At a high level, we want one of two things to be true:

The deviations from our point of expansion, \left(\boldsymbol{z}_{t}-\boldsymbol{z}_{\text{ref}}\right) are small compared to the higher order derivatives, such that contributions that depend on higher power of \left(\boldsymbol{z}_{t}-\boldsymbol{z}_{\text{ref}}\right) can be ignored. For this to work, we need to only be considering z_{t} that are very close to our reference value \boldsymbol{z}_{\text{ref}}.
Our higher order derivatives are small when compared to the deviations from our point of expansion, \left(\boldsymbol{z}_{t}-\boldsymbol{z}_{\text{ref}}\right). For this to work, the function needs to be smooth.

We haven’t assumed anything about our function, so all we can do to maximize our chances of one of these conditions being met is to pick a good value for \left(\boldsymbol{z}_{t}-\boldsymbol{z}_{\text{ref}}\right). If we want our deviations to be small on average, we should probably pick \boldsymbol{z}_{\text{ref}}=\boldsymbol{\mu}_{t \mid t-1}³. It will be impossible to judge whether this approximation is appropriate without knowing more about our function, so for the sake of this lecture we will note that you should check this assumption when using it on your own analysis.

If we accept Equation 8.4 and set \boldsymbol{z}_{\text{ref}}=\boldsymbol{\mu}_{t \mid t-1}, we now know the expected mean of \boldsymbol{x}_{t} given all our observations thus far \left\{\boldsymbol{x}_{1}, \ldots, \boldsymbol{x}_{t-1}\right\}:

\begin{align} \mathbb{E}\left[f\left(\boldsymbol{z}_{t}\right)+\boldsymbol{v}_{t}\right] & =\mathbb{E}\left[f\left(\boldsymbol{\mu}_{t \mid t-1}\right)+\nabla f\left(\boldsymbol{\mu}_{t \mid t-1}\right)\left(\boldsymbol{z}_{t}-\boldsymbol{\mu}_{t \mid t-1}+\boldsymbol{v}_{t}\right)\right] \tag{8.5}\\ & =f\left(\boldsymbol{\mu}_{t \mid t-1}\right)+\nabla f\left(\boldsymbol{\mu}_{t \mid t-1}\right) \mathbb{E}\left[\left(\boldsymbol{z}_{t}-\boldsymbol{\mu}_{t \mid t-1}\right)\right] \tag{8.6}\\ & =f\left(\boldsymbol{\mu}_{t \mid t-1}\right) \tag{8.7} \end{align}

The covariance is just as easy to calculate:

\begin{align} & \mathbb{E}\left[\left(f\left(\boldsymbol{z}_{t}\right)+\boldsymbol{v}_{t}-f\left(\boldsymbol{\mu}_{t \mid t-1}\right)\right)\left(f\left(\boldsymbol{z}_{t}\right)+\boldsymbol{v}_{t}-f\left(\boldsymbol{\mu}_{t \mid t-1}\right)\right)^{T}\right]= \\ & \nabla f\left(\boldsymbol{\mu}_{t \mid t-1}\right) \mathbb{E}\left[\left(\boldsymbol{z}_{t}-\boldsymbol{\mu}_{t \mid t-1}\right)\left(\boldsymbol{z}_{t}-\boldsymbol{\mu}_{t \mid t-1}\right)^{T}\right] \nabla f\left(\boldsymbol{\mu}_{t \mid t-1}\right)^{T}+\boldsymbol{R} \tag{8.8}\\ & =\nabla f\left(\boldsymbol{\mu}_{t \mid t-1}\right) \boldsymbol{\Sigma}_{t \mid t-1} \nabla f\left(\boldsymbol{\mu}_{t \mid t-1}\right)^{T}+\boldsymbol{R} \tag{8.9} \end{align}

which brings us to the realization that \nabla f\left(\boldsymbol{\mu}_{t \mid t-1}\right) is behaving like the matrix C! In fact, with very little work, you can show that the Kalman filtering equations are exactly the same as they were before, except for our mean prediction on the data is now f\left(\boldsymbol{\mu}_{t \mid t-1}\right) instead of \boldsymbol{C} \boldsymbol{\mu}_{t \mid t-1} and our Kalman gain matrix \boldsymbol{K}_{t} now replaces \boldsymbol{C} with \nabla f\left(\boldsymbol{\mu}_{t \mid t-1}\right). What about smoothing? Well, once we have calculated the filtering means and covariances the smoothing solutions no longer require the data, so the smoothing step is the exact same. You actually showed a version of this on your homework.

To summarize, after we make the linear approximation, our extended Kalman filtering approach boils down to: - Conduct Kalman filtering but replace the observation mean with f\left(\boldsymbol{\mu}_{t \mid t-1}\right) and the matrix \boldsymbol{C} in the Kalman gain with \nabla f\left(\boldsymbol{\mu}_{t \mid t-1}\right). - Conduct Kalman smoothing as you would before. - On \boldsymbol{\mu}_{0}, \boldsymbol{\Sigma}_{0}, \boldsymbol{A} and \boldsymbol{Q}, you can also conduct EM as you would before. The matrix \boldsymbol{R} and the parameters of your function f will require a different approach. The best approach will depend on f. - Conduct prediction on future latent states as you would before. - Generate samples of the observation \boldsymbol{x}_{T+k} by sampling from p\left(\boldsymbol{z}_{T+k} \mid \boldsymbol{x}_{1: T}\right) and then running those samples through the function f.

Particle Filtering

We showed how we can still use the frameworks we’ve built if we relax our assumptions about the observation process. But what if our assumptions about the latent space process are also too strict? Let’s relax our assumptions further and assume:

\begin{align} p\left(\boldsymbol{z}_{0}\right) & =\mathcal{N}\left(\boldsymbol{\mu}_{0}, \boldsymbol{\Sigma}_{0}\right) \tag{8.10}\\ \boldsymbol{z}_{t+1} & =g\left(\boldsymbol{z}_{t}, \boldsymbol{w}_{t}\right) \tag{8.11}\\ \boldsymbol{x}_{t} & =f\left(\boldsymbol{z}_{t}, \boldsymbol{v}_{t}\right) \tag{8.12} \end{align}

where g is a function that maps from \mathbb{R}^{d} \rightarrow \mathbb{R}^{d} using the current latent state and some noise vector \boldsymbol{w}_{t}, and f is a function that maps from \mathbb{R}^{d} \rightarrow \mathbb{R}^{n} using the current latent state and some noise vector \boldsymbol{v}_{t}. More concretely, f⁴ and g are functions that sample the distributions p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) and p\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{z}_{t}\right). We do not assume that we have access to p\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{z}_{t}\right). For particle filtering we will assume that we can evaluate p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) but make no assumptions on its form⁵.

Clearly, the linear approximation we were using before will not work. We will need a new tool to work with this model. Thankfully, being able to sample from a distribution is actually very powerful. Imagine we want to calculate the expectation value of some function h(\boldsymbol{x}) over the distribution p(\boldsymbol{x}). We can do this by:

\begin{align} \mathbb{E}[h(\boldsymbol{x})] & =\int h(\boldsymbol{x}) p(\boldsymbol{x}) d \boldsymbol{x} \tag{8.13}\\ & \approx \frac{1}{N} \sum_{\boldsymbol{x}_{i} \sim p(\boldsymbol{x})}^{N} h\left(\boldsymbol{x}_{\boldsymbol{i}}\right) \tag{8.14} \end{align}

with the equality being exact when N \rightarrow \infty. This will be central to our ability to conduct inference in this more generic limit. Let’s now make it specific to our model and imagine we want to take an expectation value over our posterior p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t}\right):

\begin{align} \mathbb{E}\left[h\left(\boldsymbol{z}_{t}\right)\right] & =\int h\left(\boldsymbol{z}_{t}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t}\right) d \boldsymbol{z}_{t} \tag{8.15}\\ & =\int h\left(\boldsymbol{z}_{t}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{t}, \boldsymbol{x}_{1: t-1}\right) d \boldsymbol{z}_{t} \tag{8.16}\\ & =\int h\left(\boldsymbol{z}_{t}\right) \frac{p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}, \boldsymbol{x}_{1: t-1}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t-1}\right)}{p\left(\boldsymbol{x}_{t} \mid \boldsymbol{x}_{1: t-1}\right)} d \boldsymbol{z}_{t} \tag{8.17}\\ & =\frac{\int h\left(\boldsymbol{z}_{t}\right) p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t-1}\right) d \boldsymbol{z}_{t}}{p\left(\boldsymbol{x}_{t} \mid \boldsymbol{x}_{1: t-1}\right)} \tag{8.18}\\ & =\frac{\int h\left(\boldsymbol{z}_{t}\right) p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t-1}\right) d \boldsymbol{z}_{t}}{\int p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t-1}\right) d \boldsymbol{z}_{t}} \tag{8.19} \end{align}

If we now define:

\boldsymbol{w}_{t}\left(\boldsymbol{z}_{t}\right)=\frac{p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right)}{\int p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t-1}\right) d \boldsymbol{z}_{t}} \tag{8.20}

we can write:

\mathbb{E}\left[h\left(\boldsymbol{z}_{t}\right)\right]=\int h\left(\boldsymbol{z}_{t}\right) \boldsymbol{w}_{t}\left(\boldsymbol{z}_{t}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t-1}\right) d \boldsymbol{z}_{t} \tag{8.21}

The intuition behind our sampling weights \boldsymbol{w}_{t} is that it defines how important a value of \boldsymbol{z}_{t} is given the data \boldsymbol{x}_{t}. So far everything I wrote was with integrals, but let’s instead assume I have some set of samples \boldsymbol{z}_{t, 1}, \cdots \boldsymbol{z}_{t, N} drawn from p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t-1}\right). Then I can write:

\boldsymbol{w}_{t, i}=\frac{p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t, i}\right)}{\sum_{j=1}^{N} p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t, j}\right)} \tag{8.22}

and:

\mathbb{E}\left[h\left(\boldsymbol{z}_{t}\right)\right]=\sum_{i=1}^{N} h\left(\boldsymbol{z}_{t, i}\right) \boldsymbol{w}_{t, i} \tag{8.23}

From this I can see that if I re-sample \boldsymbol{z}_{t, i} proportional to the weights \boldsymbol{w}_{t, i}, then I now have samples from my distribution p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t}\right).

All the math was to give us the process of particle filtering. Much like with our Kalman filtering derivation, particle filtering is a recursive, forward pass process. We start from the assumptions that we have access to samples \boldsymbol{z}_{t-1, i} from the distribution \left(\boldsymbol{z}_{t-1} \mid \boldsymbol{x}_{1: t-2}\right) and the weights \boldsymbol{w}_{t-1, i}. The steps of particle filtering are then described in Algorithm 6.1.

Drawing samples of the future latent states follows the same steps, but only the latent state \boldsymbol{z}_{T+1} requires a resampling of \boldsymbol{z}_{T}. After that, the weights of \boldsymbol{z}_{T+1} are all equal since there is no data.

Algorithm 6.1 Particle Filtering

Input: N initial latent samples \boldsymbol{z}_{0, i} from the prior p\left(\boldsymbol{z}_{0}\right) and uniform weights w_{0, i}.

for t=1 to T+k do
\quad for i=1 to N_{\text{samps}} do
\quad\quad \boldsymbol{z}_{t-1, i}^{\prime}=\boldsymbol{z}_{t-1, j} with prob w_{t-1, j} \quad \triangleright \boldsymbol{z}_{t-1, i}^{\prime} follow p\left(\boldsymbol{z}_{t-1} \mid \boldsymbol{x}_{1: t-1}\right)
\quad\quad \boldsymbol{z}_{t, i}=g\left(\boldsymbol{z}_{t-1, i}^{\prime}\right) \quad \triangleright \boldsymbol{z}_{t, i} follow p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t-1}\right)
\quad end for
\quad for i=1 to N do
\quad\quad if t \leq T then
\quad\quad\quad w_{t, i}=\frac{p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t, i}\right)}{\sum_{j=1}^{N} p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t, j}\right)}
\quad\quad else
\quad\quad\quad w_{t, i}=\frac{1}{N_{\mathrm{samps}}}
\quad\quad end if
\quad end for
end for

Output: Samples \boldsymbol{z}_{t, i} and weights w_{t, i}.

For example, we do not have a method for learning posterior on our LDS parameters, only a MAP / MLE estimate.↩︎
You may not always be able to compute derivatives of your function analytically, but the Jacobian can be computed numerically.↩︎
As always, we assume we have a solution for the mean of the previous state and therefore \boldsymbol{\mu}_{t \mid t-1}↩︎
You actually do not need to have access to f if you are only interested in making predictions on the latent states. You only need the distribution p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right).↩︎
There are also strategies when we do not have access to p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right), ranging from approximate Bayesian computation to more complex variational inference techniques↩︎