Up until now, we have been operating under the assumption that we know the parameters of our system. In reality, we will often want to learn the parameters of our system via Bayesian inference:

p(\boldsymbol{\theta} \mid \boldsymbol{X}) \propto p(\boldsymbol{X} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta}) \tag{6.1}

where \boldsymbol{\theta} is a vector of all the parameters in our system, and \boldsymbol{X} is a vector of our full time series of data. For models like the LDS, we do not have the tools to evaluate p(\boldsymbol{X} \mid \boldsymbol{\theta}) directly. Instead, we can calculate p(\boldsymbol{X}, \boldsymbol{Z} \mid \boldsymbol{\theta}) for some vector of latent states \boldsymbol{Z}. We can write:

p(\boldsymbol{X} \mid \boldsymbol{\theta})=\int p(\boldsymbol{X}, \boldsymbol{Z} \mid \boldsymbol{\theta}) d \boldsymbol{Z} \tag{6.2}

While the individual p(\boldsymbol{X}, \boldsymbol{Z} \mid \boldsymbol{\theta}) may be possible to evaluate, the full integral is not. Another way of saying this is that the likelihood is intractable (we do not have the tools to calculate it). There are a lot of solutions to this problem that allow us to still conduct inference. One strategy is known as expectation-maximization.

The crux of expectation-maximization is to introduce a new distribution to our equation, q(\boldsymbol{Z}). We can then write:

\begin{align} \log p(\boldsymbol{X} \mid \boldsymbol{\theta}) & =\log p(\boldsymbol{X} \mid \boldsymbol{\theta}) \int q(\boldsymbol{Z}) d \boldsymbol{Z} \tag{6.3}\\ & =\int \log p(\boldsymbol{X} \mid \boldsymbol{\theta}) q(\boldsymbol{Z}) d \boldsymbol{Z} \tag{6.4}\\ & =\int q(\boldsymbol{Z})(\log p(\boldsymbol{X}, \boldsymbol{Z} \mid \boldsymbol{\theta})-\log p(\boldsymbol{Z} \mid \boldsymbol{X}, \boldsymbol{\theta})) d \boldsymbol{Z} \tag{6.5}\\ & =\int q(\boldsymbol{Z})(\log p(\boldsymbol{X}, \boldsymbol{Z} \mid \boldsymbol{\theta})-\log q(\boldsymbol{Z}) \notag \\ & \qquad +\log q(\boldsymbol{Z})-\log p(\boldsymbol{Z} \mid \boldsymbol{X}, \boldsymbol{\theta})) d \boldsymbol{Z} \tag{6.6}\\ & =\int q(\boldsymbol{Z})\left(\log \frac{p(\boldsymbol{X}, \boldsymbol{Z} \mid \boldsymbol{\theta})}{q(\boldsymbol{Z})}-\log \frac{p(\boldsymbol{Z} \mid \boldsymbol{X}, \boldsymbol{\theta})}{q(\boldsymbol{Z})}\right) d \boldsymbol{Z} \tag{6.7}\\ & =\mathcal{L}(q, \boldsymbol{\theta})+\mathrm{KL}(q(\boldsymbol{Z})| | p(\boldsymbol{Z} \mid \boldsymbol{X}, \boldsymbol{\theta})) \tag{6.8} \end{align}

where KL is the Kullback-Leibler (KL) divergence. The KL divergence is always greater than 0 except when q(\boldsymbol{Z})=p(\boldsymbol{Z} \mid \boldsymbol{X}, \boldsymbol{\theta}), so we can say that:

\log p(\boldsymbol{X} \mid \boldsymbol{\theta}) \geq \mathcal{L}(q, \boldsymbol{\theta}) \tag{6.9}

This is the core of the EM algorithm - we make a choice for q(\boldsymbol{Z}) that makes the integral tractable, and then we maximize \mathcal{L}(q, \boldsymbol{\theta}) to get the maximum-likelihood estimate for the model parameters \boldsymbol{\theta}. Given what we know about the KL divergence, one natural choice is to set q(\boldsymbol{Z})_{1}=p\left(\boldsymbol{Z} \mid \boldsymbol{X}, \boldsymbol{\theta}_{0}\right), where \boldsymbol{\theta}_{0} is the initial value of our parameters. This means that for \boldsymbol{\theta}_{0} we have:

\log p\left(\boldsymbol{X} \mid \boldsymbol{\theta}_{0}\right)=\mathcal{L}\left(q_{1}, \boldsymbol{\theta}_{0}\right) \tag{6.10}

But this does not hold for other values of \boldsymbol{\theta}, so although this is a good choices of q_{1}, our loss function still only provides a lower bound to the likelihood for arbitrary \boldsymbol{\theta}. We can maximize our bound for \boldsymbol{\theta} to get:

\boldsymbol{\theta}_{1}=\underset{\boldsymbol{\theta}}{\operatorname{argmax}} \mathcal{L}\left(q_{1}, \boldsymbol{\theta}\right) \tag{6.11}

Of course, we now have better estimates of \boldsymbol{\theta} that can give us an even tighter bound, q(\boldsymbol{Z})_{2}=p\left(\boldsymbol{Z} \mid \boldsymbol{X}, \boldsymbol{\theta}_{1}\right). We may not always be able to evaluate p(\boldsymbol{Z} \mid \boldsymbol{X}, \boldsymbol{\theta}). In those cases, we introduce some variational distribution q(\boldsymbol{Z}) that approximates p(\boldsymbol{Z} \mid \boldsymbol{X}, \boldsymbol{\theta}) and minimize \operatorname{KL}(q(\boldsymbol{Z}) \| p(\boldsymbol{Z} \mid \boldsymbol{X}, \boldsymbol{\theta})). From this two-step process is born the expectation-maximization algorithm:

As we will explore in the homework, EM is not guaranteed to converge to the best global solution. It is only guaranteed to iteratively give parameters that increase the likelihood. The estimate you derive form EM will depend on your initial condition, and often picking the right initial conditions can be as much work as the EM algorithm itself.

Expectation-Maximization for LDS

Now that we understand the EM algorithm, we can apply it to our LDS system. We already know the form of the marginal posterior distribution of our latent variable¹.

q\left(\boldsymbol{z}_{t}\right)_{n}=p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: T}, \boldsymbol{\theta}_{n-1}\right)=\mathcal{N}\left(\boldsymbol{\mu}_{t \mid T}, \boldsymbol{\Sigma}_{t \mid T}\right) \tag{6.12}

This also means we know a few expected values that will show up in our M step:

\begin{align} \mathbb{E}_{q_{n}}\left[\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: T}\right] & =\boldsymbol{\mu}_{t \mid T} \tag{6.13}\\ \mathbb{E}_{q_{n}}\left[\boldsymbol{z}_{t} \boldsymbol{z}_{t}^{T} \mid \boldsymbol{x}_{1: T}\right] & =\boldsymbol{\Sigma}_{t \mid T}+\boldsymbol{\mu}_{t \mid T} \boldsymbol{\mu}_{t \mid T}^{T} \tag{6.14}\\ \mathbb{E}_{q_{n}}\left[\boldsymbol{z}_{t} \boldsymbol{z}_{t-1}^{T} \mid \boldsymbol{x}_{1: T}\right] & =\boldsymbol{F}_{t-1} \boldsymbol{\Sigma}_{t \mid T}+\boldsymbol{\mu}_{t \mid T} \boldsymbol{\mu}_{t-1 \mid T}^{T} \tag{6.15} \end{align}

where the means and covariances are those for \boldsymbol{\theta}_{n-1}. Now we just need to maximize:

\begin{align} \mathbb{E}_{q_{n}}[\log p(\boldsymbol{X}, \boldsymbol{Z} \mid \boldsymbol{\theta})] & =\mathbb{E}_{q_{n}}\left[\log p\left(\boldsymbol{z}_{\mathbf{0}} \mid \boldsymbol{\mu}_{0}, \boldsymbol{\Sigma}_{0}\right)\right. \notag \\ & \qquad \left. +\sum_{t=1}^{T} \log p\left(\boldsymbol{z}_{t} \mid \boldsymbol{z}_{t-1}, \boldsymbol{A}, \boldsymbol{Q}\right)\right. \notag \\ & \qquad \left. +\sum_{t=1}^{T} \log p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}, \boldsymbol{C}, \boldsymbol{R}\right)\right]+\mathrm{const.} \tag{6.16} \end{align}

These are all Gaussian distributions, so we can take advantage of the maximum likelihood for a datapoints drawn from a Gaussian distribution (see Section 2.3.4 of Bishop for a nice explanation) to get our solutions. (These update rules are found by taking the derivative of the expected log-likelihood (Equation 6.16) with respect to each parameter and setting it to zero.) First, let’s start with the solution for the initial state mean and covariance:

\begin{align} \boldsymbol{\mu}_{0, n} & =\mathbb{E}\left[\boldsymbol{z}_{0}\right]=\boldsymbol{\mu}_{0 \mid T} \tag{6.17}\\ \boldsymbol{\Sigma}_{0, n} & =\mathbb{E}\left[\boldsymbol{z}_{0} \boldsymbol{z}_{0}^{T}\right]-\boldsymbol{\mu}_{0, n} \boldsymbol{\mu}_{0, n}^{T}=\boldsymbol{\Sigma}_{0 \mid T} \tag{6.18} \end{align}

Note that I have dropped the conditioning on \boldsymbol{x}_{1: T} for conciseness, but these expectation values are the ones from Equation 6.13. The transition matrix and covariance are given by:

\begin{align} \boldsymbol{A}_{n} & =\left(\sum_{t=1}^{T} \mathbb{E}\left[\boldsymbol{z}_{t} \boldsymbol{z}_{t-1}^{T}\right]\right)\left(\sum_{t=1}^{T} \mathbb{E}\left[\boldsymbol{z}_{t-1} \boldsymbol{z}_{t-1}^{T}\right]\right)^{-1} \tag{6.19}\\ \boldsymbol{Q}_{n} & =\frac{1}{T}\left(\sum_{t=1}^{T} \mathbb{E}\left[\boldsymbol{z}_{t} \boldsymbol{z}_{t}^{T}\right]-\boldsymbol{A}_{n} \mathbb{E}\left[\boldsymbol{z}_{t-1} \boldsymbol{z}_{t}^{T}\right]\right. \notag \\ & \qquad \left. -\mathbb{E}\left[\boldsymbol{z}_{t} \boldsymbol{z}_{t-1}^{T}\right] \boldsymbol{A}_{n}^{T}+\boldsymbol{A}_{n} \mathbb{E}\left[\boldsymbol{z}_{t-1} \boldsymbol{z}_{t-1}^{T}\right] \boldsymbol{A}_{n}^{T}\right) \tag{6.20} \end{align}

Do not confuse the transpose and the upper sum of T in the equation above. Finally, we have for our observation matrix and covariance:

\begin{align} \boldsymbol{C}_{n} & =\left(\sum_{t=1}^{T} \mathbf{x}_{t} \mathbb{E}\left[\boldsymbol{z}_{t}^{T}\right]\right)\left(\sum_{t=1}^{T} \mathbb{E}\left[\boldsymbol{z}_{t} \boldsymbol{z}_{t}^{T}\right]\right)^{-1} \tag{6.21}\\ \boldsymbol{R}_{n} & =\frac{1}{T}\left(\sum_{t=1}^{T} \boldsymbol{x}_{t} \boldsymbol{x}_{t}^{T}-\boldsymbol{C}_{n} \mathbb{E}\left[\boldsymbol{z}_{t}\right] \boldsymbol{x}_{t}^{T}\right. \notag \\ & \qquad \left. -\boldsymbol{x}_{t} \mathbb{E}\left[\boldsymbol{z}_{t}^{T}\right] \boldsymbol{C}_{n}^{T}+\boldsymbol{C}_{n} \mathbb{E}\left[\boldsymbol{z}_{t} \boldsymbol{z}_{t}^{T}\right] \boldsymbol{C}_{n}^{T}\right) \tag{6.22} \end{align}