Expectation-Maximization Review

In Lecture 6, we discussed our desire to maximize the posterior on the parameters given the data:

\begin{equation*} p(\boldsymbol{\theta} \mid \boldsymbol{X}) \propto p(\boldsymbol{X} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta}), \tag{7.1} \end{equation*}

where \boldsymbol{\theta} is a vector of all the parameters in our system, and \boldsymbol{X} is a vector of our full time series of data. We realized that we could reduce this to the problem of maximizing the log likelihood of the data given the parameters, but that doing so required marginalizing over latent states \boldsymbol{Z} :

\begin{equation*} p(\boldsymbol{X} \mid \boldsymbol{\theta})=\int p(\boldsymbol{X}, \boldsymbol{Z} \mid \boldsymbol{\theta}) d \boldsymbol{Z} \tag{7.2} \end{equation*}

Because this integral was not tractable, we instead introduced an additional distribution q(\boldsymbol{Z}) with overlapping support and got the equality:

\begin{equation*} \log p(\boldsymbol{X} \mid \boldsymbol{\theta})=\mathcal{L}(q, \boldsymbol{\theta})+\operatorname{KL}(q(\boldsymbol{Z}) \| p(\boldsymbol{Z} \mid \boldsymbol{X}, \boldsymbol{\theta})), \tag{7.3} \end{equation*}

where KL is the Kullback-Leibler (KL) divergence and:

\begin{equation*} \mathcal{L}(q, \boldsymbol{\theta})=\int q(\boldsymbol{Z})\left(\log \frac{p(\boldsymbol{X}, \boldsymbol{Z} \mid \boldsymbol{\theta})}{q(\boldsymbol{Z})}\right) \tag{7.4} \end{equation*}

We noted that the KL divergence is always greater than 0 except when q(\boldsymbol{Z})=p(\boldsymbol{Z} \mid \boldsymbol{X}, \boldsymbol{\theta}), so:

\begin{equation*} \log p(\boldsymbol{X} \mid \boldsymbol{\theta}) \geq \mathcal{L}(q, \boldsymbol{\theta}) \tag{7.5} \end{equation*}

We then imagined a two step process: the expectation-maximization algorithm. The first step was to chose an ideal q(\boldsymbol{Z}) to minimize the divergence term. This brought \mathcal{L}(q, \boldsymbol{\theta}) as close as possible to our desired objective. We then maximized \mathcal{L}(q, \boldsymbol{\theta}) with respect to \boldsymbol{\theta} to maximize our objective as our second step. We then noted that we could chose a new, better distribution for q(\boldsymbol{Z}) given our new setting for the parameters \boldsymbol{\theta}, and so our EM algorithm (described in Algorithm (1) was born.

Algorithm 1 Expectation-Maximization

Input: Initial estimate for the parameters \boldsymbol{\theta}_{0}

for n=1 to N_{\text{iter}} do
\quad q_{n} \gets \underset{q}{\operatorname{argmin}} \; \mathrm{KL}\left(q(\boldsymbol{Z}) \| p\left(\boldsymbol{Z} \mid \boldsymbol{X}, \boldsymbol{\theta}_{n-1}\right)\right) \quad \triangleright E step
\quad \boldsymbol{\theta}_{n} \gets \underset{\boldsymbol{\theta}}{\operatorname{argmax}} \; \mathcal{L}\left(q_{n}, \boldsymbol{\theta}\right) \quad \triangleright M step
end for

Output: Parameter estimate \boldsymbol{\theta}_{N_{\text{iter}}}

One detail we skipped over last week is that we can’t actually calculate p\left(\boldsymbol{Z} \mid \boldsymbol{X}, \boldsymbol{\theta}_{n-1}\right) for our LDS system. This would be the full joint distribution of our latent states given the data, but we have only learned how to calculate p\left(\boldsymbol{z}_{t} \mid \boldsymbol{X}, \boldsymbol{\theta}_{n-1}\right). That should be alarming since I provided you with these equation in the last lecture:

\begin{align*} \boldsymbol{\mu}_{0, n} & =\mathbb{E}\left[\boldsymbol{z}_{0}\right]=\boldsymbol{\mu}_{0 \mid T} \tag{7.6}\\ \boldsymbol{\Sigma}_{0, n} & =\mathbb{E}\left[\boldsymbol{z}_{0} \boldsymbol{z}_{0}^{T}\right]-\boldsymbol{\mu}_{0, n} \boldsymbol{\mu}_{0, n}^{T}=\boldsymbol{\Sigma}_{0 \mid T} \tag{7.7}\\ \boldsymbol{A}_{n} & =\left(\sum_{t=1}^{T} \mathbb{E}\left[\boldsymbol{z}_{t} \boldsymbol{z}_{t-1}^{T}\right]\right)\left(\sum_{t=1}^{T} \mathbb{E}\left[\boldsymbol{z}_{t-1} \boldsymbol{z}_{t-1}^{T}\right]\right)^{-1} \tag{7.8}\\ \boldsymbol{Q}_{n} & =\frac{1}{T}\left(\sum_{t=1}^{T} \mathbb{E}\left[\boldsymbol{z}_{t} \boldsymbol{z}_{t}^{T}\right]-\boldsymbol{A}_{n} \mathbb{E}\left[\boldsymbol{z}_{t-1} \boldsymbol{z}_{t}^{T}\right]\right. \\ & \quad \left.-\mathbb{E}\left[\boldsymbol{z}_{t} \boldsymbol{z}_{t-1}^{T}\right] \boldsymbol{A}_{n}^{T}+\boldsymbol{A}_{n} \mathbb{E}\left[\boldsymbol{z}_{t-1} \boldsymbol{z}_{t-1}^{T}\right] \boldsymbol{A}_{n}^{T}\right) \tag{7.9}\\ \boldsymbol{C}_{n} & =\left(\sum_{t=1}^{T} \mathbf{x}_{t} \mathbb{E}\left[\boldsymbol{z}_{t}^{T}\right]\right)\left(\sum_{t=1}^{T} \mathbb{E}\left[\boldsymbol{z}_{t} \boldsymbol{z}_{t}^{T}\right]\right)^{-1} \tag{7.10}\\ \boldsymbol{R}_{n} & =\frac{1}{T}\left(\sum_{t=1}^{T} \boldsymbol{x}_{t} \boldsymbol{x}_{t}^{T}-\boldsymbol{C}_{n} \mathbb{E}\left[\boldsymbol{z}_{t}\right] \boldsymbol{x}_{t}^{T}-\boldsymbol{x}_{t} \mathbb{E}\left[\boldsymbol{z}_{t}^{T}\right] \boldsymbol{C}_{n}^{T}+\boldsymbol{C}_{n} \mathbb{E}\left[\boldsymbol{z}_{t} \boldsymbol{z}_{t}^{T}\right] \boldsymbol{C}_{n}^{T}\right) \tag{7.11} \end{align*}

How was this possible? Well, as we discussed in last lecture, the term we want to maximize can be written as:

\begin{align*} \mathbb{E}_{q_{n}}[\log p(\boldsymbol{X}, \boldsymbol{Z} \mid \boldsymbol{\theta})] &= \mathbb{E}_{q_{n}}\left[\log p\left(\boldsymbol{z}_{\mathbf{0}} \mid \boldsymbol{\mu}_{0}, \boldsymbol{\Sigma}_{0}\right)+\sum_{t=1}^{T} \log p\left(\boldsymbol{z}_{t} \mid \boldsymbol{z}_{t-1}, \boldsymbol{A}, \boldsymbol{Q}\right)\right. \\ & \quad \left.+\sum_{t=1}^{T} \log p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}, \boldsymbol{C}, \boldsymbol{R}\right)\right]+\mathrm{const.} \tag{7.12} \end{align*}

It’s this ability to factorize our model that saves us. In the end, maximizing our objective with respect to the parameters only requires us to be able to calculate three expectations:

\begin{align*} \mathbb{E}_{q_{n}}\left[\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: T}\right] & =\boldsymbol{\mu}_{t \mid T} \tag{7.13}\\ \mathbb{E}_{q_{n}}\left[\boldsymbol{z}_{t} \boldsymbol{z}_{t}^{T} \mid \boldsymbol{x}_{1: T}\right] & =\boldsymbol{\Sigma}_{t \mid T}+\boldsymbol{\mu}_{t \mid T} \boldsymbol{\mu}_{t \mid T}^{T} \tag{7.14}\\ \mathbb{E}_{q_{n}}\left[\boldsymbol{z}_{t+1} \boldsymbol{z}_{t}^{T} \mid \boldsymbol{x}_{1: T}\right] & =\boldsymbol{F}_{t} \boldsymbol{\Sigma}_{t+1 \mid T}+\boldsymbol{\mu}_{t+1 \mid T} \boldsymbol{\mu}_{t \mid T}^{T} \tag{7.15} \end{align*}

Two of these we get for free from our smoothing calculation, but the third doesn’t come for free. Thankfully, it depends on adjacent latent states, and I’ve already told you the form depends only on matrices and vectors we have already calculated as a part of our smoothing operation.

Let’s take a moment to derive this expectation. Consider:

\begin{align*} p\left(\boldsymbol{z}_{t}, \boldsymbol{z}_{t+1} \mid \boldsymbol{x}_{1: T}\right) & =p\left(\boldsymbol{z}_{t} \mid \boldsymbol{z}_{t+1}, \boldsymbol{x}_{1: T}\right) p\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{x}_{1: T}\right) \tag{7.16}\\ & =p\left(\boldsymbol{z}_{t} \mid \boldsymbol{z}_{t+1}, \boldsymbol{x}_{1: t}\right) p\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{x}_{1: T}\right) \tag{7.17}\\ & =\frac{p\left(\boldsymbol{z}_{t}, \boldsymbol{z}_{t+1} \mid \boldsymbol{x}_{1: t}\right)}{p\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{x}_{1: t}\right)} p\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{x}_{1: T}\right) \tag{7.18}\\ & =\frac{p\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{z}_{t}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t}\right)}{p\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{x}_{1: t}\right)} p\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{x}_{1: T}\right) \tag{7.19} \end{align*}

where we’ve taken advantage of some of the conditional independence equations of our HMM. We can now explicitly write out each of the terms in our likleihood, since all of them are described by normal distribution we understand:

\begin{align*} p\left(\boldsymbol{z}_{t}, \boldsymbol{z}_{t+1} \mid \boldsymbol{x}_{1: T}\right) &= \frac{\mathcal{N}\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{A} \boldsymbol{z}_{t}, \boldsymbol{Q}\right) \mathcal{N}\left(\boldsymbol{z}_{t} \mid \boldsymbol{\mu}_{t \mid t}, \boldsymbol{\Sigma}_{t \mid t}\right)}{\mathcal{N}\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{\mu}_{t+1 \mid t}, \boldsymbol{\Sigma}_{t+1 \mid t}\right)} \\ & \quad \times \mathcal{N}\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{\mu}_{t+1 \mid T}, \boldsymbol{\Sigma}_{t+1 \mid T}\right) \tag{7.20} \end{align*}

All we need to do is a bit of pattern matching. Multiplying these terms together will give us a complicated exponential with some term of the form \boldsymbol{z}_{t} \boldsymbol{\Gamma} \boldsymbol{z}_{t+1}^{T}. Since we know that the final result is a multivariate Gaussian, this \boldsymbol{\Gamma} term is just \boldsymbol{\Sigma}_{z_{t}, z_{t+1}}^{-1}. I leave the bookkeeping as an exercise to the reader, but the result is:

\begin{equation*} \mathbb{E}_{q_{n}}\left[\boldsymbol{z}_{t+1} \boldsymbol{z}_{t}^{T}\right]=\boldsymbol{F}_{t} \boldsymbol{\Sigma}_{t+1 \mid T}+\boldsymbol{\mu}_{t+1 \mid T} \boldsymbol{\mu}_{t \mid T}^{T} \tag{7.21} \end{equation*}

So it turns out that even though we have set q\left(\boldsymbol{z}_{1: T}\right)=p\left(\boldsymbol{z}_{1: T} \mid \boldsymbol{x}_{1: T}, \theta\right), the only expectation values we actually need comes from calculating p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: T}, \theta\right). This is not an accident, it is a feature of the statistical structure of the HMM model underlying our analysis.

Discrete hidden Markov models

So far, we’ve considered state space models with a continuous latent space \boldsymbol{z}. However, we may often want to run our HMM models in situations were we are moving between discrete classes in the latent space. For example, we may be observing the fMRI scans of a patient that is waking up and falling back asleep. In that case our latent space consists of two discrete states - awake and asleep - that determine the fMRI patterns we will observe. In this class, we will call these Discrete HMM models¹. In the discrete case, we will still treat \boldsymbol{z}_{t} as a vector of the one-hot encoding. That is:

z_{t, k}= \begin{cases}1 & \text { if in class } k \text { at time } t \tag{7.22}\\ 0 & \text { otherwise }\end{cases}

Below is a visual depiction of a discrete HMM model when there are two discrete latent states and three observations. The transition probabilities are explicitly labeled as A_{k,k'}.

If we assume that we have K classes, then our index k can go from 1, \ldots, K. If we only have K discrete classes, then our transition probability is fully described by specifying², for all k, k^{\prime}, the probability of transitioning from state k to k^{\prime} :

p\left(z_{t+1, k^{\prime}}=1 \mid z_{t, k}=1\right)=A_{k, k^{\prime}} \tag{7.23}

Since A_{k, k^{\prime}} is a probability we can state that:

\begin{align} & 0 \leq A_{k, k^{\prime}} \leq 1 \tag{7.24}\\ & \sum_{k^{\prime}=1}^{K} A_{k, k^{\prime}}=1 \tag{7.25} \end{align}

From Equation 7.25 we can say that the matrix \boldsymbol{A} has K(K-1) degrees of freedom. We can write our full transition probability function as:

p\left(z_{t+1} \mid z_{t, k}\right)=\prod_{k=1}^{K} \prod_{k^{\prime}=1}^{K}\left(A_{k, k^{\prime}}\right)^{z_{t, k} \times z_{t+1, k^{\prime}}} \tag{7.26}

The initial latent node \boldsymbol{z}_{0} has a prior that is fully determined by a vector of probabilities \pi :

p\left(\boldsymbol{z}_{0}\right)=\prod_{k=1}^{K}\left(\pi_{k}\right)^{z_{0, k}} \tag{7.27}

Again, we have the restriction that:

\begin{align} & 0 \leq \pi_{k} \leq 1 \tag{7.28}\\ & \sum_{k=1}^{K} \pi_{k}=1 \tag{7.29} \end{align}

meaning that there are only K-1 degrees of freedom to our vector \boldsymbol{\pi}.

The only remaining distribution we need to specify in our model is p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right). Until we want to optimize the parameters of p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right), we will not need to assume any form for the distribution. We will only require that we are able to evaluate it (the same assumption we made for particle filtering).

Inference with discrete HMMs

Our first goal is to infer the latent states of our discrete HMM model. The full posterior, p\left(\boldsymbol{z}_{1: T} \mid \boldsymbol{x}_{1: T}\right), requires us evaluating K^{T} possible configurations. For large latent states or long sequences, this quickly becomes computationally intractable. As with the LDS system, we’ll start instead by focusing on the exact marginals, p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: T}\right) and joint p\left(\boldsymbol{z}_{t}, \boldsymbol{z}_{t+1} \mid \boldsymbol{x}_{1: T}\right). These will be the quantities we need for future prediction and for optimizing the parameters of our model.

When we introduced our HMM models, we started by dividing our marginal distribution into two terms:

\begin{align} p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: T}\right) & =\frac{\alpha\left(\boldsymbol{z}_{t}\right) \beta\left(\boldsymbol{z}_{t}\right)}{p\left(\boldsymbol{x}_{1: T}\right)} \tag{7.30}\\ \alpha\left(\boldsymbol{z}_{t}\right) & =p\left(\boldsymbol{x}_{1: t}, \boldsymbol{z}_{t}\right) \tag{7.31}\\ \beta\left(\boldsymbol{z}_{t}\right) & =p\left(\boldsymbol{x}_{t+1: T} \mid \boldsymbol{z}_{t}\right) \tag{7.32} \end{align}

We utilize this refactoring of the equation because both \alpha and \beta can be calculated recursively through a forward and backward pass. For \alpha we have:

\alpha\left(\boldsymbol{z}_{t}\right)=p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) \sum_{\boldsymbol{z}_{t-1}} \alpha\left(\boldsymbol{z}_{t-1}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{z}_{t-1}\right) \tag{7.33}

For \beta we have:

\beta\left(z_{t}\right)=\sum_{z_{t+1}} \beta\left(z_{t+1}\right) p\left(x_{t+1} \mid z_{t+1}\right) p\left(z_{t+1} \mid z_{t}\right) \tag{7.34}

Note that we can write our normalizing constant from Equation 7.30 as:

p\left(\boldsymbol{x}_{1: T}\right)=\sum_{\boldsymbol{z}_{t}} \alpha\left(\boldsymbol{z}_{t}\right) \beta\left(\boldsymbol{z}_{t}\right) \tag{7.35}

and we can write our joint distribution as:

\begin{align} p\left(\boldsymbol{z}_{t}, \boldsymbol{z}_{t+1} \mid \boldsymbol{x}_{1: T}\right) & =\frac{p\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{z}_{t}, \boldsymbol{z}_{t+1}\right) p\left(\boldsymbol{z}_{t}, \boldsymbol{z}_{t+1}\right)}{p\left(\boldsymbol{x}_{1: T}\right)} \tag{7.36}\\ & =\frac{p\left(\boldsymbol{x}_{1: t} \mid \boldsymbol{z}_{t}\right) p\left(\boldsymbol{x}_{t+1} \mid \boldsymbol{z}_{t+1}\right) p\left(\boldsymbol{x}_{t+2: T} \mid \boldsymbol{z}_{t+1}\right)}{p\left(\boldsymbol{x}_{1: T}\right)} \nonumber \\ & \quad \times p\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{z}_{t}\right) p\left(\boldsymbol{z}_{t}\right) \tag{7.37}\\ & =\frac{\alpha\left(\boldsymbol{z}_{t}\right) p\left(\boldsymbol{x}_{t+1} \mid \boldsymbol{z}_{t+1}\right) \beta\left(\boldsymbol{z}_{t+1}\right) p\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{z}_{t}\right)}{p\left(\boldsymbol{x}_{1: T}\right)} \tag{7.38} \end{align}

If we calculate the \alpha and \beta terms we have everything we need to evaluate the marginal and joint distribution we’re interested in. Equation 7.33 and Equation 7.34 are both recursive, so we need a starting point for our equation. For \beta, this is easy since:

\begin{align} p\left(\boldsymbol{z}_{T} \mid \boldsymbol{x}_{1: T}\right) & =\frac{p\left(\boldsymbol{x}_{1: T}, \boldsymbol{z}_{T}\right) \times 1}{p\left(\boldsymbol{x}_{1: T}\right)} \tag{7.39}\\ \beta\left(\boldsymbol{z}_{T}\right) & =1 \tag{7.40} \end{align}

because there is no data beyond T. For \alpha we can take advantage of our prior:

\begin{align} p\left(\boldsymbol{z}_{0} \mid \boldsymbol{x}_{1: T}\right) & =\frac{p\left(\boldsymbol{z}_{0}\right) \times p\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{z}_{0}\right)}{p\left(\boldsymbol{x}_{1: T}\right)} \tag{7.41}\\ \alpha\left(\boldsymbol{z}_{0}\right) & =p\left(\boldsymbol{z}_{0}\right) \tag{7.42} \end{align}

Numerical Stability

Before we work through an example, there’s an additional implementation caveat for the forward and backward pass of the \alpha and \beta calculation.

Specifically, as we go deeper into our forward pass, the values of \alpha are being multiplied by more and more factors of the transition and observation probability. Since these values are less than 1, as our sequence gets longer we expect \alpha to tend towards zero. This will eventually run into the floating point limit of our computer, and so it will be important to work with rescaled variables. The standard choice is:

\hat{\alpha}\left(\boldsymbol{z}_{t}\right)=p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: t}\right)=\frac{\alpha\left(\boldsymbol{z}_{t}\right)}{p\left(\boldsymbol{x}_{1: t}\right)} \tag{7.43}

It’s useful to define a scaling factor c_{t} as:

c_{t}=p\left(\boldsymbol{x}_{t} \mid \boldsymbol{x}_{1: t-1}\right) \tag{7.44}

which is useful since we can write:

\begin{align} p\left(\boldsymbol{x}_{1: t}\right) & =p\left(\boldsymbol{x}_{1}\right) p\left(\boldsymbol{x}_{2} \mid \boldsymbol{x}_{1}\right) p\left(\boldsymbol{x}_{3} \mid \boldsymbol{x}_{1: 2}\right) \cdots p\left(\boldsymbol{x}_{t} \mid \boldsymbol{x}_{1: t-1}\right) \tag{7.45}\\ & =\prod_{i=1}^{t} c_{i} \tag{7.46} \end{align}

These scaling factors can be used to write our rescaled \hat{\alpha} in terms of the original \alpha :

\alpha\left(\boldsymbol{z}_{t}\right)=\left(\prod_{i=1}^{t} c_{i}\right) \hat{\alpha}\left(\boldsymbol{z}_{t}\right) \tag{7.47}

This rescaled variable is normalized to one by construction, so it is guaranteed to be numerically stable. However, we now need a recursion relation for \hat{\alpha} :

\begin{align} \alpha\left(\boldsymbol{z}_{t}\right) & =p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) \sum_{\boldsymbol{z}_{t-1}} \alpha\left(\boldsymbol{z}_{t-1}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{z}_{t-1}\right) \tag{7.48}\\ \hat{\alpha}\left(\boldsymbol{z}_{t}\right)\left(\prod_{i=1}^{t} c_{i}\right) & =p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) \sum_{\boldsymbol{z}_{t-1}}\left(\prod_{i=1}^{t-1} c_{i}\right) \hat{\alpha}\left(\boldsymbol{z}_{t-1}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{z}_{t-1}\right) \tag{7.49}\\ c_{t} \hat{\alpha}\left(\boldsymbol{z}_{t}\right) & =p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) \sum_{\boldsymbol{z}_{t-1}} \hat{\alpha}\left(\boldsymbol{z}_{t-1}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{z}_{t-1}\right) \tag{7.50} \end{align}

In practice, c_{t} is just the normalization of the right-hand side of Equation 7.34.

\begin{align} c_{t} \sum_{\boldsymbol{z}_{t}} \hat{\alpha}\left(\boldsymbol{z}_{t}\right) & =\sum_{\boldsymbol{z}_{t}} p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) \sum_{\boldsymbol{z}_{t-1}} \hat{\alpha}\left(\boldsymbol{z}_{t-1}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{z}_{t-1}\right) \tag{7.51}\\ c_{t} & =\sum_{\boldsymbol{z}_{t}} p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) \sum_{\boldsymbol{z}_{t-1}} \hat{\alpha}\left(\boldsymbol{z}_{t-1}\right) p\left(\boldsymbol{z}_{t} \mid \boldsymbol{z}_{t-1}\right) \tag{7.52} \end{align}

Notice that \hat{\alpha} is the Kalman filtering distribution we calculated for the LDS model.

Given that we have a rescaled \alpha we will also want a rescaled \beta. As we will see, a good choice is:

\begin{align} & \beta\left(\boldsymbol{z}_{t}\right)=\left(\prod_{i=t+1}^{T} c_{i}\right) \hat{\beta}\left(\boldsymbol{z}_{t}\right) \tag{7.53}\\ & \hat{\beta}\left(\boldsymbol{z}_{t}\right)=\frac{p\left(\boldsymbol{x}_{t+1: T} \mid \boldsymbol{z}_{t}\right)}{p\left(\boldsymbol{x}_{t+1: T} \mid \boldsymbol{x}_{1: t}\right)} \tag{7.54} \end{align}

This gives us the recursion relation:

c_{t+1} \hat{\beta}\left(\boldsymbol{z}_{t}\right)=\sum_{\boldsymbol{z}_{t+1}} \hat{\beta}\left(\boldsymbol{z}_{t+1}\right) p\left(\boldsymbol{x}_{t+1} \mid \boldsymbol{z}_{t+1}\right) p\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{z}_{t}\right) \tag{7.55}

Finally, we’ll leave it as an exercise for the reader to use Equation 7.27 and Equation 7.38 to verify that:

\begin{align} p\left(\boldsymbol{z}_{t} \mid \boldsymbol{x}_{1: T}\right) & =\hat{\alpha}\left(\boldsymbol{z}_{t}\right) \hat{\beta}\left(\boldsymbol{z}_{t}\right) \tag{7.56}\\ p\left(\boldsymbol{z}_{t}, \boldsymbol{z}_{t+1} \mid \boldsymbol{x}_{1: T}\right) & =\frac{1}{c_{t+1}} \hat{\alpha}\left(\boldsymbol{z}_{t}\right) p\left(\boldsymbol{x}_{t+1} \mid \boldsymbol{z}_{t+1}\right) \hat{\beta}\left(\boldsymbol{z}_{t+1}\right) p\left(\boldsymbol{z}_{t+1} \mid \boldsymbol{z}_{t}\right) \tag{7.57} \end{align}

Example inference calculation

Let’s work through an example that uses these rescaled values. Imagine we have a latent space with K=2 and we have two observations such that T=2. Let’s start by specifying our prior and transition matrix:

\begin{align} \boldsymbol{\pi} & =\left[\begin{array}{l} 0.2 \\ 0.8 \end{array}\right] \tag{7.58}\\ \boldsymbol{A} & =\left[\begin{array}{ll} 0.5 & 0.5 \\ 0.5 & 0.5 \end{array}\right] \tag{7.59} \end{align}

Our prior preferences the k=2 state, but our transition matrix says that we are equally likely to move to either state no matter what state we are in. The last piece we need is p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right). Let’s keep this example simple by asserting that:

\begin{align} & p\left(\boldsymbol{x}_{1} \left\lvert\, \boldsymbol{z}_{1}=\left[\begin{array}{l} 1.0 \\ 0.0 \end{array}\right]\right.\right)=0.2 \tag{7.60}\\ & p\left(\boldsymbol{x}_{1} \left\lvert\, \boldsymbol{z}_{1}=\left[\begin{array}{l} 0.0 \\ 1.0 \end{array}\right]\right.\right)=0.8 \tag{7.61}\\ & p\left(\boldsymbol{x}_{2} \left\lvert\, \boldsymbol{z}_{2}=\left[\begin{array}{l} 1.0 \\ 0.0 \end{array}\right]\right.\right)=0.9 \tag{7.62}\\ & p\left(\boldsymbol{x}_{2} \left\lvert\, \boldsymbol{z}_{2}=\left[\begin{array}{l} 0.0 \\ 1.0 \end{array}\right]\right.\right)=0.1 \tag{7.63} \end{align}

That’s all we need to conduct our calculations. Let’s adopt the notation:

p\left(\boldsymbol{z}_{t}=\left[\begin{array}{l} 0.0 \tag{7.64}\\ 1.0 \end{array}\right] \left\lvert\, \boldsymbol{z}_{t-1}=\left[\begin{array}{l} 1.0 \\ 0.0 \end{array}\right]\right.\right)=p\left(z_{t, 2} \mid z_{t-1,1}\right)

and calculate \hat{\alpha} :

\begin{align} \hat{\alpha}\left(z_{0,1}\right) & =\pi_{1}^{z_{0,1}} \tag{7.65}\\ & =0.2 \tag{7.66}\\ \hat{\alpha}\left(z_{0,2}\right) & =0.8 \tag{7.67}\\ \hat{\alpha}\left(z_{1,1}\right) & =\frac{1}{c_{1}} p\left(\boldsymbol{x}_{1} \mid z_{1,1}\right) \sum_{\boldsymbol{z}_{0}} \hat{\alpha}\left(\boldsymbol{z}_{0}\right) p\left(z_{1,1} \mid \boldsymbol{z}_{0}\right) \tag{7.68}\\ & =\frac{1}{c_{1}} p\left(\boldsymbol{x}_{1} \mid z_{1,1}\right)\left[\hat{\alpha}\left(z_{0,1}\right) p\left(z_{1,1} \mid z_{0,1}\right)+\hat{\alpha}\left(z_{0,2}\right) p\left(z_{1,1} \mid z_{0,2}\right)\right] \tag{7.69}\\ & =\frac{1}{c_{1}} \times 0.2 \times[0.2 \times 0.5+0.8 \times 0.5] \tag{7.70}\\ & =\frac{1}{c_{1}} \times 0.1 \tag{7.71}\\ \hat{\alpha}\left(z_{1,2}\right) & =\frac{1}{c_{1}} p\left(\boldsymbol{x}_{1} \mid z_{1,2}\right) \sum_{\boldsymbol{z}_{0}} \hat{\alpha}\left(\boldsymbol{z}_{0}\right) p\left(z_{1,2} \mid \boldsymbol{z}_{0}\right) \tag{7.72}\\ & =\frac{1}{c_{1}} p\left(\boldsymbol{x}_{1} \mid z_{1,2}\right)\left[\hat{\alpha}\left(z_{0,1}\right) p\left(z_{1,2} \mid z_{0,1}\right)+\hat{\alpha}\left(z_{0,2}\right) p\left(z_{1,2} \mid z_{0,2}\right)\right] \tag{7.73}\\ & =\frac{1}{c_{1}} \times 0.8 \times[0.2 \times 0.5+0.8 \times 0.5] \tag{7.74}\\ & =\frac{1}{c_{1}} \times 0.4 \tag{7.75}\\ c_{1} & =0.5 \tag{7.76} \end{align}

Notice how we only get our value for c_{1} after we’ve calculated both of our \hat{\alpha} terms. The final two terms are:

\begin{align} \hat{\alpha}\left(z_{2,1}\right) & =\frac{1}{c_{2}} p\left(\boldsymbol{x}_{2} \mid z_{2,1}\right) \sum_{z_{1}} \hat{\alpha}\left(\boldsymbol{z}_{1}\right) p\left(z_{2,1} \mid \boldsymbol{z}_{1}\right) \tag{7.77}\\ & =\frac{1}{c_{2}} \times 0.45 \tag{7.78}\\ \hat{\alpha}\left(z_{2,2}\right) & =\frac{1}{c_{2}} p\left(\boldsymbol{x}_{2} \mid z_{2,2}\right) \sum_{\boldsymbol{z}_{1}} \hat{\alpha}\left(\boldsymbol{z}_{1}\right) p\left(z_{2,2} \mid \boldsymbol{z}_{1}\right) \tag{7.79}\\ & =\frac{1}{c_{2}} \times 0.05 \tag{7.80}\\ c_{2} & =0.5 \tag{7.81} \end{align}

We leave it as an exercise to the reader to calculate the \hat{\beta} values, but notice that you already have all the constants c_{t} you will need for that calculation.

We did not need to know the explicit form of p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) to calculate the exact values of \hat{\alpha}, we only needed to be able to evaluate it. As we will see in next week’s lecture, we will only be interested in the parameterization of p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right) when we want to update those parameters.

Expectation-maximization with Discrete HMMs

As with out continuous latent space, we want to build the machinery to infer the parameters of our model:

\begin{equation*} p(\boldsymbol{\theta} \mid \boldsymbol{X}) \propto p(\boldsymbol{X} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta}), \tag{7.82} \end{equation*}

where \theta=\{\boldsymbol{\pi}, \boldsymbol{A}, \ldots\}. We have not yet specified the parameters of our observation probability function. The EM algorithm we derived for a continuous latent space (see Algorithm 1) translates perfectly to our discrete case. We already know the choice we want to make for q_{n} :

\begin{equation*} q_{n}\left(\boldsymbol{z}_{1: T}\right)=p\left(\boldsymbol{z}_{1: T} \mid \boldsymbol{x}_{1: T}, \boldsymbol{\theta}_{n-1}\right) . \tag{7.83} \end{equation*}

As with the continuous case, we do not have access to this full joint posterior. We only have the marginal and the joint over two neighboring latent states. However, as with our continuous case, the structure of our model will come to our rescue. Once we have specified q_{n} we need to maximize:

\begin{align*} \mathbb{E}_{q_{n}}\left[\log p\left(\boldsymbol{x}_{1: T}, \boldsymbol{z}_{1: T} \mid \boldsymbol{\theta}\right)\right] &= \mathbb{E}_{q_{n}}\left[\log p\left(\boldsymbol{z}_{0}\right)+\sum_{t=1}^{T} \log p\left(\boldsymbol{z}_{t} \mid \boldsymbol{z}_{t-1}\right)\right. \\ & \quad \left.+\sum_{t=1}^{T} \log p\left(\boldsymbol{x}_{t} \mid \boldsymbol{z}_{t}\right)\right] \tag{7.84} \end{align*}

with respect to \boldsymbol{\theta}. All of the terms in our equation above either depend on a single latent state or a pair of latent states, so taking an expectation of the full joint boils down to summing over the probabilities for the marginals and joints of neighboring variables:

\begin{align*} & \mathbb{E}_{q_{n}}\left[\log p\left(\boldsymbol{x}_{1: T}, \boldsymbol{z}_{1: T} \mid \boldsymbol{\theta}\right)\right]=\sum_{k=1}^{K} p\left(z_{0, k} \mid \boldsymbol{x}_{1: T}\right) \log \pi_{k} \\ & +\sum_{t=1}^{T} \sum_{k^{\prime}=1}^{K} \sum_{k=1}^{K} p\left(z_{t, k^{\prime}}, z_{t-1, k} \mid \boldsymbol{x}_{1: T}\right) \log A_{k k^{\prime}} \\ & \quad+\sum_{k=1}^{K} \sum_{t=1}^{T} p\left(z_{t, k} \mid \boldsymbol{x}_{1: T}\right) \log p\left(\boldsymbol{x}_{t} \mid z_{t, k}\right) \tag{7.85} \end{align*}

We have some constraints on \boldsymbol{\pi} and \boldsymbol{A} that must be enforced with Lagrange multipliers, but otherwise we can maximize these equations with respect to \boldsymbol{\pi} and \boldsymbol{A} to find:

\begin{align*} \pi_{k, n} & =\frac{p\left(z_{0, k} \mid \boldsymbol{x}_{1: T}, \boldsymbol{\pi}_{n-1}, \boldsymbol{A}_{n-1}\right)}{\sum_{k=1}^{K} p\left(z_{0, k} \mid \boldsymbol{x}_{1: T}, \boldsymbol{\pi}_{n-1}, \boldsymbol{A}_{n-1}\right)} \tag{7.86}\\ A_{k k^{\prime}, n} & =\frac{\sum_{t=1}^{T} p\left(z_{t, k^{\prime}}, z_{t-1, k} \mid \boldsymbol{x}_{1: T}, \boldsymbol{\pi}_{n-1}, \boldsymbol{A}_{n-1}\right)}{\sum_{j=1}^{K} \sum_{t=1}^{T} p\left(z_{t, j}, z_{t-1, k} \mid \boldsymbol{x}_{1: T}, \boldsymbol{\pi}_{n-1}, \boldsymbol{A}_{n-1}\right)}, \tag{7.87} \end{align*}

where the n subscript encodes that these are the parameters corresponding to EM step n where we used distribution q_{n}.

We can also solve for the parameters of our observation probability function so long as we specify its form. One choice that will often come up (including in lab) is the form:

\begin{equation*} p\left(\boldsymbol{x}_{t} \mid z_{t, k}\right)=\mathcal{N}\left(\boldsymbol{x}_{t} \mid \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}\right) \tag{7.88} \end{equation*}

This is a model where each discrete latent state k represents a multivariate normal observational probability function with its own mean, \boldsymbol{\mu}_{k}, and covariance, \boldsymbol{\Sigma}_{k}. If we plug this choice into Equation 7.85 we get:

\begin{align*} \boldsymbol{\mu}_{k, n} & =\frac{\sum_{t=1}^{T} p\left(z_{t, k} \mid \boldsymbol{x}_{1: T}\right) \boldsymbol{x}_{t}}{\sum_{t=1}^{T} p\left(z_{t, k} \mid \boldsymbol{x}_{1: T}\right)} \tag{7.89}\\ \boldsymbol{\Sigma}_{k, n} & =\frac{\sum_{t=1}^{T} p\left(z_{t, k} \mid \boldsymbol{x}_{1: T}\right)\left(\boldsymbol{x}_{t}-\boldsymbol{\mu}_{k, n}\right)\left(\boldsymbol{x}_{t}-\boldsymbol{\mu}_{k, n}\right)^{T}}{\sum_{t=1}^{T} p\left(z_{t, k} \mid \boldsymbol{x}_{1: T}\right)} \tag{7.90} \end{align*}

Viterbi algorithm with discrete HMMs

Alternative reference: See Section 9.2.6 “The Viterbi algorithm” in Murphy, Probabilistic Machine Learning: Advanced Topics.

The discrete nature of our latent space makes it possible for us to ask what the most likely sequence of latent states is. This is not the same as the MAP estimate for \boldsymbol{z}_{t} at each t because we are now considering the full joint p\left(\boldsymbol{z}_{1: T} \mid \boldsymbol{x}_{1: T}\right). We do not want to calculate the full joint posterior since this would require K^{T} terms. However, if we are only interested in the MAP sequence and not the full joint, we can do a much more efficient dynamic programming calculation known as the Viterbi algorithm.

To start, imagine that at time step t you have the probability of the most likely path where z_{t, k}=1. We will call this w_{t, k} :

\begin{equation*} w_{t, k}=\max _{\boldsymbol{z}_{1: t-1}} \log p\left(\boldsymbol{z}_{1: t-1}, z_{t, k} \mid \boldsymbol{x}_{1: t}\right) \tag{7.91} \end{equation*}

Imagine that we also have the index m_{t-1, k} of the class for z_{t-1} that maximized the expression in Equation 7.91. First, notice the dimensionality of these two objects. The weights, w_{t, k}, must be calculated for our K classes for each latent state, and therefore has dimensionality (T+1) \times K. Since each weight except \boldsymbol{w}_{0} has a corresponding m_{t-1, k}, the indices will have dimensionality T \times K. This is much more manageable than K^{T}.

If we already had all of these weights, then:

\begin{align*} k_{T}^{\max } & =\underset{k}{\operatorname{argmax}} w_{T, k} \tag{7.92}\\ & =\underset{k}{\operatorname{argmax}} \max _{\boldsymbol{z}_{1: T-1}} \log p\left(\boldsymbol{z}_{1: T-1}, z_{T, k} \mid \boldsymbol{x}_{1: T}\right), \tag{7.93} \end{align*}

which is the MAP class for the latent state \boldsymbol{z}_{T}. The MAP class for the latent state \boldsymbol{z}_{T-1}, k_{T-1}^{\max }, would then be:

\begin{equation*} k_{T-1}^{\max }=m_{T-1, k_{T}^{\max }} \tag{7.94} \end{equation*}

and more generally:

\begin{equation*} k_{t-1}^{\max }=m_{t-1, k_{t}^{\max }} \tag{7.95} \end{equation*}

So now we just need to show that we can calculate w_{t, k} and m_{t-1, k} recursively. If we have the maximum probability of the most likely path for each class k at timestep t-1 then:

\begin{align*} w_{t, k} & =\log p\left(\boldsymbol{x}_{t} \mid z_{t, k}\right)+\max _{k^{\prime}}\left[\log p\left(z_{t, k} \mid z_{t-1, k^{\prime}}\right)+w_{t-1, k^{\prime}}\right] \tag{7.96}\\ m_{t-1, k} & =\underset{k^{\prime}}{\operatorname{argmax}}\left[\log p\left(z_{t, k} \mid z_{t-1, k^{\prime}}\right)+w_{t-1, k^{\prime}}\right] . \tag{7.97} \end{align*}

And so the Viterbi algorithm allows you to make one pass through the data to calculate the MAP sequence for our model and observations. In our framing, the initial weight distribution of the Viterbi algorithm is defined by:

\begin{equation*} w_{0, k}=\log p\left(z_{0, k}\right) \tag{7.98} \end{equation*}