Course logistics
All of the relevant course logistics can be found on the syllabus posted on the course website.
Introduction to time series
A time series is a collection of random variables indexed by time:
\{X_{1}, X_{2}, X_{3}, \cdots X_{t} \cdots\} \tag{1.1}
where the subscript denotes the time index.
Time series data examples
Examples will be incorporated into the lecture 1 slides.
Key properties
The previous time series notation is common for discrete time intervals, but it is also possible for our time series to be continuous in time. For most of this class, we will be considering discrete time intervals. The process that generates the time series is denoted as a stochastic process. The data we observe is one / multiple realization of this stochastic process.
There are two properties of time series that deviate from the traditional case:
- The data in a time series has a natural ordering established by the time index.
- The data in a time series are not independently, identically distributed (I.I.D.).
While our definition doesn’t require that the data not be independently, identically distributed, we are interested in time series because there are some time dependencies in our data that we want to use for analysis or prediction.
Statistical representation
We can fully specify our time series through the joint cumulative distribution function (cdf):
P(X_{1} \leq x_{1}, X_{2} \leq x_{2}, \ldots X_{t} \leq x_{t}, \ldots) \tag{1.4}
This measures the probability that the values of our series are less than a series of constant points x_{1}, x_{2}, \ldots x_{t}, \ldots. Another way to quantify the statistics of our time series is through the joint probability density function (pdf):
p(X_{1}=x_{1}, X_{2}=x_{2}, \ldots X_{t}=x_{t}, \ldots) \tag{1.5}
This measures the probability density that the values of our series have a specific value. Generically, it may be intractable to write out the full cdf / pdf. We will exploit knowledge of (or make assumptions about) the structure of the problem in order to make the math tractable.
The curse of dimensionality
To make this challenge clear, let’s consider a sequence of random variables with values 0 or 1. Imagine I wanted to understand the distribution of a sequence of length 2: \{X_{1}, X_{2}\}. If I make no assumptions about the structure of my data, there are 4 possibilities, and 3 unique parameters for the distribution. Now imagine I want to understand the distribution of a sequence of length 3: \{X_{1}, X_{2}, X_{3}\}. There are now 8 possibilities and 7 parameters. The number of parameters grows exponentially with the length of the sequence! For a series of length T, I would need to collect at least 2^{T} datapoints just to have seen every option. We will often be considering time series where the number of realizations is \mathcal{O}(1).
Basic statistics of time series
Statistical summaries
Mean: \mu_{X}(t)=\mathbb{E}(X_{t})
(Auto)-covariance: \gamma_{X}(s, t)=\mathbb{E}[(X_{s}-\mu_{X}(s))(X_{t}-\mu_{X}(t))]
Auto-correlation: \rho_{X}(s, t)=\frac{\gamma_{X}(s, t)}{\sqrt{\gamma_{X}(s, s) \gamma_{X}(t, t)}}
Because the distribution functions will often be unwieldy objects to work with for a time series, we will take advantage of a number of statistical summaries of the process. These statistics will be useful for quantifying our assumptions, correcting trends in our time series, and building our models. The first is the mean of the time series:
\mu_{X}(t)=\mathbb{E}(X_{t}) \tag{1.6}
where \mathbb{E} denotes the expectation value. Another useful statistic is the (auto)covariance function:
\gamma_{X}(s, t)=\mathbb{E}[(X_{s}-\mu_{X}(s))(X_{t}-\mu_{X}(t))] \tag{1.7}
Sometimes, it will be useful to understand how correlated two points in time are independent of each parameter’s intrinsic variance. In those cases, the auto-correlation function can be useful:
\rho_{X}(s, t)=\frac{\gamma_{X}(s, t)}{\sqrt{\gamma_{X}(s, s) \gamma_{X}(t, t)}} \tag{1.8}
The auto-correlation function is a measure of how predictable each time point is of the other. Finally, we may want to understand how predictive one time series is of another. In these cases, it is useful to employ the cross-covariance and cross-correlation function:
\begin{align} \gamma_{X, Y}(s, t) &= \mathbb{E}[(X_{s}-\mu_{X}(s))(Y_{t}-\mu_{Y}(t))] \tag{1.9} \\ \rho_{X, Y}(s, t) &= \frac{\gamma_{X, Y}(s, t)}{\sqrt{\gamma_{X}(s, s) \gamma_{Y}(t, t)}} \tag{1.10} \end{align}
Example: White noise
White noise has the following properties:
X_{t} \sim \mathcal{N}(0, \sigma_{w}^2)
Mean: \mu_{X}(t)=0
Example 3.1 The statistical properties of white noise.
A stochastic process that will show up repeatedly in this course is white noise. Within a white noise process, the distribution of each random variable is defined by a Gaussian distribution with mean zero and variance \sigma_{w}^{2}:
p(X_{t})=\mathcal{N}(\mu=0, \sigma=\sigma_{w}) \tag{1.11}
By construction, the mean of the white noise process is zero:
\mu_{X}(t)=0 \tag{1.12}
The covariance is also fairly quick to calculate:
\begin{align} \gamma_{X}(s, t) &= \mathbb{E}[(X_{s}-\mu_{X}(s))(X_{t}-\mu_{X}(t))] \tag{1.13}\\ &= \int(X_{s} X_{t}-X_{s} \mu_{X}(t)-X_{t} \mu_{X}(s)+\mu_{X}(s) \mu_{X}(t)) p(X_{s}, X_{t}) dX_{s} dX_{t} \tag{1.14}\\ &= \int X_{s} X_{t} p(X_{s}, X_{t}) dX_{s} dX_{t}-\mu_{X}(s) \mu_{X}(t) \tag{1.15}\\ &= \begin{cases}s=t & \sigma_{w}^{2} \\ s != t & 0\end{cases} \tag{1.16} \end{align}
Structure and regularities
Conditional distributions
Let’s start by breaking our joint cumulative distribution function down into a number of conditional distributions:
P(X_{1}, X_{2}, \ldots X_{T})=P(X_{1}) P(X_{2} \mid X_{1}) P(X_{3} \mid X_{1: 2}) \ldots P(X_{T} \mid X_{1: T-1}) \tag{1.18}
Markov assumption
The Markov assumption asserts that only the previous state influences the current state:
P(X_{1}, X_{2}, \ldots X_{T})=P(X_{1}) P(X_{2} \mid X_{1}) P(X_{3} \mid X_{2}) \ldots P(X_{T} \mid X_{T-1}) \tag{1.19}
One simplifying structure we can impose is to assume that only the previous random variable (or the current state) will influence the next random variable. This is often called the Markov assumption:
P(X_{1}, X_{2}, \ldots X_{T})=P(X_{1}) P(X_{2} \mid X_{1}) P(X_{3} \mid X_{2}) \ldots P(X_{T} \mid X_{T-1}) \tag{1.20}
We can generalize this assumption to include the k previous random variables.
Another common set of assumptions involve the stationarity of the system. The most restrictive is strong / strict stationarity.
Stationarity
Strong stationarity
Definition 2 The stochastic process is strongly stationary if \{X_{t}, \ldots X_{t+k}\} and \{X_{t+h}, \ldots X_{t+h+k}\} have the same joint distribution for all t, h, k.
Weak stationarity
Definition 3 The stochastic process is weakly stationary if:
- Mean is time-independent: \mu_{X}(t)= constant
- Auto-correlation depends only on lag: \rho_{X}(t, s)=\rho_{X}(|t-s|)
- Variance is finite
While this is a powerful assumption, we will rarely want to be so restrictive. Instead, we will often only assume weak stationarity.
Definition 3 Consider two time series drawn from our stochastic process, \{X_{t}, \ldots X_{t+k}\} and \{X_{t+h}, \ldots X_{t+h+k}\}. The stochastic process is weakly stationary if:
- The mean value of the process does not depend on time: \mu_{X}(t)= constant
- The auto-correlation function only depends on the absolute
difference between the time indices: \rho_{X}(t, s)=\rho_{X}(|t-s|)
- The variance is finite.
Weak stationarity significantly reduces the parameters we need to describe our system. So long as the variance is finite, strong stationarity yields weak stationarity. In cases where we assume that the underlying stochastic process is Gaussian, weak stationarity is equivalent to strong stationarity. Throughout this course, if we say a process is ‘stationary’ we are referring to weak stationary.
Trend stationarity
Definition 4 Trend stationary process has:
- Time-dependent mean: \mu_{X}(t)=f(t)
- Auto-correlation depends only on lag: \rho_{X}(t, s)=\rho_{X}(|t-s|)
- Variance is finite
Definition 4 Consider two time series drawn from our stochastic process, \{X_{t}, \ldots X_{t+k}\} and \{X_{t+h}, \ldots X_{t+h+k}\}. The stochastic process is trend stationary if:
- The mean value of the process has a time dependence: \mu_{X}(t)=f(t)
- The auto-correlation function only depends on the absolute difference between the time indices: \rho_{X}(t, s)=\rho_{X}(|t-s|)
- The variance is finite.
With trend stationary processes we will often attempt to partition the data into a time-dependent mean and a weak stationary process.
Bayesian inference and graphical models
Before we introduce our Bayesian inference and graphical modeling concepts, it’s worth highlighting why these models are still valuable in the deep learning era.
- Simpler, probabilistic models can excel in the small data regime.
- They offer interpretability to the results - can be a useful check even when deep learning models are appropriate.
The Bayesian framework
In the context of time series, we can break our use of the Bayesian framework into three buckets:
- Representation: How we represent our prior knowledge of the problem using probabilistic language.
- Learning: How we learn update our knowledge of the structure of the problem given our data (often in the context of model parameters).
The second and third buckets can both be considered applications of Bayesian inference.
Graphical models
Factorization example: P(A, B, C, D, E)=P(A) P(B) P(C \mid A, B) P(D \mid C, B) P(E \mid C) \tag{1.21}
One of the best ways to represent our knowledge about the structure of the problem is through the use of graphical models. This particular graph is a directed acyclic graph because all the relationship between random variables is directional and there are no cycles in the graph. Imagine that we now have a time series that goes \{A, B, C, D, E\}. This graphical representation of the structure of our time series allows us to factorize our joint distribution function. If we take the example in Figure 1 we can write:
P(A, B, C, D, E)=P(A) P(B) P(C \mid A, B) P(D \mid C, B) P(E \mid C) \tag{1.22}
This factorization stems from a series of structural assumptions about the conditional dependencies in the problem. If we think back to the example of binary variables, we can see how the number of parameters required to describe the joint distribution has been significantly reduced. Note that this is true for both the cdf and pdf.
d-separation and the Markov boundary
Definition 5 Variables \mathcal{V} d-separate \mathcal{X} from \mathcal{Y} if every path is blocked by \mathcal{V}.
Definition 6 The Markov boundary is the union of parents, children, and parents of children. Conditioning on its Markov boundary makes a variable independent of all others.
With our graphical model in hand, we can introduce two relevant concepts: d-separability and the Markov boundary:
Definition 5 A set of random variables \mathcal{V} is said to d-separate the set of variables \mathcal{X} from \mathcal{Y} if every path between \mathcal{X} and \mathcal{Y} is blocked by \mathcal{V}. A path is blocked by \mathcal{V} if there exists a node W along the path for which one of the two following conditions are met:
- W has converging arrows along the path and neither W nor its descendant are contained in \mathcal{V}.
- W does not have converging arrows and it is contained in \mathcal{V}.
Definition 6 The Markov boundary for X is the union of its parents, its children, and the parents of children. If a variable is conditioned on its Markov boundary, it becomes independent of all other random variables.
We will not prove the definitions here, but we will highlight their importance with three examples.
Conditioning in three cases
The first example is conditioning on the parent of two nodes. Before we condition on B, the nodes C and D are dependent on each other because they share a parent B. However, when we condition on B we break that dependence. If we know the value of B, then the value of D is no longer informative to the value of C. Written out mathematically:
\begin{align} P(C, D \mid B) &= \frac{p(B, C, D)}{p(B)} \tag{1.23}\\ &= \frac{p(B) p(C \mid B) p(D \mid B)}{p(B)} \tag{1.24}\\ &= p(C \mid B) p(D \mid B) \tag{1.25} \end{align}
Another example is conditioning on the child of one node and the parent of the next. Before conditioning on C, E is dependent on B through C. If the value of B changes, it will affect the value of C, which will affect the value of E. Once we condition on C, that dependency is broken and E becomes independent of B.
The last example is the child of two parents. Before conditioning on C, A and B are independent of each other. However, once we have conditioned on C, B and A become dependent on each other. To help build intuition for this property, imagine that C is positively correlated with A and B. That is as A or B goes up, so does C. Now imagine I have observed that C is higher than average (i.e. I have conditioned on a high C). Now knowing something about A/B tells me something about B/A. For example, if A is low then B must be particularly high so that the resulting C is high. This is a fairly qualitative example, but we will work through much more quantitative examples later in the course.
Bayesian inference rules
Sum and product rules
Sum rule: p(x)=\int p(x, y) dy
Product rule: p(x, y)=p(x) p(y \mid x)
In statistical inference, there are two useful properties we often exploit. We will write these rules in terms of the probability distribution functions p, but we can also express them in terms of the cumulative density functions P (see last week’s notes). The first is the sum rule:
p(x)=\int p(x, y) dy \tag{1.26}
where p(x) is called the marginal distribution of x. The process of applying the sum rule to get the marginal distribution is often referred to as marginalizing over the variable y. The second is the product rule:
p(x, y)=p(x) p(y \mid x) \tag{1.27}
These concepts will show up in times series analysis repeatedly. For example, given a distribution p(X_{1: t+1}), we will often want to make predictions of the next time step, t+1, given an observation of a times series up to time index t:
p(X_{t+1} \mid X_{1: t})=\frac{p(X_{1: t+1})}{p(X_{1: t})} \tag{1.28}
In this case, to get the bottom term, we will have to take advantage of the sum rule.
Bayes’ theorem
Bayes’ Theorem: p(A \mid B) = \frac{p(B \mid A) p(A)}{p(B)}
For parameter learning: p(\theta \mid \mathcal{D})=\frac{p(\mathcal{D} \mid \theta) p(\theta)}{p(\mathcal{D})}
p(\theta): prior
p(\mathcal{D} \mid \theta): likelihood
We can use the product rule to prove Bayes’ theorem:
\begin{align} p(A, B) &= p(B \mid A) p(A) \tag{1.29}\\ p(A, B) &= p(A \mid B) p(B) \tag{1.30}\\ p(A \mid B) p(B) &= p(B \mid A) p(A) \tag{1.31}\\ p(A \mid B) &= \frac{p(B \mid A) p(A)}{p(B)} \tag{1.32} \end{align}
In most Bayesian inference, we are interested in learning the parameters of some model, \theta, given some dataset \mathcal{D}. Our model will often provide us with the probability distribution p(\mathcal{D} \mid \theta), and we will use Bayes’ theorem to calculate our desired distribution:
p(\theta \mid \mathcal{D})=\frac{p(\mathcal{D} \mid \theta) p(\theta)}{p(\mathcal{D})} \tag{1.33}
In Bayesian inference, we update our prior beliefs based off of the data that we collect. For this reason, p(\theta) is called our prior. It represents our prior beliefs about the parameters of our model. The distribution p(\mathcal{D} \mid \theta) is called likelihood, and it represents how likely our data is given a specific set of parameters for our model. The distribution p(\theta \mid \mathcal{D}) is called the posterior.
It represents our updated beliefs about the value of the parameters in our model. The denominator, p(\mathcal{D}) is the marginal distribution of the data. As you can see, it does not depend on the value of the parameters \theta, so we can think of it as a normalizing factor. It is possible to conduct Bayesian inference even when p(\mathcal{D}) cannot be calculated. When a distribution either has no analytical form or is too complex to calculate, we will refer to it as intractable.
Multivariate Gaussian review
Basic properties
Multivariate Gaussian: For vector \boldsymbol{X}=[X_{1}, X_{2}, \ldots X_{n}]:
\mathcal{N}(\boldsymbol{X}=\boldsymbol{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma})=\frac{1}{(2 \pi)^{d / 2}|\boldsymbol{\Sigma}|^{1 / 2}} \exp \left[-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^{T} \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})\right] \tag{1.34}
Before we introduce our first major time series model, it is important for us to review the basics of multivariate Gaussian distributions. As we will see, they will show up repeatedly in our inference. The multivariate Gaussian distribution is an extension of the Gaussian distribution to multiple dimensions. As with the univariate Gaussian, it is fully described by a vector of means, \boldsymbol{\mu}, and a matrix of (co)variances, \boldsymbol{\Sigma}. Written in the language of last week’s lecture, if we consider the vector \boldsymbol{X}=[X_{1}, X_{2}, \ldots X_{n}] then:
\begin{align} \boldsymbol{\mu} &= [\mu_{X_{1}} \quad \mu_{X_{2}} \quad \cdots \quad \mu_{X_{n}}] \tag{1.35}\\ \boldsymbol{\Sigma} &= \begin{bmatrix} \gamma_{X_{1}, X_{1}} & \gamma_{X_{1}, X_{2}} & \cdots & \gamma_{X_{1}, X_{n}} \\ \cdots & \cdots & \cdots & \cdots \\ \gamma_{X_{n}, X_{1}} & \gamma_{X_{n}, X_{2}} & \cdots & \gamma_{X_{n}, X_{n}} \end{bmatrix} \tag{1.36} \end{align}
I will try to bold variables when they represent a vector / matrix instead of a single value. The probability density functions of the multivariate Gaussian distributions is:
\mathcal{N}(\boldsymbol{X}=\boldsymbol{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma})=\frac{1}{(2 \pi)^{d / 2}|\boldsymbol{\Sigma}|^{1 / 2}} \exp \left[-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^{T} \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})\right] \tag{1.37}
In Equation 11, d represents the number of dimensions, |\boldsymbol{\Sigma}| represents the determinant of the covariance matrix, and \boldsymbol{x} is the value at which we are evaluating the probability density function of our vector of random variables \boldsymbol{X}. When we are working with vector and matrices, we will often use the shorthand p(\boldsymbol{X}=\boldsymbol{x})=p(\boldsymbol{x}) so that we can keep our vectors lowercased and our matrices uppercased.
Marginal and conditional distributions
For partitioned vector \mathbf{X} = [\mathbf{X}_a, \mathbf{X}_b]:
Marginal: p(\mathbf{x}_a) = \mathcal{N}(\mathbf{x}_a \mid \boldsymbol{\mu}_a, \boldsymbol{\Sigma}_{aa})
Conditional: p(\mathbf{x}_a \mid \mathbf{x}_b) = \mathcal{N}(\mathbf{x}_a \mid \boldsymbol{\mu}_{a|b}, \boldsymbol{\Sigma}_{a|b})
where:
\boldsymbol{\mu}_{a|b} = \boldsymbol{\mu}_a + \boldsymbol{\Sigma}_{ab} \boldsymbol{\Sigma}_{bb}^{-1}(\mathbf{x}_b - \boldsymbol{\mu}_b)
\boldsymbol{\Sigma}_{a|b} = \boldsymbol{\Sigma}_{aa} - \boldsymbol{\Sigma}_{ab} \boldsymbol{\Sigma}_{bb}^{-1} \boldsymbol{\Sigma}_{ba}
We will frequently take advantage of the marginal and conditional distributions of a multivariate Gaussian. In these cases, we will want to divide our larger vector of random variables \boldsymbol{X} into two subsets \boldsymbol{X}_{a} and \boldsymbol{X}_{b}. Note that we have made no assumption to the size of the subsets |a| and |b| beyond that |a|,|b| \geq 1 and that |a|+|b|=|d|, where |d| is the dimensionality of our original vector of random variables \boldsymbol{X}. We can divide our pdf from Equation 11 accordingly:
\mathcal{N}(\boldsymbol{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma})=\mathcal{N}\left(\begin{bmatrix} \boldsymbol{x}_{a} \\ \boldsymbol{x}_{b} \end{bmatrix} \left\lvert\,\begin{bmatrix} \boldsymbol{\mu}_{a} \\ \boldsymbol{\mu}_{b} \end{bmatrix}\right.,\begin{bmatrix} \boldsymbol{\Sigma}_{a a} & \boldsymbol{\Sigma}_{a b} \\ \boldsymbol{\Sigma}_{b a} & \boldsymbol{\Sigma}_{b b} \end{bmatrix}\right) \tag{1.38}
Given the factorization in Equation 12, we can now easily define the marginal and conditional probability density functions. The marginal probability density function is given by:
\begin{align} p(\boldsymbol{x}_{a}) &= \mathcal{N}(\boldsymbol{x}_{a} \mid \boldsymbol{\mu}_{a}, \boldsymbol{\Sigma}_{a a}) \tag{1.39}\\ p(\boldsymbol{x}_{b}) &= \mathcal{N}(\boldsymbol{x}_{b} \mid \boldsymbol{\mu}_{b}, \boldsymbol{\Sigma}_{b b}) \tag{1.40} \end{align}
The conditional probability density function is a bit more complicated:
\begin{align} p(\boldsymbol{x}_{a} \mid \boldsymbol{x}_{b}) &= \mathcal{N}(\boldsymbol{x}_{a} \mid \boldsymbol{\mu}_{a \mid b}, \boldsymbol{\Sigma}_{a \mid b}) \tag{1.41}\\ \boldsymbol{\mu}_{a \mid b} &= \boldsymbol{\mu}_{a}+\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1}(\boldsymbol{x}_{b}-\boldsymbol{\mu}_{b}) \tag{1.42}\\ \boldsymbol{\Sigma}_{a \mid b} &= \boldsymbol{\Sigma}_{a a}-\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \boldsymbol{\Sigma}_{b a} \tag{1.43} \end{align}
We can reverse a and b in the equation above to get the other equation. Note that the conditional covariance matrix does not depend on the value of the conditioned variable \boldsymbol{x}_{b}. In lab we will play around with the conditional and marginal distributions to gain some intuition.
Linear operations
Gaussians under linear operations:
For \mathbf{Z} = \mathbf{A}\mathbf{X} + \mathbf{y}:
p(\mathbf{z}) = \mathcal{N}(\mathbf{z} \mid \mathbf{A}\boldsymbol{\mu}_X + \mathbf{y}, \mathbf{A}\boldsymbol{\Sigma}_X \mathbf{A}^T) \tag{1.44}
Gaussians are closed under linear operations.
For any vector of random variables \boldsymbol{X}, matrix \boldsymbol{A}, and vector \boldsymbol{y}, we know that the mean and covariance:
\begin{align} \boldsymbol{\mu}_{(AX+\boldsymbol{y})} &= A \mu_{X}+\boldsymbol{y} \tag{1.45}\\ \boldsymbol{\Sigma}_{(AX+\boldsymbol{y})} &= A \Sigma_{X} A^{T} \tag{1.46} \end{align}
Since our multivariate Gaussian is defined entirely by a mean and covariance, for the vector of random variables \mathbf{Z}=\boldsymbol{A} \boldsymbol{X}+\boldsymbol{y} we can write:
p(\mathbf{z})=\mathcal{N}(\mathbf{z} \mid \boldsymbol{A} \mu_{\boldsymbol{X}}+\boldsymbol{y}, \boldsymbol{A} \Sigma_{\boldsymbol{X}} \boldsymbol{A}^{T}) \tag{1.47}
A consequence of this is that Gaussians are closed under linear operations. That means that, under a linear operations, a Gaussian will be mapped to another Gaussian. This is important because we will often start from white noise assumptions and then use linear operations to build more complex models.