Advanced Statistical Inference
EURECOM
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \definecolor{function}{rgb}{0.75, 0.75, 0.12} \]
Given training data from an unknown distribution \(p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\), we want to learn a model \(p_\text{model}({\textcolor{input}{\boldsymbol{x}}})\) that approximates the true distribution.

Train from \({\textcolor{input}{\boldsymbol{x}}}\sim p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\).

Generate from \({\textcolor{input}{\boldsymbol{x}}}\sim p_\text{model}({\textcolor{input}{\boldsymbol{x}}})\).
Our objective is to learn the data distribution \(p({\textcolor{input}{\boldsymbol{x}}})\), but we suppose that each data point \({\textcolor{input}{\boldsymbol{x}}}\) is associated with a latent variable \({\textcolor{latent}{\boldsymbol{z}}}\).
For example, image we want to learn the distribution of images of objects, like the one below:

Each image is a huge vector of pixels \({\textcolor{input}{\boldsymbol{x}}}\in \mathbb{R}^{H \times W \times C}\), where \(H\) is the height, \(W\) is the width and \(C\) is the number of channels (in this case, \(64 \times 64 \times 3\)).
Each image can be described by a set of 6 latent variables: floor, wall and object color, shape, orientation and scale.
Latent variables are unobserved variables, we cannot measure them directly, but we can infer them from the observed data.
In math terms, we model our data distribution as a marginal of the joint distribution of the observed data and the latent variables:
\[ p({\textcolor{input}{\boldsymbol{x}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}\quad \text{or} \quad p({\textcolor{input}{\boldsymbol{x}}}) = \sum_{i} p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}_i) \]
where \(p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}})\) is the joint distribution of the observed data and the latent variables.
Gaussian mixture model (GMM)
Conditional distribution of the data given the latent variable: \[ p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}=k) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}\mid {\boldsymbol{\mu}}_k, {\boldsymbol{\Sigma}}_k) \]
where \({\boldsymbol{\mu}}_k\) and \({\boldsymbol{\Sigma}}_k\) are the mean and covariance of the \(k\)-th Gaussian component.
Prior distribution of the latent variable:
\[ p({\textcolor{latent}{\boldsymbol{z}}}) = \text{Categorical}({\textcolor{latent}{\boldsymbol{z}}}\mid {\boldsymbol{\pi}}) \]
where \(\pi_k\) are the mixing coefficients, \({\boldsymbol{\mu}}_k\) is the mean of the \(k\)-th Gaussian component, and \({\boldsymbol{\Sigma}}_k\) is the covariance of the \(k\)-th Gaussian component.
Probabilistic PCA (PPCA)
Conditional distribution of the data given the latent variable: \[ p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{params}{\boldsymbol{b}}}, \sigma^2 {\boldsymbol{I}}) \]
where \({\textcolor{params}{\boldsymbol{W}}}\) is a linear transformation matrix, \({\textcolor{params}{\boldsymbol{b}}}\) is a bias vector, and \(\sigma^2 {\boldsymbol{I}}\) is the noise covariance.
Prior distribution of the latent variable: \[ p({\textcolor{latent}{\boldsymbol{z}}}) = {\mathcal{N}}({\textcolor{latent}{\boldsymbol{z}}}\mid {\boldsymbol{0}}, {\boldsymbol{I}}) \]
Idea:
Disclaimer:
Variational autoencoders (VAEs) are often misrepresented as a simple autoencoder with a probabilistic twist.
This is wrong (sort of)! It hides the real nature of the encoder and decoder!
Consider a maximum likelihood training objective for a generative model with latent variables:
\[ p({\textcolor{input}{\boldsymbol{x}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}) p({\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}} \]
Objective: increase the flexibility of \(p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}})\) to approximate the true data distribution \(p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\).
Solution: define a noisy observation model
\[ p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{function}{\boldsymbol{f}}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{latent}{\boldsymbol{z}}}), \sigma^2 {\boldsymbol{I}}) \]
where \({\textcolor{function}{\boldsymbol{f}}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{latent}{\boldsymbol{z}}})\) is a function of the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) and the model parameters \({\textcolor{params}{\boldsymbol{\theta}}}\).
Note: \(p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}})\) is not a deterministic function, it is a probabilistic model that outputs a distribution over the data given the latent variable. If it were deterministic, then \(p({\textcolor{input}{\boldsymbol{x}}}) = 0\) almost everywhere.
Notation: we will use \(p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\) and \(p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{params}{\boldsymbol{\theta}}})\) to remember that these two distributions depend on the model parameters \({\textcolor{params}{\boldsymbol{\theta}}}\).
\(p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) p({\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}\) is well-defined
Can we compute it?
Can we maximize it w.r.t. \({\textcolor{params}{\boldsymbol{\theta}}}\)?
Solution: try to maximize a lower bound on the likelihood \(\log p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\)
\[ \log {\mathbb{E}}[X] \geq {\mathbb{E}}[\log X] \]
We can use Jensen’s inequality to obtain a lower bound on the log-likelihood.
Suppose we have a distribution \(q({\textcolor{latent}{\boldsymbol{z}}})\) (we will see later how to choose it):
\[ \begin{aligned} \log p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}}) &= \log \int p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) p({\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}\\ &= \log \int q({\textcolor{latent}{\boldsymbol{z}}}) \frac{p({\textcolor{latent}{\boldsymbol{z}}})}{q({\textcolor{latent}{\boldsymbol{z}}})} p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}\\ &\geq \int q({\textcolor{latent}{\boldsymbol{z}}}) \log \left( \frac{p({\textcolor{latent}{\boldsymbol{z}}})}{q({\textcolor{latent}{\boldsymbol{z}}})} p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) \right) \dd{\textcolor{latent}{\boldsymbol{z}}}\qquad \text{(Jensen's inequality)} \\ &= \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}})} \frac{p({\textcolor{latent}{\boldsymbol{z}}})}{q({\textcolor{latent}{\boldsymbol{z}}})} + \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}})} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) \end{aligned} \]
Remember the Variational Inference lecture?
We can rewrite the lower bound as: \[ \log p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}}) \geq \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}})} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) - \mathrm{KL}\left( q({\textcolor{latent}{\boldsymbol{z}}}) \| p({\textcolor{latent}{\boldsymbol{z}}}) \right) \]
Meaning that we want to minimize the expected reconstruction error between the data \({\textcolor{input}{\boldsymbol{x}}}\) and the model output \({\textcolor{function}{\boldsymbol{f}}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{latent}{\boldsymbol{z}}})\), when \({\textcolor{latent}{\boldsymbol{z}}}\) is sampled from \(q({\textcolor{latent}{\boldsymbol{z}}})\).
\[ \log p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}}) \geq \mathcal{L}_\text{ELBO}({\textcolor{params}{\boldsymbol{\theta}}}) = \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}})} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) - \mathrm{KL}\left( q({\textcolor{latent}{\boldsymbol{z}}}) \| p({\textcolor{latent}{\boldsymbol{z}}}) \right) \]
We would like to maximize the lower bound on the log-likelihood, but we want the bound to be tight
Remember that the KL divergence is always non-negative, so we can rewrite the objective as:
\[ \log p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}}) - \mathcal{L}_\text{ELBO}({\textcolor{params}{\boldsymbol{\theta}}}) = \mathrm{KL}\left( q({\textcolor{latent}{\boldsymbol{z}}}) \| p({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}}) \right) \]
This means that the closer the variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}})\) is to the true posterior distribution \(p({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}})\), the tighter the bound will be.
To fit \(q\), we need to define a variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}})\) that approximates the true posterior distribution \(p({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}})\).
The variational distribution is often chosen to be Gaussian, i.e., \(q({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{vparams}{\boldsymbol{\nu}}}) = {\mathcal{N}}({\textcolor{latent}{\boldsymbol{z}}}\mid {\boldsymbol{\mu}}, {\boldsymbol{\sigma}}^2{\boldsymbol{I}})\), where \({\textcolor{vparams}{\boldsymbol{\nu}}}= \{ {\boldsymbol{\mu}}, {\boldsymbol{\sigma}}^2\}\) are the variational parameters.
Gaussian is good because it is easy to sample from, it has a closed-form KL divergence, and it is possible to use the reparameterization trick to compute gradients of the ELBO w.r.t. the variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\).
Important
The reparameterization trick allows us to sample from the variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{vparams}{\boldsymbol{\nu}}})\) in a way that is differentiable w.r.t. the variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\).
\[ {\textcolor{latent}{\boldsymbol{z}}}= {\boldsymbol{\mu}}+ {\boldsymbol{\sigma}}\odot {\textcolor{noise}{\boldsymbol{\varepsilon}}}, \quad {\textcolor{noise}{\boldsymbol{\varepsilon}}}\sim {\mathcal{N}}({\boldsymbol{0}}, {\boldsymbol{I}}) \]
At the end, we are left with the following optimization problem:
\[ \max_{{\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{vparams}{\boldsymbol{\nu}}}} \mathcal{L}_\text{ELBO}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{vparams}{\boldsymbol{\nu}}}) = \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{vparams}{\boldsymbol{\nu}}})} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) - \mathrm{KL}\left( q({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{vparams}{\boldsymbol{\nu}}}) \| p({\textcolor{latent}{\boldsymbol{z}}}) \right) \]
Note:
The ELBO is a function of both the likelihood parameters \({\textcolor{params}{\boldsymbol{\theta}}}\) and the variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\).
The likelihood are shared across all data points, while the variational parameters are per-sample.
In classic VI (e.g., Bayesian logistic regression):
In our case, for variational autoencoders:
We optimize both:
But these two are coupled:
This may lead to unstable or inefficient training, especially when \({\textcolor{function}{\boldsymbol{f}}}(\cdot)\) is complex (e.g., deep neural networks).
The joint optimization of \({\textcolor{params}{\boldsymbol{\theta}}}\) and \({\textcolor{vparams}{\boldsymbol{\nu}}}\) in VAEs is similar to the Expectation-Maximization (EM) algorithm, but with some key differences:
In the previous approach we learn a separate set of variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\) for each data point \({\textcolor{input}{\boldsymbol{x}}}\).
This requires an expensive iterative optimization for each data point, which will take a long time to process large datasets.
Amortization: instead of learning a separate set of variational parameters for each data point, we learn a function \({\boldsymbol{g}}({\boldsymbol{\phi}}, {\textcolor{input}{\boldsymbol{x}}})\) that maps the data point \({\textcolor{input}{\boldsymbol{x}}}\) to the variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\).
This function is often implemented as a neural network, which allows us to learn a complex mapping from the data to the variational parameters.
Idea: “amortize” the cost of computing the variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\) by learning a function that predicts \(\{ {\boldsymbol{\mu}}, {\boldsymbol{\sigma}}^2 \}\) as a function of the data \({\textcolor{input}{\boldsymbol{x}}}\).
The output of this function is generally \(\textcolor{vparams}{{\boldsymbol{\mu}}}\) and \(\log\textcolor{vparams}{{\boldsymbol{\sigma}}}\) (the \(\log\) is needed to ensure that the variance is positive).
The reparametrization trick is still used to sample from the variational distribution, the only difference is that the variational parameters are not free variables anymore, but are predicted by a neural network \({\boldsymbol{g}}({\boldsymbol{\phi}}, {\textcolor{input}{\boldsymbol{x}}})\).
Sometimes, this is noted as \(q({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}})\) to emphasize that the variational distribution depends on the data \({\textcolor{input}{\boldsymbol{x}}}\), even though it’s not actually a conditional distribution.

We have a neural network \(g({\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\phi}})\) that predicts the variational parameters \(\{ \textcolor{vparams}{{\boldsymbol{\mu}}}, \textcolor{vparams}{{\boldsymbol{\sigma}}}^2 \}\) from the data \({\textcolor{input}{\boldsymbol{x}}}\).
We can sample from the variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\phi}})\) using the reparameterization trick
We compute the likelihood \(\log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{params}{\boldsymbol{\theta}}})\) using using a neural network \({\textcolor{function}{\boldsymbol{f}}}({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{params}{\boldsymbol{\theta}}})\) that maps the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) to the data space.
We can now optimize the ELBO with respect to both the likelihood parameters \({\textcolor{params}{\boldsymbol{\theta}}}\) and the amortization parameters \({\boldsymbol{\phi}}\): \[ \mathcal{L}_\text{ELBO}({\textcolor{params}{\boldsymbol{\theta}}}, {\boldsymbol{\phi}}) = \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\phi}})} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{params}{\boldsymbol{\theta}}}) - \mathrm{KL}\left( q({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\phi}}) \| p({\textcolor{latent}{\boldsymbol{z}}}) \right) \]
Follow the generation process:
Note
When using amortized inference, the encoder network is not used during generation, since we are not conditioning on any observed data
It does not make sense to compute \(q({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}})\) when we don’t have any \({\textcolor{input}{\boldsymbol{x}}}\)
Sometimes we want to generate samples conditioned on some additional information \(p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{output}{\boldsymbol{y}}})\), e.g., we want to generate images of a specific class.
Since the latent code \({\textcolor{latent}{\boldsymbol{z}}}\) no longer has to model the image category, it can focus on modeling the stylistic features.
If we’re lucky, this lets us disentangle style and content. (Note: disentanglement is still a dark art.)
Conditional VAE can be achieve by adding an additional input to the encoder and decoder networks, which is the additional information \({\textcolor{output}{\boldsymbol{y}}}\):

You can often get interesting results by interpolating between two vectors in the latent space:

You can also perform “arithmetic” in the latent space, e.g. by adding or subtracting vectors:

By varying two latent dimensions (i.e. dimensions of \({\textcolor{latent}{\boldsymbol{z}}}\)) while holding \({\textcolor{output}{\boldsymbol{y}}}\) fixed, we can visualize the latent space.
By varying the label \({\textcolor{output}{\boldsymbol{y}}}\) while holding \({\textcolor{latent}{\boldsymbol{z}}}\) fixed, we can solve image analogies.
The samples may not be very good:
VAEs suffer from mode collapse, where the amortization network learns to predict the prior. This is due to the strong effect of the KL divergence term in the ELBO.
The spherical prior \(p({\textcolor{latent}{\boldsymbol{z}}})\) is too restrictive, it can lead to a “blurry” output distribution (or very noisy samples, if we sample from the likelihood).