Deep Generative Models

Advanced Statistical Inference

Simone Rossi

EURECOM

\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \definecolor{function}{rgb}{0.75, 0.75, 0.12} \]

Generative models

Given training data from an unknown distribution \(p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\), we want to learn a model \(p_\text{model}({\textcolor{input}{\boldsymbol{x}}})\) that approximates the true distribution.

Train from \({\textcolor{input}{\boldsymbol{x}}}\sim p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\).

Generate from \({\textcolor{input}{\boldsymbol{x}}}\sim p_\text{model}({\textcolor{input}{\boldsymbol{x}}})\).

Latent variable models

Our objective is to learn the data distribution \(p({\textcolor{input}{\boldsymbol{x}}})\), but we suppose that each data point \({\textcolor{input}{\boldsymbol{x}}}\) is associated with a latent variable \({\textcolor{latent}{\boldsymbol{z}}}\).

For example, image we want to learn the distribution of images of objects, like the one below:

Each image is a huge vector of pixels \({\textcolor{input}{\boldsymbol{x}}}\in \mathbb{R}^{H \times W \times C}\), where \(H\) is the height, \(W\) is the width and \(C\) is the number of channels (in this case, \(64 \times 64 \times 3\)).

Each image can be described by a set of 6 latent variables: floor, wall and object color, shape, orientation and scale.

Latent variable models

Latent variables are unobserved variables, we cannot measure them directly, but we can infer them from the observed data.

In math terms, we model our data distribution as a marginal of the joint distribution of the observed data and the latent variables:

\[ p({\textcolor{input}{\boldsymbol{x}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}\quad \text{or} \quad p({\textcolor{input}{\boldsymbol{x}}}) = \sum_{i} p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}_i) \]

where \(p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}})\) is the joint distribution of the observed data and the latent variables.

Linear latent variable models

Gaussian mixture model (GMM)

Conditional distribution of the data given the latent variable: \[ p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}=k) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}\mid {\boldsymbol{\mu}}_k, {\boldsymbol{\Sigma}}_k) \]

where \({\boldsymbol{\mu}}_k\) and \({\boldsymbol{\Sigma}}_k\) are the mean and covariance of the \(k\)-th Gaussian component.

Prior distribution of the latent variable:

\[ p({\textcolor{latent}{\boldsymbol{z}}}) = \text{Categorical}({\textcolor{latent}{\boldsymbol{z}}}\mid {\boldsymbol{\pi}}) \]

where \(\pi_k\) are the mixing coefficients, \({\boldsymbol{\mu}}_k\) is the mean of the \(k\)-th Gaussian component, and \({\boldsymbol{\Sigma}}_k\) is the covariance of the \(k\)-th Gaussian component.

Probabilistic PCA (PPCA)

Conditional distribution of the data given the latent variable: \[ p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{params}{\boldsymbol{b}}}, \sigma^2 {\boldsymbol{I}}) \]

where \({\textcolor{params}{\boldsymbol{W}}}\) is a linear transformation matrix, \({\textcolor{params}{\boldsymbol{b}}}\) is a bias vector, and \(\sigma^2 {\boldsymbol{I}}\) is the noise covariance.

Prior distribution of the latent variable: \[ p({\textcolor{latent}{\boldsymbol{z}}}) = {\mathcal{N}}({\textcolor{latent}{\boldsymbol{z}}}\mid {\boldsymbol{0}}, {\boldsymbol{I}}) \]

Variational autoencoders (VAEs)

Idea:

We want to learn a generative model \(p_\text{model}({\textcolor{input}{\boldsymbol{x}}})\) that approximates the true data distribution \(p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\).
We introduce a latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) to explain the observed data \({\textcolor{input}{\boldsymbol{x}}}\).
We model the relationship between the data and the latent variable with a more complex function.
We use a variational inference approach to learn the model parameters.

Variational autoencoders (VAEs)

Disclaimer:

Variational autoencoders (VAEs) are often misrepresented as a simple autoencoder with a probabilistic twist.
This is wrong (sort of)! It hides the real nature of the encoder and decoder!

Neural networks as flexible likelihood models

Consider a maximum likelihood training objective for a generative model with latent variables:

\[ p({\textcolor{input}{\boldsymbol{x}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}) p({\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}} \]

Objective: increase the flexibility of \(p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}})\) to approximate the true data distribution \(p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\).

Solution: define a noisy observation model

\[ p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{function}{\boldsymbol{f}}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{latent}{\boldsymbol{z}}}), \sigma^2 {\boldsymbol{I}}) \]

where \({\textcolor{function}{\boldsymbol{f}}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{latent}{\boldsymbol{z}}})\) is a function of the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) and the model parameters \({\textcolor{params}{\boldsymbol{\theta}}}\).

Note: \(p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}})\) is not a deterministic function, it is a probabilistic model that outputs a distribution over the data given the latent variable. If it were deterministic, then \(p({\textcolor{input}{\boldsymbol{x}}}) = 0\) almost everywhere.

Notation: we will use \(p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\) and \(p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{params}{\boldsymbol{\theta}}})\) to remember that these two distributions depend on the model parameters \({\textcolor{params}{\boldsymbol{\theta}}}\).

Neural networks as flexible likelihood models

\(p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) p({\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}\) is well-defined

Can we compute it?
- No, \({\textcolor{function}{\boldsymbol{f}}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{latent}{\boldsymbol{z}}})\) is very complex, so there is no hope of finding a closed-form solution.

Can we maximize it w.r.t. \({\textcolor{params}{\boldsymbol{\theta}}}\)?
- No, we cannot compute the integral, so we cannot maximize the likelihood directly.

Solution: try to maximize a lower bound on the likelihood \(\log p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\)

\[ \log {\mathbb{E}}[X] \geq {\mathbb{E}}[\log X] \]

Variational inference to the rescue

We can use Jensen’s inequality to obtain a lower bound on the log-likelihood.

Suppose we have a distribution \(q({\textcolor{latent}{\boldsymbol{z}}})\) (we will see later how to choose it):

\[ \begin{aligned} \log p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}}) &= \log \int p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) p({\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}\\ &= \log \int q({\textcolor{latent}{\boldsymbol{z}}}) \frac{p({\textcolor{latent}{\boldsymbol{z}}})}{q({\textcolor{latent}{\boldsymbol{z}}})} p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}\\ &\geq \int q({\textcolor{latent}{\boldsymbol{z}}}) \log \left( \frac{p({\textcolor{latent}{\boldsymbol{z}}})}{q({\textcolor{latent}{\boldsymbol{z}}})} p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) \right) \dd{\textcolor{latent}{\boldsymbol{z}}}\qquad \text{(Jensen's inequality)} \\ &= \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}})} \frac{p({\textcolor{latent}{\boldsymbol{z}}})}{q({\textcolor{latent}{\boldsymbol{z}}})} + \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}})} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) \end{aligned} \]

Variational inference to the rescue

Remember the Variational Inference lecture?

We can rewrite the lower bound as: \[ \log p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}}) \geq \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}})} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) - \mathrm{KL}\left( q({\textcolor{latent}{\boldsymbol{z}}}) \| p({\textcolor{latent}{\boldsymbol{z}}}) \right) \]

The first term is the expected log-likelihood of the data given the latent variable. Since we assume a Gaussian model: \[ \begin{aligned} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) &= \log {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{function}{\boldsymbol{f}}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{latent}{\boldsymbol{z}}}), \sigma^2 {\boldsymbol{I}}) \\ &= \log \left( \frac{1}{(2\pi\sigma^2)^{D/2}} \exp\left(-\frac{1}{2\sigma^2} \|{\textcolor{input}{\boldsymbol{x}}}- {\textcolor{function}{\boldsymbol{f}}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{latent}{\boldsymbol{z}}})\|^2\right) \right) \\ &= -\frac{1}{2\sigma^2} \|{\textcolor{input}{\boldsymbol{x}}}- {\textcolor{function}{\boldsymbol{f}}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{latent}{\boldsymbol{z}}})\|^2 + \text{const} \end{aligned} \]

Meaning that we want to minimize the expected reconstruction error between the data \({\textcolor{input}{\boldsymbol{x}}}\) and the model output \({\textcolor{function}{\boldsymbol{f}}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{latent}{\boldsymbol{z}}})\), when \({\textcolor{latent}{\boldsymbol{z}}}\) is sampled from \(q({\textcolor{latent}{\boldsymbol{z}}})\).

Variational inference to the rescue

The second term is the KL divergence between the variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}})\) and the prior distribution \(p({\textcolor{latent}{\boldsymbol{z}}})\): \[ \mathrm{KL}\left( q({\textcolor{latent}{\boldsymbol{z}}}) \| p({\textcolor{latent}{\boldsymbol{z}}}) \right) = \int q({\textcolor{latent}{\boldsymbol{z}}}) \log \frac{q({\textcolor{latent}{\boldsymbol{z}}})}{p({\textcolor{latent}{\boldsymbol{z}}})} \dd{\textcolor{latent}{\boldsymbol{z}}} \] This term acts as a regularizer, encouraging the variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}})\) to be close to the prior distribution \(p({\textcolor{latent}{\boldsymbol{z}}})\).

Optimization objective

\[ \log p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}}) \geq \mathcal{L}_\text{ELBO}({\textcolor{params}{\boldsymbol{\theta}}}) = \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}})} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) - \mathrm{KL}\left( q({\textcolor{latent}{\boldsymbol{z}}}) \| p({\textcolor{latent}{\boldsymbol{z}}}) \right) \]

We would like to maximize the lower bound on the log-likelihood, but we want the bound to be tight

Remember that the KL divergence is always non-negative, so we can rewrite the objective as:

\[ \log p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}}) - \mathcal{L}_\text{ELBO}({\textcolor{params}{\boldsymbol{\theta}}}) = \mathrm{KL}\left( q({\textcolor{latent}{\boldsymbol{z}}}) \| p({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}}) \right) \]

This means that the closer the variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}})\) is to the true posterior distribution \(p({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}})\), the tighter the bound will be.

Approximate posterior inference

To fit \(q\), we need to define a variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}})\) that approximates the true posterior distribution \(p({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}})\).
The variational distribution is often chosen to be Gaussian, i.e., \(q({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{vparams}{\boldsymbol{\nu}}}) = {\mathcal{N}}({\textcolor{latent}{\boldsymbol{z}}}\mid {\boldsymbol{\mu}}, {\boldsymbol{\sigma}}^2{\boldsymbol{I}})\), where \({\textcolor{vparams}{\boldsymbol{\nu}}}= \{ {\boldsymbol{\mu}}, {\boldsymbol{\sigma}}^2\}\) are the variational parameters.
Gaussian is good because it is easy to sample from, it has a closed-form KL divergence, and it is possible to use the reparameterization trick to compute gradients of the ELBO w.r.t. the variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\).

Important

The reparameterization trick allows us to sample from the variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{vparams}{\boldsymbol{\nu}}})\) in a way that is differentiable w.r.t. the variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\).

\[ {\textcolor{latent}{\boldsymbol{z}}}= {\boldsymbol{\mu}}+ {\boldsymbol{\sigma}}\odot {\textcolor{noise}{\boldsymbol{\varepsilon}}}, \quad {\textcolor{noise}{\boldsymbol{\varepsilon}}}\sim {\mathcal{N}}({\boldsymbol{0}}, {\boldsymbol{I}}) \]

Training a VAE

We need to maximize the likelihood \(\log p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}})\). But we cannot do it directly.
We derive a lower bound on the likelihood, the evidence lower bound (ELBO) \(\mathcal{L}_\text{ELBO}({\textcolor{params}{\boldsymbol{\theta}}})\), which we can maximize instead. But we need to define a variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}})\) that approximates the true posterior distribution \(p({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}})\).
We parametrize the variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{vparams}{\boldsymbol{\nu}}})\) with a Gaussian distribution, where \({\textcolor{vparams}{\boldsymbol{\nu}}}= \{ {\boldsymbol{\mu}}, {\boldsymbol{\Sigma}}\}\) are the variational parameters.

At the end, we are left with the following optimization problem:

\[ \max_{{\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{vparams}{\boldsymbol{\nu}}}} \mathcal{L}_\text{ELBO}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{vparams}{\boldsymbol{\nu}}}) = \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{vparams}{\boldsymbol{\nu}}})} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) - \mathrm{KL}\left( q({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{vparams}{\boldsymbol{\nu}}}) \| p({\textcolor{latent}{\boldsymbol{z}}}) \right) \]

Note:

The ELBO is a function of both the likelihood parameters \({\textcolor{params}{\boldsymbol{\theta}}}\) and the variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\).
The likelihood are shared across all data points, while the variational parameters are per-sample.

Training a VAE via joint optimization is hard

In classic VI (e.g., Bayesian logistic regression):

ELBO depends only on \(q({\textcolor{params}{\boldsymbol{\theta}}})\)
Optimization is stable — just improve the approximation to the true posterior

In our case, for variational autoencoders:

We optimize both:
- \({\textcolor{params}{\boldsymbol{\theta}}}\): parameters of the generative model
- \({\textcolor{vparams}{\boldsymbol{\nu}}}\): variational parameters
But these two are coupled:
- ELBO depends on both sets of parameters
- Changing \({\textcolor{params}{\boldsymbol{\theta}}}\) shifts the optimal \({\textcolor{vparams}{\boldsymbol{\nu}}}\), changing \({\textcolor{vparams}{\boldsymbol{\nu}}}\) affects how we learn \({\textcolor{params}{\boldsymbol{\theta}}}\).
- No close form solution for \({\textcolor{vparams}{\boldsymbol{\nu}}}\) given \({\textcolor{params}{\boldsymbol{\theta}}}\) exists (no EM-like solution).

This may lead to unstable or inefficient training, especially when \({\textcolor{function}{\boldsymbol{f}}}(\cdot)\) is complex (e.g., deep neural networks).

Expectation-Maximization vs Variational Inference

The joint optimization of \({\textcolor{params}{\boldsymbol{\theta}}}\) and \({\textcolor{vparams}{\boldsymbol{\nu}}}\) in VAEs is similar to the Expectation-Maximization (EM) algorithm, but with some key differences:

In EM, the E-step computes the exact posterior distribution \(p({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\), while in VAEs, we approximate it with a variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{vparams}{\boldsymbol{\nu}}})\).
In EM, the M-step maximizes the exact expected log-likelihood, while in VAEs, we maximize the ELBO, which is a lower bound on the log-likelihood.

Amortization

In the previous approach we learn a separate set of variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\) for each data point \({\textcolor{input}{\boldsymbol{x}}}\).
This requires an expensive iterative optimization for each data point, which will take a long time to process large datasets.
Amortization: instead of learning a separate set of variational parameters for each data point, we learn a function \({\boldsymbol{g}}({\boldsymbol{\phi}}, {\textcolor{input}{\boldsymbol{x}}})\) that maps the data point \({\textcolor{input}{\boldsymbol{x}}}\) to the variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\).
This function is often implemented as a neural network, which allows us to learn a complex mapping from the data to the variational parameters.

Amortization

Idea: “amortize” the cost of computing the variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\) by learning a function that predicts \(\{ {\boldsymbol{\mu}}, {\boldsymbol{\sigma}}^2 \}\) as a function of the data \({\textcolor{input}{\boldsymbol{x}}}\).
The output of this function is generally \(\textcolor{vparams}{{\boldsymbol{\mu}}}\) and \(\log\textcolor{vparams}{{\boldsymbol{\sigma}}}\) (the \(\log\) is needed to ensure that the variance is positive).
The reparametrization trick is still used to sample from the variational distribution, the only difference is that the variational parameters are not free variables anymore, but are predicted by a neural network \({\boldsymbol{g}}({\boldsymbol{\phi}}, {\textcolor{input}{\boldsymbol{x}}})\).
Sometimes, this is noted as \(q({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}})\) to emphasize that the variational distribution depends on the data \({\textcolor{input}{\boldsymbol{x}}}\), even though it’s not actually a conditional distribution.

Putting it all together

We have a neural network \(g({\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\phi}})\) that predicts the variational parameters \(\{ \textcolor{vparams}{{\boldsymbol{\mu}}}, \textcolor{vparams}{{\boldsymbol{\sigma}}}^2 \}\) from the data \({\textcolor{input}{\boldsymbol{x}}}\).
We can sample from the variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\phi}})\) using the reparameterization trick
We compute the likelihood \(\log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{params}{\boldsymbol{\theta}}})\) using using a neural network \({\textcolor{function}{\boldsymbol{f}}}({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{params}{\boldsymbol{\theta}}})\) that maps the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) to the data space.
We can now optimize the ELBO with respect to both the likelihood parameters \({\textcolor{params}{\boldsymbol{\theta}}}\) and the amortization parameters \({\boldsymbol{\phi}}\): \[ \mathcal{L}_\text{ELBO}({\textcolor{params}{\boldsymbol{\theta}}}, {\boldsymbol{\phi}}) = \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\phi}})} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{params}{\boldsymbol{\theta}}}) - \mathrm{KL}\left( q({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\phi}}) \| p({\textcolor{latent}{\boldsymbol{z}}}) \right) \]

Putting it all together

Generating samples from the VAE

Follow the generation process:

Sample from the latent space \(z \sim p({\textcolor{latent}{\boldsymbol{z}}})\), where \(p({\textcolor{latent}{\boldsymbol{z}}})\) is the prior distribution.
Decode the latent variable \(z\) to generate a sample \(x \sim p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}})\) using the decoder network.

Note

When using amortized inference, the encoder network is not used during generation, since we are not conditioning on any observed data

It does not make sense to compute \(q({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}})\) when we don’t have any \({\textcolor{input}{\boldsymbol{x}}}\)

Conditional VAE (CVAE)

Sometimes we want to generate samples conditioned on some additional information \(p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{output}{\boldsymbol{y}}})\), e.g., we want to generate images of a specific class.
Since the latent code \({\textcolor{latent}{\boldsymbol{z}}}\) no longer has to model the image category, it can focus on modeling the stylistic features.
If we’re lucky, this lets us disentangle style and content. (Note: disentanglement is still a dark art.)
Conditional VAE can be achieve by adding an additional input to the encoder and decoder networks, which is the additional information \({\textcolor{output}{\boldsymbol{y}}}\):

Latent space analysis

You can often get interesting results by interpolating between two vectors in the latent space:

You can also perform “arithmetic” in the latent space, e.g. by adding or subtracting vectors:

Latent space visualization

By varying two latent dimensions (i.e. dimensions of \({\textcolor{latent}{\boldsymbol{z}}}\)) while holding \({\textcolor{output}{\boldsymbol{y}}}\) fixed, we can visualize the latent space.

Latent space interpolation

By varying the label \({\textcolor{output}{\boldsymbol{y}}}\) while holding \({\textcolor{latent}{\boldsymbol{z}}}\) fixed, we can solve image analogies.

Problems with VAEs

The samples may not be very good:
- VAEs suffer from mode collapse, where the amortization network learns to predict the prior. This is due to the strong effect of the KL divergence term in the ELBO.
- The spherical prior \(p({\textcolor{latent}{\boldsymbol{z}}})\) is too restrictive, it can lead to a “blurry” output distribution (or very noisy samples, if we sample from the likelihood).