Advanced Statistical Inference
EURECOM
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]
Given training data from an unknown distribution \(p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\), we want to learn a model \(p_\text{model}({\textcolor{input}{\boldsymbol{x}}})\) that approximates the true distribution.

Train from \({\textcolor{input}{\boldsymbol{x}}}\sim p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\).

Generate from \({\textcolor{input}{\boldsymbol{x}}}\sim p_\text{model}({\textcolor{input}{\boldsymbol{x}}})\).
Our objective is to learn the data distribution \(p({\textcolor{input}{\boldsymbol{x}}})\), but we suppose that each data point \({\textcolor{input}{\boldsymbol{x}}}\) is associated with a latent variable \({\textcolor{latent}{\boldsymbol{z}}}\).
For example, image we want to learn the distribution of images of objects, like the one below:

Each image is a huge vector of pixels \({\textcolor{input}{\boldsymbol{x}}}\in \mathbb{R}^{H \times W \times C}\), where \(H\) is the height, \(W\) is the width and \(C\) is the number of channels (in this case, \(64 \times 64 \times 3\)).
Each image can be described by a set of 6 latent variables: floor, wall and object color, shape, orientation and scale.
Latent variables are unobserved variables, we cannot measure them directly, but we can infer them from the observed data.
In math terms, we model our data distribution as a marginal of the joint distribution of the observed data and the latent variables:
\[ p({\textcolor{input}{\boldsymbol{x}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}\quad \text{or} \quad p({\textcolor{input}{\boldsymbol{x}}}) = \sum_{i} p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}_i) \]
where \(p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}})\) is the joint distribution of the observed data and the latent variables.
Gaussian mixture model (GMM)
Conditional distribution of the data given the latent variable: \[ p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}=k) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}\mid {\boldsymbol{\mu}}_k, {\boldsymbol{\Sigma}}_k) \]
where \({\boldsymbol{\mu}}_k\) and \({\boldsymbol{\Sigma}}_k\) are the mean and covariance of the \(k\)-th Gaussian component.
Prior distribution of the latent variable:
\[ p({\textcolor{latent}{\boldsymbol{z}}}) = \text{Categorical}({\textcolor{latent}{\boldsymbol{z}}}\mid {\boldsymbol{\pi}}) \]
where \(\pi_k\) are the mixing coefficients, \({\boldsymbol{\mu}}_k\) is the mean of the \(k\)-th Gaussian component, and \({\boldsymbol{\Sigma}}_k\) is the covariance of the \(k\)-th Gaussian component.
Probabilistic PCA (PPCA)
Conditional distribution of the data given the latent variable: \[ p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{params}{\boldsymbol{b}}}, \sigma^2 {\boldsymbol{I}}) \]
where \({\textcolor{params}{\boldsymbol{W}}}\) is a linear transformation matrix, \({\textcolor{params}{\boldsymbol{b}}}\) is a bias vector, and \(\sigma^2 {\boldsymbol{I}}\) is the noise covariance.
Prior distribution of the latent variable: \[ p({\textcolor{latent}{\boldsymbol{z}}}) = {\mathcal{N}}({\textcolor{latent}{\boldsymbol{z}}}\mid {\boldsymbol{0}}, {\boldsymbol{I}}) \]
Idea:
Discaimer:
Variational autoencoders (VAEs) are often misrepresented as a simple autoencoder with a probabilistic twist.
This is wrong (sort of)! It hides the real nature of the encoder and decoder!
Consider a maximum likelihood training objective for a generative model with latent variables:
\[ p({\textcolor{input}{\boldsymbol{x}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}) p({\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}} \]
Objective: increase the flexibility of \(p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}})\) to approximate the true data distribution \(p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\).
Solution: define a noisy observation model
\[ p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{output}{\boldsymbol{f}}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{latent}{\boldsymbol{z}}}), \sigma^2 {\boldsymbol{I}}) \]
where \({\textcolor{output}{\boldsymbol{f}}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{latent}{\boldsymbol{z}}})\) is a function of the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) and the model parameters \({\textcolor{params}{\boldsymbol{\theta}}}\).
Note: \(p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}})\) is not a deterministic function, it is a probabilistic model that outputs a distribution over the data given the latent variable. If it were deterministic, then \(p({\textcolor{input}{\boldsymbol{x}}}) = 0\) almost everywhere.
Notation: we will use \(p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\) and \(p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{params}{\boldsymbol{\theta}}})\) to remember that these two distributions depend on the model parameters \({\textcolor{params}{\boldsymbol{\theta}}}\).
\(p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) p({\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}\) is well-defined
Can we compute it?
Can we maximize it w.r.t. \({\textcolor{params}{\boldsymbol{\theta}}}\)?
Solution: try to maximize a lower bound on the likelihood \(\log p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\)
We obtain the lower bound by applying Jensen’s inequality: for any convex function \(h\) and random variable \(X\),
\[ h(\mathbb{E}[X]) \geq \mathbb{E}[h(X)] \]
Therefore, if \(h\) is concave, we have:
\[ h(\mathbb{E}[X]) \leq \mathbb{E}[h(X)] \]
The function \(\log x\) is concave, so we can apply Jensen’s inequality to obtain a lower bound on the log-likelihood:
\[ \mathbb{E} \log X \leq \log \mathbb{E} X \]
We can use Jensen’s inequality to obtain a lower bound on the log-likelihood.
Suppose we have a distribution \(q({\textcolor{latent}{\boldsymbol{z}}})\) (we will see later how to choose it):
\[ \begin{aligned} \log p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}}) &= \log \int p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) p({\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}\\ &= \class{fragment}{\log \int q({\textcolor{latent}{\boldsymbol{z}}}) \frac{p({\textcolor{latent}{\boldsymbol{z}}})}{q({\textcolor{latent}{\boldsymbol{z}}})} p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}} \\ &\geq \class{fragment}{\int q({\textcolor{latent}{\boldsymbol{z}}}) \log \left( \frac{p({\textcolor{latent}{\boldsymbol{z}}})}{q({\textcolor{latent}{\boldsymbol{z}}})} p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) \right) \dd{\textcolor{latent}{\boldsymbol{z}}}\qquad \text{(Jensen's inequality)} }\\ &= \class{fragment}{\mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}})} \frac{p({\textcolor{latent}{\boldsymbol{z}}})}{q({\textcolor{latent}{\boldsymbol{z}}})} + \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}})} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}})} \end{aligned} \]
Remember the Variational Inference lecture?
We can rewrite the lower bound as: \[ \log p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}}) \geq \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}})} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) - \text{KL}\left(q({\textcolor{latent}{\boldsymbol{z}}}) \parallel p({\textcolor{latent}{\boldsymbol{z}}})\right) \]
Meaning that we want to minimize the expected reconstruction error between the data \({\textcolor{input}{\boldsymbol{x}}}\) and the model output \({\textcolor{output}{\boldsymbol{f}}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{latent}{\boldsymbol{z}}})\), when \({\textcolor{latent}{\boldsymbol{z}}}\) is sampled from \(q({\textcolor{latent}{\boldsymbol{z}}})\).
\[ \log p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}}) \geq \mathcal{L}_\text{ELBO}({\textcolor{params}{\boldsymbol{\theta}}}) = \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}})} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) - \text{KL}\left(q({\textcolor{latent}{\boldsymbol{z}}}) \parallel p({\textcolor{latent}{\boldsymbol{z}}})\right) \]
We would like to maximize the lower bound on the log-likelihood, but we want the bound to be tight
Remember that the KL divergence is always non-negative, so we can rewrite the objective as:
\[ \log p({\textcolor{input}{\boldsymbol{x}}};{\textcolor{params}{\boldsymbol{\theta}}}) - \mathcal{L}_\text{ELBO}({\textcolor{params}{\boldsymbol{\theta}}}) = \text{KL}\left(q({\textcolor{latent}{\boldsymbol{z}}}) \parallel p({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}})\right) \]
This means that the closer the variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}})\) is to the true posterior distribution \(p({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}})\), the tighter the bound will be.
To fit \(q\), we need to define a variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}})\) that approximates the true posterior distribution \(p({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}})\).
The variational distribution is often chosen to be Gaussian, i.e., \(q({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{vparams}{\boldsymbol{\nu}}}) = {\mathcal{N}}({\textcolor{latent}{\boldsymbol{z}}}\mid {\boldsymbol{\mu}}, {\boldsymbol{\sigma}}^2{\boldsymbol{I}})\), where \({\textcolor{vparams}{\boldsymbol{\nu}}}= \{ {\boldsymbol{\mu}}, {\boldsymbol{\sigma}}^2\}\) are the variational parameters.
Gaussian is good because it is easy to sample from, it has a closed-form KL divergence, and it is possible to use the reparameterization trick to compute gradients of the ELBO w.r.t. the variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\).
Important
The reparameterization trick allows us to sample from the variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{vparams}{\boldsymbol{\nu}}})\) in a way that is differentiable w.r.t. the variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\).
\[ {\textcolor{latent}{\boldsymbol{z}}}= {\boldsymbol{\mu}}+ {\boldsymbol{\sigma}}\odot {\textcolor{noise}{\boldsymbol{\varepsilon}}}, \quad {\textcolor{noise}{\boldsymbol{\varepsilon}}}\sim {\mathcal{N}}({\boldsymbol{0}}, {\boldsymbol{I}}) \]
At the end, we are left with the following optimization problem:
\[ \max_{{\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{vparams}{\boldsymbol{\nu}}}} \mathcal{L}_\text{ELBO}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{vparams}{\boldsymbol{\nu}}}) = \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{vparams}{\boldsymbol{\nu}}})} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}};{\textcolor{params}{\boldsymbol{\theta}}}) - \text{KL}\left(q({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{vparams}{\boldsymbol{\nu}}}) \parallel p({\textcolor{latent}{\boldsymbol{z}}})\right) \]
Note:
The ELBO is a function of both the likelihood parameters \({\textcolor{params}{\boldsymbol{\theta}}}\) and the variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\).
The likelihood are shared across all data points, while the variational parameters are per-sample.
In classic VI (e.g., Bayesian logistic regression):
In our case, for variational autoencoders:
We optimize both:
But these two are coupled:
This may lead to unstable or inefficient training, especially when \({\textcolor{output}{\boldsymbol{f}}}(\cdot)\) is complex (e.g., deep neural networks).
In the previous approach we learn a separate set of variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\) for each data point \({\textcolor{input}{\boldsymbol{x}}}\).
This requires an expensive iterative optimization for each data point, which will take a long time to process large datasets.
Amortization: instead of learning a separate set of variational parameters for each data point, we learn a function \({\boldsymbol{g}}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}})\) that maps the data point \({\textcolor{input}{\boldsymbol{x}}}\) to the variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\).
This function is often implemented as a neural network, which allows us to learn a complex mapping from the data to the variational parameters.
Idea: “amortize” the cost of computing the variational parameters \({\textcolor{vparams}{\boldsymbol{\nu}}}\) by learning a function that predicts \(\{ {\boldsymbol{\mu}}, {\boldsymbol{\sigma}}^2 \}\) as a function of the data \({\textcolor{input}{\boldsymbol{x}}}\).
The output of this function is generally \(\textcolor{vparams}{{\boldsymbol{\mu}}}\) and \(\log\textcolor{vparams}{{\boldsymbol{\sigma}}}\) (the \(\log\) is needed to ensure that the variance is positive).
The reparametrization trick is still used to sample from the variational distribution, the only difference is that the variational parameters are not free variables anymore, but are predicted by a neural network \({\boldsymbol{g}}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}})\).
Sometimes, this is noted as \(q({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}})\) to emphasize that the variational distribution depends on the data \({\textcolor{input}{\boldsymbol{x}}}\), even though it’s not actually a conditional distribution.

We have a neural network \(g({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\phi}}})\) that predicts the variational parameters \(\{ \textcolor{vparams}{{\boldsymbol{\mu}}}, \textcolor{vparams}{{\boldsymbol{\sigma}}}^2 \}\) from the data \({\textcolor{input}{\boldsymbol{x}}}\).
We can sample from the variational distribution \(q({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\phi}}})\) using the reparameterization trick
We compute the likelihood \(\log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{params}{\boldsymbol{\theta}}})\) using using a neural network \({\textcolor{output}{\boldsymbol{f}}}({\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{params}{\boldsymbol{\theta}}})\) that maps the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) to the data space.
We can now optimize the ELBO with respect to both the likelihood parameters \({\textcolor{params}{\boldsymbol{\theta}}}\) and the amortization parameters \({\textcolor{params}{\boldsymbol{\phi}}}\): \[ \mathcal{L}_\text{ELBO}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{params}{\boldsymbol{\phi}}}) = \mathbb{E}_{q({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\phi}}})} \log p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}; {\textcolor{params}{\boldsymbol{\theta}}}) - \text{KL}\left(q({\textcolor{latent}{\boldsymbol{z}}}\mid {\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\phi}}}) \parallel p({\textcolor{latent}{\boldsymbol{z}}})\right) \]
Follow the generation process:
Sometimes we want to generate samples conditioned on some additional information \(p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{output}{\boldsymbol{y}}})\), e.g., we want to generate images of a specific class.
Since the latent code \({\textcolor{latent}{\boldsymbol{z}}}\) no longer has to model the image category, it can focus on modeling the stylistic features.
If we’re lucky, this lets us disentangle style and content. (Note: disentanglement is still a dark art.)
Conditional VAE can be achieve by adding an additional input to the encoder and decoder networks, which is the additional information \({\textcolor{output}{\boldsymbol{y}}}\):

You can often get interesting results by interpolating between two vectors in the latent space:

You can also perform “arithmetic” in the latent space, e.g. by adding or subtracting vectors:

By varying two latent dimensions (i.e. dimensions of \({\textcolor{latent}{\boldsymbol{z}}}\)) while holding \({\textcolor{output}{\boldsymbol{y}}}\) fixed, we can visualize the latent space.
By varying the label \({\textcolor{output}{\boldsymbol{y}}}\) while holding \({\textcolor{latent}{\boldsymbol{z}}}\) fixed, we can solve image analogies.
The samples may not be very good:
VAEs suffer from mode collapse, where the amortization network learns to predict the prior. This is due to the strong effect of the KL divergence term in the ELBO.
The spherical prior \(p({\textcolor{latent}{\boldsymbol{z}}})\) is too restrictive, it can lead to a “blurry” output distribution (or very noisy samples, if we sample from the likelihood).
So far we have focused on likelihood-based generative models, where we maximize the likelihood of the data given the model parameters.
Likelihood-based models are powerful, but they have some limitations:
Likelihood-free learning consider alternative training objectives that do not depend directly on a likelihood

\(S_1 = \{ {\textcolor{input}{\boldsymbol{x}}}_1, \ldots, {\textcolor{input}{\boldsymbol{x}}}_N \mid {\textcolor{input}{\boldsymbol{x}}}_i \sim p({\textcolor{input}{\boldsymbol{x}}}) \}\)

\(S_2 = \{ {\textcolor{input}{\boldsymbol{x}}}_1, \ldots, {\textcolor{input}{\boldsymbol{x}}}_N \mid {\textcolor{input}{\boldsymbol{x}}}_i \sim q({\textcolor{input}{\boldsymbol{x}}}) \}\)
Given a finite set of samples from two distributions (\(S_1\) and \(S_2\)), how can we tell if these samples are from the same distribution or not?
Set up a statistical test to compare the two sets of samples \(S_1\) and \(S_2\):
Null hypothesis: the two sets of samples are drawn from the same distribution, i.e., \(p = q\).
Test statistic \(T\) computes a \(S_1\) and \(S_2\), for example the difference in the empirical means:
\[ T(S_1, S_2) = \frac{1}{N_1} \sum_{{\textcolor{input}{\boldsymbol{x}}}\in S_1} {\textcolor{input}{\boldsymbol{x}}}- \frac{1}{N_2} \sum_{{\textcolor{input}{\boldsymbol{x}}}\in S_2} {\textcolor{input}{\boldsymbol{x}}} \]
If \(T\) is larger than some threshold \(\alpha\), we reject \(H_0\)
Key observation: Test statistic is likelihood-free since it does not depend on \(p\) and \(q\) directly, but only on the samples from these distributions.

Assume \(p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\) is the true data distribution and we have training samples \(S_1\)

Assume we have a model \(p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\) that permits us to generate samples for \(S_2\)
Idea: formulate the training of the generative model to minimize a two-sample test statistic between \(S_1\) and \(S_2\)
Finding a good test objective in high dimensions is hard, especially when the data is complex (e.g., images).
In the generative model setup, we know that \(S_1\) and \(S_2\) come from different distributions, \(p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\) and \(p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\)
Idea: Learn a statistic to automatically identify in what way the two sets of samples differ
How? Train a classifier (which we will call discriminator)
Build binary classifier \({\textcolor{output}{D}}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}})\) (e.g., neural network with parameters \({\textcolor{params}{\boldsymbol{\phi}}}\)) that tries to distinguish “real” (\(\textcolor{output}{y}= 1\)) samples from the dataset and “fake” (\(\textcolor{output}{y}= 0\)) samples generated from the model
Test statistic: negative loss of the classifier.
Goal: Maximize the two-sample test statistic or equivalently minimize the classification loss.
Do you remember what we used for a binary classification task? Bernoulli
\[ \begin{aligned} \arg\max_{\textcolor{params}{\boldsymbol{\phi}}}\mathcal{L}({\textcolor{params}{\boldsymbol{\phi}}}) :&= \mathbb{E}_{p_{\text{data}} ({\textcolor{input}{\boldsymbol{x}}})} \log {\textcolor{output}{D}}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}}) + \mathbb{E}_{p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})} \log (1 - {\textcolor{output}{D}}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}})) \\ &\approx \frac{1}{N_1} \sum_{{\textcolor{input}{\boldsymbol{x}}}\in S_1} \log {\textcolor{output}{D}}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}}) + \frac{1}{N_2} \sum_{{\textcolor{input}{\boldsymbol{x}}}\in S_2} \log (1 - {\textcolor{output}{D}}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}})) \\ \end{aligned} \]
For a fixed generative model \(p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\), we can train the discriminator \({\textcolor{output}{D}}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}})\) to distinguish between real and fake samples:
Still missing: how do we generate the “fake” samples \(S_2\) from the model \(p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\)?
Generator:
Latent variable model with a deterministic mapping from a latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) to the data space \({\textcolor{input}{\boldsymbol{x}}}\):
Previously we trained the discriminator to maximize the two-sample test statistic (i.e., correctly classify real and fake samples).
But for the generator, we want to minimize the two-sample test statistic (i.e., make it hard for the discriminator to distinguish between real and fake samples). \[ \min_{\textcolor{params}{\boldsymbol{\theta}}}\max_{\textcolor{params}{\boldsymbol{\phi}}}\mathcal{L}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{params}{\boldsymbol{\phi}}}) := \mathbb{E}_{p_{\text{data}} ({\textcolor{input}{\boldsymbol{x}}})} \log {\textcolor{output}{D}}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}}) + \mathbb{E}_{p({\textcolor{latent}{\boldsymbol{z}}})} \log (1 - {\textcolor{output}{D}}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{output}{G}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{latent}{\boldsymbol{z}}}))) \] This is a minimax optimization problem
This model configuration and training procedure is known as Generative Adversarial Networks (GANs)
GANs have been successfully applied to several domains and tasks
Pros:
Cons:
Imagine \(f(x, y) = xy\) and we want to comput \(\min_x \max_y f(x, y)\). A gradient descent ascent algorithm would look like this:
\[ \begin{aligned} x &\gets x - \eta \nabla_x f(x, y) = x - \eta y\\ y &\gets y + \eta \nabla_y f(x, y) = y + \eta x \end{aligned} \]
The generator and discriminator loss keep oscillating during GAN training
Difficult to assess if training is converging or not 
Currently the state-of-the-art generative models, especially for image/video generation tasks
Key idea: corrupt the data \({\textcolor{input}{\boldsymbol{x}}}\) into noise by adding Gaussian noise in a series of steps, and then learn to reverse this process to generate new samples.
For example:
\[ q({\textcolor{input}{\boldsymbol{x}}}_t \mid {\textcolor{input}{\boldsymbol{x}}}_{t-1}) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_t; \sqrt{1 - \beta_t} {\textcolor{input}{\boldsymbol{x}}}_{t-1}, \beta_t {\boldsymbol{I}}) \]
where \(\beta_t\) is a small positive constant that controls the amount of noise added at each step.
The diffusion process can be viewed as a continuous-time stochastic process, described by a stochastic differential equation (SDE) of the form:
\[ \dd{\textcolor{input}{\boldsymbol{x}}}_t = {\textcolor{output}{\boldsymbol{f}}}({\textcolor{input}{\boldsymbol{x}}}_t, t) \dd t + \textcolor{output}{g}(t) \dd {\boldsymbol{B}}_t \]
The reverse diffusion process is also described by an SDE:
\[ \dd{\textcolor{input}{\boldsymbol{x}}}_t = \left({\textcolor{output}{\boldsymbol{f}}}({\textcolor{input}{\boldsymbol{x}}}_t, t) - \textcolor{output}{g}(t)^2 \nabla_{{\textcolor{input}{\boldsymbol{x}}}_t} \log p({\textcolor{input}{\boldsymbol{x}}}_t)\right) \dd t + \textcolor{output}{g}(t) \dd {\boldsymbol{B}}_t \]
To train the diffusion model, we learn a neural network \({\textcolor{output}{\boldsymbol{s}}}({\textcolor{params}{\boldsymbol{\theta}}}, \cdots)\) to approximate the score function \(\nabla_{{\textcolor{input}{\boldsymbol{x}}}_t} \log p({\textcolor{input}{\boldsymbol{x}}}_t)\):
\[ \text{KL}\left(p_0({\textcolor{input}{\boldsymbol{x}}}) \parallel p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\right) \leq \frac{T}{2}\mathbb{E}_{t \in \mathcal{U}(0, T)}\mathbb{E}_{p_t({\textcolor{input}{\boldsymbol{x}}})}[\lambda(t) \| \nabla_{\textcolor{input}{\boldsymbol{x}}}\log p_t({\textcolor{input}{\boldsymbol{x}}}) - {\textcolor{output}{\boldsymbol{s}}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{input}{\boldsymbol{x}}}, t) \|_2^2] + \text{KL}\left(p_T({\textcolor{input}{\boldsymbol{x}}}) \parallel p_{\text{prior}}({\textcolor{input}{\boldsymbol{x}}})\right) \]
High-resolution samples starting from low-resolution noise:
Prompt: Produce a stunning, award-winning close-up of a chameleon blending into a background of vibrant, textured leaves, its eye swivelled to look directly at the camera. The intricate texture of its skin changing colour is the focus (visceral adaptation). Abstract dappled light filters through the leaves. Inspired by wildlife macro photography and camouflage patterns.
Prompt: Cinematic shot using a stabilized drone flying dynamically alongside a pod of immense baleen whales as they breach spectacularly in deep offshore waters. The camera maintains a close, dramatic perspective as these colossal creatures launch themselves skyward from the dark blue ocean, creating enormous splashes and showering cascades of water droplets that catch the sunlight. In the background, misty, fjord-like coastlines with dense coniferous forests provide context. The focus expertly tracks the whales, capturing their surprising agility, immense power, and inherent grace. The color palette features the deep blues and greens of the ocean, the brilliant white spray, the dark grey skin of the whales, and the muted tones of the distant wild coastline, conveying the thrilling magnificence of marine megafauna.

Simone Rossi - Advanced Statistical Inference - EURECOM