Advanced Statistical Inference
EURECOM
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \definecolor{function}{rgb}{0.75, 0.75, 0.12} \]
So far we have focused on likelihood-based generative models, where we maximize the likelihood of the data given the model parameters.
Likelihood-based models are powerful, but they have some limitations:
Likelihood-free learning consider alternative training objectives that do not depend directly on a likelihood

\(S_1 = \{ {\textcolor{input}{\boldsymbol{x}}}_1, \ldots, {\textcolor{input}{\boldsymbol{x}}}_N \mid {\textcolor{input}{\boldsymbol{x}}}_i \sim p({\textcolor{input}{\boldsymbol{x}}}) \}\)

\(S_2 = \{ {\textcolor{input}{\boldsymbol{x}}}_1, \ldots, {\textcolor{input}{\boldsymbol{x}}}_N \mid {\textcolor{input}{\boldsymbol{x}}}_i \sim q({\textcolor{input}{\boldsymbol{x}}}) \}\)
Given a finite set of samples from two distributions (\(S_1\) and \(S_2\)), how can we tell if these samples are from the same distribution or not?
Set up a statistical test to compare the two sets of samples \(S_1\) and \(S_2\):
Null hypothesis: the two sets of samples are drawn from the same distribution, i.e., \(p = q\).
Test statistic \(T\) computes a \(S_1\) and \(S_2\), for example the difference in the empirical means:
\[ T(S_1, S_2) = \frac{1}{N_1} \sum_{{\textcolor{input}{\boldsymbol{x}}}\in S_1} {\textcolor{input}{\boldsymbol{x}}}- \frac{1}{N_2} \sum_{{\textcolor{input}{\boldsymbol{x}}}\in S_2} {\textcolor{input}{\boldsymbol{x}}} \]
If \(T\) is larger than some threshold \(\alpha\), we reject \(H_0\)
Key observation: Test statistic is likelihood-free since it does not depend on \(p\) and \(q\) directly, but only on the samples from these distributions.

Assume \(p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\) is the true data distribution and we have training samples \(S_1\)

Assume we have a model \(p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\) that permits us to generate samples for \(S_2\)
Idea: formulate the training of the generative model to minimize a two-sample test statistic between \(S_1\) and \(S_2\)
In the generative model setup, we know that \(S_1\) and \(S_2\) come from different complex distributions, \(p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\) and \(p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\). Simple test statistics based on moments (e.g., mean, variance) may not be able to distinguish between the two distributions.
Idea: Learn a statistic to automatically identify in what way the two sets of samples differ
How? Train a classifier (which we will call discriminator)
Build binary classifier \(\textcolor{function}{D}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}})\) (e.g., neural network with parameters \({\textcolor{params}{\boldsymbol{\phi}}}\)) that tries to distinguish “real” (\(\textcolor{output}{y}= 1\)) samples from the dataset and “fake” (\(\textcolor{output}{y}= 0\)) samples generated from the model
Test statistic: negative loss of the classifier.
Goal: Maximize the two-sample test statistic or equivalently minimize the classification loss.
Do you remember what we used for a binary classification task? Bernoulli
\[ \begin{aligned} \arg\max_{\textcolor{params}{\boldsymbol{\phi}}}\mathcal{L}({\textcolor{params}{\boldsymbol{\phi}}}) :&= \mathbb{E}_{p_{\text{data}} ({\textcolor{input}{\boldsymbol{x}}})} \log \textcolor{function}{D}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}}) + \mathbb{E}_{p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})} \log (1 - \textcolor{function}{D}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}})) \\ &\approx \frac{1}{N_1} \sum_{{\textcolor{input}{\boldsymbol{x}}}\in S_1} \log \textcolor{function}{D}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}}) + \frac{1}{N_2} \sum_{{\textcolor{input}{\boldsymbol{x}}}\in S_2} \log (1 - \textcolor{function}{D}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}})) \\ \end{aligned} \]
For a fixed generative model \(p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\), we can train the discriminator \(\textcolor{function}{D}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}})\) to distinguish between real and fake samples:
Still missing: how do we generate the “fake” samples \(S_2\) from the model \(p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}})\)?
Generator:
Latent variable model with a deterministic mapping from a latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) to the data space \({\textcolor{input}{\boldsymbol{x}}}\):
Previously we trained the discriminator to maximize the two-sample test statistic (i.e., correctly classify real and fake samples).
But for the generator, we want to minimize the two-sample test statistic (i.e., make it hard for the discriminator to distinguish between real and fake samples). \[ \min_{\textcolor{params}{\boldsymbol{\theta}}}\max_{\textcolor{params}{\boldsymbol{\phi}}}\mathcal{L}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{params}{\boldsymbol{\phi}}}) := \mathbb{E}_{p_{\text{data}} ({\textcolor{input}{\boldsymbol{x}}})} \log \textcolor{function}{D}({\textcolor{params}{\boldsymbol{\phi}}}, {\textcolor{input}{\boldsymbol{x}}}) + \mathbb{E}_{p({\textcolor{latent}{\boldsymbol{z}}})} \log (1 - \textcolor{function}{D}({\textcolor{params}{\boldsymbol{\phi}}}, \textcolor{function}{G}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{latent}{\boldsymbol{z}}}))) \] This is a minimax optimization problem
This model configuration and training procedure is known as Generative Adversarial Networks (GANs)
GANs have been successfully applied to several domains and tasks
Pros:
Cons:
Imagine \(f(x, y) = xy\) and we want to comput \(\min_x \max_y f(x, y)\). A gradient descent-ascent algorithm would look like this:
\[ \begin{aligned} x &\gets x - \eta \nabla_x f(x, y) = x - \eta y\\ y &\gets y + \eta \nabla_y f(x, y) = y + \eta x \end{aligned} \]
The generator and discriminator loss keep oscillating during GAN training
Difficult to assess if training is converging or not
Currently the state-of-the-art generative models, especially for image/video generation tasks
Key idea: corrupt the data \({\textcolor{input}{\boldsymbol{x}}}\) into noise by adding Gaussian noise in a series of steps, and then learn to reverse this process to generate new samples.
For example:
\[ q({\textcolor{input}{\boldsymbol{x}}}_t \mid {\textcolor{input}{\boldsymbol{x}}}_{t-1}) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_t; \sqrt{1 - \beta_t} {\textcolor{input}{\boldsymbol{x}}}_{t-1}, \beta_t {\boldsymbol{I}}) \]
where \(\beta_t\) is a small positive constant that controls the amount of noise added at each step.
The diffusion process can be viewed as a continuous-time stochastic process, described by a stochastic differential equation (SDE) of the form:
\[ \dd{\textcolor{input}{\boldsymbol{x}}}_t = {\textcolor{function}{\boldsymbol{f}}}({\textcolor{input}{\boldsymbol{x}}}_t, t) \dd t + g(t) \dd {\boldsymbol{B}}_t \]
The reverse diffusion process is also described by an SDE:
\[ \dd{\textcolor{input}{\boldsymbol{x}}}_t = \left({\textcolor{function}{\boldsymbol{f}}}({\textcolor{input}{\boldsymbol{x}}}_t, t) - g(t)^2 \nabla_{{\textcolor{input}{\boldsymbol{x}}}_t} \log p({\textcolor{input}{\boldsymbol{x}}}_t)\right) \dd t + g(t) \dd {\boldsymbol{B}}_t \]
To train the diffusion model, we learn a neural network \({\boldsymbol{s}}({\textcolor{params}{\boldsymbol{\theta}}}, \cdots)\) to approximate the score function \(\nabla_{{\textcolor{input}{\boldsymbol{x}}}_t} \log p({\textcolor{input}{\boldsymbol{x}}}_t)\):
\[ \mathrm{KL}\left( p_0({\textcolor{input}{\boldsymbol{x}}}) \| p({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{\theta}}}) \right) \leq \frac{T}{2}\mathbb{E}_{t \in \mathcal{U}(0, T)}\mathbb{E}_{p_t({\textcolor{input}{\boldsymbol{x}}})}[\lambda(t) \| \nabla_{\textcolor{input}{\boldsymbol{x}}}\log p_t({\textcolor{input}{\boldsymbol{x}}}) - {\boldsymbol{s}}({\textcolor{params}{\boldsymbol{\theta}}}, {\textcolor{input}{\boldsymbol{x}}}, t) \|_2^2] + \mathrm{KL}\left( p_T({\textcolor{input}{\boldsymbol{x}}}) \| p_{\text{prior}}({\textcolor{input}{\boldsymbol{x}}}) \right) \]
High-resolution samples starting from low-resolution noise:
Prompt: Produce a stunning, award-winning close-up of a chameleon blending into a background of vibrant, textured leaves, its eye swivelled to look directly at the camera. The intricate texture of its skin changing colour is the focus (visceral adaptation). Abstract dappled light filters through the leaves. Inspired by wildlife macro photography and camouflage patterns.
Prompt: Cinematic shot using a stabilized drone flying dynamically alongside a pod of immense baleen whales as they breach spectacularly in deep offshore waters. The camera maintains a close, dramatic perspective as these colossal creatures launch themselves skyward from the dark blue ocean, creating enormous splashes and showering cascades of water droplets that catch the sunlight. In the background, misty, fjord-like coastlines with dense coniferous forests provide context. The focus expertly tracks the whales, capturing their surprising agility, immense power, and inherent grace. The color palette features the deep blues and greens of the ocean, the brilliant white spray, the dark grey skin of the whales, and the muted tones of the distant wild coastline, conveying the thrilling magnificence of marine megafauna.