Advanced Statistical Inference
EURECOM
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \]Markov Chain Monte Carlo (MCMC): use randomness to generate samples from a distribution
History


MH algorithm produces a sequence of samples \({\textcolor{params}{\boldsymbol{\theta}}}^{(1)}, {\textcolor{params}{\boldsymbol{\theta}}}^{(2)}, \ldots, {\textcolor{params}{\boldsymbol{\theta}}}^{(T)}\) that converges to the target distribution \(p({\textcolor{params}{\boldsymbol{\theta}}}\mid {\textcolor{output}{\boldsymbol{y}}})\).
Properties:
How to generate the sequence of samples \({\textcolor{params}{\boldsymbol{\theta}}}^{(1)}, {\textcolor{params}{\boldsymbol{\theta}}}^{(2)}, \ldots, {\textcolor{params}{\boldsymbol{\theta}}}^{(T)}\)?
Initialization: Start with an initial value \({\textcolor{params}{\boldsymbol{\theta}}}^{(1)}\).
Proposal Distribution: Generate a candidate sample \({\textcolor{params}{\boldsymbol{\theta}}}'\) from a proposal distribution \(q({\textcolor{params}{\boldsymbol{\theta}}}' \mid {\textcolor{params}{\boldsymbol{\theta}}}^{(t)})\).
Accepts/Rejects: Accept the candidate sample based on some criteria.
Repeat: Repeat steps 2-3 until we have \(T\) samples.
\(q({\textcolor{params}{\boldsymbol{\theta}}}' \mid {\textcolor{params}{\boldsymbol{\theta}}}^{(t)})\) is a distribution that generates candidate samples \({\textcolor{params}{\boldsymbol{\theta}}}'\) given the current sample \({\textcolor{params}{\boldsymbol{\theta}}}^{(t)}\).
We are free to choose any distribution as the proposal distribution (with the same or larger support as the target distribution).
Symmetric Proposal Distribution: \(q({\textcolor{params}{\boldsymbol{\theta}}}' \mid {\textcolor{params}{\boldsymbol{\theta}}}^{(t)}) = q({\textcolor{params}{\boldsymbol{\theta}}}^{(t)} \mid {\textcolor{params}{\boldsymbol{\theta}}}')\).
Random Walk Proposal: \(q({\textcolor{params}{\boldsymbol{\theta}}}' \mid {\textcolor{params}{\boldsymbol{\theta}}}^{(t)}) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{\theta}}}^{(t)}, {\boldsymbol{\Sigma}})\).
\[ {\boldsymbol{\Sigma}}= \mqty(1 & 0 \\ 0 & 1) \]
\[ {\boldsymbol{\Sigma}}= \mqty(0.1 & 0 \\ 0 & 0.1) \]
How to decide whether to accept or reject the candidate sample \({\textcolor{params}{\boldsymbol{\theta}}}'\)?
In the general case, acceptance based on:
\[ \alpha = \frac{p({\textcolor{params}{\boldsymbol{\theta}}}' \mid {\textcolor{output}{\boldsymbol{y}}})p({\textcolor{params}{\boldsymbol{\theta}}}^{(t)} \mid {\textcolor{params}{\boldsymbol{\theta}}}')}{p({\textcolor{params}{\boldsymbol{\theta}}}^{(t)} \mid {\textcolor{output}{\boldsymbol{y}}})p({\textcolor{params}{\boldsymbol{\theta}}}' \mid {\textcolor{params}{\boldsymbol{\theta}}}^{(t)})} \]
If proposal distribution is symmetric,
\[ \alpha = \frac{p({\textcolor{params}{\boldsymbol{\theta}}}' \mid {\textcolor{output}{\boldsymbol{y}}})}{p({\textcolor{params}{\boldsymbol{\theta}}}^{(t)} \mid {\textcolor{output}{\boldsymbol{y}}})} = \frac{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{\theta}}}')p({\textcolor{params}{\boldsymbol{\theta}}}')/Z}{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{\theta}}}^{(t)})p({\textcolor{params}{\boldsymbol{\theta}}}^{(t)})/Z} = \frac{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{\theta}}}')p({\textcolor{params}{\boldsymbol{\theta}}}')}{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{\theta}}}^{(t)})p({\textcolor{params}{\boldsymbol{\theta}}}^{(t)})} \]
Tip
Compute the acceptance ratio in the log-space to be numerically stable.
Intuition:
Steps:
MCMC algorithms converge to the target distribution as \(T \to \infty\).
Practically, we need to decide when to stop the algorithm or to be able to assess its convergence.
Burn-in Period: Discard the initial samples to allow the chain to converge to the target distribution.
Visual Inspection: Plot the trace of the samples and check for convergence.
Compute heuristics: Compute some metrics to assess convergence.
Burn some initial samples to allow the chain to mix and converge to the target distribution.
Good practice: Run multiple chains from different starting points and plot the trace of the samples.
Good practice: Run multiple chains from different starting points and plot the trace of the samples.
Intuition: If multiple chains have converged, in-chain variance should be similar to between-chain variance.
Potential Scale Reduction Factor (\(\hat{R}\)):
\[ \hat{R} = \sqrt{\frac{\text{between-chain variance}}{\text{in-chain variance}}} \]
In practice:
Hamiltonian Monte Carlo (HMC): A more sophisticated MCMC algorithm that uses the concept of Hamiltonian dynamics to propose better samples.
Idea:
Use gradient information to guide local exploration of the target distribution.
Propose samples by simulating the dynamics of a particle moving in a potential energy landscape.
A particle moving in a potential energy landscape \(\mathcal{U}({\textcolor{params}{\boldsymbol{\theta}}})\) with momentum \(\textcolor{vparams}{{\boldsymbol{\rho}}}\).
Hamiltonian:
\[ \mathcal{H}({\textcolor{params}{\boldsymbol{\theta}}}, \textcolor{vparams}{{\boldsymbol{\rho}}}) = \mathcal{U}({\textcolor{params}{\boldsymbol{\theta}}}) + \mathcal{K}(\textcolor{vparams}{{\boldsymbol{\rho}}}) \]
We consider:
In case of no-friction, the Hamiltonian is conserved.
The trajectory of the particle in the potential energy landscape is governed by the Hamiltonian equations:
\[ \begin{aligned} \frac{\dd {\textcolor{params}{\boldsymbol{\theta}}}}{\dd t} &= \grad_{\textcolor{vparams}{{\boldsymbol{\rho}}}}\mathcal{H}( {\textcolor{params}{\boldsymbol{\theta}}}, \textcolor{vparams}{{\boldsymbol{\rho}}}) = \grad_{\textcolor{vparams}{{\boldsymbol{\rho}}}}\mathcal{K}(\textcolor{vparams}{{\boldsymbol{\rho}}}) \\ \frac{\dd {\color{vparams}{{\boldsymbol{\rho}}}}}{\dd t} &= -\grad_{{\textcolor{params}{\boldsymbol{\theta}}}}\mathcal{H}({\textcolor{params}{\boldsymbol{\theta}}}, \textcolor{vparams}{{\boldsymbol{\rho}}}) = -\grad_{{\textcolor{params}{\boldsymbol{\theta}}}}\mathcal{U}({\textcolor{params}{\boldsymbol{\theta}}}) \end{aligned} \]
The energy is conserved: \(\frac{\dd {{\mathcal{H}}}}{\dd t} = 0\).
Important
To simulate the dynamics of the particle, we don’t need to know the normalization constant \(Z\) for \({\mathcal{U}}({\textcolor{params}{\boldsymbol{\theta}}})= -\log p({\textcolor{params}{\boldsymbol{\theta}}}\mid {\textcolor{output}{\boldsymbol{y}}})\).
Why? Because the dynamics only depend on the gradient of the log-density \(\grad_{{\textcolor{params}{\boldsymbol{\theta}}}}\log p({\textcolor{params}{\boldsymbol{\theta}}}\mid {\textcolor{output}{\boldsymbol{y}}}) = \grad_{{\textcolor{params}{\boldsymbol{\theta}}}}[ \log p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{params}{\boldsymbol{\theta}}}) + \log p({\textcolor{params}{\boldsymbol{\theta}}}) - \log Z]\).
\[ \begin{aligned} \frac{\dd {\textcolor{params}{\boldsymbol{\theta}}}}{\dd t} &= \grad_{\textcolor{vparams}{{\boldsymbol{\rho}}}}\mathcal{K}(\textcolor{vparams}{{\boldsymbol{\rho}}}) \\ \frac{\dd {\color{vparams}{{\boldsymbol{\rho}}}}}{\dd t} &= -\grad_{{\textcolor{params}{\boldsymbol{\theta}}}}\mathcal{U}({\textcolor{params}{\boldsymbol{\theta}}}) \end{aligned} \]
Solving exactly is infeasible, but we can use numerical methods to approximate the solution.
Leapfrog Integrator:
Start with \({\textcolor{params}{\boldsymbol{\theta}}}, \textcolor{vparams}{{\boldsymbol{\rho}}}\).
Repeat for \(L\) steps:
\[ p({\textcolor{params}{\boldsymbol{\theta}}}, {\color{vparams}{{\boldsymbol{\rho}}}}) = \frac 1 {Z_H} \exp(-\mathcal{H}({\textcolor{params}{\boldsymbol{\theta}}}, {\color{vparams}{{\boldsymbol{\rho}}}})) = \frac 1{Z_H} \exp\left(-\mathcal{U}({\textcolor{params}{\boldsymbol{\theta}}}) - \frac{1}{2}{\color{vparams}{{\boldsymbol{\rho}}}}^\top{\boldsymbol{M}}^{-1}{\color{vparams}{{\boldsymbol{\rho}}}}\right) \]
The marginal distribution of \({\textcolor{params}{\boldsymbol{\theta}}}\) is the target distribution \(p({\textcolor{params}{\boldsymbol{\theta}}}\mid {\textcolor{output}{\boldsymbol{y}}})\).
\[ p({\textcolor{params}{\boldsymbol{\theta}}}\mid {\textcolor{output}{\boldsymbol{y}}}) = \int p({\textcolor{params}{\boldsymbol{\theta}}}, {\color{vparams}{{\boldsymbol{\rho}}}}) \dd {\color{vparams}{{\boldsymbol{\rho}}}} = \frac{1}{Z} \exp(-\mathcal{U}({\textcolor{params}{\boldsymbol{\theta}}}))\frac{1}{Z_K}\int \exp\left(-\frac{1}{2}{\color{vparams}{{\boldsymbol{\rho}}}}^\top{\boldsymbol{M}}^{-1}{\color{vparams}{{\boldsymbol{\rho}}}}\right)\dd {\color{vparams}{{\boldsymbol{\rho}}}} = \frac{1}{Z} \exp(-\mathcal{U}({\textcolor{params}{\boldsymbol{\theta}}})) \]
Suppose the last sample in the sequence is \({\textcolor{params}{\boldsymbol{\theta}}}^{(t-1)}\)
\[ \alpha = \min\left(1, \frac{p({\textcolor{params}{\boldsymbol{\theta}}}^\prime_L, \textcolor{vparams}{{\boldsymbol{\rho}}}^\prime_L)}{p({\textcolor{params}{\boldsymbol{\theta}}}^{(t-1)}, \textcolor{vparams}{{\boldsymbol{\rho}}}^{(t-1)})}\right) = \min\left(1, \exp\left(-\mathcal{H}({\textcolor{params}{\boldsymbol{\theta}}}^\prime_L, \textcolor{vparams}{{\boldsymbol{\rho}}}^\prime_L) + \mathcal{H}({\textcolor{params}{\boldsymbol{\theta}}}^{(t-1)}, \textcolor{vparams}{{\boldsymbol{\rho}}}^{(t-1)})\right)\right) \]
Considerations:
No-U-Turn Sampler (NUTS): Automatically adapt the number of steps \(L\).
Stochastic Gradient Hamiltonian Monte Carlo: Use stochastic gradients to scale to large datasets
\[ \begin{aligned} {\mathcal{U}}({\textcolor{params}{\boldsymbol{\theta}}}) &= -\log p({\textcolor{params}{\boldsymbol{\theta}}}\mid {\textcolor{output}{\boldsymbol{y}}}) \propto -\sum_{i=1}^n \log p({\textcolor{output}{\boldsymbol{y}}}_i \mid {\textcolor{params}{\boldsymbol{\theta}}}) - \log p({\textcolor{params}{\boldsymbol{\theta}}})\\ \end{aligned} \]
Source: Score-Based Generative Modeling with Critically-Damped Langevin Diffusion

Simone Rossi - Advanced Statistical Inference - EURECOM