Advanced Statistical Inference
EURECOM
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \]Given a likelihood \(p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{\theta}}})\) and a prior \(p({\textcolor{params}{\boldsymbol{\theta}}})\), the goal is to compute the posterior distribution \(p({\textcolor{params}{\boldsymbol{\theta}}}\mid {\textcolor{output}{\boldsymbol{y}}})\)
\[ p({\textcolor{params}{\boldsymbol{\theta}}}\mid {\textcolor{output}{\boldsymbol{y}}}) = \frac{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{\theta}}}) p({\textcolor{params}{\boldsymbol{\theta}}})}{p({\textcolor{output}{\boldsymbol{y}}})} \]
We have seen that exact inference is possible if the prior and likelihood are conjugate distributions (e.g. for Bayesian linear regression with Gaussian likelihood and Gaussian prior).
But in most cases, the Bayesian approach is intractable.
We have seen that we can use Monte Carlo methods to approximate the posterior distribution \(p({\textcolor{params}{\boldsymbol{\theta}}}\mid {\textcolor{output}{\boldsymbol{y}}})\).
In this week, we will see another way to approximate the posterior distribution \(p({\textcolor{params}{\boldsymbol{\theta}}}\mid {\textcolor{output}{\boldsymbol{y}}})\).
Can we use a simpler distribution to approximate the posterior?
And if so, how?
Idea:
Given a likelihood \(p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{params}{\boldsymbol{\theta}}})\) and a prior \(p({\textcolor{params}{\boldsymbol{\theta}}})\), the posterior distribution is given by
\[ p({\textcolor{params}{\boldsymbol{\theta}}}\mid{\textcolor{output}{\boldsymbol{y}}}) = \frac 1 Z \exp\left(-{\mathcal{U}}({\textcolor{params}{\boldsymbol{\theta}}})\right) \]
where \(Z\) is the normalizing constant and \({\mathcal{U}}({\textcolor{params}{\boldsymbol{\theta}}})= -\log p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{params}{\boldsymbol{\theta}}}) - \log p({\textcolor{params}{\boldsymbol{\theta}}})\) is the (un-normalized) negative log posterior.
Taylor expand \({\mathcal{U}}({\textcolor{params}{\boldsymbol{\theta}}})\) around the mode \(\widehat{{\textcolor{params}{\boldsymbol{\theta}}}}\) to the second order
\[ {\mathcal{U}}({\textcolor{params}{\boldsymbol{\theta}}}) \approx {\mathcal{U}}(\widehat{{\textcolor{params}{\boldsymbol{\theta}}}}) + ({\textcolor{params}{\boldsymbol{\theta}}}-\widehat{{\textcolor{params}{\boldsymbol{\theta}}}})^\top {\boldsymbol{g}}+ \frac 1 2 ({\textcolor{params}{\boldsymbol{\theta}}}-\widehat{{\textcolor{params}{\boldsymbol{\theta}}}})^\top {\boldsymbol{H}}({\textcolor{params}{\boldsymbol{\theta}}}-\widehat{{\textcolor{params}{\boldsymbol{\theta}}}}) \]
where:
The Hessian matrix \({\boldsymbol{H}}\) is a square matrix \(D\times D\) of second-order partial derivatives of \({\mathcal{U}}({\textcolor{params}{\boldsymbol{\theta}}})\) computed in \({\textcolor{params}{\boldsymbol{\theta}}}\).
\[ {\boldsymbol{H}}= \grad^2 {\mathcal{U}}({\textcolor{params}{\boldsymbol{\theta}}})\begin{bmatrix} \frac{\partial^2 {\mathcal{U}}}{\partial {\color{params}\theta}_1^2}({\textcolor{params}{\boldsymbol{\theta}}}) & \frac{\partial^2 {\mathcal{U}}}{\partial {\color{params}\theta}_1\partial {\color{params}\theta}_2}({\textcolor{params}{\boldsymbol{\theta}}}) & \cdots & \frac{\partial^2 {\mathcal{U}}}{\partial {\color{params}\theta}_1\partial {\color{params}\theta}_D}({\textcolor{params}{\boldsymbol{\theta}}}) \\ \frac{\partial^2 {\mathcal{U}}}{\partial {\color{params}\theta}_2\partial {\color{params}\theta}_1}({\textcolor{params}{\boldsymbol{\theta}}}) & \frac{\partial^2 {\mathcal{U}}}{\partial {\color{params}\theta}_2^2}({\textcolor{params}{\boldsymbol{\theta}}}) & \cdots & \frac{\partial^2 {\mathcal{U}}}{\partial {\color{params}\theta}_2\partial {\color{params}\theta}_D}({\textcolor{params}{\boldsymbol{\theta}}}) \\ \cdots & \cdots & \ddots & \cdots \\ \frac{\partial^2 {\mathcal{U}}}{\partial {\color{params}\theta}_D\partial {\color{params}\theta}_1}({\textcolor{params}{\boldsymbol{\theta}}}) & \frac{\partial^2 {\mathcal{U}}}{\partial {\color{params}\theta}_D\partial {\color{params}\theta}_2}({\textcolor{params}{\boldsymbol{\theta}}}) & \cdots & \frac{\partial^2 {\mathcal{U}}}{\partial {\color{params}\theta}_D^2}({\textcolor{params}{\boldsymbol{\theta}}}) \end{bmatrix} \]

\[ {\mathcal{U}}({\textcolor{params}{\boldsymbol{\theta}}}) \approx {\mathcal{U}}(\widehat{{\textcolor{params}{\boldsymbol{\theta}}}}) + ({\textcolor{params}{\boldsymbol{\theta}}}-\widehat{{\textcolor{params}{\boldsymbol{\theta}}}})^\top {\boldsymbol{g}}+ \frac 1 2 ({\textcolor{params}{\boldsymbol{\theta}}}-\widehat{{\textcolor{params}{\boldsymbol{\theta}}}})^\top {\boldsymbol{H}}({\textcolor{params}{\boldsymbol{\theta}}}-\widehat{{\textcolor{params}{\boldsymbol{\theta}}}}) \]
The mode \(\widehat{{\textcolor{params}{\boldsymbol{\theta}}}}\) is the value of \({\textcolor{params}{\boldsymbol{\theta}}}\) that maximizes \(p({\textcolor{params}{\boldsymbol{\theta}}}\mid{\textcolor{output}{\boldsymbol{y}}})\) or equivalently minimizes \({\mathcal{U}}({\textcolor{params}{\boldsymbol{\theta}}})\).
Because we are expanding around the mode, \({\boldsymbol{g}}=\grad{\mathcal{U}}(\widehat{{\textcolor{params}{\boldsymbol{\theta}}}})=0\).
Approximate the posterior distribution with this expansion:
\[ \begin{aligned} p({\textcolor{params}{\boldsymbol{\theta}}}\mid{\textcolor{output}{\boldsymbol{y}}}) &\approx \frac 1 Z \exp\left(-{\mathcal{U}}(\widehat{{\textcolor{params}{\boldsymbol{\theta}}}}) - \frac 1 2 ({\textcolor{params}{\boldsymbol{\theta}}}-\widehat{{\textcolor{params}{\boldsymbol{\theta}}}})^\top {\boldsymbol{H}}({\textcolor{params}{\boldsymbol{\theta}}}-\widehat{{\textcolor{params}{\boldsymbol{\theta}}}})\right)\\ &= \frac 1 Z \exp\left(-{\mathcal{U}}(\widehat{{\textcolor{params}{\boldsymbol{\theta}}}})\right) \exp\left(-\frac 1 2 ({\textcolor{params}{\boldsymbol{\theta}}}-\widehat{{\textcolor{params}{\boldsymbol{\theta}}}})^\top {\boldsymbol{H}}({\textcolor{params}{\boldsymbol{\theta}}}-\widehat{{\textcolor{params}{\boldsymbol{\theta}}}})\right) \end{aligned} \]
\[ \begin{aligned} p({\textcolor{params}{\boldsymbol{\theta}}}\mid{\textcolor{output}{\boldsymbol{y}}}) &\approx \frac 1 Z \exp\left(-{\mathcal{U}}(\widehat{{\textcolor{params}{\boldsymbol{\theta}}}})\right) \exp\left(-\frac 1 2 ({\textcolor{params}{\boldsymbol{\theta}}}-\widehat{{\textcolor{params}{\boldsymbol{\theta}}}})^\top {\boldsymbol{H}}({\textcolor{params}{\boldsymbol{\theta}}}-\widehat{{\textcolor{params}{\boldsymbol{\theta}}}})\right) \end{aligned} \]
Compare this to the form of a Gaussian distribution:
\[ \begin{aligned} {\mathcal{N}}({\textcolor{params}{\boldsymbol{\theta}}}\mid{\boldsymbol{\mu}}, {\boldsymbol{\Sigma}}) &= \frac 1 2 (2\pi)^{D/2}\det({\boldsymbol{\Sigma}})^{-1/2} \exp\left(-\frac 1 2 ({\textcolor{params}{\boldsymbol{\theta}}}-{\boldsymbol{\mu}})^\top {\boldsymbol{\Sigma}}^{-1}({\textcolor{params}{\boldsymbol{\theta}}}-{\boldsymbol{\mu}})\right) \end{aligned} \]
Taylor expansion of the log posterior \(\Rightarrow\) Gaussian approximation of the posterior.
The mean of the Gaussian is the mode of the posterior \(\widehat{{\textcolor{params}{\boldsymbol{\theta}}}}\).
The covariance of the Gaussian is the inverse of the Hessian \({\boldsymbol{H}}^{-1}\).
The normalizing constant is \(Z=(2\pi)^{D/2}\det({\boldsymbol{H}})^{-1/2}\exp(-{\mathcal{U}}(\widehat{{\textcolor{params}{\boldsymbol{\theta}}}}))\).
\[ {\textcolor{params}{\boldsymbol{\theta}}}\leftarrow {\textcolor{params}{\boldsymbol{\theta}}}- \eta \grad{\mathcal{U}}({\textcolor{params}{\boldsymbol{\theta}}}) \quad \text{until convergence} \]
The exact Hessian \({\boldsymbol{H}}\) can be expensive to compute and invert for large \(D\).
The Laplace approximation is only valid in a neighborhood of the mode.
Consider a Gamma distribution with shape parameter \(\alpha\) and rate parameter \(\beta\).
\[ p(x\mid\alpha,\beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1}\exp(-\beta x) % x^{\alpha-1}\exp(-\beta x) \]
Write the negative log posterior \({\mathcal{U}}(x)\). \[ {\mathcal{U}}(x) = -(\alpha-1)\log x + \beta x \]
Compute the mode \(\widehat{x}\) by setting \(\frac{\dd{\mathcal{U}}(x)}{\dd x}=0\).
\[ \begin{aligned} \frac{\dd{\mathcal{U}}(x)}{\dd x} = -\frac{\alpha-1}{x} + \beta &= 0 \quad \Rightarrow \quad \widehat{x} = \frac{\alpha-1}{\beta} \end{aligned} \]
\[ \begin{aligned} \frac{\dd{^2}{{\mathcal{U}}(x)}}{\dd x^2} = \left.\frac{\alpha-1}{x^2} \right\vert_{x=\widehat{x}} = \frac{\alpha-1}{(\alpha-1)^2/\beta^2} = \frac{\beta^2}{\alpha-1} \end{aligned} \] 4. Set up the Gaussian approximation of the posterior.
\[ \begin{aligned} p(x\mid\alpha,\beta) &\approx {\mathcal{N}}\left(x\mid\frac{\alpha-1}{\beta}, \frac{\alpha-1}{\beta^2}\right) \end{aligned} \]

Simone Rossi - Advanced Statistical Inference - EURECOM