Advanced Statistical Inference
EURECOM
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \]We need to compute
\[ {\mathbb{E}}_{p({\textcolor{input}{\boldsymbol{x}}})}[f({\textcolor{input}{\boldsymbol{x}}})] = \int f({\textcolor{input}{\boldsymbol{x}}}) p({\textcolor{input}{\boldsymbol{x}}}) \dd {\textcolor{input}{\boldsymbol{x}}} \]
where \(p({\textcolor{input}{\boldsymbol{x}}})\) is the probability density function of \({\textcolor{input}{\boldsymbol{x}}}\).
Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to obtain numerical results.
\[ {\mathbb{E}}_{p({\textcolor{input}{\boldsymbol{x}}})}[f({\textcolor{input}{\boldsymbol{x}}})] \approx \frac{1}{N} \sum_{i=1}^N f({\textcolor{input}{\boldsymbol{x}}}_i) \]
where \({\textcolor{input}{\boldsymbol{x}}}_i \sim p({\textcolor{input}{\boldsymbol{x}}})\).
Area of a circle of radius \(r\) is \(\pi r^2\) but also \[ I = \int_{-r}^{r} \int_{-r}^{r} \mathbb{I}(x^2 + y^2 \leq r^2) \dd x \dd y \]
where \(\mathbb{I}\) is 1 if the condition is true and 0 otherwise. Then, \(\pi = I / r^2\).
With Monte-Carlo \[ \begin{aligned} I &= (2r)^2 \int \int f(x, y) p(x) p(y) \dd x \dd y \\ &\approx (2r)^2 \frac{1}{N} \sum_{i=1}^N f(x_i, y_i) \end{aligned} \] where \(f(x, y) = \mathbb{I}(x^2 + y^2 \leq r^2)\) and \(p(x) = p(y) = \mathcal{U}(-r, r)\).
Consistency: As the number of samples \(N\) goes to infinity, the estimate converges to the true value.
Central Limit Theorem: The estimate is normally distributed around the true value.
\[ \left( \frac{1}{N} \sum_{i=1}^N f({\textcolor{input}{\boldsymbol{x}}}_i) - {\mathbb{E}}_{p({\textcolor{input}{\boldsymbol{x}}})}[f({\textcolor{input}{\boldsymbol{x}}})] \right) \sim {\mathcal{N}}\left(0, \frac{\sigma^2}{N}\right) \]
where \(\sigma^2\) is the variance of \(f({\textcolor{input}{\boldsymbol{x}}})\).
Note: This is independent of the dimensionality of \({\textcolor{input}{\boldsymbol{x}}}\).
Important
How do we sample from \(p({\textcolor{input}{\boldsymbol{x}}})\) when \(p({\textcolor{input}{\boldsymbol{x}}})\) is not known? For example, when it’s a posterior distribution \(p({\textcolor{params}{\boldsymbol{\theta}}}\mid{\textcolor{output}{\boldsymbol{y}}})\).
Suppose we want to sample from a (univariate) Gaussian distribution \(p(x) = \mathcal{N}(x\mid\mu, \sigma^2)\).
Sampling from a Gaussian distribution
Each sample can be obtained by
\[ x_i = \mu + \sigma z_i \]
where \(z_i \sim \mathcal{N}(0, 1)\).
Suppose we want to sample from a multivariate Gaussian distribution \(p({\textcolor{input}{\boldsymbol{x}}}) = \mathcal{N}({\textcolor{input}{\boldsymbol{x}}}\mid{\boldsymbol{\mu}}, {\boldsymbol{\Sigma}})\).
Note: The Cholesky decomposition is only defined for positive-definite matrices. If \({\boldsymbol{\Sigma}}\) is not positive-definite, you will not be able to sample from the distribution.
Suppose we want to sample from a distribution \(p(x)\) that we only know up to a normalizing constant \(p(x) = \frac{p^*(x)}{Z}\), where \(Z = \int p^*(x) \dd x\).
In rejection sampling, we start from a distribution \(q(x)\) such that \(Cq(x) \geq p^*(x)\) for all \(x\), with \(C>0\).
Rejection sampling
Rejection sampling
It’s possible to prove that
\[ \mathbb P(\text{accept}) \propto \frac 1 C \]
Important
Note: In high dimensions (\(>100\)), rejection sampling is not efficient. The volume of the region where \(p^*(x)\) is large becomes very small compared to the volume of the region where \(q(x)\) is large.
We need something more efficient.

Simone Rossi - Advanced Statistical Inference - EURECOM