Monte Carlo Methods

Advanced Statistical Inference

Simone Rossi

EURECOM

Monte Carlo Methods

\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]

\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \]

Introduction

We need to compute

\[ {\mathbb{E}}_{p({\textcolor{input}{\boldsymbol{x}}})}[f({\textcolor{input}{\boldsymbol{x}}})] = \int f({\textcolor{input}{\boldsymbol{x}}}) p({\textcolor{input}{\boldsymbol{x}}}) \dd {\textcolor{input}{\boldsymbol{x}}} \]

where \(p({\textcolor{input}{\boldsymbol{x}}})\) is the probability density function of \({\textcolor{input}{\boldsymbol{x}}}\).

Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to obtain numerical results.

\[ {\mathbb{E}}_{p({\textcolor{input}{\boldsymbol{x}}})}[f({\textcolor{input}{\boldsymbol{x}}})] \approx \frac{1}{N} \sum_{i=1}^N f({\textcolor{input}{\boldsymbol{x}}}_i) \]

where \({\textcolor{input}{\boldsymbol{x}}}_i \sim p({\textcolor{input}{\boldsymbol{x}}})\).

Compute \(\pi\) with Monte Carlo

Area of a circle of radius \(r\) is \(\pi r^2\) but also \[ I = \int_{-r}^{r} \int_{-r}^{r} \mathbb{I}(x^2 + y^2 \leq r^2) \dd x \dd y \]

where \(\mathbb{I}\) is 1 if the condition is true and 0 otherwise. Then, \(\pi = I / r^2\).

With Monte-Carlo \[ \begin{aligned} I &= (2r)^2 \int \int f(x, y) p(x) p(y) \dd x \dd y \\ &\approx (2r)^2 \frac{1}{N} \sum_{i=1}^N f(x_i, y_i) \end{aligned} \] where \(f(x, y) = \mathbb{I}(x^2 + y^2 \leq r^2)\) and \(p(x) = p(y) = \mathcal{U}(-r, r)\).

Theoretical properties of Monte Carlo methods

  • Consistency: As the number of samples \(N\) goes to infinity, the estimate converges to the true value.

  • Central Limit Theorem: The estimate is normally distributed around the true value.

\[ \left( \frac{1}{N} \sum_{i=1}^N f({\textcolor{input}{\boldsymbol{x}}}_i) - {\mathbb{E}}_{p({\textcolor{input}{\boldsymbol{x}}})}[f({\textcolor{input}{\boldsymbol{x}}})] \right) \sim {\mathcal{N}}\left(0, \frac{\sigma^2}{N}\right) \]

where \(\sigma^2\) is the variance of \(f({\textcolor{input}{\boldsymbol{x}}})\).

Note: This is independent of the dimensionality of \({\textcolor{input}{\boldsymbol{x}}}\).

Important

How do we sample from \(p({\textcolor{input}{\boldsymbol{x}}})\) when \(p({\textcolor{input}{\boldsymbol{x}}})\) is not known? For example, when it’s a posterior distribution \(p({\textcolor{params}{\boldsymbol{\theta}}}\mid{\textcolor{output}{\boldsymbol{y}}})\).

Sampling from simple distributions

Sampling from a Gaussian distribution

Suppose we want to sample from a (univariate) Gaussian distribution \(p(x) = \mathcal{N}(x\mid\mu, \sigma^2)\).

Sampling from a Gaussian distribution

Each sample can be obtained by

\[ x_i = \mu + \sigma z_i \]

where \(z_i \sim \mathcal{N}(0, 1)\).

Sampling from a multivariate Gaussian distribution

Suppose we want to sample from a multivariate Gaussian distribution \(p({\textcolor{input}{\boldsymbol{x}}}) = \mathcal{N}({\textcolor{input}{\boldsymbol{x}}}\mid{\boldsymbol{\mu}}, {\boldsymbol{\Sigma}})\).

  1. Compute the Cholesky decomposition of \({\boldsymbol{\Sigma}}= {\boldsymbol{L}}{\boldsymbol{L}}^T\).
  2. Sample \({\boldsymbol{z}}\sim \mathcal{N}(0, {\boldsymbol{I}})\).
  3. Compute \({\textcolor{input}{\boldsymbol{x}}}= {\boldsymbol{\mu}}+ {\boldsymbol{L}}z\).

Note: The Cholesky decomposition is only defined for positive-definite matrices. If \({\boldsymbol{\Sigma}}\) is not positive-definite, you will not be able to sample from the distribution.

Rejection sampling

Suppose we want to sample from a distribution \(p(x)\) that we only know up to a normalizing constant \(p(x) = \frac{p^*(x)}{Z}\), where \(Z = \int p^*(x) \dd x\).

In rejection sampling, we start from a distribution \(q(x)\) such that \(Cq(x) \geq p^*(x)\) for all \(x\), with \(C>0\).

Rejection sampling

Rejection sampling algorithm

  1. Sample \(x \sim q(x)\).
  2. Sample \(u \sim \mathcal{U}(0, Cq(x))\).
  3. If \(u \leq p^*(x)\), accept \(x\); otherwise, reject it and go back to step 1.

Rejection sampling

Rejection sampling algorithm

It’s possible to prove that

\[ \mathbb P(\text{accept}) \propto \frac 1 C \]

  • We want to have high acceptance rate.
  • We want to have \(C\) as small as possible.
  • … but large enough to satisfy \(Cq(x) \geq p^*(x)\).

Important

Note: In high dimensions (\(>100\)), rejection sampling is not efficient. The volume of the region where \(p^*(x)\) is large becomes very small compared to the volume of the region where \(q(x)\) is large.

We need something more efficient.

⏩ Next: Mark Chain Monte Carlo