Bayesian Coin Toss Inference

Advanced Statistical Inference

Simone Rossi

EURECOM

\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \definecolor{function}{rgb}{0.75, 0.75, 0.12} \]

Coin toss: A class exercise

Setup: I toss the coin \(n\) times and observe \(\textcolor{output}{y}\) heads

Question: Is the coin fair? What is the probability of heads?

Steps:

  1. Define a model for the number of heads
  2. Choose a prior distribution
  3. Compute the posterior distribution
  4. Make predictions

Model

Assumptions:

  1. The probability of heads is the same for all tosses.
  2. Each coin toss is independent of the others.

We model the number of heads \(\textcolor{output}{y}\) with a binomial distribution and probability \(\textcolor{params}{\theta}\):

\[ p(\textcolor{output}{y}\mid \textcolor{params}{\theta}) = \binom{n}{\textcolor{output}{y}} \textcolor{params}{\theta}^{\textcolor{output}{y}} (1-\textcolor{params}{\theta})^{n-\textcolor{output}{y}} \]

where \(\textcolor{params}{\theta}\) is the probability of heads, \(n\) is the number of tosses, \(\textcolor{output}{y}\) is the number of heads, and

\[ \binom{n}{\textcolor{output}{y}}= \frac{n!}{\textcolor{output}{y}!(n-\textcolor{output}{y})!} \]

is the binomial coefficient.

Prior

We need a prior distribution for \(\textcolor{params}{\theta}\).

How to choose it? Recall what \(\textcolor{params}{\theta}\) represents: the probability of heads.

  1. It must be between 0 and 1.
  2. It must be continuous.
  3. We want a posterior we can compute analytically (conjugacy with binomial).

Beta distribution:

\[ p(\textcolor{params}{\theta}) = \frac{1}{B(\alpha, \beta)} \textcolor{params}{\theta}^{\alpha-1} (1-\textcolor{params}{\theta})^{\beta-1} \]

where

\[ B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)} \]

  • Parameters: \(\alpha\) and \(\beta\)
  • \(\alpha=\beta=1\): uniform prior (no preference)
  • \(\alpha=\beta>1\): preference for fairness, with increasing certainty as values grow
  • \(\alpha>\beta\): prior preference for heads
  • \(\alpha<\beta\): prior preference for tails

Prior intuition: pseudo-counts

The beta prior can be interpreted as prior observations:

  • \(\alpha-1\): prior pseudo-heads
  • \(\beta-1\): prior pseudo-tails
  • \(\alpha+\beta\): prior strength

Interpretation:

  • Small \(\alpha+\beta\): weak prior influence
  • Large \(\alpha+\beta\): strong prior influence
  • Posterior update behaves like adding data to pseudo-counts

Posterior

By Bayes’ rule:

\[ p(\textcolor{params}{\theta}\mid \textcolor{output}{y}) = \frac{p(\textcolor{output}{y}\mid \textcolor{params}{\theta}) p(\textcolor{params}{\theta})}{p(\textcolor{output}{y})} \]

Because beta is conjugate to binomial, the posterior is also beta:

\[ p(\textcolor{params}{\theta}\mid \textcolor{output}{y}) = \frac{1}{B(\alpha', \beta')} \textcolor{params}{\theta}^{\alpha'-1} (1-\textcolor{params}{\theta})^{\beta'-1} \]

So we only need \(\alpha'\) and \(\beta'\).

Posterior: parameter update

From conjugacy:

\[ p(\textcolor{params}{\theta}\mid \textcolor{output}{y}) = \frac{1}{B(\alpha', \beta')} \textcolor{params}{\theta}^{\alpha'-1} (1-\textcolor{params}{\theta})^{\beta'-1} \]

From Bayes’ rule:

\[ p(\textcolor{params}{\theta}\mid \textcolor{output}{y}) \propto \textcolor{params}{\theta}^{\textcolor{output}{y}} (1-\textcolor{params}{\theta})^{n-\textcolor{output}{y}} \textcolor{params}{\theta}^{\alpha-1} (1-\textcolor{params}{\theta})^{\beta-1} \]

Matching powers gives:

\[ \alpha' = \alpha + \textcolor{output}{y}, \qquad \beta' = \beta + n - \textcolor{output}{y} \]

Posterior summaries

If \(\textcolor{params}{\theta}\mid \textcolor{output}{y}\sim \mathrm{Beta}(\alpha',\beta')\), then

\[ {\mathbb{E}}[\textcolor{params}{\theta}\mid \textcolor{output}{y}] = \frac{\alpha'}{\alpha'+\beta'} \]

\[ \mathrm{Var}(\textcolor{params}{\theta}\mid \textcolor{output}{y}) = \frac{\alpha'\beta'}{(\alpha'+\beta')^2(\alpha'+\beta'+1)} \]

If \(\alpha',\beta' > 1\), the MAP is

\[ hetasc_{\mathrm{MAP}} = \frac{\alpha'-1}{\alpha'+\beta'-2} \]

As \(n\) grows, posterior mean and MAP become close.

Concentrating the posterior

Assume the coin is fair, \(\widehat{\textcolor{params}{\theta}}=0.5\).

  1. As we toss more times, the posterior concentrates around \(0.5\), regardless of prior.
  2. The prior still controls how fast concentration happens.
  3. The posterior becomes increasingly close to a normal distribution.

Prediction for the next toss

Use the same likelihood as before, but for one trial (\(n=1\)):

\[ p(y_{\mathrm{new}} \mid \textcolor{params}{\theta}, n=1)=\binom{1}{y_{\mathrm{new}}}\textcolor{params}{\theta}^{y_{\mathrm{new}}}(1-\textcolor{params}{\theta})^{1-y_{\mathrm{new}}} \]

So for a head (\(y_{\mathrm{new}}=1\)):

\[ p(y_{\mathrm{new}}=1 \mid \textcolor{params}{\theta}, n=1)=\binom{1}{1}\textcolor{params}{\theta}^1(1-\textcolor{params}{\theta})^0=\textcolor{params}{\theta} \]

Since \(\textcolor{params}{\theta}\) is unknown, we integrate over its posterior:

\[ p(y_{\mathrm{new}}=1 \mid \textcolor{output}{y}) = \int p(y_{\mathrm{new}}=1 \mid \textcolor{params}{\theta}, n=1) p(\textcolor{params}{\theta}\mid \textcolor{output}{y}) \, \mathrm{d}\textcolor{params}{\theta}= \int \textcolor{params}{\theta}\, p(\textcolor{params}{\theta}\mid \textcolor{output}{y}) \, \mathrm{d}\textcolor{params}{\theta}= {\mathbb{E}}[\textcolor{params}{\theta}\mid \textcolor{output}{y}] \]

With beta-binomial conjugacy:

\[ {\mathbb{E}}[\textcolor{params}{\theta}\mid \textcolor{output}{y}] = \frac{\alpha+\textcolor{output}{y}}{\alpha+\beta+n} \]

Is the coin fair? Bayesian decision view

Useful diagnostics:

  • 95% credible interval for \(\textcolor{params}{\theta}\)
  • Posterior probability \(p(\textcolor{params}{\theta}> 0.5 \mid \textcolor{output}{y})\)
  • Sensitivity to prior choices \((\alpha,\beta)\)

If credible interval is tight around \(0.5\), data supports fairness.