Bayesian classification

Advanced Statistical Inference

Simone Rossi

EURECOM

Bayesian Logistic Regression

\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]

Where are we?

We have seen linear regression and how to do inference with it.
We have seen that exact inference is only possible for few cases.
We have seen that we can use approximate inference to do Bayesian inference in more complex models.
- Laplace approximation, variational inference, MCMC methods.

Now we will implement what we have learned for a more complex model: Classification

Classification

A classification problem is a problem where we want to assign a label to an input \({\textcolor{input}{\boldsymbol{x}}}\in{\mathbb{R}}^D\).

Binary classification: \(\textcolor{output}{y} \in \{0, 1\}\).
Multiclass classification: \(\textcolor{output}{y} \in \{0, 1, \ldots, K\}\).

Probabilistic vs Non-probabilistic classifiers

Given a training set \(\{({\textcolor{input}{\boldsymbol{x}}}_n, \textcolor{output}{y_n})\}_{n=1}^N\), we want to predict the label \(\textcolor{output}{y}_\star\) of a new input \({\textcolor{input}{\boldsymbol{x}}}_\star\).

Probabilistic classifiers: output a probability distribution over the labels \(P(\textcolor{output}{y}_\star = k \mid {\textcolor{input}{\boldsymbol{x}}}_\star)\).
- For binary classification, \(P(\textcolor{output}{y}_\star = 1 \mid {\textcolor{input}{\boldsymbol{x}}}_\star)\) and \(P(\textcolor{output}{y}_\star = 0 \mid {\textcolor{input}{\boldsymbol{x}}}_\star)\).
Non-probabilistic classifiers: produce a hard assignment \(\textcolor{output}{y}_\star = k\).

Examples: Probabilistic classifier

Logistic regression
Naive Bayes

Examples: Non-probabilistic classifier

Support Vector Machines
Decision Trees
K-Nearest Neighbors

Probabilistic classifiers

Probabilistic classifiers are more informative than non-probabilistic classifiers.

\(P(\textcolor{output}{y}_\star = 1 \mid {\textcolor{input}{\boldsymbol{x}}}_\star) = 0.7\) is more informative than \(\textcolor{output}{y}_\star = 1\).
\(P(\textcolor{output}{y}_\star = 1 \mid {\textcolor{input}{\boldsymbol{x}}}_\star)\) gives us a measure of confidence in the prediction.
Particularly useful in applications where the cost of misclassification is high.

Logistic Regression

Logistic regression is a probabilistic classifier that models the probability of the output \(\textcolor{output}{y}\) given the input \({\textcolor{input}{\boldsymbol{x}}}\).
We model \(P(\textcolor{output}{y} = 1 \mid {\textcolor{input}{\boldsymbol{x}}})\) through some function \(f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}})\).
In linear regression, \(f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}) = {\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}\). Can we use this for classification?
- No: the output of linear regression is unbounded and it can’t be interpreted as a probability.
- But, we can use a function \(h(\cdot)\) to map the output of linear regression to the interval \([0, 1]\).

Logistic function

For logistic regression, we use the sigmoid function

\[ P(\textcolor{output}{y} = 1 \mid {\textcolor{input}{\boldsymbol{x}}}) = \sigma({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}) = \frac{1}{1 + \exp(-{\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}})} \]

where \(\sigma(\cdot)\) is the sigmoid function.

Bayesian Logistic Regression

Same approach as for linear regression.
We place a prior distribution over the parameters \({\textcolor{params}{\boldsymbol{w}}}\) and we define a likelihood to obtain the posterior distribution.

\[ p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \frac{p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}})p({\textcolor{params}{\boldsymbol{w}}})}{p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{input}{\boldsymbol{X}}})} \]

We can make predictions by integrating over the posterior distribution.

\[ p(\textcolor{output}{y}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \int p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_\star)p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\dd{\textcolor{params}{\boldsymbol{w}}} \]

Bayesian Logistic Regression: Likelihood

First, we assume independence between the data points.

\[ p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) = \prod_{n=1}^N p(\textcolor{output}{y}_n\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_n) \]

We already know that:

\(P(\textcolor{output}{y}_n=1\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_n) = \sigma({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}_n)\)
\(P(\textcolor{output}{y}_n=0\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_n) = 1 - \sigma({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}_n)\).

What distribution does this correspond to?

Bayesian Logistic Regression: Likelihood

The likelihood of a single data point is a Bernoulli distribution.

\[ p(\textcolor{output}{y}_n\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_n) = \text{Bern}(\textcolor{output}y_n\mid\sigma({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}_n)) \]

where

\[ p(\textcolor{output}{y}_n\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_n) = \begin{cases} \sigma({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}_n) & \text{if } \textcolor{output}{y}_n = 1 \\ 1 - \sigma({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}_n) & \text{if } \textcolor{output}{y}_n = 0 \end{cases} \]

Sometimes, we write this as

\[ p(\textcolor{output}{y}_n\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_n) = \sigma({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}_n)^{\textcolor{output}{y}_n}(1 - \sigma({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}_n))^{1 - \textcolor{output}{y}_n} \]

Bayesian Logistic Regression: Prior

For logistic regression, we can use a Gaussian prior over the parameters.

\[ p({\textcolor{params}{\boldsymbol{w}}}) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid{\boldsymbol{0}}, \sigma_{\textcolor{params}{\boldsymbol{w}}}^2{\boldsymbol{I}}) \]

Previously, we used a Gaussian prior because it makes the math easier.
For logistic regression, no prior makes the math easier.
We can use any prior that we want, but the Gaussian prior is a common choice.

Bayesian Logistic Regression: Posterior

The posterior distribution is given by

Now things get a bit more complicated.

We can’t compute the posterior distribution in closed form:
- Prior is not conjugate to the likelihood. No prior is!
- We don’t know the form of \(p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\).
- We can’t compute the normalization constant \(p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{input}{\boldsymbol{X}}})\).
\[ p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{input}{\boldsymbol{X}}}) = \int p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}})p({\textcolor{params}{\boldsymbol{w}}})\dd{\textcolor{params}{\boldsymbol{w}}} \]

Bayesian Logistic Regression: Approximate Inference

Bayesian logistic regression is the first example where we will use approximate inference.

Maximum a posteriori (MAP) estimation: find the mode of the posterior distribution and use it as the point estimate of the parameters.
Laplace approximation: approximate the posterior distribution with a Gaussian distribution centered at the mode of the posterior.
Variational inference: approximate the posterior distribution with a simpler distribution that is easier to work with.
MCMC methods: sample from the posterior distribution using Markov Chain Monte Carlo methods.

Approximate Inference for Bayesian Logistic Regression

Some data

Let’s generate some data to illustrate Bayesian logistic regression.

Maximum a Posteriori (MAP) Estimation

We can find the mode of the posterior distribution by maximizing the (log) posterior distribution.

\[ {\textcolor{params}{\boldsymbol{w}}}_{\text{MAP}} = \arg\max_{{\textcolor{params}{\boldsymbol{w}}}} \log p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \arg\min \underbrace{-\log p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) - \log p({\textcolor{params}{\boldsymbol{w}}})}_{{\mathcal{U}}({\textcolor{params}{\boldsymbol{w}}})} \]

For linear regression, we could find \({\textcolor{params}{\boldsymbol{w}}}_{\text{MAP}}\) in closed form.
For logistic regression, we can’t do this in closed form.

We can use numerical optimization methods to find \({\textcolor{params}{\boldsymbol{w}}}_{\text{MAP}}\):

Gradient ascent (or descent)
Newton-Raphson’s method

Gradient descent

We can use gradient descent to find the mode of the posterior distribution.
We need to compute the gradient \(\grad_{{\textcolor{params}{\boldsymbol{w}}}} {\mathcal{U}}({\textcolor{params}{\boldsymbol{w}}})\) and iterate until convergence.

\[ {\textcolor{params}{\boldsymbol{w}}}^{(t+1)} \gets {\textcolor{params}{\boldsymbol{w}}}^{(t)} + \eta \grad_{{\textcolor{params}{\boldsymbol{w}}}} {\mathcal{U}}({\textcolor{params}{\boldsymbol{w}}}^{(t)}) \]

where \(\eta\) is the learning rate.

Note

In practice, we use automatic differentiation to compute the gradient.

JAX, PyTorch, TensorFlow, etc. have built-in functions to compute gradients.
In the labs, we will mostly use JAX, because it is fast and easy to use.

Maximum a Posteriori (MAP) Estimation

Decision boundary

Once we have found \({\textcolor{params}{\boldsymbol{w}}}_{\text{MAP}}\), we can classify new examples.
Decision boundary: line where the probability of the two classes is equal to 0.5.

Predictive probabilities

Does the decision boundary look good?

Laplace Approximation

Recall Laplace approximation:

Approximate the posterior distribution with a Gaussian distribution \(q({\textcolor{params}{\boldsymbol{w}}}) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid{\boldsymbol{\mu}}, {\boldsymbol{\Sigma}})\).
Find the mode of the posterior distribution \({\textcolor{params}{\boldsymbol{w}}}_{\text{MAP}}\) (we have this already).
Compute the Hessian of the log posterior at \({\textcolor{params}{\boldsymbol{w}}}_{\text{MAP}}\).

\[ {\boldsymbol{H}}= -\grad^2 \log p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) \big|_{{\textcolor{params}{\boldsymbol{w}}}= {\textcolor{params}{\boldsymbol{w}}}_{\text{MAP}}} \]

The covariance of the Gaussian approximation is given by the inverse of the Hessian.

Laplace Approximation

Making predictions

We can make predictions by integrating over the posterior distribution. \[ p(\textcolor{output}{y}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \int p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_\star)q({\textcolor{params}{\boldsymbol{w}}})\dd{\textcolor{params}{\boldsymbol{w}}} \]
Even though we have a Gaussian approximation of the posterior, this integral is intractable.

Two common approximations:

Monte Carlo integration: sample from the posterior distribution and average the predictions. \[ p(\textcolor{output}{y}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) \approx \frac{1}{S}\sum_{s=1}^S p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(s)}, {\textcolor{input}{\boldsymbol{x}}}_\star) \quad \text{where } {\textcolor{params}{\boldsymbol{w}}}^{(s)} \sim q({\textcolor{params}{\boldsymbol{w}}}) \]
Probit approximation: use deterministic approximation of the sigmoid function.

Making predictions with probit approximation

Logistic function \(\sigma(t)\) is approximated by the cumulative distribution function of a Gaussian \(\Phi(\lambda t)\), where \(\lambda = \sqrt{\pi/8}\).

Making predictions with probit approximation

We can compute the distribution of \(\textcolor{latent}{f}_\star = f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_\star)\) where \(f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}) = {\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}\) before applying the sigmoid function.
The distribution of \(\textcolor{latent}{f}_\star\) is a Gaussian distribution. \[ q(\textcolor{latent}{f}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \int p(\textcolor{latent}{f}_\star \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_\star){\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid{\boldsymbol{\mu}}, {\boldsymbol{\Sigma}})\dd{\textcolor{params}{\boldsymbol{w}}}= {\mathcal{N}}(\textcolor{latent}{f}_\star\mid{\boldsymbol{\mu}}^\top{\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{input}{\boldsymbol{x}}}_\star^\top{\boldsymbol{\Sigma}}{\textcolor{input}{\boldsymbol{x}}}_\star) \]
Apply the probit approximation \[ p(\textcolor{output}{y}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) \approx \int \Phi(\textcolor{latent}{f}_\star)q(\textcolor{latent}{f}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\dd{\textcolor{latent}{f}_\star} = \sigma\left(\frac{{\boldsymbol{\mu}}^\top{\textcolor{input}{\boldsymbol{x}}}_\star}{\sqrt{1 + \frac \pi 8 {\textcolor{input}{\boldsymbol{x}}}_\star^\top{\boldsymbol{\Sigma}}{\textcolor{input}{\boldsymbol{x}}}_\star}}\right) \]

Making predictions with the Laplace approximation

Curved decision boundary?

By marginalizing over the posterior distribution, we don’t consider a single decision boundary.
We consider all possible decision boundaries weighted by the uncertainty in the parameters.

Variational Inference

Recall variational inference:

Define a family of distributions \(q({\textcolor{params}{\boldsymbol{w}}}; {\textcolor{vparams}{\boldsymbol{\nu}}})\), where \({\textcolor{vparams}{\boldsymbol{\nu}}}\) are the variational parameters.
Minimize the KL divergence between the true posterior and the variational distribution

\[ {\textcolor{vparams}{\boldsymbol{\nu}}}^* = \arg\min_{{\textcolor{vparams}{\boldsymbol{\nu}}}} \text{KL}\left(q({\textcolor{params}{\boldsymbol{w}}}; {\textcolor{vparams}{\boldsymbol{\nu}}}) \parallel p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\right) \]

Remember: this is intracable!

Use the evidence lower bound (ELBO) to optimize the variational parameters.

\[ {\mathcal{L}}_{\text{ELBO}}({\textcolor{vparams}{\boldsymbol{\nu}}}) = {\mathbb{E}}_{q({\textcolor{params}{\boldsymbol{w}}}; {\textcolor{vparams}{\boldsymbol{\nu}}})}[\log p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}})] - \text{KL}\left(q({\textcolor{params}{\boldsymbol{w}}}; {\textcolor{vparams}{\boldsymbol{\nu}}}) \parallel p({\textcolor{params}{\boldsymbol{w}}})\right) \]

Variational Inference with diagonal Gaussian

Assume \(q({\textcolor{params}{\boldsymbol{w}}}; {\textcolor{vparams}{\boldsymbol{\nu}}}) = \prod_{i=1}^D {\mathcal{N}}(w_i\mid\textcolor{vparams}\mu_i, \textcolor{vparams}\sigma_i^2) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid\textcolor{vparams}\mu, \textcolor{vparams}\sigma^2{\boldsymbol{I}})\).

Braking down the ELBO

Breaking down the ELBO

Variational Inference with full Gaussian

Assume \(q({\textcolor{params}{\boldsymbol{w}}}; {\textcolor{vparams}{\boldsymbol{\nu}}}) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid\textcolor{vparams}{\boldsymbol{\mu}}, \textcolor{vparams}{\boldsymbol{L}}^T\textcolor{vparams}{\boldsymbol{L}})\).

Breaking down the ELBO

ELBO: Lower bound on the log marginal likelihood

\({\mathcal{L}}_{\text{ELBO}}({\textcolor{vparams}{\boldsymbol{\nu}}}) \le \log p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{input}{\boldsymbol{X}}})\), with equality if \(q({\textcolor{params}{\boldsymbol{w}}}; {\textcolor{vparams}{\boldsymbol{\nu}}}) = p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\).

Predictions with variational inference

\(p(\textcolor{output}{y}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) \approx \int p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_\star)q({\textcolor{params}{\boldsymbol{w}}}; {\textcolor{vparams}{\boldsymbol{\nu}}})\dd{\textcolor{params}{\boldsymbol{w}}}\)

Sampling from the posterior with MCMC

Recall MCMC methods:

Even though we can’t compute the posterior distribution, we can sample from it.
Markov Chain Monte Carlo (MCMC) methods are a class of algorithms that allow us to sample from complex distributions.
MCMC methods are based on the idea of proposing new samples and accepting them with a certain probability:
1. Random walk: propose a new sample by adding Gaussian noise to the current sample.
2. Hamiltonian Monte Carlo: propose a new sample by simulating the dynamics of a physical system.

Sampling from the posterior with MCMC

Has it converged?

Run multiple chains with different initializations.
Check for convergence using trace plots and density estimates.

Predictions with MCMC

Which method to use?

No free lunch theorem: no method is the best for all problems.

MAP estimation: fast, but doesn’t provide uncertainty estimates.
Laplace approximation: needs to compute the MAP estimate and the Hessian inverse (can be very expensive for large models).
Variational inference: flexible, scalable to large models and datasets, but rough approximation (unless you use more complex distributions).
MCMC: most accurate, but computationally expensive
- Random walk Metropolis: simple, but slow convergence.
- Hamiltonian Monte Carlo: faster convergence, but requires gradient computation.

Uncertainty estimates

Uncertainty on class predictions

One advantage of Bayesian methods is that they provide uncertainty estimates.

We have seen how to average all the decision boundaries for \(P(\textcolor{output}{y}_\star = 1 \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\), by weighting them by the posterior distribution.
But we can also compute the variance of the predictions, which gives us an idea of the uncertainty on the class probabilities, e.g. \(P(\textcolor{output}{y}_\star = 1 \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = 0.7 \pm 0.1\).

Uncertainty on class predictions (variance)

We can compute the variance of the class probabilities

Performance evaluation

Evaluating the performance of a classifier

Accuracy: proportion of correctly classified examples.

Error rate (or 0/1 loss): proportion of misclassified examples.

Consider a set of predictions \(\widehat{\textcolor{output}{y}}_i\) and true labels \(\textcolor{output}{y}_i\) for \(i=1,\ldots,N\).

\[ \ell(\widehat{{\textcolor{output}{\boldsymbol{y}}}}, {\textcolor{output}{\boldsymbol{y}}}) = \frac{1}{N}\sum_{i=1}^N \mathbb{I}(\widehat{\textcolor{output}{y}}_i \neq \textcolor{output}{y}_i) \]

where \(\mathbb{I}(\cdot)\) is the indicator function (1 if the condition is true, 0 otherwise).

\[\widehat{{\textcolor{output}{\boldsymbol{y}}}} = \arg\max_{k} p(\textcolor{output}{y}_i = k \mid {\textcolor{input}{\boldsymbol{x}}}_i)\]

Limitations of accuracy

Accuracy has some advantages:

Easy to interpret.
Can be used for binary and multiclass classification.
Single number that we can use to compare different models.

But it has some limitations:

Not suitable for imbalanced datasets.

Example

We are building a classifier to detect fraudulent transactions; only 1% of the transactions are fraudulent.
Class labels: \(\textcolor{output}{y} = 0\) (not fraudulent); \(\textcolor{output}{y} = 1\) (fraudulent).
Build a classifier that always predicts \(\textcolor{output}{y} = 0\).
What is the accuracy of the classifier? Is it a good classifier?

Precision and recall

Need to define 4 quantities:

True positive (TP): number of positive examples correctly classified (\(\textcolor{output}{y} = 1\), \(\widehat{\textcolor{output}{y}} = 1\)).
True negative (TN): number of negative examples correctly classified (\(\textcolor{output}{y} = 0\), \(\widehat{\textcolor{output}{y}} = 0\)).
False positive (FP): number of negative examples incorrectly classified (\(\textcolor{output}{y} = 0\), \(\widehat{\textcolor{output}{y}} = 1\)).
False negative (FN): number of positive examples incorrectly classified (\(\textcolor{output}{y} = 1\), \(\widehat{\textcolor{output}{y}} = 0\)).

Precision and recall

Precision: proportion of correctly classified positive examples among all examples classified as positive.

\[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = 1 - \frac{\text{FP}}{\text{TP} + \text{FP}} \]

Recall (or sensitivity): proportion of correctly classified positive examples among all positive examples.

\[ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} = 1 - \frac{\text{FN}}{\text{TP} + \text{FN}} \]

Precision + Recall = F1 score

Single metric that combines precision and recall: F1 score.

\[ \text{F1 score} = 2\frac{\text{Precision}\times\text{Recall}}{\text{Precision} + \text{Recall}} \]

In our example of fraudulent transactions:

\(\text{Precision} = 0\) (no fraudulent transactions are detected).
\(\text{Recall} = 1\) (all non-fraudulent transactions are correctly classified).
\(\text{F1 score} = 0\).

Predictive test log-likelihood

So far, we have evaluated the performance of the classifier based on the class predictions not the probabilities.
Predictive log-likelihood: test the ability of the model to correctly predict the class probabilities.

Recall the predictive distribution \(p(\textcolor{output}{y}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\) after marginalizing over the parameters with the posterior distribution (or any approximation).

Predictive test log-likelihood

If we have samples from the posterior distribution:

\[ \begin{aligned} \log p(\textcolor{output}{y}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) &\approx \log\frac{1}{S}\sum_{s=1}^S p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(s)}, {\textcolor{input}{\boldsymbol{x}}}_\star)\\ &\approx \log\sum_{s=1}^S p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(s)}, {\textcolor{input}{\boldsymbol{x}}}_\star) - \log S \end{aligned} \]

where \({\textcolor{params}{\boldsymbol{w}}}^{(s)} \sim p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\).

Problem: we cannot swap log and sum, and we can have numerical issues when computing the the likelihood for each sample \(p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(s)}, {\textcolor{input}{\boldsymbol{x}}}_\star)\).

Log-sum-exp trick

We can use the log-sum-exp trick to avoid numerical issues. \[ \begin{aligned} \log\sum_{s=1}^S p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(s)}, {\textcolor{input}{\boldsymbol{x}}}_\star) &= \log\sum_{s=1}^S \exp\left(\log p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(s)}, {\textcolor{input}{\boldsymbol{x}}}_\star)\right)\\ &= \text{logsumexp}\left(\log p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(1)}, {\textcolor{input}{\boldsymbol{x}}}_\star), \ldots, \log p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(S)}, {\textcolor{input}{\boldsymbol{x}}}_\star)\right) \end{aligned} \]
Many libraries (Numpy, Pytorch, JAX, etc.) provide a numerical stable implementation of the log-sum-exp function
The trick is based on the following identity: \[ \log(\exp(a) + \exp(b)) = a + \log(1 + \exp(b - a)) \]

with \(a \ge b\).

Calibration

Calibration: the predicted probabilities should match the true probabilities.

When we predict a class probability of 0.9 for a class, we expect that 90% of the examples with this predicted probability belong to this class.

Tools to evaluate calibration:

Expected calibration error (ECE)
Reliability diagram

Accuracy and confidence

To compute calibration metrics, we need to group the examples based on the predicted probabilities.

Define:

\(\textcolor{output}{y}_i = \arg\max_k p(\textcolor{output}{y}_i = k \mid {\textcolor{input}{\boldsymbol{x}}}_i)\)
\(\widehat{p}_i = \max_k p(\textcolor{output}{y}_i = k \mid {\textcolor{input}{\boldsymbol{x}}}_i)\)

Now:

Divide the range of predicted probabilities into \(B\) buckets or bins.
Let \(\mathcal B_b\) be the set of examples with predicted probabilities in \(\left(\frac{b-1}{B}, \frac{b}{B}\right]\).
For each bin \(\mathcal B_b\), compute:
- Accuracy: \(\text{acc}(\mathcal B_b) = \frac{1}{|\mathcal B_b|}\sum_{i\in\mathcal B_b} \mathbb{I}(\textcolor{output}{y}_i = \widehat{\textcolor{output}{y}}_i)\)
- Confidence: \(\text{conf}(\mathcal B_b) = \frac{1}{|\mathcal B_b|}\sum_{i\in\mathcal B_b} \widehat{p}_i\)

Reliability diagram

Plot the accuracy as a function of the confidence.

Expected calibration error (ECE)

Expected calibration error (ECE) measures “how much” the predicted probabilities deviate from the true probabilities.

\[ \text{ECE} = \sum_{b=1}^B \frac{|\mathcal B_b|}{B}|\text{acc}(\mathcal B_b) - \text{conf}(\mathcal B_b)| \]

ECE = 0: perfect calibration.
ECE = 1: worst calibration (maximum error).

Improving calibration

Temperature scaling: scale the logits before applying the sigmoid function. \[ p(\textcolor{output}{y}_i \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_i) = \text{Bern}(\sigma({\textcolor{params}{\boldsymbol{w}}}^T{\textcolor{input}{\boldsymbol{x}}}_i/T)) \]

Improving calibration: a recipe for temperature scaling

Train the model.
Compute the ECE on the validation set.
Choose the temperature that minimizes the ECE on the validation set.
Re-calibrate the model using the chosen temperature.

Note: temperature scaling is not affecting model performance (accuracy, F1 score, etc.), but only the calibration.

Another example of temperature scaling

Temperature scaling in LLMs