Advanced Statistical Inference
EURECOM
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]We have seen linear regression and how to do inference with it.
We have seen that exact inference is only possible for few cases.
We have seen that we can use approximate inference to do Bayesian inference in more complex models.
Now we will implement what we have learned for a more complex model: Classification
A classification problem is a problem where we want to assign a label to an input \({\textcolor{input}{\boldsymbol{x}}}\in{\mathbb{R}}^D\).
Given a training set \(\{({\textcolor{input}{\boldsymbol{x}}}_n, \textcolor{output}{y_n})\}_{n=1}^N\), we want to predict the label \(\textcolor{output}{y}_\star\) of a new input \({\textcolor{input}{\boldsymbol{x}}}_\star\).
Examples: Probabilistic classifier
Examples: Non-probabilistic classifier
Probabilistic classifiers are more informative than non-probabilistic classifiers.
\(P(\textcolor{output}{y}_\star = 1 \mid {\textcolor{input}{\boldsymbol{x}}}_\star) = 0.7\) is more informative than \(\textcolor{output}{y}_\star = 1\).
\(P(\textcolor{output}{y}_\star = 1 \mid {\textcolor{input}{\boldsymbol{x}}}_\star)\) gives us a measure of confidence in the prediction.
Particularly useful in applications where the cost of misclassification is high.
Logistic regression is a probabilistic classifier that models the probability of the output \(\textcolor{output}{y}\) given the input \({\textcolor{input}{\boldsymbol{x}}}\).
We model \(P(\textcolor{output}{y} = 1 \mid {\textcolor{input}{\boldsymbol{x}}})\) through some function \(f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}})\).
In linear regression, \(f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}) = {\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}\). Can we use this for classification?
No: the output of linear regression is unbounded and it can’t be interpreted as a probability.
But, we can use a function \(h(\cdot)\) to map the output of linear regression to the interval \([0, 1]\).
For logistic regression, we use the sigmoid function
\[ P(\textcolor{output}{y} = 1 \mid {\textcolor{input}{\boldsymbol{x}}}) = \sigma({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}) = \frac{1}{1 + \exp(-{\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}})} \]
where \(\sigma(\cdot)\) is the sigmoid function.
Same approach as for linear regression.
We place a prior distribution over the parameters \({\textcolor{params}{\boldsymbol{w}}}\) and we define a likelihood to obtain the posterior distribution.
\[ p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \frac{p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}})p({\textcolor{params}{\boldsymbol{w}}})}{p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{input}{\boldsymbol{X}}})} \]
\[ p(\textcolor{output}{y}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \int p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_\star)p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\dd{\textcolor{params}{\boldsymbol{w}}} \]
\[ p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) = \prod_{n=1}^N p(\textcolor{output}{y}_n\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_n) \]
We already know that:
What distribution does this correspond to?
\[ p(\textcolor{output}{y}_n\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_n) = \text{Bern}(\textcolor{output}y_n\mid\sigma({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}_n)) \]
where
\[ p(\textcolor{output}{y}_n\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_n) = \begin{cases} \sigma({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}_n) & \text{if } \textcolor{output}{y}_n = 1 \\ 1 - \sigma({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}_n) & \text{if } \textcolor{output}{y}_n = 0 \end{cases} \]
Sometimes, we write this as
\[ p(\textcolor{output}{y}_n\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_n) = \sigma({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}_n)^{\textcolor{output}{y}_n}(1 - \sigma({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}_n))^{1 - \textcolor{output}{y}_n} \]
\[ p({\textcolor{params}{\boldsymbol{w}}}) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid{\boldsymbol{0}}, \sigma_{\textcolor{params}{\boldsymbol{w}}}^2{\boldsymbol{I}}) \]
Previously, we used a Gaussian prior because it makes the math easier.
For logistic regression, no prior makes the math easier.
We can use any prior that we want, but the Gaussian prior is a common choice.
\[ p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \frac{p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}})p({\textcolor{params}{\boldsymbol{w}}})}{p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{input}{\boldsymbol{X}}})} \]
Now things get a bit more complicated.
We can’t compute the posterior distribution in closed form:
\[ p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{input}{\boldsymbol{X}}}) = \int p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}})p({\textcolor{params}{\boldsymbol{w}}})\dd{\textcolor{params}{\boldsymbol{w}}} \]
Bayesian logistic regression is the first example where we will use approximate inference.
Maximum a posteriori (MAP) estimation: find the mode of the posterior distribution and use it as the point estimate of the parameters.
Laplace approximation: approximate the posterior distribution with a Gaussian distribution centered at the mode of the posterior.
Variational inference: approximate the posterior distribution with a simpler distribution that is easier to work with.
MCMC methods: sample from the posterior distribution using Markov Chain Monte Carlo methods.
Let’s generate some data to illustrate Bayesian logistic regression.
\[ {\textcolor{params}{\boldsymbol{w}}}_{\text{MAP}} = \arg\max_{{\textcolor{params}{\boldsymbol{w}}}} \log p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \arg\min \underbrace{-\log p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) - \log p({\textcolor{params}{\boldsymbol{w}}})}_{{\mathcal{U}}({\textcolor{params}{\boldsymbol{w}}})} \]
We can use numerical optimization methods to find \({\textcolor{params}{\boldsymbol{w}}}_{\text{MAP}}\):
We can use gradient descent to find the mode of the posterior distribution.
We need to compute the gradient \(\grad_{{\textcolor{params}{\boldsymbol{w}}}} {\mathcal{U}}({\textcolor{params}{\boldsymbol{w}}})\) and iterate until convergence.
\[ {\textcolor{params}{\boldsymbol{w}}}^{(t+1)} \gets {\textcolor{params}{\boldsymbol{w}}}^{(t)} + \eta \grad_{{\textcolor{params}{\boldsymbol{w}}}} {\mathcal{U}}({\textcolor{params}{\boldsymbol{w}}}^{(t)}) \]
where \(\eta\) is the learning rate.
Note
In practice, we use automatic differentiation to compute the gradient.
\[ {\textcolor{params}{\boldsymbol{w}}}_{\text{MAP}} = \arg\max_{{\textcolor{params}{\boldsymbol{w}}}} \log p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) \]
Once we have found \({\textcolor{params}{\boldsymbol{w}}}_{\text{MAP}}\), we can classify new examples.
Decision boundary: line where the probability of the two classes is equal to 0.5.
Recall Laplace approximation:
\[ {\boldsymbol{H}}= -\grad^2 \log p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) \big|_{{\textcolor{params}{\boldsymbol{w}}}= {\textcolor{params}{\boldsymbol{w}}}_{\text{MAP}}} \]
We can make predictions by integrating over the posterior distribution. \[ p(\textcolor{output}{y}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \int p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_\star)q({\textcolor{params}{\boldsymbol{w}}})\dd{\textcolor{params}{\boldsymbol{w}}} \]
Even though we have a Gaussian approximation of the posterior, this integral is intractable.
Two common approximations:
Monte Carlo integration: sample from the posterior distribution and average the predictions. \[ p(\textcolor{output}{y}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) \approx \frac{1}{S}\sum_{s=1}^S p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(s)}, {\textcolor{input}{\boldsymbol{x}}}_\star) \quad \text{where } {\textcolor{params}{\boldsymbol{w}}}^{(s)} \sim q({\textcolor{params}{\boldsymbol{w}}}) \]
Probit approximation: use deterministic approximation of the sigmoid function.
Logistic function \(\sigma(t)\) is approximated by the cumulative distribution function of a Gaussian \(\Phi(\lambda t)\), where \(\lambda = \sqrt{\pi/8}\).
We can compute the distribution of \(\textcolor{latent}{f}_\star = f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_\star)\) where \(f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}) = {\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}\) before applying the sigmoid function.
The distribution of \(\textcolor{latent}{f}_\star\) is a Gaussian distribution. \[ q(\textcolor{latent}{f}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \int p(\textcolor{latent}{f}_\star \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_\star){\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid{\boldsymbol{\mu}}, {\boldsymbol{\Sigma}})\dd{\textcolor{params}{\boldsymbol{w}}}= {\mathcal{N}}(\textcolor{latent}{f}_\star\mid{\boldsymbol{\mu}}^\top{\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{input}{\boldsymbol{x}}}_\star^\top{\boldsymbol{\Sigma}}{\textcolor{input}{\boldsymbol{x}}}_\star) \]
Apply the probit approximation \[ p(\textcolor{output}{y}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) \approx \int \Phi(\textcolor{latent}{f}_\star)q(\textcolor{latent}{f}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\dd{\textcolor{latent}{f}_\star} = \sigma\left(\frac{{\boldsymbol{\mu}}^\top{\textcolor{input}{\boldsymbol{x}}}_\star}{\sqrt{1 + \frac \pi 8 {\textcolor{input}{\boldsymbol{x}}}_\star^\top{\boldsymbol{\Sigma}}{\textcolor{input}{\boldsymbol{x}}}_\star}}\right) \]
By marginalizing over the posterior distribution, we don’t consider a single decision boundary.
We consider all possible decision boundaries weighted by the uncertainty in the parameters.
Recall variational inference:
\[ {\textcolor{vparams}{\boldsymbol{\nu}}}^* = \arg\min_{{\textcolor{vparams}{\boldsymbol{\nu}}}} \text{KL}\left(q({\textcolor{params}{\boldsymbol{w}}}; {\textcolor{vparams}{\boldsymbol{\nu}}}) \parallel p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\right) \]
Remember: this is intracable!
\[ {\mathcal{L}}_{\text{ELBO}}({\textcolor{vparams}{\boldsymbol{\nu}}}) = {\mathbb{E}}_{q({\textcolor{params}{\boldsymbol{w}}}; {\textcolor{vparams}{\boldsymbol{\nu}}})}[\log p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}})] - \text{KL}\left(q({\textcolor{params}{\boldsymbol{w}}}; {\textcolor{vparams}{\boldsymbol{\nu}}}) \parallel p({\textcolor{params}{\boldsymbol{w}}})\right) \]
Assume \(q({\textcolor{params}{\boldsymbol{w}}}; {\textcolor{vparams}{\boldsymbol{\nu}}}) = \prod_{i=1}^D {\mathcal{N}}(w_i\mid\textcolor{vparams}\mu_i, \textcolor{vparams}\sigma_i^2) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid\textcolor{vparams}\mu, \textcolor{vparams}\sigma^2{\boldsymbol{I}})\).
Assume \(q({\textcolor{params}{\boldsymbol{w}}}; {\textcolor{vparams}{\boldsymbol{\nu}}}) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid\textcolor{vparams}{\boldsymbol{\mu}}, \textcolor{vparams}{\boldsymbol{L}}^T\textcolor{vparams}{\boldsymbol{L}})\).
\(p(\textcolor{output}{y}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) \approx \int p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_\star)q({\textcolor{params}{\boldsymbol{w}}}; {\textcolor{vparams}{\boldsymbol{\nu}}})\dd{\textcolor{params}{\boldsymbol{w}}}\)
Recall MCMC methods:
Even though we can’t compute the posterior distribution, we can sample from it.
Markov Chain Monte Carlo (MCMC) methods are a class of algorithms that allow us to sample from complex distributions.
MCMC methods are based on the idea of proposing new samples and accepting them with a certain probability:
No free lunch theorem: no method is the best for all problems.
MAP estimation: fast, but doesn’t provide uncertainty estimates.
Laplace approximation: needs to compute the MAP estimate and the Hessian inverse (can be very expensive for large models).
Variational inference: flexible, scalable to large models and datasets, but rough approximation (unless you use more complex distributions).
MCMC: most accurate, but computationally expensive
One advantage of Bayesian methods is that they provide uncertainty estimates.
We have seen how to average all the decision boundaries for \(P(\textcolor{output}{y}_\star = 1 \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\), by weighting them by the posterior distribution.
But we can also compute the variance of the predictions, which gives us an idea of the uncertainty on the class probabilities, e.g. \(P(\textcolor{output}{y}_\star = 1 \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = 0.7 \pm 0.1\).
Accuracy: proportion of correctly classified examples.
Error rate (or 0/1 loss): proportion of misclassified examples.
\[ \ell(\widehat{{\textcolor{output}{\boldsymbol{y}}}}, {\textcolor{output}{\boldsymbol{y}}}) = \frac{1}{N}\sum_{i=1}^N \mathbb{I}(\widehat{\textcolor{output}{y}}_i \neq \textcolor{output}{y}_i) \]
where \(\mathbb{I}(\cdot)\) is the indicator function (1 if the condition is true, 0 otherwise).
\[\widehat{{\textcolor{output}{\boldsymbol{y}}}} = \arg\max_{k} p(\textcolor{output}{y}_i = k \mid {\textcolor{input}{\boldsymbol{x}}}_i)\]
Accuracy has some advantages:
But it has some limitations:
Example
We are building a classifier to detect fraudulent transactions; only 1% of the transactions are fraudulent.
Class labels: \(\textcolor{output}{y} = 0\) (not fraudulent); \(\textcolor{output}{y} = 1\) (fraudulent).
Build a classifier that always predicts \(\textcolor{output}{y} = 0\).
What is the accuracy of the classifier? Is it a good classifier?
Need to define 4 quantities:
\[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = 1 - \frac{\text{FP}}{\text{TP} + \text{FP}} \]
\[ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} = 1 - \frac{\text{FN}}{\text{TP} + \text{FN}} \]
Single metric that combines precision and recall: F1 score.
\[ \text{F1 score} = 2\frac{\text{Precision}\times\text{Recall}}{\text{Precision} + \text{Recall}} \]
In our example of fraudulent transactions:
So far, we have evaluated the performance of the classifier based on the class predictions not the probabilities.
Predictive log-likelihood: test the ability of the model to correctly predict the class probabilities.
Recall the predictive distribution \(p(\textcolor{output}{y}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\) after marginalizing over the parameters with the posterior distribution (or any approximation).
\[ p(\textcolor{output}{y}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \int p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_\star)p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\dd{\textcolor{params}{\boldsymbol{w}}} \]
If we have samples from the posterior distribution:
\[ \begin{aligned} \log p(\textcolor{output}{y}_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) &\approx \log\frac{1}{S}\sum_{s=1}^S p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(s)}, {\textcolor{input}{\boldsymbol{x}}}_\star)\\ &\approx \log\sum_{s=1}^S p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(s)}, {\textcolor{input}{\boldsymbol{x}}}_\star) - \log S \end{aligned} \]
where \({\textcolor{params}{\boldsymbol{w}}}^{(s)} \sim p({\textcolor{params}{\boldsymbol{w}}}\mid{\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\).
Problem: we cannot swap log and sum, and we can have numerical issues when computing the the likelihood for each sample \(p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(s)}, {\textcolor{input}{\boldsymbol{x}}}_\star)\).
We can use the log-sum-exp trick to avoid numerical issues. \[ \begin{aligned} \log\sum_{s=1}^S p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(s)}, {\textcolor{input}{\boldsymbol{x}}}_\star) &= \log\sum_{s=1}^S \exp\left(\log p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(s)}, {\textcolor{input}{\boldsymbol{x}}}_\star)\right)\\ &= \text{logsumexp}\left(\log p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(1)}, {\textcolor{input}{\boldsymbol{x}}}_\star), \ldots, \log p(\textcolor{output}{y}_\star \mid {\textcolor{params}{\boldsymbol{w}}}^{(S)}, {\textcolor{input}{\boldsymbol{x}}}_\star)\right) \end{aligned} \]
Many libraries (Numpy, Pytorch, JAX, etc.) provide a numerical stable implementation of the log-sum-exp function
The trick is based on the following identity: \[ \log(\exp(a) + \exp(b)) = a + \log(1 + \exp(b - a)) \]
with \(a \ge b\).
Calibration: the predicted probabilities should match the true probabilities.
Tools to evaluate calibration:
To compute calibration metrics, we need to group the examples based on the predicted probabilities.
Define:
Now:
Plot the accuracy as a function of the confidence.
Expected calibration error (ECE) measures “how much” the predicted probabilities deviate from the true probabilities.
\[ \text{ECE} = \sum_{b=1}^B \frac{|\mathcal B_b|}{B}|\text{acc}(\mathcal B_b) - \text{conf}(\mathcal B_b)| \]

Note: temperature scaling is not affecting model performance (accuracy, F1 score, etc.), but only the calibration.


Simone Rossi - Advanced Statistical Inference - EURECOM