Bayesian linear regression

Advanced Statistical Inference

Simone Rossi

EURECOM

\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]

Linear regression

Objectives for today

Review of linear regression
Understand the probabilistic interpretation of (regularized) loss minimization

Break

Introduction of Bayesian linear regression
Compute the posterior distribution and make predictions
Model selection and other properties of Bayesian linear regression

Break

Class exercise on Bayesian inference for coin toss

A quick recap on probability

Consider two continuous random variables \(x\) and \(y\)

Sum rule (marginalization):

\[ p(x) = \int p(x, y) \dd y \]

Product rule (conditioning):

\[ p(x, y) = p(x \mid y) p(y) = p(y \mid x) p(x) \]

Bayes’ rule:

\[ p(x \mid y) = \frac{p(y \mid x) p(x)}{p(y)} \]

Consider a random vector \({\textcolor{input}{\boldsymbol{x}}}\) with \(D\) components (\({\boldsymbol{z}}\in{\mathbb{R}}^D\))

Chain rule:

\[ p({\boldsymbol{z}}) = p(z_1, z_2, \ldots, z_D) = p(z_1) p(z_2 \mid z_1) p(z_3 \mid z_1, z_2) \cdots p(z_D \mid z_1, z_2, \ldots, z_{D-1}) \]

If \(z_i\) are independent, then

\[ p(z_1 \mid z_2, \ldots, z_{D-1}) = p(z_1) \] and the chain rule becomes

\[ p({\boldsymbol{z}}) = p(z_1) p(z_2) \cdots p(z_D) = \prod_{d=1}^D p(z_d) \]

Introduction

Objective: Learn functions from data (and quantify uncertainty)

Definitions

Input, features, covariates: \({\textcolor{input}{\boldsymbol{x}}}\in {\mathbb{R}}^D\)

\[ {\textcolor{input}{\boldsymbol{X}}}= \begin{bmatrix} {\textcolor{input}{\boldsymbol{x}}}_1 \\ \vdots \\ {\textcolor{input}{\boldsymbol{x}}}_N \end{bmatrix} \in {\mathbb{R}}^{N \times D} \quad \text{or with a bias term} \quad {\textcolor{input}{\boldsymbol{X}}}= \begin{bmatrix} {\boldsymbol{1}}& {\textcolor{input}{\boldsymbol{x}}}_1 \\ \vdots & \vdots \\ {\boldsymbol{1}}& {\textcolor{input}{\boldsymbol{x}}}_N \end{bmatrix} \in {\mathbb{R}}^{N \times (D+1)} \]

Output, target, response: \(y \in {\mathbb{R}}\)

\[ {\textcolor{output}{\boldsymbol{y}}}= \begin{bmatrix} \textcolor{output}{y}_1 \\ \vdots \\ \textcolor{output}{y}_N \end{bmatrix} \in {\mathbb{R}}^N \]

Dataset: \({\mathcal{D}}= \{{\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}\}\)

Regression

Objective: Learn a function \(f: {\mathbb{R}}^D \to {\mathbb{R}}\)

Linear models implement a linear combination of (basis) functions

\[ f({\textcolor{input}{\boldsymbol{x}}}) = \sum_{d=1}^D \textcolor{params}{w}_d \varphi_d({\textcolor{input}{\boldsymbol{x}}}) = {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{\varphi}}({\textcolor{input}{\boldsymbol{x}}}) \]

Parameters: \({\textcolor{params}{\boldsymbol{w}}}= [\textcolor{params}{w}_1, \ldots, \textcolor{params}{w}_D]^\top\)
Basis functions: \({\boldsymbol{\varphi}}({\textcolor{input}{\boldsymbol{x}}}) = [\varphi_1({\textcolor{input}{\boldsymbol{x}}}), \ldots, \varphi_D({\textcolor{input}{\boldsymbol{x}}})]^\top\)

Important

Any model that can be written as a linear combination of parameters (not the input) is a linear model

Linear regression

For simplicity, let’s consider linear functions

\[ f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}) = \sum_{d=1}^D \textcolor{params}{w}_d \textcolor{input}{x}_d = {\textcolor{params}{\boldsymbol{w}}}^\top {\textcolor{input}{\boldsymbol{x}}} \]

Objective: Find \({\textcolor{params}{\boldsymbol{w}}}\) that minimizes the error

\[ {\mathcal{L}}({\textcolor{params}{\boldsymbol{w}}}) = \frac{1}{2} \sum_{n=1}^N (\textcolor{output}{y}_n - f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_n))^2 = \frac{1}{2} \norm{{\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}}_2^2 \]

Least squares solution

\[ {\mathcal{L}}({\textcolor{params}{\boldsymbol{w}}}) = \frac{1}{2} \norm{{\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}}_2^2 \]

Gradient: \(\nabla_{{\textcolor{params}{\boldsymbol{w}}}} {\mathcal{L}}({\textcolor{params}{\boldsymbol{w}}}) = -{\textcolor{input}{\boldsymbol{X}}}^\top ({\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}) = 0\)
Solution: \({\textcolor{params}{\boldsymbol{w}}}^* = ({\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}})^{-1} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}}\)

Exercise

Implement the least squares solution for linear regression using Cholesky decomposition and back-substitution (ref revision of linear algebra)

Example

Figure 2: An example of generated dataset

Use the solution to fit a linear model to the data \({\textcolor{params}{\boldsymbol{w}}}^* = ({\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}})^{-1} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}}\)

Figure 3: Linear regression fit to the data

Implemetation tips

Remember the Cholesky decomposition and the forward and backward substitution

\[ {\textcolor{params}{\boldsymbol{w}}}= {\underbrace{({\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}})}_{{\boldsymbol{A}}}}^{-1} \overbrace{{\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}}}^{{\boldsymbol{c}}} \]

Solution of a linear system \({\boldsymbol{A}}{\textcolor{params}{\boldsymbol{w}}}= {\boldsymbol{c}}\)

Compute the Cholesky decomposition \({\boldsymbol{A}}= {\boldsymbol{L}}{\boldsymbol{L}}^\top\)
Solve \({\boldsymbol{L}}{\boldsymbol{z}}= {\boldsymbol{c}}\) for \({\boldsymbol{z}}\) using forward substitution
Solve \({\boldsymbol{L}}^\top {\textcolor{params}{\boldsymbol{w}}}= {\boldsymbol{z}}\) for \({\textcolor{params}{\boldsymbol{w}}}\) using backward substitution

Why?

\({\boldsymbol{A}}\) is symmetric and positive (semi)definite
Numerically stable
Complexity \(\mathcal O(D^3)\) for Cholesky decomposition and \(\mathcal O(D^2)\) for forward and backward substitution

Probabilistic interpretation of loss minimization

Consider a simple transformation of the loss function

\[ \ell_i = \frac 1 2 (\textcolor{output}{y}_i - {\textcolor{params}{\boldsymbol{w}}}^\top {\textcolor{input}{\boldsymbol{x}}}_i)^2 \]

\[ \exp(-\gamma \ell_i) = \exp\left(-\frac \gamma 2 (\textcolor{output}{y}_i - {\textcolor{params}{\boldsymbol{w}}}^\top {\textcolor{input}{\boldsymbol{x}}}_i)^2\right) \]

Minimizing the loss is equivalent to maximizing the likelihood of the data under a Gaussian noise model

\[ \exp(-\gamma {\mathcal{L}}_i) \propto {\mathcal{N}}\left({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}, \gamma^{-1} {\boldsymbol{I}}\right) \]

Assumption: Data is generated by a linear model with Gaussian noise \(\epsilon \sim {\mathcal{N}}(0, \sigma^2)\) independent across samples (with \(\sigma^2=\gamma^{-1}\)).

Maximum likelihood estimation is solving

\[ \arg\max_{{\textcolor{params}{\boldsymbol{w}}}} \prod_{i=1}^N \underbrace{{\mathcal{N}}\left(\textcolor{output}{y}_i \mid {\textcolor{params}{\boldsymbol{w}}}^\top {\textcolor{input}{\boldsymbol{x}}}_i, \sigma^2 \right)}_{p(\textcolor{output}{y}_i \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_i)} \]

\(p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) = \prod_{i=1}^N p(\textcolor{output}{y}_i \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_i)\) is the likelihood of the data given the model

Tip

We never maximize the likelihood directly, but the log-likelihood

\[ \arg\max_{{\textcolor{params}{\boldsymbol{w}}}} \sum_{i=1}^N \log {\mathcal{N}}\left(\textcolor{output}{y}_i \mid {\textcolor{params}{\boldsymbol{w}}}^\top {\textcolor{input}{\boldsymbol{x}}}_i, \sigma^2 \right) \]

Because log is monotonic and concave, the optimum value is the same and numerically more stable

Likelihood is not a probability

The likelihood function is not a probability distribution.
The likelihood can take any non-negative value, not just values between 0 and 1.
It represents the density of the data (\({\textcolor{output}{\boldsymbol{y}}}\)) given the model (\({\textcolor{params}{\boldsymbol{w}}}\)).

Properties of maximum likelihood estimation

Consistency: As \(N \to \infty\), the MLE converges to the true parameter value

Note

The consistency property makes sense if the model is correct (e.g. the data is generated by a linear model with Gaussian noise). But this is an assumption that is often not met in practice.

Unbiasedness: The expected value of the MLE is unbiased

\[ {\mathbb{E}}_{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}})}[\hat{{\textcolor{params}{\boldsymbol{w}}}}] = {\textcolor{params}{\boldsymbol{w}}} \]

The proof is quite easy

\[ \begin{aligned} {\mathbb{E}}_{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}})}[\hat{{\textcolor{params}{\boldsymbol{w}}}}] &= \int \hat{{\textcolor{params}{\boldsymbol{w}}}} p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) \dd {\textcolor{output}{\boldsymbol{y}}}\\ &= \int ({\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}})^{-1} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}}{\mathcal{N}}({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}, \sigma^2 {\boldsymbol{I}}) \dd {\textcolor{output}{\boldsymbol{y}}}\\ &= ({\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}})^{-1} {\textcolor{input}{\boldsymbol{X}}}^\top \int {\textcolor{output}{\boldsymbol{y}}}{\mathcal{N}}({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}, \sigma^2 {\boldsymbol{I}}) \dd {\textcolor{output}{\boldsymbol{y}}}\\ &= ({\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}})^{-1} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}= {\textcolor{params}{\boldsymbol{w}}} \end{aligned} \]

Regularization

Ridge regression adds a penalty term to the loss function

\[ {\mathcal{L}}({\textcolor{params}{\boldsymbol{w}}}) = \frac{1}{2} \norm{{\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}}_2^2 + \frac{\lambda}{2} \norm{{\textcolor{params}{\boldsymbol{w}}}}_2^2 \]

Objective: Find \({\textcolor{params}{\boldsymbol{w}}}\) that minimizes the error while keeping the parameters small
Solution: \({\textcolor{params}{\boldsymbol{w}}}^* = ({\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}}+ \lambda {\boldsymbol{I}})^{-1} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}}\)

Probabilistic interpretation of regularization

Assumption: Data is generated by a linear model with Gaussian noise independent across samples

Same trick as before (exponential of the negative loss)

\[ \begin{aligned} \exp(-\gamma {\mathcal{L}}) &= \exp\left(-\frac \gamma 2 \norm{{\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}}_2^2 - \frac \gamma 2 \lambda \norm{{\textcolor{params}{\boldsymbol{w}}}}_2^2\right) = \\ &= \exp\left(-\frac \gamma 2 \norm{{\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}}_2^2\right) \exp\left(-\frac \gamma 2 \lambda \norm{{\textcolor{params}{\boldsymbol{w}}}}_2^2\right) = \\ &\propto {\mathcal{N}}\left({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}, \gamma^{-1} {\boldsymbol{I}}\right) {\mathcal{N}}\left({\textcolor{params}{\boldsymbol{w}}}\mid {\boldsymbol{0}}, (\gamma \lambda)^{-1} {\boldsymbol{I}}\right) \end{aligned} \]

Minimizing the loss is equivalent to maximizing the product of two Gaussian distributions (likelihood and prior).

We are getting closer to Bayesian inference!

Bayesian linear regression

Bayesian inference

Bayesian inference allows to “transform” a prior distribution over the parameters into a posterior after observing the data

Prior distribution \(p({\textcolor{params}{\boldsymbol{w}}})\)

Posterior distribution \(p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\)

Bayesian linear regression

Bayes’ rule :

\[ p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \frac{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) p({\textcolor{params}{\boldsymbol{w}}})}{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}})} \]

Prior: \(p({\textcolor{params}{\boldsymbol{w}}})\)
- Encodes our beliefs about the parameters before observing the data
Likelihod: \(p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}})\)
- Encodes our model of the data
Posterior: \(p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\)
- Encodes our beliefs about the parameters after observing the data (e.g. conditioned on the data)
Evidence: \(p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}})\)
- Normalizing constant, ensures that \(\int p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) \dd {\textcolor{params}{\boldsymbol{w}}}= 1\)

Bayesian linear regression - Likelihood and prior

Modeling observation as noisy realization of a linear combination of the features As before, we assume a Gaussian likelihood

\[ p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) = {\mathcal{N}}({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}, \sigma^2 {\boldsymbol{I}}) \]

For the prior, we use a Gaussian distribution over the model parameters

\[ p({\textcolor{params}{\boldsymbol{w}}}) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid {\boldsymbol{0}}, {\boldsymbol{S}}) \]

In practice, we often use a diagonal covariance matrix \({\boldsymbol{S}}= \sigma_{\textcolor{params}{\boldsymbol{w}}}^2 {\boldsymbol{I}}\)

When can we compute the posterior?

Definition

A prior is conjugate to a likelihood if the posterior is in the same family as the prior.

Only a few conjugate priors exist, but they are very useful.

Examples:

Gaussian likelihood and Gaussian prior \(\Rightarrow\) Gaussian posterior
Binomial likelihood and Beta prior \(\Rightarrow\) Beta posterior

Full table available on wikipedia

Why is this useful?

Generally the posterior is intractable to compute
- We don’t the form of the posterior \(p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\)
- The evidence \(p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}})\) is an integral
  - without closed form solution
  - high-dimensional and computationally intractable to compute numerically

Analytical solution thanks to conjugacy:
- We know the form of the posterior
- We know the form of the normalization constant
- We don’t need to compute the evidence, just some algebra to get the posterior

Back to our model, the posterior must be Gaussian \({\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid {\boldsymbol{\mu}}, {\boldsymbol{\Sigma}})\)

Ignoring constant terms in \({\textcolor{params}{\boldsymbol{w}}}\):

\[ \begin{aligned} p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) &\propto \exp\left(-\frac 1 2({\textcolor{params}{\boldsymbol{w}}}- {\boldsymbol{\mu}})^\top {\boldsymbol{\Sigma}}^{-1} ({\textcolor{params}{\boldsymbol{w}}}- {\boldsymbol{\mu}}) \right) \\ &= \exp\left(-\frac 1 2 \left( {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{\Sigma}}^{-1} {\textcolor{params}{\boldsymbol{w}}}- 2 {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{\Sigma}}^{-1} {\boldsymbol{\mu}}+ {\boldsymbol{\mu}}^\top {\boldsymbol{\Sigma}}^{-1} {\boldsymbol{\mu}}\right)\right)\\ &\propto \exp\left(-\frac 1 2 \left( {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{\Sigma}}^{-1} {\textcolor{params}{\boldsymbol{w}}}- 2 {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{\Sigma}}^{-1} {\boldsymbol{\mu}}\right)\right) \end{aligned} \]

From the likelihood and prior, we can write the posterior as

\[ \begin{aligned} p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}})p({\textcolor{params}{\boldsymbol{w}}}) &\propto \exp\left(-\frac 1 {2\sigma^2} \norm{{\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}}_2^2 - \frac 1 2 {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{S}}^{-1} {\textcolor{params}{\boldsymbol{w}}}\right) \\ &\propto \exp\left(-\frac 1 2 \left( {\textcolor{params}{\boldsymbol{w}}}^\top \left(\frac 1 {\sigma^2} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}}+ {\boldsymbol{S}}^{-1}\right) {\textcolor{params}{\boldsymbol{w}}}- \frac 2 {\sigma^2}{\textcolor{params}{\boldsymbol{w}}}^\top {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}}\right)\right) \end{aligned} \]

From previous slide,

\[ p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) \propto \exp\left(-\frac 1 2 \left( {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{\Sigma}}^{-1} {\textcolor{params}{\boldsymbol{w}}}- 2 {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{\Sigma}}^{-1} {\boldsymbol{\mu}}\right)\right) \]

We can identify the posterior mean and covariance

Posterior covariance \[ {\boldsymbol{\Sigma}}= \left(\frac 1 {\sigma^2} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}}+ {\boldsymbol{S}}^{-1}\right)^{-1} \]

Posterior mean \[ {\boldsymbol{\mu}}= \frac 1 {\sigma^2} {\boldsymbol{\Sigma}}{\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}} \]

Bayesian linear regression - Example

We can now compute the posterior for the linear regression model

\[ p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid {\boldsymbol{\mu}}, {\boldsymbol{\Sigma}})\quad \text{with} \quad {\boldsymbol{\mu}}= \frac 1 {\sigma^2} {\boldsymbol{\Sigma}}{\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}}\quad \text{and} \quad {\boldsymbol{\Sigma}}= \left(\frac 1 {\sigma^2} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}}+ {\boldsymbol{S}}^{-1}\right)^{-1} \]

Figure 9

How to make predictions?

The posterior distribution \(p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\) gives us the uncertainty about the parameters

We can visualize the uncertainty by sampling from the posterior and plotting the predictions

Figure 10: Predictions from the Bayesian linear regression model

Predictive distribution

The predictive distribution is the distribution of the target variable \(y_\star\) given the input \({\textcolor{input}{\boldsymbol{x}}}_\star\)

Obtained by marginalizing the parameters using the posterior

\[ p(y_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \int p(y_\star \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_\star ) p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) \dd {\textcolor{params}{\boldsymbol{w}}} \]

For linear regression, the predictive distribution is Gaussian

\[ p(y_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = {\mathcal{N}}(y_\star \mid {\boldsymbol{\mu}}^\top {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{input}{\boldsymbol{x}}}_\star^\top {\boldsymbol{\Sigma}}{\textcolor{input}{\boldsymbol{x}}}_\star + \sigma^2) \]

Predictive distribution - Example

Figure 11: Predictive distribution from the Bayesian linear regression model

Bayesian linear regression with basis functions

The same approach can be used with non-linear basis functions

Transform the input \({\textcolor{input}{\boldsymbol{x}}}\) using a non-linear function \(\varphi({\textcolor{input}{\boldsymbol{x}}})\)

\[ {\textcolor{input}{\boldsymbol{x}}}\to \varphi({\textcolor{input}{\boldsymbol{x}}}) = [\varphi_1({\textcolor{input}{\boldsymbol{x}}}), \ldots, \varphi_D({\textcolor{input}{\boldsymbol{x}}})]^\top \]

For convenience, define \({\textcolor{input}{\boldsymbol{\Phi}}}= [\varphi({\textcolor{input}{\boldsymbol{x}}}_1), \ldots, \varphi({\textcolor{input}{\boldsymbol{x}}}_N)]^\top\)

Posterior

\[ p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{\Phi}}}) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid {\boldsymbol{\mu}}, {\boldsymbol{\Sigma}}) \quad \text{with} \quad {\boldsymbol{\mu}}= \frac 1 {\sigma^2} {\boldsymbol{\Sigma}}{\textcolor{input}{\boldsymbol{\Phi}}}^\top {\textcolor{output}{\boldsymbol{y}}}\quad \text{and} \quad {\boldsymbol{\Sigma}}= \left(\frac 1 {\sigma^2} {\textcolor{input}{\boldsymbol{\Phi}}}^\top {\textcolor{input}{\boldsymbol{\Phi}}}+ {\boldsymbol{S}}^{-1}\right)^{-1} \]

Predictive distribution

\[ p(y_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{\Phi}}}) = {\mathcal{N}}(y_\star \mid {\boldsymbol{\mu}}^\top \varphi({\textcolor{input}{\boldsymbol{x}}}_\star), \varphi({\textcolor{input}{\boldsymbol{x}}}_\star)^\top {\boldsymbol{\Sigma}}\varphi({\textcolor{input}{\boldsymbol{x}}}_\star) + \sigma^2) \]

Bayesian linear regression with basis functions - Example

Use polynomial basis functions \(\varphi_i({\textcolor{input}{\boldsymbol{x}}}) = \textcolor{input}{x}^i\): \(f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}) = \sum_{i=0}^K \textcolor{params}{w}_i \textcolor{input}{x}^i\)

Analysis of Bayesian linear regression

Connection with ridge regression and maximum a posteriori estimation

Maximum a posteriori estimation (MAP) computes the mode of the posterior distribution \(\arg\max p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\). For Gaussians, it is the same as the mean

\[ \arg\min {\mathcal{L}}({\textcolor{params}{\boldsymbol{w}}}) = \frac{1}{2} \norm{{\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}}_2^2 + \frac{\lambda}{2} \norm{{\textcolor{params}{\boldsymbol{w}}}}_2^2 \]

\[ \arg\max p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid {\boldsymbol{\mu}}, {\boldsymbol{\Sigma}}) \]

If \(\lambda = \sigma_{\textcolor{output}{\boldsymbol{y}}}^2 / \sigma_{\textcolor{params}{\boldsymbol{w}}}^2\), the ridge regression solution is equivalent to the MAP solution with a Gaussian prior

Why not use MAP estimation?

No uncertainty quantification: MAP only gives the mode of the posterior, not the full distribution
Overfitting: MAP doesn’t model the uncertainty in the parameters, which can overfit the data, especially with complex models. This is known as overconfidence in the predictions

Mode in untypical regions: The mode of the posterior can be in regions of low probability, leading to poor generalization

Figure 13: The mode of the posterior is not representative of the full distribution. Here, the mean would be a better estimate

Effect of the prior

Prior encodes our beliefs about the parameters before observing the data.
Prior effect diminishes with more data
When we don’t have much data, the prior can have a strong effect on the posterior

Question: How do we choose the prior?

Data type:
- Real-values: Gaussian prior
- Positive data: Log-normal prior, Gamma prior, etc.
- 0-1 data: Beta prior
- Data summing to 1: Dirichlet prior
Expert knowledge
Computational convenience

Model selection

We can now get a solution for the linear regression problem (Bayesian and not) but we have to choose the model

Questions:

What is the best model for the data?
How many basis functions should we use?
How to avoid overfitting?

Attempted to choose the model based on the likelihood of the data. Is this a good idea?

NO!

Higher complexity models will always have higher likelihood …
… but they will also overfit the data and generalize poorly

Model selection - Increasing complexity

Consider polynomial functions of increasing degree \(f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}) = \sum_{d=0}^D \textcolor{params}{w}_d \textcolor{input}{x}^d\)

Model selection - Likelihood as a function of complexity

Figure 15: Train and test log-likelihood as a function of polynomial degree

Model selection - Error as a function of complexity

Figure 16: Train and test error as a function of polynomial degree

Model selection - Cross-validation

Cross-validation: Split the data into training and validation sets
Solve the model with the training set and evaluate the performance on the validation set
(Optional) Repeat the process for different splits of the data
Choose the model that performs best on the validation set

Pros

Simple
Works well in practice

Cons

Computationally expensive
Requires multiple runs
Works poorly with small datasets
Violates the likelihood principle

Likelihood principle

Likelihood principle: All the information from the data is contained in the likelihood function
Consequence: You should not base your inference on data that you could have observed but did not
Cross-validation (and other frequentist methods) violate the likelihood principle

Note

This is more of a philosophical point than a practical one.
It’s not a rule of nature but an argument that Bayesian methods are more principled.

Marginal likelihood

The marginal likelihood is the normalization constant of the posterior distribution

\[ p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}) = \int p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) p({\textcolor{params}{\boldsymbol{w}}}) \dd {\textcolor{params}{\boldsymbol{w}}} \]

We are averaging the likelihood over all possible values of the parameters from the prior
It tells us how likely the data is under the model

Let’s be explicit about the model in the marginal likelihood

\[ p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}, m) = \int p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}, m) p({\textcolor{params}{\boldsymbol{w}}}\mid m) \dd {\textcolor{params}{\boldsymbol{w}}} \]

where \(m\) is the model (e.g. polynomial degree) and \(p({\textcolor{params}{\boldsymbol{w}}}\mid m)\) is the prior over the parameters for the model \(m\).

Recipe for Bayesian model selection:

\[ \widehat{m} = \arg\max_m p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}, m) \]

This is also known as Type II maximum likelihood or evidence maximization

Model selection with Bayesian linear regression

Given:

Likelihood: \(p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) = {\mathcal{N}}({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}, \sigma_{\textcolor{output}{\boldsymbol{y}}}^2 {\boldsymbol{I}})\)
Prior: \(p({\textcolor{params}{\boldsymbol{w}}}\mid m) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid {\boldsymbol{\mu}}_p, {\boldsymbol{\Sigma}}_p)\)

The marginal likelihood is a Gaussian

\[ p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}) = {\mathcal{N}}({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}{\boldsymbol{\mu}}_p, {\textcolor{input}{\boldsymbol{X}}}{\boldsymbol{\Sigma}}_p {\textcolor{input}{\boldsymbol{X}}}^\top + \sigma_{\textcolor{output}{\boldsymbol{y}}}^2 {\boldsymbol{I}}) \]

It does NOT depend on the parameters \({\textcolor{params}{\boldsymbol{w}}}\) but only on the model!

Exercise

Write the expression if \({\boldsymbol{\Sigma}}_p = \sigma_{\textcolor{params}{\boldsymbol{w}}}^2 {\boldsymbol{I}}\) and \({\boldsymbol{\mu}}_p = {\boldsymbol{0}}\)

Tip

Remember that when we work compute PDFs, in practice we work with the log of the PDFs to avoid numerical issues.

Model selection with Bayesian linear regression - Example

Using the marginal likelihood to select the best polynomial degree

Figure 17: Marginal likelihood as a function of polynomial degree

Why does it work? The Bayesian Occam’s razor

Does \(p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}, m)\) favor complex models? No!

We marginalize over the parameters, not maximize them
The marginal likelihood penalizes complex models that don’t fit the data well

This is known as the Bayesian Occam’s razor: the simplest model that explains the data is the best

Intuition

Apply chain rule to the marginal likelihood (drop all dependencies): \[ p({\textcolor{output}{\boldsymbol{y}}}) = p(\textcolor{output}{y}_1) p(\textcolor{output}{y}_2 \mid \textcolor{output}{y}_1) p(\textcolor{output}{y}_3 \mid \textcolor{output}{y}_1, \textcolor{output}{y}_2) \ldots p(\textcolor{output}{y}_N \mid \textcolor{output}{y}_1, \ldots, \textcolor{output}{y}_{N-1}) \]

If the model is too complex, it will predict early data points well but later data points poorly

⏩ Next: Coin toss exercise