Advanced Statistical Inference
EURECOM
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]Break
Break
Consider two continuous random variables \(x\) and \(y\)
\[ p(x) = \int p(x, y) \dd y \]
\[ p(x, y) = p(x \mid y) p(y) = p(y \mid x) p(x) \]
\[ p(x \mid y) = \frac{p(y \mid x) p(x)}{p(y)} \]
Consider a random vector \({\textcolor{input}{\boldsymbol{x}}}\) with \(D\) components (\({\boldsymbol{z}}\in{\mathbb{R}}^D\))
\[ p({\boldsymbol{z}}) = p(z_1, z_2, \ldots, z_D) = p(z_1) p(z_2 \mid z_1) p(z_3 \mid z_1, z_2) \cdots p(z_D \mid z_1, z_2, \ldots, z_{D-1}) \]
If \(z_i\) are independent, then
\[ p(z_1 \mid z_2, \ldots, z_{D-1}) = p(z_1) \] and the chain rule becomes
\[ p({\boldsymbol{z}}) = p(z_1) p(z_2) \cdots p(z_D) = \prod_{d=1}^D p(z_d) \]
Objective: Learn functions from data (and quantify uncertainty)
\[ {\textcolor{input}{\boldsymbol{X}}}= \begin{bmatrix} {\textcolor{input}{\boldsymbol{x}}}_1 \\ \vdots \\ {\textcolor{input}{\boldsymbol{x}}}_N \end{bmatrix} \in {\mathbb{R}}^{N \times D} \quad \text{or with a bias term} \quad {\textcolor{input}{\boldsymbol{X}}}= \begin{bmatrix} {\boldsymbol{1}}& {\textcolor{input}{\boldsymbol{x}}}_1 \\ \vdots & \vdots \\ {\boldsymbol{1}}& {\textcolor{input}{\boldsymbol{x}}}_N \end{bmatrix} \in {\mathbb{R}}^{N \times (D+1)} \]
\[ {\textcolor{output}{\boldsymbol{y}}}= \begin{bmatrix} \textcolor{output}{y}_1 \\ \vdots \\ \textcolor{output}{y}_N \end{bmatrix} \in {\mathbb{R}}^N \]
Linear models implement a linear combination of (basis) functions
\[ f({\textcolor{input}{\boldsymbol{x}}}) = \sum_{d=1}^D \textcolor{params}{w}_d \varphi_d({\textcolor{input}{\boldsymbol{x}}}) = {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{\varphi}}({\textcolor{input}{\boldsymbol{x}}}) \]
Parameters: \({\textcolor{params}{\boldsymbol{w}}}= [\textcolor{params}{w}_1, \ldots, \textcolor{params}{w}_D]^\top\)
Basis functions: \({\boldsymbol{\varphi}}({\textcolor{input}{\boldsymbol{x}}}) = [\varphi_1({\textcolor{input}{\boldsymbol{x}}}), \ldots, \varphi_D({\textcolor{input}{\boldsymbol{x}}})]^\top\)
Important
Any model that can be written as a linear combination of parameters (not the input) is a linear model
For simplicity, let’s consider linear functions
\[ f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}) = \sum_{d=1}^D \textcolor{params}{w}_d \textcolor{input}{x}_d = {\textcolor{params}{\boldsymbol{w}}}^\top {\textcolor{input}{\boldsymbol{x}}} \]
\[ {\mathcal{L}}({\textcolor{params}{\boldsymbol{w}}}) = \frac{1}{2} \sum_{n=1}^N (\textcolor{output}{y}_n - f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_n))^2 = \frac{1}{2} \norm{{\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}}_2^2 \]
\[ {\mathcal{L}}({\textcolor{params}{\boldsymbol{w}}}) = \frac{1}{2} \norm{{\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}}_2^2 \]
Gradient: \(\nabla_{{\textcolor{params}{\boldsymbol{w}}}} {\mathcal{L}}({\textcolor{params}{\boldsymbol{w}}}) = -{\textcolor{input}{\boldsymbol{X}}}^\top ({\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}) = 0\)
Solution: \({\textcolor{params}{\boldsymbol{w}}}^* = ({\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}})^{-1} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}}\)
Exercise
Implement the least squares solution for linear regression using Cholesky decomposition and back-substitution (ref revision of linear algebra)
Figure 2: An example of generated dataset
Use the solution to fit a linear model to the data \({\textcolor{params}{\boldsymbol{w}}}^* = ({\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}})^{-1} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}}\)
Figure 3: Linear regression fit to the data
Remember the Cholesky decomposition and the forward and backward substitution
\[ {\textcolor{params}{\boldsymbol{w}}}= {\underbrace{({\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}})}_{{\boldsymbol{A}}}}^{-1} \overbrace{{\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}}}^{{\boldsymbol{c}}} \]
Solution of a linear system \({\boldsymbol{A}}{\textcolor{params}{\boldsymbol{w}}}= {\boldsymbol{c}}\)
Compute the Cholesky decomposition \({\boldsymbol{A}}= {\boldsymbol{L}}{\boldsymbol{L}}^\top\)
Solve \({\boldsymbol{L}}{\boldsymbol{z}}= {\boldsymbol{c}}\) for \({\boldsymbol{z}}\) using forward substitution
Solve \({\boldsymbol{L}}^\top {\textcolor{params}{\boldsymbol{w}}}= {\boldsymbol{z}}\) for \({\textcolor{params}{\boldsymbol{w}}}\) using backward substitution
Why?
Consider a simple transformation of the loss function
\[ \ell_i = \frac 1 2 (\textcolor{output}{y}_i - {\textcolor{params}{\boldsymbol{w}}}^\top {\textcolor{input}{\boldsymbol{x}}}_i)^2 \]
\[ \exp(-\gamma \ell_i) = \exp\left(-\frac \gamma 2 (\textcolor{output}{y}_i - {\textcolor{params}{\boldsymbol{w}}}^\top {\textcolor{input}{\boldsymbol{x}}}_i)^2\right) \]
Minimizing the loss is equivalent to maximizing the likelihood of the data under a Gaussian noise model
\[ \exp(-\gamma {\mathcal{L}}_i) \propto {\mathcal{N}}\left({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}, \gamma^{-1} {\boldsymbol{I}}\right) \]
Assumption: Data is generated by a linear model with Gaussian noise \(\epsilon \sim {\mathcal{N}}(0, \sigma^2)\) independent across samples (with \(\sigma^2=\gamma^{-1}\)).
Maximum likelihood estimation is solving
\[ \arg\max_{{\textcolor{params}{\boldsymbol{w}}}} \prod_{i=1}^N \underbrace{{\mathcal{N}}\left(\textcolor{output}{y}_i \mid {\textcolor{params}{\boldsymbol{w}}}^\top {\textcolor{input}{\boldsymbol{x}}}_i, \sigma^2 \right)}_{p(\textcolor{output}{y}_i \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_i)} \]
\(p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) = \prod_{i=1}^N p(\textcolor{output}{y}_i \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_i)\) is the likelihood of the data given the model
Tip
We never maximize the likelihood directly, but the log-likelihood
\[ \arg\max_{{\textcolor{params}{\boldsymbol{w}}}} \sum_{i=1}^N \log {\mathcal{N}}\left(\textcolor{output}{y}_i \mid {\textcolor{params}{\boldsymbol{w}}}^\top {\textcolor{input}{\boldsymbol{x}}}_i, \sigma^2 \right) \]
Because log is monotonic and concave, the optimum value is the same and numerically more stable
Note
The consistency property makes sense if the model is correct (e.g. the data is generated by a linear model with Gaussian noise). But this is an assumption that is often not met in practice.
\[ {\mathbb{E}}_{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}})}[\hat{{\textcolor{params}{\boldsymbol{w}}}}] = {\textcolor{params}{\boldsymbol{w}}} \]
The proof is quite easy
\[ \begin{aligned} {\mathbb{E}}_{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}})}[\hat{{\textcolor{params}{\boldsymbol{w}}}}] &= \int \hat{{\textcolor{params}{\boldsymbol{w}}}} p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) \dd {\textcolor{output}{\boldsymbol{y}}}\\ &= \int ({\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}})^{-1} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}}{\mathcal{N}}({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}, \sigma^2 {\boldsymbol{I}}) \dd {\textcolor{output}{\boldsymbol{y}}}\\ &= ({\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}})^{-1} {\textcolor{input}{\boldsymbol{X}}}^\top \int {\textcolor{output}{\boldsymbol{y}}}{\mathcal{N}}({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}, \sigma^2 {\boldsymbol{I}}) \dd {\textcolor{output}{\boldsymbol{y}}}\\ &= ({\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}})^{-1} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}= {\textcolor{params}{\boldsymbol{w}}} \end{aligned} \]
Ridge regression adds a penalty term to the loss function
\[ {\mathcal{L}}({\textcolor{params}{\boldsymbol{w}}}) = \frac{1}{2} \norm{{\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}}_2^2 + \frac{\lambda}{2} \norm{{\textcolor{params}{\boldsymbol{w}}}}_2^2 \]
Objective: Find \({\textcolor{params}{\boldsymbol{w}}}\) that minimizes the error while keeping the parameters small
Solution: \({\textcolor{params}{\boldsymbol{w}}}^* = ({\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}}+ \lambda {\boldsymbol{I}})^{-1} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}}\)
Same trick as before (exponential of the negative loss)
\[ \begin{aligned} \exp(-\gamma {\mathcal{L}}) &= \exp\left(-\frac \gamma 2 \norm{{\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}}_2^2 - \frac \gamma 2 \lambda \norm{{\textcolor{params}{\boldsymbol{w}}}}_2^2\right) = \\ &= \exp\left(-\frac \gamma 2 \norm{{\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}}_2^2\right) \exp\left(-\frac \gamma 2 \lambda \norm{{\textcolor{params}{\boldsymbol{w}}}}_2^2\right) = \\ &\propto {\mathcal{N}}\left({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}, \gamma^{-1} {\boldsymbol{I}}\right) {\mathcal{N}}\left({\textcolor{params}{\boldsymbol{w}}}\mid {\boldsymbol{0}}, (\gamma \lambda)^{-1} {\boldsymbol{I}}\right) \end{aligned} \]
Minimizing the loss is equivalent to maximizing the product of two Gaussian distributions (likelihood and prior).
We are getting closer to Bayesian inference!
Bayesian inference allows to “transform” a prior distribution over the parameters into a posterior after observing the data
Prior distribution \(p({\textcolor{params}{\boldsymbol{w}}})\)
Posterior distribution \(p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\)
Bayes’ rule :
\[ p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \frac{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) p({\textcolor{params}{\boldsymbol{w}}})}{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}})} \]
Modeling observation as noisy realization of a linear combination of the features As before, we assume a Gaussian likelihood
\[ p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) = {\mathcal{N}}({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}, \sigma^2 {\boldsymbol{I}}) \]
For the prior, we use a Gaussian distribution over the model parameters
\[ p({\textcolor{params}{\boldsymbol{w}}}) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid {\boldsymbol{0}}, {\boldsymbol{S}}) \]
In practice, we often use a diagonal covariance matrix \({\boldsymbol{S}}= \sigma_{\textcolor{params}{\boldsymbol{w}}}^2 {\boldsymbol{I}}\)
Definition
A prior is conjugate to a likelihood if the posterior is in the same family as the prior.
Only a few conjugate priors exist, but they are very useful.
Examples:
Full table available on wikipedia
\[ p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \frac{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) p({\textcolor{params}{\boldsymbol{w}}})}{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}})} \]
Back to our model, the posterior must be Gaussian \({\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid {\boldsymbol{\mu}}, {\boldsymbol{\Sigma}})\)
Ignoring constant terms in \({\textcolor{params}{\boldsymbol{w}}}\):
\[ \begin{aligned} p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) &\propto \exp\left(-\frac 1 2({\textcolor{params}{\boldsymbol{w}}}- {\boldsymbol{\mu}})^\top {\boldsymbol{\Sigma}}^{-1} ({\textcolor{params}{\boldsymbol{w}}}- {\boldsymbol{\mu}}) \right) \\ &= \exp\left(-\frac 1 2 \left( {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{\Sigma}}^{-1} {\textcolor{params}{\boldsymbol{w}}}- 2 {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{\Sigma}}^{-1} {\boldsymbol{\mu}}+ {\boldsymbol{\mu}}^\top {\boldsymbol{\Sigma}}^{-1} {\boldsymbol{\mu}}\right)\right)\\ &\propto \exp\left(-\frac 1 2 \left( {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{\Sigma}}^{-1} {\textcolor{params}{\boldsymbol{w}}}- 2 {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{\Sigma}}^{-1} {\boldsymbol{\mu}}\right)\right) \end{aligned} \]
From the likelihood and prior, we can write the posterior as
\[ \begin{aligned} p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}})p({\textcolor{params}{\boldsymbol{w}}}) &\propto \exp\left(-\frac 1 {2\sigma^2} \norm{{\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}}_2^2 - \frac 1 2 {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{S}}^{-1} {\textcolor{params}{\boldsymbol{w}}}\right) \\ &\propto \exp\left(-\frac 1 2 \left( {\textcolor{params}{\boldsymbol{w}}}^\top \left(\frac 1 {\sigma^2} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}}+ {\boldsymbol{S}}^{-1}\right) {\textcolor{params}{\boldsymbol{w}}}- \frac 2 {\sigma^2}{\textcolor{params}{\boldsymbol{w}}}^\top {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}}\right)\right) \end{aligned} \]
From previous slide,
\[ p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) \propto \exp\left(-\frac 1 2 \left( {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{\Sigma}}^{-1} {\textcolor{params}{\boldsymbol{w}}}- 2 {\textcolor{params}{\boldsymbol{w}}}^\top {\boldsymbol{\Sigma}}^{-1} {\boldsymbol{\mu}}\right)\right) \]
We can identify the posterior mean and covariance
Posterior covariance \[ {\boldsymbol{\Sigma}}= \left(\frac 1 {\sigma^2} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}}+ {\boldsymbol{S}}^{-1}\right)^{-1} \]
Posterior mean \[ {\boldsymbol{\mu}}= \frac 1 {\sigma^2} {\boldsymbol{\Sigma}}{\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}} \]
We can now compute the posterior for the linear regression model
\[ p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid {\boldsymbol{\mu}}, {\boldsymbol{\Sigma}})\quad \text{with} \quad {\boldsymbol{\mu}}= \frac 1 {\sigma^2} {\boldsymbol{\Sigma}}{\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{output}{\boldsymbol{y}}}\quad \text{and} \quad {\boldsymbol{\Sigma}}= \left(\frac 1 {\sigma^2} {\textcolor{input}{\boldsymbol{X}}}^\top {\textcolor{input}{\boldsymbol{X}}}+ {\boldsymbol{S}}^{-1}\right)^{-1} \]
Figure 9
The posterior distribution \(p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\) gives us the uncertainty about the parameters
We can visualize the uncertainty by sampling from the posterior and plotting the predictions
Figure 10: Predictions from the Bayesian linear regression model
The predictive distribution is the distribution of the target variable \(y_\star\) given the input \({\textcolor{input}{\boldsymbol{x}}}_\star\)
Obtained by marginalizing the parameters using the posterior
\[ p(y_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \int p(y_\star \mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}_\star ) p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) \dd {\textcolor{params}{\boldsymbol{w}}} \]
For linear regression, the predictive distribution is Gaussian
\[ p(y_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = {\mathcal{N}}(y_\star \mid {\boldsymbol{\mu}}^\top {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{input}{\boldsymbol{x}}}_\star^\top {\boldsymbol{\Sigma}}{\textcolor{input}{\boldsymbol{x}}}_\star + \sigma^2) \]
Figure 11: Predictive distribution from the Bayesian linear regression model
The same approach can be used with non-linear basis functions
Transform the input \({\textcolor{input}{\boldsymbol{x}}}\) using a non-linear function \(\varphi({\textcolor{input}{\boldsymbol{x}}})\)
\[ {\textcolor{input}{\boldsymbol{x}}}\to \varphi({\textcolor{input}{\boldsymbol{x}}}) = [\varphi_1({\textcolor{input}{\boldsymbol{x}}}), \ldots, \varphi_D({\textcolor{input}{\boldsymbol{x}}})]^\top \]
For convenience, define \({\textcolor{input}{\boldsymbol{\Phi}}}= [\varphi({\textcolor{input}{\boldsymbol{x}}}_1), \ldots, \varphi({\textcolor{input}{\boldsymbol{x}}}_N)]^\top\)
Posterior
\[ p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{\Phi}}}) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid {\boldsymbol{\mu}}, {\boldsymbol{\Sigma}}) \quad \text{with} \quad {\boldsymbol{\mu}}= \frac 1 {\sigma^2} {\boldsymbol{\Sigma}}{\textcolor{input}{\boldsymbol{\Phi}}}^\top {\textcolor{output}{\boldsymbol{y}}}\quad \text{and} \quad {\boldsymbol{\Sigma}}= \left(\frac 1 {\sigma^2} {\textcolor{input}{\boldsymbol{\Phi}}}^\top {\textcolor{input}{\boldsymbol{\Phi}}}+ {\boldsymbol{S}}^{-1}\right)^{-1} \]
Predictive distribution
\[ p(y_\star \mid {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{\Phi}}}) = {\mathcal{N}}(y_\star \mid {\boldsymbol{\mu}}^\top \varphi({\textcolor{input}{\boldsymbol{x}}}_\star), \varphi({\textcolor{input}{\boldsymbol{x}}}_\star)^\top {\boldsymbol{\Sigma}}\varphi({\textcolor{input}{\boldsymbol{x}}}_\star) + \sigma^2) \]
Use polynomial basis functions \(\varphi_i({\textcolor{input}{\boldsymbol{x}}}) = \textcolor{input}{x}^i\): \(f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}) = \sum_{i=0}^K \textcolor{params}{w}_i \textcolor{input}{x}^i\)
Maximum a posteriori estimation (MAP) computes the mode of the posterior distribution \(\arg\max p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}})\). For Gaussians, it is the same as the mean
\[ \arg\min {\mathcal{L}}({\textcolor{params}{\boldsymbol{w}}}) = \frac{1}{2} \norm{{\textcolor{output}{\boldsymbol{y}}}- {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}}_2^2 + \frac{\lambda}{2} \norm{{\textcolor{params}{\boldsymbol{w}}}}_2^2 \]
\[ \arg\max p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = {\mathcal{N}}({\textcolor{params}{\boldsymbol{w}}}\mid {\boldsymbol{\mu}}, {\boldsymbol{\Sigma}}) \]
If \(\lambda = \sigma_{\textcolor{output}{\boldsymbol{y}}}^2 / \sigma_{\textcolor{params}{\boldsymbol{w}}}^2\), the ridge regression solution is equivalent to the MAP solution with a Gaussian prior
Question: How do we choose the prior?
We can now get a solution for the linear regression problem (Bayesian and not) but we have to choose the model
Questions:
Attempted to choose the model based on the likelihood of the data. Is this a good idea?
NO!
Consider polynomial functions of increasing degree \(f({\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{x}}}) = \sum_{d=0}^D \textcolor{params}{w}_d \textcolor{input}{x}^d\)
Figure 15: Train and test log-likelihood as a function of polynomial degree
Figure 16: Train and test error as a function of polynomial degree
Cross-validation: Split the data into training and validation sets
Solve the model with the training set and evaluate the performance on the validation set
(Optional) Repeat the process for different splits of the data
Choose the model that performs best on the validation set
Pros
Cons
Likelihood principle: All the information from the data is contained in the likelihood function
Consequence: You should not base your inference on data that you could have observed but did not
Cross-validation (and other frequentist methods) violate the likelihood principle
Note
\[ p({\textcolor{params}{\boldsymbol{w}}}\mid {\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{X}}}) = \frac{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) p({\textcolor{params}{\boldsymbol{w}}})}{p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}})} \]
The marginal likelihood is the normalization constant of the posterior distribution
\[ p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}) = \int p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}) p({\textcolor{params}{\boldsymbol{w}}}) \dd {\textcolor{params}{\boldsymbol{w}}} \]
We are averaging the likelihood over all possible values of the parameters from the prior
It tells us how likely the data is under the model
Let’s be explicit about the model in the marginal likelihood
\[ p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}, m) = \int p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{params}{\boldsymbol{w}}}, {\textcolor{input}{\boldsymbol{X}}}, m) p({\textcolor{params}{\boldsymbol{w}}}\mid m) \dd {\textcolor{params}{\boldsymbol{w}}} \]
where \(m\) is the model (e.g. polynomial degree) and \(p({\textcolor{params}{\boldsymbol{w}}}\mid m)\) is the prior over the parameters for the model \(m\).
\[ \widehat{m} = \arg\max_m p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}, m) \]
This is also known as Type II maximum likelihood or evidence maximization
Given:
The marginal likelihood is a Gaussian
\[ p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}) = {\mathcal{N}}({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}{\boldsymbol{\mu}}_p, {\textcolor{input}{\boldsymbol{X}}}{\boldsymbol{\Sigma}}_p {\textcolor{input}{\boldsymbol{X}}}^\top + \sigma_{\textcolor{output}{\boldsymbol{y}}}^2 {\boldsymbol{I}}) \]
It does NOT depend on the parameters \({\textcolor{params}{\boldsymbol{w}}}\) but only on the model!
Exercise
Write the expression if \({\boldsymbol{\Sigma}}_p = \sigma_{\textcolor{params}{\boldsymbol{w}}}^2 {\boldsymbol{I}}\) and \({\boldsymbol{\mu}}_p = {\boldsymbol{0}}\)
Tip
Remember that when we work compute PDFs, in practice we work with the log of the PDFs to avoid numerical issues.
Using the marginal likelihood to select the best polynomial degree
Figure 17: Marginal likelihood as a function of polynomial degree
Does \(p({\textcolor{output}{\boldsymbol{y}}}\mid {\textcolor{input}{\boldsymbol{X}}}, m)\) favor complex models? No!
This is known as the Bayesian Occam’s razor: the simplest model that explains the data is the best
Intuition
Apply chain rule to the marginal likelihood (drop all dependencies): \[ p({\textcolor{output}{\boldsymbol{y}}}) = p(\textcolor{output}{y}_1) p(\textcolor{output}{y}_2 \mid \textcolor{output}{y}_1) p(\textcolor{output}{y}_3 \mid \textcolor{output}{y}_1, \textcolor{output}{y}_2) \ldots p(\textcolor{output}{y}_N \mid \textcolor{output}{y}_1, \ldots, \textcolor{output}{y}_{N-1}) \]

Simone Rossi - Advanced Statistical Inference - EURECOM