Advanced Statistical Inference
EURECOM
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]
Discriminative models learn the conditional distribution of the labels given the data of the form \(p({\textcolor{output}{\boldsymbol{y}}}\mid{\textcolor{input}{\boldsymbol{x}}})\).
Generative models learn the joint distribution of the data and the labels of the form \(p({\textcolor{output}{\boldsymbol{y}}}, {\textcolor{input}{\boldsymbol{x}}})\).
So far, we have focused on discriminative models:
But they cannot:
Generative models can do both!
While generative models are often used in unsupervised learning (e.g. images without labels, text without labels, etc.), they can also be used in supervised learning.
Before we look at more classical generative models, let’s look at a simple example of a generative model that is also supervised: Naive Bayes classification.
Naive Bayes is an example of generative classifier.
It is based on the Bayes theorem and assumes that the features are conditionally independent given the class.
\[ p(\textcolor{output}{y}_\star = k \vert {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = \frac{p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) p(\textcolor{output}{y}_\star = k)}{\sum_{j} p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = j, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) p(\textcolor{output}{y}_\star = j)} \]
Questions:
\[ p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, \textcolor{latent}{\textcolor{latent}{\boldsymbol{f}}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})) \]
How likely is the new data \({\textcolor{input}{\boldsymbol{x}}}_\star\) given the class \(k\) and the training data \({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}\)?
The function \(\textcolor{latent}{\textcolor{latent}{\boldsymbol{f}}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})\) encodes some likelihood parameters, given the training data.
Free to choose the form of the likelihood, but it needs to be class-conditional.
Common choice: Gaussian distribution (for each class).
Training data for class \(k\) used to estimate the parameters of the Gaussian (mean and variance).
Naive Bayes makes an additional assumption on the likelihood:
\[ p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, \textcolor{latent}{\textcolor{latent}{\boldsymbol{f}}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})) = \prod_{d=1}^D p(\textcolor{input}x_{\star d} \vert \textcolor{output}{y}_\star = k, \textcolor{latent}{\textcolor{latent}{\boldsymbol{f}}}_d({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})) \]
The likelihood parameters \(\textcolor{latent}{\textcolor{latent}{\boldsymbol{f}}}_d({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})\) are specific to each feature \(d\).
In the Gaussian case:
\[ p(\textcolor{output}{y}_\star = k) \]
Examples:
\(\textcolor{latent}{\textcolor{latent}{\boldsymbol{f}}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})\) needs to estimate mean and variance for each class.
Empirical mean: \[ \textcolor{latent}\mu_{kd} = \frac{1}{N_k} \sum_{i=1}^N \textcolor{input}x_{id} \mathbb{I}(\textcolor{output}y_i = k) \]
Empirical variance: \[ \textcolor{latent}\sigma_{kd}^2 = \frac{1}{N_k} \sum_{i=1}^N (\textcolor{input}x_{id} - \textcolor{latent}\mu_{kd})^2 \mathbb{I}(\textcolor{output}y_i = k) \]
\[ p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = \prod_{d=1}^D {\mathcal{N}}(\textcolor{input}x_{\star d} \vert \textcolor{latent}\mu_{kd}, \textcolor{latent}\sigma_{kd}^2) \]
\[ p(\textcolor{output}{y}_\star = k \vert {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = \frac{p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) p(\textcolor{output}{y}_\star = k)}{\sum_{j} p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = j, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) p(\textcolor{output}{y}_\star = j)} \]
Naive Bayes is a generative classifier based on the Bayes theorem.
Advantages:
Disadvantages:
Generative models:
Given training data from an unknown distribution \(p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\), we want to learn a model \(p_\text{model}({\textcolor{input}{\boldsymbol{x}}})\) that approximates the true distribution.

Train from \({\textcolor{input}{\boldsymbol{x}}}\sim p_\text{data}({\textcolor{input}{\boldsymbol{x}}})\).

Generate from \({\textcolor{input}{\boldsymbol{x}}}\sim p_\text{model}({\textcolor{input}{\boldsymbol{x}}})\).
Our objective is to learn the data distribution \(p({\textcolor{input}{\boldsymbol{x}}})\), but we suppose that each data point \({\textcolor{input}{\boldsymbol{x}}}\) is associated with a latent variable \({\textcolor{latent}{\boldsymbol{z}}}\).
For example, image we want to learn the distribution of images of objects, like the one below:

Each image is a huge vector of pixels \({\textcolor{input}{\boldsymbol{x}}}\in \mathbb{R}^{H \times W \times C}\), where \(H\) is the height, \(W\) is the width and \(C\) is the number of channels (in this case, \(64 \times 64 \times 3\)).
Each image can be described by a set of 6 latent variables: floor, wall and object color, shape, orientation and scale.
Latent variables are unobserved variables, we cannot measure them directly, but we can infer them from the observed data.
In math terms, we model our data distribution as a marginal of the joint distribution of the observed data and the latent variables:
\[ p({\textcolor{input}{\boldsymbol{x}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}\quad \text{or} \quad p({\textcolor{input}{\boldsymbol{x}}}) = \sum_{i} p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}_i) \]
where \(p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}})\) is the joint distribution of the observed data and the latent variables.
In mixture models, we assume that the data is generated from a mixture of several distributions
\[ p({\textcolor{input}{\boldsymbol{x}}}) = \sum_i p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}_i) = \sum_i p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{\boldsymbol{z}}}_i)p({\textcolor{latent}{\boldsymbol{z}}}_i) \]
where \(p({\textcolor{latent}{\boldsymbol{z}}})\) is the mixture distribution and \(p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{\boldsymbol{z}}})\) is the conditional distribution of the data given the latent variables.
1-hot encoding of the discrete latent variable \(\textcolor{latent}{z}_k \in \{0,1\}\), with prior: \[ p(\textcolor{latent}{z}_k=1) = \pi_k \quad \text{and} \quad \sum_k \pi_k = 1 \]
The conditional distribution of the data given the latent variable is: \[ p({\textcolor{input}{\boldsymbol{x}}}\mid\textcolor{latent}{z}_k=1) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\mu}}_k, {\boldsymbol{\Sigma}}_k) \]
The joint distribution of the data and the latent variable is: \[ p({\textcolor{input}{\boldsymbol{x}}}, \textcolor{latent}{z}_k=1) = p({\textcolor{input}{\boldsymbol{x}}}\mid\textcolor{latent}{z}_k=1)p(\textcolor{latent}{z}_k=1) = \pi_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\mu}}_k, {\boldsymbol{\Sigma}}_k) \]
The marginal distribution of the data is: \[ p({\textcolor{input}{\boldsymbol{x}}}) = \sum_k p({\textcolor{input}{\boldsymbol{x}}}, \textcolor{latent}{z}_k=1) = \sum_k \pi_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\mu}}_k, {\boldsymbol{\Sigma}}_k) \]
We are interested in the posterior distribution of a cluster assignment given the data:
\[ \begin{aligned} p(\textcolor{latent}{z}_k=1\mid{\textcolor{input}{\boldsymbol{x}}}) &= \frac{p({\textcolor{input}{\boldsymbol{x}}}\mid\textcolor{latent}{z}_k=1)p(\textcolor{latent}{z}_k=1)}{p({\textcolor{input}{\boldsymbol{x}}})} \\ &= \frac{p({\textcolor{input}{\boldsymbol{x}}}\mid\textcolor{latent}{z}_k=1)p(\textcolor{latent}{z}_k=1)}{\sum_{j} p({\textcolor{input}{\boldsymbol{x}}}\mid\textcolor{latent}{z}_j=1)p(\textcolor{latent}{z}_j=1)} \\ &= \frac{\pi_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\mu}}_k, {\boldsymbol{\Sigma}}_k)}{\sum_{j} \pi_j{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\mu}}_j, {\boldsymbol{\Sigma}}_j)} := \gamma(\textcolor{latent}{z}_k) \end{aligned} \]
\(\gamma(\textcolor{latent}{z}_k)\) is the responsibility of cluster \(k\) for data point \({\textcolor{input}{\boldsymbol{x}}}\).
Given a dataset \({\textcolor{input}{\boldsymbol{X}}}= \{{\textcolor{input}{\boldsymbol{x}}}_1, \ldots, {\textcolor{input}{\boldsymbol{x}}}_N\}\) i.i.d, the log-likelihood of the data is:
\[ \begin{aligned} \log p({\textcolor{input}{\boldsymbol{X}}}\mid {\boldsymbol{\pi}}, \{{\boldsymbol{\mu}}_k\}, \{{\boldsymbol{\Sigma}}_k\}) &= \log \prod_{i=1}^N p({\textcolor{input}{\boldsymbol{x}}}_i) \\ &= \sum_{i=1}^N \log p({\textcolor{input}{\boldsymbol{x}}}_i\mid {\boldsymbol{\pi}}, \{{\boldsymbol{\mu}}_k\}, \{{\boldsymbol{\Sigma}}_k\}) \\ &= \sum_{i=1}^N \log \sum_{k=1}^K p({\textcolor{input}{\boldsymbol{x}}}_i\mid\textcolor{latent}{z}_k=1)p(\textcolor{latent}{z}_k=1) \\ &= \sum_{i=1}^N \log \sum_{k=1}^K \pi_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_i; {\boldsymbol{\mu}}_k, {\boldsymbol{\Sigma}}_k) \\ \end{aligned} \]
Warning
We cannot simplify it further, because the log and the sum do not commute. How can we optimize it?
We need to maximize the log-likelihood of the data w.r.t. the parameters \({\boldsymbol{\pi}}\), \(\{{\boldsymbol{\mu}}_k\}\) and \(\{{\boldsymbol{\Sigma}}_k\}\), for \(k=1,\ldots,K\)
\[ \log p({\textcolor{input}{\boldsymbol{X}}}) = \sum_{i=1}^N \log \sum_{k=1}^K \pi_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_i; {\boldsymbol{\mu}}_k, {\boldsymbol{\Sigma}}_k) \]
The problem is not convex.
No closed-form solution! Stationary points depend on the posterior \(\gamma(\textcolor{latent}{z}_{nk})\) for each data point \({\textcolor{input}{\boldsymbol{x}}}_n\).
… but we will see that we can write \(\pi_k\), \({\boldsymbol{\mu}}_k\) and \({\boldsymbol{\Sigma}}_k\) in terms of \(\gamma(\textcolor{latent}{z}_{nk})\).
We can find the local minima by iterative algorithm: alternate update of the expected posterior \(\gamma(\textcolor{latent}{z}_{nk})\) and the maximized parameters \(\pi_k\), \({\boldsymbol{\mu}}_k\) and \({\boldsymbol{\Sigma}}_k\).
Results of results for Gaussian distributions
Assume \(p({\textcolor{input}{\boldsymbol{x}}}) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\mu}}, {\boldsymbol{\Sigma}}) = (2\pi)^{-\frac{d}{2}}\det({\boldsymbol{\Sigma}})^{-\frac{1}{2}}\exp\left(-\frac{1}{2}({\textcolor{input}{\boldsymbol{x}}}- {\boldsymbol{\mu}})^\top{\boldsymbol{\Sigma}}^{-1}({\textcolor{input}{\boldsymbol{x}}}- {\boldsymbol{\mu}})\right)\).
Then,
\[ \frac{\partial p({\textcolor{input}{\boldsymbol{x}}})}{\partial {\boldsymbol{\mu}}} = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\mu}}, {\boldsymbol{\Sigma}})({\textcolor{input}{\boldsymbol{x}}}- {\boldsymbol{\mu}}){\boldsymbol{\Sigma}}^{-1} \]
and
\[ \frac{\partial p({\textcolor{input}{\boldsymbol{x}}})}{\partial {\boldsymbol{\Sigma}}} = -\frac{1}{2}{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\boldsymbol{\mu}}, {\boldsymbol{\Sigma}})\left({\boldsymbol{\Sigma}}^{-1} - {\boldsymbol{\Sigma}}^{-1}({\textcolor{input}{\boldsymbol{x}}}- {\boldsymbol{\mu}})({\textcolor{input}{\boldsymbol{x}}}- {\boldsymbol{\mu}})^\top{\boldsymbol{\Sigma}}^{-1} \right) \]
\[ \begin{aligned} \arg\max_{{\boldsymbol{\mu}}_k} p({\textcolor{input}{\boldsymbol{X}}}\mid {\boldsymbol{\pi}}, \{{\boldsymbol{\mu}}_j\}, \{{\boldsymbol{\Sigma}}_j\}) &= \arg\max_{{\boldsymbol{\mu}}_k} \sum_{n=1}^N \log p({\textcolor{input}{\boldsymbol{x}}}_n\mid {\boldsymbol{\pi}}, \{{\boldsymbol{\mu}}_j\}, \{{\boldsymbol{\Sigma}}_j\}) \end{aligned} \]
Compute the partial derivative w.r.t. \({\boldsymbol{\mu}}_k\):
\[ \begin{aligned} \frac{\partial}{\partial {\boldsymbol{\mu}}_k} \log p({\textcolor{input}{\boldsymbol{X}}}\mid {\boldsymbol{\pi}}, \{{\boldsymbol{\mu}}_k\}, \{{\boldsymbol{\Sigma}}_k\}) &= \sum_{n=1}^N \frac{1}{p({\textcolor{input}{\boldsymbol{x}}}_n\mid {\boldsymbol{\pi}}, \{{\boldsymbol{\mu}}_k\}, \{{\boldsymbol{\Sigma}}_k\})}\frac{\partial p({\textcolor{input}{\boldsymbol{x}}}_n\mid {\boldsymbol{\pi}}, \{{\boldsymbol{\mu}}_j\}, \{{\boldsymbol{\Sigma}}_j\})}{\partial {\boldsymbol{\mu}}_k} \\ &= \sum_{n=1}^N \frac{1}{p({\textcolor{input}{\boldsymbol{x}}}_n\mid {\boldsymbol{\pi}}, \{{\boldsymbol{\mu}}_k\}, \{{\boldsymbol{\Sigma}}_k\})}\frac{\partial \pi_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_n; {\boldsymbol{\mu}}_k, {\boldsymbol{\Sigma}}_k)}{\partial {\boldsymbol{\mu}}_k} \\ &= \sum_{n=1}^N \frac{\pi_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_n; {\boldsymbol{\mu}}_k, {\boldsymbol{\Sigma}}_k)}{\sum_{j=1}^K \pi_j{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_n; {\boldsymbol{\mu}}_j, {\boldsymbol{\Sigma}}_j)}({\textcolor{input}{\boldsymbol{x}}}_n - {\boldsymbol{\mu}}_k){\boldsymbol{\Sigma}}_k^{-1} \\ &= \sum_{n=1}^N \gamma(\textcolor{latent}{z}_{nk})({\textcolor{input}{\boldsymbol{x}}}_n - {\boldsymbol{\mu}}_k){\boldsymbol{\Sigma}}_k^{-1} = 0 \quad \Rightarrow {\boldsymbol{\mu}}_k = \frac{\sum_{n=1}^N \gamma(\textcolor{latent}{z}_{nk}){\textcolor{input}{\boldsymbol{x}}}_n}{\sum_{n=1}^N \gamma(\textcolor{latent}{z}_{nk})} \end{aligned} \]
Same for the covariance:
Same for the weights \({\boldsymbol{\pi}}\), but we have to use the constraint \(\sum_{k=1}^K \pi_k = 1\).
Algorithm:
Important:
After the EM algorithm converges, we can the learned parameters to:
Pros:
Cons:
Question: Can we be Bayesian about the number of clusters \(K\)? Yes, it’s called Dirichlet Process Mixture Model (DPMM)
Linear latent variable models are a class of generative models that assume that the data is generated from a linear combination of latent variables.
We will:
Assume we have a dataset \({\textcolor{input}{\boldsymbol{X}}}= \{{\textcolor{input}{\boldsymbol{x}}}_1, \ldots, {\textcolor{input}{\boldsymbol{x}}}_N\}\), where \({\textcolor{input}{\boldsymbol{x}}}_n\in{\mathbb{R}}^{D}\).
We want to find a low-dimensional representation of the data, i.e., we want to find a mapping \({\textcolor{input}{\boldsymbol{x}}}_n \mapsto {\textcolor{latent}{\boldsymbol{z}}}_n\) such that \({\textcolor{latent}{\boldsymbol{z}}}_n\in{\mathbb{R}}^{K}\), where \(K < D\).
Choose the mapping such that the variance of the data is maximized, i.e., we want to find the mapping that maximizes the variance of the data in the low-dimensional space.
We want to find the direction of maximum variance, i.e., the direction that maximizes the variance of the data in the low-dimensional space.
Let’s start by projecting the data on 1D subspace \(\textcolor{latent}{z}= {\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}\), where \({\textcolor{params}{\boldsymbol{w}}}\in{\mathbb{R}}^{D}\).
We only want the direction, so we can assume that \(\norm{{\textcolor{params}{\boldsymbol{w}}}} = 1\).
Given data \({\textcolor{input}{\boldsymbol{X}}}= \{{\textcolor{input}{\boldsymbol{x}}}_1, \ldots, {\textcolor{input}{\boldsymbol{x}}}_N\}\), we maximize the variance of the projected data \({\textcolor{latent}{\boldsymbol{z}}}= {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}\) constrained to \(\norm{{\textcolor{params}{\boldsymbol{w}}}} = 1\): \[ \arg\max_{{\textcolor{params}{\boldsymbol{w}}}} \norm{{\textcolor{latent}{\boldsymbol{z}}}}^2 \qquad \text{s.t.} \quad \norm{{\textcolor{params}{\boldsymbol{w}}}}^2 = 1 \]
Assumes that \({\textcolor{input}{\boldsymbol{X}}}\) is centered (i.e., \({\textcolor{input}{\boldsymbol{X}}}\) has zero mean)
\[ \ell({\textcolor{params}{\boldsymbol{w}}}, \lambda) = \norm{{\textcolor{latent}{\boldsymbol{z}}}}^2 - \lambda(\norm{{\textcolor{params}{\boldsymbol{w}}}}^2 - 1) = {\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}- \lambda({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{params}{\boldsymbol{w}}}- 1) \]
\[ \begin{aligned} \frac{\partial \ell}{\partial {\textcolor{params}{\boldsymbol{w}}}} &= 2{\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}- 2\lambda{\textcolor{params}{\boldsymbol{w}}}= 0 \quad \Rightarrow \quad {\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}= \lambda{\textcolor{params}{\boldsymbol{w}}}\\ \frac{\partial \ell}{\partial \lambda} &= -({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{params}{\boldsymbol{w}}}- 1) = 0 \end{aligned} \]
This is an eigenvalue problem, where \(\lambda\) is the eigenvalue and \({\textcolor{params}{\boldsymbol{w}}}\) is the eigenvector of the covariance matrix \({\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}\).
But which eigenvalue?
The eigenvalue \(\lambda\) is the variance of the projected data \({\textcolor{latent}{\boldsymbol{z}}}\): \(\lambda = {\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}\)
If we want to maximize the variance, we need to find the largest eigenvalue of the covariance matrix \({\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}\).
The corresponding eigenvector is the direction of maximum variance: \(\widehat{\textcolor{params}{\boldsymbol{w}}}\)
PCA can be seen as a linear continuous latent variable model, where the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) is a linear combination of the observed data \({\textcolor{input}{\boldsymbol{x}}}\).
Question: Can we approach PCA from a probabilistic point of view? Answer: Yes!
Recall the continuous latent variable model:
\[ p({\textcolor{input}{\boldsymbol{x}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}= \int p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{\boldsymbol{z}}})p({\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}} \]
\[ {\textcolor{input}{\boldsymbol{x}}}= {\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\boldsymbol{\mu}}+ {\textcolor{noise}{\boldsymbol{\varepsilon}}} \]
where:
How can we treat these variables?
The observed data \({\textcolor{input}{\boldsymbol{x}}}\) conditioned on the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) is a Gaussian distribution \[ p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\boldsymbol{\mu}}, \sigma^2{\boldsymbol{I}}) \]
The marginal distribution of the data \({\textcolor{input}{\boldsymbol{x}}}\) is obtained by integrating out the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\):
\[ p({\textcolor{input}{\boldsymbol{x}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{\boldsymbol{z}}})p({\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}= \int {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\boldsymbol{\mu}}, \sigma^2{\boldsymbol{I}}){\mathcal{N}}({\textcolor{latent}{\boldsymbol{z}}}; 0, {\boldsymbol{I}}) \dd{\textcolor{latent}{\boldsymbol{z}}} \]
It’s a Gaussian distribution \({\mathcal{N}}(?, ?)\).
\[ \begin{aligned} \mathbb{E}{[{\textcolor{input}{\boldsymbol{x}}}]} &= \mathbb{E}{[{\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\boldsymbol{\mu}}+ {\textcolor{noise}{\boldsymbol{\varepsilon}}}]} \\ &= {\textcolor{params}{\boldsymbol{W}}}\mathbb{E}{[{\textcolor{latent}{\boldsymbol{z}}}]} + {\boldsymbol{\mu}}+ \mathbb{E}{[{\textcolor{noise}{\boldsymbol{\varepsilon}}}]} \\ &= {\textcolor{params}{\boldsymbol{W}}}\cdot 0 + {\boldsymbol{\mu}}+ 0 = {\boldsymbol{\mu}} \end{aligned} \]
\[ \begin{aligned} \text{Cov}({\textcolor{input}{\boldsymbol{x}}}) &= \mathbb{E}[({\textcolor{input}{\boldsymbol{x}}}- {\boldsymbol{\mu}})({\textcolor{input}{\boldsymbol{x}}}- {\boldsymbol{\mu}})^\top] \\ &= \mathbb{E}[({\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{noise}{\boldsymbol{\varepsilon}}})({\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{noise}{\boldsymbol{\varepsilon}}})^\top] \\ &= \mathbb{E}[{\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}{\textcolor{latent}{\boldsymbol{z}}}^\top{\textcolor{params}{\boldsymbol{W}}}^\top + {\textcolor{noise}{\boldsymbol{\varepsilon}}}{\textcolor{noise}{\boldsymbol{\varepsilon}}}^\top + {\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}{\textcolor{noise}{\boldsymbol{\varepsilon}}}^\top + {\textcolor{noise}{\boldsymbol{\varepsilon}}}{\textcolor{latent}{\boldsymbol{z}}}^\top{\textcolor{params}{\boldsymbol{W}}}] \\ &= {\textcolor{params}{\boldsymbol{W}}}\mathbb{E}[{\textcolor{latent}{\boldsymbol{z}}}{\textcolor{latent}{\boldsymbol{z}}}^\top]{\textcolor{params}{\boldsymbol{W}}}^\top + \mathbb{E}[{\textcolor{noise}{\boldsymbol{\varepsilon}}}{\textcolor{noise}{\boldsymbol{\varepsilon}}}^\top] + {\textcolor{params}{\boldsymbol{W}}}\mathbb{E}[{\textcolor{latent}{\boldsymbol{z}}}]\mathbb{E}[{\textcolor{noise}{\boldsymbol{\varepsilon}}}^\top] + \mathbb{E}[{\textcolor{noise}{\boldsymbol{\varepsilon}}}]\mathbb{E}[{\textcolor{latent}{\boldsymbol{z}}}^\top]{\textcolor{params}{\boldsymbol{W}}}\\ &= {\textcolor{params}{\boldsymbol{W}}}\cdot {\boldsymbol{I}}\cdot{\textcolor{params}{\boldsymbol{W}}}^\top + \sigma^2{\boldsymbol{I}}+ 0 + 0 \\ &= {\textcolor{params}{\boldsymbol{W}}}{\textcolor{params}{\boldsymbol{W}}}^\top + \sigma^2{\boldsymbol{I}} \end{aligned} \]
Now, we need to find the parameters \({\textcolor{params}{\boldsymbol{W}}}\), \({\boldsymbol{\mu}}\) and \(\sigma^2\) that maximize the log-likelihood of the data: \[ \begin{aligned} \log p({\textcolor{input}{\boldsymbol{X}}}; {\textcolor{params}{\boldsymbol{W}}}, {\boldsymbol{\mu}}, \sigma^2) &= \sum_{n=1}^N \log p({\textcolor{input}{\boldsymbol{x}}}_n; {\textcolor{params}{\boldsymbol{W}}}, {\boldsymbol{\mu}}, \sigma^2) = \sum_{n=1}^N \log {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_n; {\boldsymbol{\mu}}, {\textcolor{params}{\boldsymbol{W}}}{\textcolor{params}{\boldsymbol{W}}}^\top + \sigma^2{\boldsymbol{I}})\\ &= -\frac{NK}{2}\log(2\pi) - \frac{N}{2}\log(\det({\textcolor{params}{\boldsymbol{W}}}{\textcolor{params}{\boldsymbol{W}}}^\top + \sigma^2{\boldsymbol{I}})) - \frac{1}{2}\sum_{n=1}^N ({\textcolor{input}{\boldsymbol{x}}}_n - {\boldsymbol{\mu}})^\top({\textcolor{params}{\boldsymbol{W}}}{\textcolor{params}{\boldsymbol{W}}}^\top + \sigma^2{\boldsymbol{I}})^{-1}({\textcolor{input}{\boldsymbol{x}}}_n - {\boldsymbol{\mu}}) \end{aligned} \]
Again, compute the partial derivative w.r.t. \({\textcolor{params}{\boldsymbol{W}}}\), \({\boldsymbol{\mu}}\) and \(\sigma^2\) and set them to zero.
Let’s define the sample covariance matrix \({\boldsymbol{S}}= {\textcolor{input}{\boldsymbol{X}}}{\textcolor{input}{\boldsymbol{X}}}^\top\). Solutions for \({\textcolor{params}{\boldsymbol{W}}}\), \({\boldsymbol{\mu}}\) and \(\sigma^2\) can be found in closed form, no need to use the EM algorithm.
Because everything is Gaussian, we can also compute the posterior distribution of the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) given the data \({\textcolor{input}{\boldsymbol{x}}}\):
\[ p({\textcolor{latent}{\boldsymbol{z}}}_n\mid{\textcolor{input}{\boldsymbol{x}}}_n) = {\mathcal{N}}({\textcolor{latent}{\boldsymbol{z}}}_n; {\boldsymbol{C}}^{-1}{\textcolor{params}{\boldsymbol{W}}}^\top({\textcolor{input}{\boldsymbol{x}}}_n - {\boldsymbol{\mu}}), \sigma^2{\boldsymbol{C}}^{-1}) \]
with \({\boldsymbol{C}}= {\textcolor{params}{\boldsymbol{W}}}^\top{\textcolor{params}{\boldsymbol{W}}}+ \sigma^2{\boldsymbol{I}}\)
The PPCA solution is more general than PCA, because it assumes that the data is generated from a Gaussian distribution with noise.
Start from a model for the data \({\textcolor{input}{\boldsymbol{x}}}\) given the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) and the linear weight \({\textcolor{params}{\boldsymbol{W}}}\):
But we could have also done the following:
This is a step closer to probabilistic kernel PCA and to the Gaussian process latent variable model (GPLVM).

Simone Rossi - Advanced Statistical Inference - EURECOM