Latent Variable Models

Advanced Statistical Inference

Simone Rossi

EURECOM

\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \definecolor{function}{rgb}{0.75, 0.75, 0.12} \]

Latent variable models

Our objective is to learn the data distribution \(p({\textcolor{input}{\boldsymbol{x}}})\), but we suppose that each data point \({\textcolor{input}{\boldsymbol{x}}}\) is associated with a latent variable \({\textcolor{latent}{\boldsymbol{z}}}\).

For example, image we want to learn the distribution of images of objects, like the one below:

Each image is a huge vector of pixels \({\textcolor{input}{\boldsymbol{x}}}\in \mathbb{R}^{H \times W \times C}\), where \(H\) is the height, \(W\) is the width and \(C\) is the number of channels (in this case, \(64 \times 64 \times 3\)).

Each image can be described by a set of 6 latent variables: floor, wall and object color, shape, orientation and scale.

Latent variable models

Latent variables are unobserved variables, we cannot measure them directly, but we can infer them from the observed data.

In math terms, we model our data distribution as a marginal of the joint distribution of the observed data and the latent variables:

\[ p({\textcolor{input}{\boldsymbol{x}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}\quad \text{or} \quad p({\textcolor{input}{\boldsymbol{x}}}) = \sum_{i} p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}_i) \]

where \(p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}})\) is the joint distribution of the observed data and the latent variables.

Mixture models

In mixture models, we assume that the data is generated from a mixture of several distributions

\[ p({\textcolor{input}{\boldsymbol{x}}}) = \sum_i p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}_i) = \sum_i p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{\boldsymbol{z}}}_i)p({\textcolor{latent}{\boldsymbol{z}}}_i) \]

where \(p({\textcolor{latent}{\boldsymbol{z}}})\) is the mixture distribution and \(p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{\boldsymbol{z}}})\) is the conditional distribution of the data given the latent variables.

The latent variables \({\textcolor{latent}{\boldsymbol{z}}}\) are often called clusters.
The likelihood is often assumed to be Gaussian, i.e., \(p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{\boldsymbol{z}}}) = \mathcal{N}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k)\)
The prior is often assumed to be a categorical distribution, i.e., \(p({\textcolor{latent}{\boldsymbol{z}}}) = \text{Cat}({\textcolor{latent}{\boldsymbol{z}}};{\textcolor{vparams}{\boldsymbol{\pi}}})\)

Gaussian Mixture Models

Modeling the data distribution

1-hot encoding of the discrete latent variable \({\textcolor{latent}{z}}_k \in \{0,1\}\), with prior: \[ p({\textcolor{latent}{z}}_k=1) = \textcolor{vparams}{\pi}_k \quad \text{and} \quad \sum_k \textcolor{vparams}{\pi}_k = 1 \]
The conditional distribution of the data given the latent variable is: \[ p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{z}}_k=1) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k) \]
The joint distribution of the data and the latent variable is: \[ p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{z}}_k=1) = p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{z}}_k=1)p({\textcolor{latent}{z}}_k=1) = \textcolor{vparams}{\pi}_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k) \]
The marginal distribution of the data is: \[ p({\textcolor{input}{\boldsymbol{x}}}) = \sum_k p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{z}}_k=1) = \sum_k \textcolor{vparams}{\pi}_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k) \]

Posterior distribution

We are interested in the posterior distribution of a cluster assignment given the data:

\[ \begin{aligned} p({\textcolor{latent}{z}}_k=1\mid{\textcolor{input}{\boldsymbol{x}}}) &= \frac{p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{z}}_k=1)p({\textcolor{latent}{z}}_k=1)}{p({\textcolor{input}{\boldsymbol{x}}})} \\ &= \frac{p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{z}}_k=1)p({\textcolor{latent}{z}}_k=1)}{\sum_{j} p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{z}}_j=1)p({\textcolor{latent}{z}}_j=1)} \\ &= \frac{\textcolor{vparams}{\pi}_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k)}{\sum_{j} \textcolor{vparams}{\pi}_j{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}_j, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_j)} := \gamma({\textcolor{latent}{z}}_k) \end{aligned} \]

\(\gamma({\textcolor{latent}{z}}_k)\) is the responsibility of cluster \(k\) for data point \({\textcolor{input}{\boldsymbol{x}}}\).

Log-likelihood

Given a dataset \({\textcolor{input}{\boldsymbol{X}}}= \{{\textcolor{input}{\boldsymbol{x}}}_1, \ldots, {\textcolor{input}{\boldsymbol{x}}}_N\}\) i.i.d, the log-likelihood of the data is:

\[ \begin{aligned} \log p({\textcolor{input}{\boldsymbol{X}}}\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\}) &= \log \prod_{i=1}^N p({\textcolor{input}{\boldsymbol{x}}}_i) \\ &= \sum_{i=1}^N \log p({\textcolor{input}{\boldsymbol{x}}}_i\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\}) \\ &= \sum_{i=1}^N \log \sum_{k=1}^K p({\textcolor{input}{\boldsymbol{x}}}_i\mid{\textcolor{latent}{z}}_k=1)p({\textcolor{latent}{z}}_k=1) \\ &= \sum_{i=1}^N \log \sum_{k=1}^K \textcolor{vparams}{\pi}_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_i; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k) \\ \end{aligned} \]

Warning

We cannot simplify it further, because the log and the sum do not commute. How can we optimize it?

Expectation-Maximization algorithm

We need to maximize the log-likelihood of the data w.r.t. the parameters \({\textcolor{vparams}{\boldsymbol{\pi}}}\), \(\{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}\) and \(\{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\}\), for \(k=1,\ldots,K\)

\[ \log p({\textcolor{input}{\boldsymbol{X}}}) = \sum_{i=1}^N \log \sum_{k=1}^K \textcolor{vparams}{\pi}_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_i; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k) \]

The problem is not convex.
No closed-form solution! Stationary points depend on the posterior \(\gamma({\textcolor{latent}{z}}_{nk})\) for each data point \({\textcolor{input}{\boldsymbol{x}}}_n\).
… but we will see that we can write \(\textcolor{vparams}{\pi}_k\), \({\textcolor{vparams}{\boldsymbol{\mu}}}_k\) and \({\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\) in terms of \(\gamma({\textcolor{latent}{z}}_{nk})\).
We can find the local minima by iterative algorithm: alternate update of the expected posterior \(\gamma({\textcolor{latent}{z}}_{nk})\) and the maximized parameters \(\textcolor{vparams}{\pi}_k\), \({\textcolor{vparams}{\boldsymbol{\mu}}}_k\) and \({\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\).

Expectation-Maximization algorithm

E-step: compute the expected posterior \(\gamma({\textcolor{latent}{z}}_{nk})\) given the current parameters \({\textcolor{vparams}{\boldsymbol{\pi}}}\), \(\{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}\) and \(\{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\}\): \[ \gamma({\textcolor{latent}{z}}_{nk}) = \frac{{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_n; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k)\textcolor{vparams}{\pi}_k}{\sum_{j=1}^K {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_n; {\textcolor{vparams}{\boldsymbol{\mu}}}_j, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_j)\textcolor{vparams}{\pi}_j} \]
M-step: update the parameters \({\textcolor{vparams}{\boldsymbol{\pi}}}\), \(\{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}\) and \(\{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\}\) to maximize the expected log-likelihood, given the current expected posterior \(\gamma({\textcolor{latent}{z}}_{nk})\). This can be solved in closed form: \[ \begin{aligned} N_k &= \sum_{n=1}^N \gamma({\textcolor{latent}{z}}_{nk}) \qquad \textcolor{vparams}{\pi}_k = \frac{N_k}{N} \\ {\textcolor{vparams}{\boldsymbol{\mu}}}_k &= \frac{1}{N_k}\sum_{n=1}^N \gamma({\textcolor{latent}{z}}_{nk}){\textcolor{input}{\boldsymbol{x}}}_n \\ {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k &= \frac{1}{N_k}\sum_{n=1}^N \gamma({\textcolor{latent}{z}}_{nk})({\textcolor{input}{\boldsymbol{x}}}_n - {\textcolor{vparams}{\boldsymbol{\mu}}}_k)({\textcolor{input}{\boldsymbol{x}}}_n - {\textcolor{vparams}{\boldsymbol{\mu}}}_k)^\top \end{aligned} \]

Derivation of the M-step updates

Results of results for Gaussian distributions:

Assume \(p({\textcolor{input}{\boldsymbol{x}}}) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}, {\textcolor{vparams}{\boldsymbol{\Sigma}}}) = (2\pi)^{-\frac{d}{2}}\det({\textcolor{vparams}{\boldsymbol{\Sigma}}})^{-\frac{1}{2}}\exp\left(-\frac{1}{2}({\textcolor{input}{\boldsymbol{x}}}- {\textcolor{vparams}{\boldsymbol{\mu}}})^\top{\textcolor{vparams}{\boldsymbol{\Sigma}}}^{-1}({\textcolor{input}{\boldsymbol{x}}}- {\textcolor{vparams}{\boldsymbol{\mu}}})\right)\).

Then,

\[ \frac{\partial p({\textcolor{input}{\boldsymbol{x}}})}{\partial {\textcolor{vparams}{\boldsymbol{\mu}}}} = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}, {\textcolor{vparams}{\boldsymbol{\Sigma}}})({\textcolor{input}{\boldsymbol{x}}}- {\textcolor{vparams}{\boldsymbol{\mu}}}){\textcolor{vparams}{\boldsymbol{\Sigma}}}^{-1} \]

and

\[ \frac{\partial p({\textcolor{input}{\boldsymbol{x}}})}{\partial {\textcolor{vparams}{\boldsymbol{\Sigma}}}} = -\frac{1}{2}{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}, {\textcolor{vparams}{\boldsymbol{\Sigma}}})\left({\textcolor{vparams}{\boldsymbol{\Sigma}}}^{-1} - {\textcolor{vparams}{\boldsymbol{\Sigma}}}^{-1}({\textcolor{input}{\boldsymbol{x}}}- {\textcolor{vparams}{\boldsymbol{\mu}}})({\textcolor{input}{\boldsymbol{x}}}- {\textcolor{vparams}{\boldsymbol{\mu}}})^\top{\textcolor{vparams}{\boldsymbol{\Sigma}}}^{-1} \right) \]

Derivation of the M-step updates (\({\textcolor{vparams}{\boldsymbol{\mu}}}_k\))

\[ \begin{aligned} \arg\max_{{\textcolor{vparams}{\boldsymbol{\mu}}}_k} p({\textcolor{input}{\boldsymbol{X}}}\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_j\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_j\}) &= \arg\max_{{\textcolor{vparams}{\boldsymbol{\mu}}}_k} \sum_{n=1}^N \log p({\textcolor{input}{\boldsymbol{x}}}_n\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_j\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_j\}) \end{aligned} \]

Compute the partial derivative w.r.t. \({\textcolor{vparams}{\boldsymbol{\mu}}}_k\):

\[ \begin{aligned} \frac{\partial}{\partial {\textcolor{vparams}{\boldsymbol{\mu}}}_k} \log p({\textcolor{input}{\boldsymbol{X}}}\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\}) &= \sum_{n=1}^N \frac{1}{p({\textcolor{input}{\boldsymbol{x}}}_n\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\})}\frac{\partial p({\textcolor{input}{\boldsymbol{x}}}_n\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_j\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_j\})}{\partial {\textcolor{vparams}{\boldsymbol{\mu}}}_k} \\ &= \sum_{n=1}^N \frac{1}{p({\textcolor{input}{\boldsymbol{x}}}_n\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\})}\frac{\partial \textcolor{vparams}{\pi}_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_n; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k)}{\partial {\textcolor{vparams}{\boldsymbol{\mu}}}_k} \\ &= \sum_{n=1}^N \frac{\textcolor{vparams}{\pi}_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_n; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k)}{\sum_{j=1}^K \textcolor{vparams}{\pi}_j{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_n; {\textcolor{vparams}{\boldsymbol{\mu}}}_j, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_j)}({\textcolor{input}{\boldsymbol{x}}}_n - {\textcolor{vparams}{\boldsymbol{\mu}}}_k){\textcolor{vparams}{\boldsymbol{\Sigma}}}_k^{-1} \\ &= \sum_{n=1}^N \gamma({\textcolor{latent}{z}}_{nk})({\textcolor{input}{\boldsymbol{x}}}_n - {\textcolor{vparams}{\boldsymbol{\mu}}}_k){\textcolor{vparams}{\boldsymbol{\Sigma}}}_k^{-1} = 0 \quad \Rightarrow {\textcolor{vparams}{\boldsymbol{\mu}}}_k = \frac{\sum_{n=1}^N \gamma({\textcolor{latent}{z}}_{nk}){\textcolor{input}{\boldsymbol{x}}}_n}{\sum_{n=1}^N \gamma({\textcolor{latent}{z}}_{nk})} \end{aligned} \]

Derivation of the M-step updates (\({\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\), \({\textcolor{vparams}{\boldsymbol{\pi}}}\))

Same for the covariance:

Write the derivative of the log-likelihood w.r.t. \({\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\) (use the properties of the Gaussian distribution).
Manipulate it to make \(\gamma({\textcolor{latent}{z}}_{nk})\) appear.
Set it to zero and solve for \({\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\).

Same for the weights \({\textcolor{vparams}{\boldsymbol{\pi}}}\), but we have to use the constraint \(\sum_{k=1}^K \textcolor{vparams}{\pi}_k = 1\).

Use Lagrange multipliers to introduce a new variable \(\lambda\) and write the Lagrangian function: \[ \ell({\textcolor{vparams}{\boldsymbol{\pi}}}, \lambda) = \log p({\textcolor{input}{\boldsymbol{X}}}\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\}) + \lambda\left(1 - \sum_{k=1}^K \textcolor{vparams}{\pi}_k\right) \]
Compute the partial derivative w.r.t. \({\textcolor{vparams}{\boldsymbol{\pi}}}\) and \(\lambda\)
Set them to zero and solve for \({\textcolor{vparams}{\boldsymbol{\pi}}}\) and \(\lambda\).

EM for Gaussian Mixture Models

Algorithm:

Initialize the parameters \({\textcolor{vparams}{\boldsymbol{\pi}}}\), \(\{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}\) and \(\{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\}\)
Repeat until convergence (likelihood does not change significantly):
1. E-step: compute the expected posterior \(\gamma({\textcolor{latent}{z}}_{nk})\) given the current parameters \({\textcolor{vparams}{\boldsymbol{\pi}}}\), \(\{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}\) and \(\{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\}\):
2. M-step: update the parameters \({\textcolor{vparams}{\boldsymbol{\pi}}}\), \(\{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}\) and \(\{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\}\) to maximize the expected log-likelihood, given the current expected posterior \(\gamma({\textcolor{latent}{z}}_{nk})\):
3. Compute the log-likelihood of the data given the current parameters \({\textcolor{vparams}{\boldsymbol{\pi}}}\), \(\{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}\) and \(\{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\}\)

Important:

EM is guaranteed to converge to a local maximum, not necessarily to the global maximum.
EM is sensitive to the initialization of the parameters (try different initializations and choose the one that gives the highest log-likelihood)

Gaussian Mixture Models

After the EM algorithm converges, we can the learned parameters to:

Generate new data points by sampling from the mixture model.
Compute the likelihood of new data points.
Compute the posterior distribution of the latent variables given new data points (e.g., clustering).

Gaussian Mixture Models

Pros:

Simple and easy to implement.
Can model any distribution as a mixture of Gaussians.
One model can be used for different tasks (e.g., clustering, density estimation, generation).

Cons:

Sensitive to the initialization of the parameters, can get stuck in local minima.
The number of clusters \(K\) must be specified in advance.
The number of clusters \(K\) must be small, otherwise the model will overfit the data.

Question: Can we be Bayesian about the number of clusters \(K\)? Yes, it’s called Dirichlet Process Mixture Model (DPMM)

Dimensionality reduction as a generative model

Linear latent variable models

Linear latent variable models are a class of generative models that assume that the data is generated from a linear combination of latent variables.

We will:

Review the Principal Component Analysis (PCA) algorithm.
Introduce the (probabilistic) PCA as a linear latent variable model.

Principal Component Analysis

Assume we have a dataset \({\textcolor{input}{\boldsymbol{X}}}= \{{\textcolor{input}{\boldsymbol{x}}}_1, \ldots, {\textcolor{input}{\boldsymbol{x}}}_N\}\), where \({\textcolor{input}{\boldsymbol{x}}}_n\in{\mathbb{R}}^{D}\).
We want to find a low-dimensional representation of the data, i.e., we want to find a mapping \({\textcolor{input}{\boldsymbol{x}}}_n \mapsto {\textcolor{latent}{\boldsymbol{z}}}_n\) such that \({\textcolor{latent}{\boldsymbol{z}}}_n\in{\mathbb{R}}^{K}\), where \(K < D\).
Choose the mapping such that the variance of the projected data is maximized, i.e., we want to find the mapping that maximizes the variance of the data in the low-dimensional space.

How to find the direction of maximum variance?

Let’s start by projecting the data on 1D subspace \({\textcolor{latent}{z}}= {\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}\), where \({\textcolor{params}{\boldsymbol{w}}}\in{\mathbb{R}}^{D}\).
We only want the direction, so we can assume that \(\norm{{\textcolor{params}{\boldsymbol{w}}}} = 1\).
Given data \({\textcolor{input}{\boldsymbol{X}}}= \{{\textcolor{input}{\boldsymbol{x}}}_1, \ldots, {\textcolor{input}{\boldsymbol{x}}}_N\}\), we maximize the variance of the projected data \({\textcolor{latent}{\boldsymbol{z}}}= {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}\) constrained to \(\norm{{\textcolor{params}{\boldsymbol{w}}}} = 1\): \[ \arg\max_{{\textcolor{params}{\boldsymbol{w}}}} \norm{{\textcolor{latent}{\boldsymbol{z}}}}^2 \qquad \text{s.t.} \quad \norm{{\textcolor{params}{\boldsymbol{w}}}}^2 = 1 \]

Assumes that \({\textcolor{input}{\boldsymbol{X}}}\) is centered (i.e., \({\textcolor{input}{\boldsymbol{X}}}\) has zero mean)

Analytic solution

Let’s use the Lagrange multipliers to solve the constrained optimization problem:

\[ \ell({\textcolor{params}{\boldsymbol{w}}}, \lambda) = \norm{{\textcolor{latent}{\boldsymbol{z}}}}^2 - \lambda(\norm{{\textcolor{params}{\boldsymbol{w}}}}^2 - 1) = {\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}- \lambda({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{params}{\boldsymbol{w}}}- 1) \]

Compute the partial derivative w.r.t. \({\textcolor{params}{\boldsymbol{w}}}\) and \(\lambda\):

\[ \begin{aligned} \frac{\partial \ell}{\partial {\textcolor{params}{\boldsymbol{w}}}} &= 2{\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}- 2\lambda{\textcolor{params}{\boldsymbol{w}}}= 0 \quad \Rightarrow \quad {\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}= \lambda{\textcolor{params}{\boldsymbol{w}}}\\ \frac{\partial \ell}{\partial \lambda} &= -({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{params}{\boldsymbol{w}}}- 1) = 0 \end{aligned} \]

This is an eigenvalue problem, where \(\lambda\) is the eigenvalue and \({\textcolor{params}{\boldsymbol{w}}}\) is the eigenvector of the covariance matrix \({\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}\).

Analytic solution

But which eigenvalue?

The eigenvalue \(\lambda\) is the variance of the projected data \({\textcolor{latent}{\boldsymbol{z}}}\): \(\lambda = {\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}\)
If we want to maximize the variance, we need to find the largest eigenvalue of the covariance matrix \({\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}\).
The corresponding eigenvector is the direction of maximum variance: \(\widehat{\textcolor{params}{\boldsymbol{w}}}\)

Probabilistic PCA

PCA can be seen as a linear continuous latent variable model, where the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) is a linear combination of the observed data \({\textcolor{input}{\boldsymbol{x}}}\).
Question: Can we approach PCA from a probabilistic point of view? Answer: Yes!

Recall the continuous latent variable model:

\[ p({\textcolor{input}{\boldsymbol{x}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}= \int p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{\boldsymbol{z}}})p({\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}} \]

PPCA Modeling Assumptions

The generative model works as follows:

\[ {\textcolor{input}{\boldsymbol{x}}}= {\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{vparams}{\boldsymbol{\mu}}}+ {\textcolor{noise}{\boldsymbol{\varepsilon}}} \]

where:

\({\textcolor{params}{\boldsymbol{W}}}\in{\mathbb{R}}^{D\times K}\) is the weight matrix
\({\textcolor{latent}{\boldsymbol{z}}}\in{\mathbb{R}}^{K}\) is the latent variable
\({\textcolor{vparams}{\boldsymbol{\mu}}}\in{\mathbb{R}}^{D}\) is the mean of the data
\({\textcolor{noise}{\boldsymbol{\varepsilon}}}\in{\mathbb{R}}^{D}\) is the noise

How can we treat these variables?

PPCA Modeling Assumptions

The latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) is assumed to be a standard Gaussian distribution: \[ p({\textcolor{latent}{\boldsymbol{z}}}) = {\mathcal{N}}({\textcolor{latent}{\boldsymbol{z}}}; 0, {\boldsymbol{I}}) \]
The noise \({\textcolor{noise}{\boldsymbol{\varepsilon}}}\) is assumed to be a Gaussian distribution with zero mean and covariance matrix \(\sigma^2{\boldsymbol{I}}\): \[ p({\textcolor{noise}{\boldsymbol{\varepsilon}}}) = {\mathcal{N}}({\textcolor{noise}{\boldsymbol{\varepsilon}}}; 0, \sigma^2{\boldsymbol{I}}) \]

The observed data \({\textcolor{input}{\boldsymbol{x}}}\) conditioned on the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) is a Gaussian distribution \[ p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{vparams}{\boldsymbol{\mu}}}, \sigma^2{\boldsymbol{I}}) \]
The marginal distribution of the data \({\textcolor{input}{\boldsymbol{x}}}\) is obtained by integrating out the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\):

\[ p({\textcolor{input}{\boldsymbol{x}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{\boldsymbol{z}}})p({\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}= \int {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{vparams}{\boldsymbol{\mu}}}, \sigma^2{\boldsymbol{I}}){\mathcal{N}}({\textcolor{latent}{\boldsymbol{z}}}; 0, {\boldsymbol{I}}) \dd{\textcolor{latent}{\boldsymbol{z}}} \]

It’s a Gaussian distribution \({\mathcal{N}}(?, ?)\).

PPCA Modeling Assumptions

For the mean, we have:

\[ \begin{aligned} \mathbb{E}{[{\textcolor{input}{\boldsymbol{x}}}]} &= \mathbb{E}{[{\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{vparams}{\boldsymbol{\mu}}}+ {\textcolor{noise}{\boldsymbol{\varepsilon}}}]} \\ &= {\textcolor{params}{\boldsymbol{W}}}\mathbb{E}{[{\textcolor{latent}{\boldsymbol{z}}}]} + {\textcolor{vparams}{\boldsymbol{\mu}}}+ \mathbb{E}{[{\textcolor{noise}{\boldsymbol{\varepsilon}}}]} \\ &= {\textcolor{params}{\boldsymbol{W}}}\cdot 0 + {\textcolor{vparams}{\boldsymbol{\mu}}}+ 0 = {\textcolor{vparams}{\boldsymbol{\mu}}} \end{aligned} \]

For the covariance, we have:

\[ \begin{aligned} \text{Cov}({\textcolor{input}{\boldsymbol{x}}}) &= \mathbb{E}[({\textcolor{input}{\boldsymbol{x}}}- {\textcolor{vparams}{\boldsymbol{\mu}}})({\textcolor{input}{\boldsymbol{x}}}- {\textcolor{vparams}{\boldsymbol{\mu}}})^\top] \\ &= \mathbb{E}[({\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{noise}{\boldsymbol{\varepsilon}}})({\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{noise}{\boldsymbol{\varepsilon}}})^\top] \\ &= \mathbb{E}[{\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}{\textcolor{latent}{\boldsymbol{z}}}^\top{\textcolor{params}{\boldsymbol{W}}}^\top + {\textcolor{noise}{\boldsymbol{\varepsilon}}}{\textcolor{noise}{\boldsymbol{\varepsilon}}}^\top + {\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}{\textcolor{noise}{\boldsymbol{\varepsilon}}}^\top + {\textcolor{noise}{\boldsymbol{\varepsilon}}}{\textcolor{latent}{\boldsymbol{z}}}^\top{\textcolor{params}{\boldsymbol{W}}}] \\ &= {\textcolor{params}{\boldsymbol{W}}}\mathbb{E}[{\textcolor{latent}{\boldsymbol{z}}}{\textcolor{latent}{\boldsymbol{z}}}^\top]{\textcolor{params}{\boldsymbol{W}}}^\top + \mathbb{E}[{\textcolor{noise}{\boldsymbol{\varepsilon}}}{\textcolor{noise}{\boldsymbol{\varepsilon}}}^\top] + {\textcolor{params}{\boldsymbol{W}}}\mathbb{E}[{\textcolor{latent}{\boldsymbol{z}}}]\mathbb{E}[{\textcolor{noise}{\boldsymbol{\varepsilon}}}^\top] + \mathbb{E}[{\textcolor{noise}{\boldsymbol{\varepsilon}}}]\mathbb{E}[{\textcolor{latent}{\boldsymbol{z}}}^\top]{\textcolor{params}{\boldsymbol{W}}}\\ &= {\textcolor{params}{\boldsymbol{W}}}\cdot {\boldsymbol{I}}\cdot{\textcolor{params}{\boldsymbol{W}}}^\top + \sigma^2{\boldsymbol{I}}+ 0 + 0 \\ &= {\textcolor{params}{\boldsymbol{W}}}{\textcolor{params}{\boldsymbol{W}}}^\top + \sigma^2{\boldsymbol{I}} \end{aligned} \]

PPCA Solution

Now, we need to find the parameters \({\textcolor{params}{\boldsymbol{W}}}\), \({\textcolor{vparams}{\boldsymbol{\mu}}}\) and \(\sigma^2\) that maximize the log-likelihood of the data: \[ \begin{aligned} \log p({\textcolor{input}{\boldsymbol{X}}}; {\textcolor{params}{\boldsymbol{W}}}, {\textcolor{vparams}{\boldsymbol{\mu}}}, \sigma^2) &= \sum_{n=1}^N \log p({\textcolor{input}{\boldsymbol{x}}}_n; {\textcolor{params}{\boldsymbol{W}}}, {\textcolor{vparams}{\boldsymbol{\mu}}}, \sigma^2) = \sum_{n=1}^N \log {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_n; {\textcolor{vparams}{\boldsymbol{\mu}}}, {\textcolor{params}{\boldsymbol{W}}}{\textcolor{params}{\boldsymbol{W}}}^\top + \sigma^2{\boldsymbol{I}})\\ &= -\frac{NK}{2}\log(2\pi) - \frac{N}{2}\log(\det({\textcolor{params}{\boldsymbol{W}}}{\textcolor{params}{\boldsymbol{W}}}^\top + \sigma^2{\boldsymbol{I}})) - \\ &-\frac{1}{2}\sum_{n=1}^N ({\textcolor{input}{\boldsymbol{x}}}_n - {\textcolor{vparams}{\boldsymbol{\mu}}})^\top({\textcolor{params}{\boldsymbol{W}}}{\textcolor{params}{\boldsymbol{W}}}^\top + \sigma^2{\boldsymbol{I}})^{-1}({\textcolor{input}{\boldsymbol{x}}}_n - {\textcolor{vparams}{\boldsymbol{\mu}}}) \end{aligned} \]

Again, compute the partial derivative w.r.t. \({\textcolor{params}{\boldsymbol{W}}}\), \({\textcolor{vparams}{\boldsymbol{\mu}}}\) and \(\sigma^2\) and set them to zero.

PPCA Solution

Let’s define the sample covariance matrix \({\boldsymbol{S}}= {\textcolor{input}{\boldsymbol{X}}}{\textcolor{input}{\boldsymbol{X}}}^\top\). Solutions for \({\textcolor{params}{\boldsymbol{W}}}\), \({\textcolor{vparams}{\boldsymbol{\mu}}}\) and \(\sigma^2\) can be found in closed form, no need to use the EM algorithm.

For the mean, we have: \({\textcolor{vparams}{\boldsymbol{\mu}}}= \frac{1}{N}\sum_{n=1}^N {\textcolor{input}{\boldsymbol{x}}}_n\)

For the noise variance, we have: \(\sigma^2 = \frac{1}{D-K} \sum_{j=K+1}^D \lambda_j\), where \(\lambda_j\) are the eigenvalues of \({\boldsymbol{S}}\)

For the weight matrix, we have: \({\textcolor{params}{\boldsymbol{W}}}= {\boldsymbol{U}}_K({\boldsymbol{\Lambda}}_K - \sigma^2{\boldsymbol{I}})^{1/2}\), where \({\boldsymbol{U}}_K\) are the first \(K\) eigenvectors \({\boldsymbol{S}}\) and \({\boldsymbol{\Lambda}}_K\) is the diagonal matrix of the first \(K\) eigenvalues.

Because everything is Gaussian, we can also compute the posterior distribution of the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) given the data \({\textcolor{input}{\boldsymbol{x}}}\):

\[ p({\textcolor{latent}{\boldsymbol{z}}}_n\mid{\textcolor{input}{\boldsymbol{x}}}_n) = {\mathcal{N}}({\textcolor{latent}{\boldsymbol{z}}}_n; {\boldsymbol{C}}^{-1}{\textcolor{params}{\boldsymbol{W}}}^\top({\textcolor{input}{\boldsymbol{x}}}_n - {\textcolor{vparams}{\boldsymbol{\mu}}}), \sigma^2{\boldsymbol{C}}^{-1}) \]

with \({\boldsymbol{C}}= {\textcolor{params}{\boldsymbol{W}}}^\top{\textcolor{params}{\boldsymbol{W}}}+ \sigma^2{\boldsymbol{I}}\)

Is the PPCA solution the same as PCA?

The PPCA solution is more general than PCA, because it assumes that the data is generated from a Gaussian distribution with noise.
However, when the noise variance \(\sigma^2\) goes to zero, the PPCA solution converges to the PCA solution.

Note: The principal components found by PCA and PPCA span the same subspace, but the actual components (i.e., the directions) may differ by an (un-identifiable) rotation matrix \({\boldsymbol{R}}\)