Advanced Statistical Inference
EURECOM
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \definecolor{function}{rgb}{0.75, 0.75, 0.12} \]
Our objective is to learn the data distribution \(p({\textcolor{input}{\boldsymbol{x}}})\), but we suppose that each data point \({\textcolor{input}{\boldsymbol{x}}}\) is associated with a latent variable \({\textcolor{latent}{\boldsymbol{z}}}\).
For example, image we want to learn the distribution of images of objects, like the one below:

Each image is a huge vector of pixels \({\textcolor{input}{\boldsymbol{x}}}\in \mathbb{R}^{H \times W \times C}\), where \(H\) is the height, \(W\) is the width and \(C\) is the number of channels (in this case, \(64 \times 64 \times 3\)).
Each image can be described by a set of 6 latent variables: floor, wall and object color, shape, orientation and scale.
Latent variables are unobserved variables, we cannot measure them directly, but we can infer them from the observed data.
In math terms, we model our data distribution as a marginal of the joint distribution of the observed data and the latent variables:
\[ p({\textcolor{input}{\boldsymbol{x}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}\quad \text{or} \quad p({\textcolor{input}{\boldsymbol{x}}}) = \sum_{i} p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}_i) \]
where \(p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}})\) is the joint distribution of the observed data and the latent variables.
In mixture models, we assume that the data is generated from a mixture of several distributions
\[ p({\textcolor{input}{\boldsymbol{x}}}) = \sum_i p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}_i) = \sum_i p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{\boldsymbol{z}}}_i)p({\textcolor{latent}{\boldsymbol{z}}}_i) \]
where \(p({\textcolor{latent}{\boldsymbol{z}}})\) is the mixture distribution and \(p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{\boldsymbol{z}}})\) is the conditional distribution of the data given the latent variables.
1-hot encoding of the discrete latent variable \({\textcolor{latent}{z}}_k \in \{0,1\}\), with prior: \[ p({\textcolor{latent}{z}}_k=1) = \textcolor{vparams}{\pi}_k \quad \text{and} \quad \sum_k \textcolor{vparams}{\pi}_k = 1 \]
The conditional distribution of the data given the latent variable is: \[ p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{z}}_k=1) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k) \]
The joint distribution of the data and the latent variable is: \[ p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{z}}_k=1) = p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{z}}_k=1)p({\textcolor{latent}{z}}_k=1) = \textcolor{vparams}{\pi}_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k) \]
The marginal distribution of the data is: \[ p({\textcolor{input}{\boldsymbol{x}}}) = \sum_k p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{z}}_k=1) = \sum_k \textcolor{vparams}{\pi}_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k) \]
We are interested in the posterior distribution of a cluster assignment given the data:
\[ \begin{aligned} p({\textcolor{latent}{z}}_k=1\mid{\textcolor{input}{\boldsymbol{x}}}) &= \frac{p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{z}}_k=1)p({\textcolor{latent}{z}}_k=1)}{p({\textcolor{input}{\boldsymbol{x}}})} \\ &= \frac{p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{z}}_k=1)p({\textcolor{latent}{z}}_k=1)}{\sum_{j} p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{z}}_j=1)p({\textcolor{latent}{z}}_j=1)} \\ &= \frac{\textcolor{vparams}{\pi}_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k)}{\sum_{j} \textcolor{vparams}{\pi}_j{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}_j, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_j)} := \gamma({\textcolor{latent}{z}}_k) \end{aligned} \]
\(\gamma({\textcolor{latent}{z}}_k)\) is the responsibility of cluster \(k\) for data point \({\textcolor{input}{\boldsymbol{x}}}\).
Given a dataset \({\textcolor{input}{\boldsymbol{X}}}= \{{\textcolor{input}{\boldsymbol{x}}}_1, \ldots, {\textcolor{input}{\boldsymbol{x}}}_N\}\) i.i.d, the log-likelihood of the data is:
\[ \begin{aligned} \log p({\textcolor{input}{\boldsymbol{X}}}\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\}) &= \log \prod_{i=1}^N p({\textcolor{input}{\boldsymbol{x}}}_i) \\ &= \sum_{i=1}^N \log p({\textcolor{input}{\boldsymbol{x}}}_i\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\}) \\ &= \sum_{i=1}^N \log \sum_{k=1}^K p({\textcolor{input}{\boldsymbol{x}}}_i\mid{\textcolor{latent}{z}}_k=1)p({\textcolor{latent}{z}}_k=1) \\ &= \sum_{i=1}^N \log \sum_{k=1}^K \textcolor{vparams}{\pi}_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_i; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k) \\ \end{aligned} \]
Warning
We cannot simplify it further, because the log and the sum do not commute. How can we optimize it?
We need to maximize the log-likelihood of the data w.r.t. the parameters \({\textcolor{vparams}{\boldsymbol{\pi}}}\), \(\{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}\) and \(\{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\}\), for \(k=1,\ldots,K\)
\[ \log p({\textcolor{input}{\boldsymbol{X}}}) = \sum_{i=1}^N \log \sum_{k=1}^K \textcolor{vparams}{\pi}_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_i; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k) \]
The problem is not convex.
No closed-form solution! Stationary points depend on the posterior \(\gamma({\textcolor{latent}{z}}_{nk})\) for each data point \({\textcolor{input}{\boldsymbol{x}}}_n\).
… but we will see that we can write \(\textcolor{vparams}{\pi}_k\), \({\textcolor{vparams}{\boldsymbol{\mu}}}_k\) and \({\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\) in terms of \(\gamma({\textcolor{latent}{z}}_{nk})\).
We can find the local minima by iterative algorithm: alternate update of the expected posterior \(\gamma({\textcolor{latent}{z}}_{nk})\) and the maximized parameters \(\textcolor{vparams}{\pi}_k\), \({\textcolor{vparams}{\boldsymbol{\mu}}}_k\) and \({\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\).
Results of results for Gaussian distributions:
Assume \(p({\textcolor{input}{\boldsymbol{x}}}) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}, {\textcolor{vparams}{\boldsymbol{\Sigma}}}) = (2\pi)^{-\frac{d}{2}}\det({\textcolor{vparams}{\boldsymbol{\Sigma}}})^{-\frac{1}{2}}\exp\left(-\frac{1}{2}({\textcolor{input}{\boldsymbol{x}}}- {\textcolor{vparams}{\boldsymbol{\mu}}})^\top{\textcolor{vparams}{\boldsymbol{\Sigma}}}^{-1}({\textcolor{input}{\boldsymbol{x}}}- {\textcolor{vparams}{\boldsymbol{\mu}}})\right)\).
Then,
\[ \frac{\partial p({\textcolor{input}{\boldsymbol{x}}})}{\partial {\textcolor{vparams}{\boldsymbol{\mu}}}} = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}, {\textcolor{vparams}{\boldsymbol{\Sigma}}})({\textcolor{input}{\boldsymbol{x}}}- {\textcolor{vparams}{\boldsymbol{\mu}}}){\textcolor{vparams}{\boldsymbol{\Sigma}}}^{-1} \]
and
\[ \frac{\partial p({\textcolor{input}{\boldsymbol{x}}})}{\partial {\textcolor{vparams}{\boldsymbol{\Sigma}}}} = -\frac{1}{2}{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{vparams}{\boldsymbol{\mu}}}, {\textcolor{vparams}{\boldsymbol{\Sigma}}})\left({\textcolor{vparams}{\boldsymbol{\Sigma}}}^{-1} - {\textcolor{vparams}{\boldsymbol{\Sigma}}}^{-1}({\textcolor{input}{\boldsymbol{x}}}- {\textcolor{vparams}{\boldsymbol{\mu}}})({\textcolor{input}{\boldsymbol{x}}}- {\textcolor{vparams}{\boldsymbol{\mu}}})^\top{\textcolor{vparams}{\boldsymbol{\Sigma}}}^{-1} \right) \]
\[ \begin{aligned} \arg\max_{{\textcolor{vparams}{\boldsymbol{\mu}}}_k} p({\textcolor{input}{\boldsymbol{X}}}\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_j\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_j\}) &= \arg\max_{{\textcolor{vparams}{\boldsymbol{\mu}}}_k} \sum_{n=1}^N \log p({\textcolor{input}{\boldsymbol{x}}}_n\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_j\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_j\}) \end{aligned} \]
Compute the partial derivative w.r.t. \({\textcolor{vparams}{\boldsymbol{\mu}}}_k\):
\[ \begin{aligned} \frac{\partial}{\partial {\textcolor{vparams}{\boldsymbol{\mu}}}_k} \log p({\textcolor{input}{\boldsymbol{X}}}\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\}) &= \sum_{n=1}^N \frac{1}{p({\textcolor{input}{\boldsymbol{x}}}_n\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\})}\frac{\partial p({\textcolor{input}{\boldsymbol{x}}}_n\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_j\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_j\})}{\partial {\textcolor{vparams}{\boldsymbol{\mu}}}_k} \\ &= \sum_{n=1}^N \frac{1}{p({\textcolor{input}{\boldsymbol{x}}}_n\mid {\textcolor{vparams}{\boldsymbol{\pi}}}, \{{\textcolor{vparams}{\boldsymbol{\mu}}}_k\}, \{{\textcolor{vparams}{\boldsymbol{\Sigma}}}_k\})}\frac{\partial \textcolor{vparams}{\pi}_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_n; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k)}{\partial {\textcolor{vparams}{\boldsymbol{\mu}}}_k} \\ &= \sum_{n=1}^N \frac{\textcolor{vparams}{\pi}_k{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_n; {\textcolor{vparams}{\boldsymbol{\mu}}}_k, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_k)}{\sum_{j=1}^K \textcolor{vparams}{\pi}_j{\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_n; {\textcolor{vparams}{\boldsymbol{\mu}}}_j, {\textcolor{vparams}{\boldsymbol{\Sigma}}}_j)}({\textcolor{input}{\boldsymbol{x}}}_n - {\textcolor{vparams}{\boldsymbol{\mu}}}_k){\textcolor{vparams}{\boldsymbol{\Sigma}}}_k^{-1} \\ &= \sum_{n=1}^N \gamma({\textcolor{latent}{z}}_{nk})({\textcolor{input}{\boldsymbol{x}}}_n - {\textcolor{vparams}{\boldsymbol{\mu}}}_k){\textcolor{vparams}{\boldsymbol{\Sigma}}}_k^{-1} = 0 \quad \Rightarrow {\textcolor{vparams}{\boldsymbol{\mu}}}_k = \frac{\sum_{n=1}^N \gamma({\textcolor{latent}{z}}_{nk}){\textcolor{input}{\boldsymbol{x}}}_n}{\sum_{n=1}^N \gamma({\textcolor{latent}{z}}_{nk})} \end{aligned} \]
Same for the covariance:
Same for the weights \({\textcolor{vparams}{\boldsymbol{\pi}}}\), but we have to use the constraint \(\sum_{k=1}^K \textcolor{vparams}{\pi}_k = 1\).
Algorithm:
Important:
After the EM algorithm converges, we can the learned parameters to:
Pros:
Cons:
Question: Can we be Bayesian about the number of clusters \(K\)? Yes, it’s called Dirichlet Process Mixture Model (DPMM)
Linear latent variable models are a class of generative models that assume that the data is generated from a linear combination of latent variables.
We will:
Assume we have a dataset \({\textcolor{input}{\boldsymbol{X}}}= \{{\textcolor{input}{\boldsymbol{x}}}_1, \ldots, {\textcolor{input}{\boldsymbol{x}}}_N\}\), where \({\textcolor{input}{\boldsymbol{x}}}_n\in{\mathbb{R}}^{D}\).
We want to find a low-dimensional representation of the data, i.e., we want to find a mapping \({\textcolor{input}{\boldsymbol{x}}}_n \mapsto {\textcolor{latent}{\boldsymbol{z}}}_n\) such that \({\textcolor{latent}{\boldsymbol{z}}}_n\in{\mathbb{R}}^{K}\), where \(K < D\).
Choose the mapping such that the variance of the projected data is maximized, i.e., we want to find the mapping that maximizes the variance of the data in the low-dimensional space.
Let’s start by projecting the data on 1D subspace \({\textcolor{latent}{z}}= {\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{x}}}\), where \({\textcolor{params}{\boldsymbol{w}}}\in{\mathbb{R}}^{D}\).
We only want the direction, so we can assume that \(\norm{{\textcolor{params}{\boldsymbol{w}}}} = 1\).
Given data \({\textcolor{input}{\boldsymbol{X}}}= \{{\textcolor{input}{\boldsymbol{x}}}_1, \ldots, {\textcolor{input}{\boldsymbol{x}}}_N\}\), we maximize the variance of the projected data \({\textcolor{latent}{\boldsymbol{z}}}= {\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}\) constrained to \(\norm{{\textcolor{params}{\boldsymbol{w}}}} = 1\): \[ \arg\max_{{\textcolor{params}{\boldsymbol{w}}}} \norm{{\textcolor{latent}{\boldsymbol{z}}}}^2 \qquad \text{s.t.} \quad \norm{{\textcolor{params}{\boldsymbol{w}}}}^2 = 1 \]
Assumes that \({\textcolor{input}{\boldsymbol{X}}}\) is centered (i.e., \({\textcolor{input}{\boldsymbol{X}}}\) has zero mean)
\[ \ell({\textcolor{params}{\boldsymbol{w}}}, \lambda) = \norm{{\textcolor{latent}{\boldsymbol{z}}}}^2 - \lambda(\norm{{\textcolor{params}{\boldsymbol{w}}}}^2 - 1) = {\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}- \lambda({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{params}{\boldsymbol{w}}}- 1) \]
\[ \begin{aligned} \frac{\partial \ell}{\partial {\textcolor{params}{\boldsymbol{w}}}} &= 2{\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}- 2\lambda{\textcolor{params}{\boldsymbol{w}}}= 0 \quad \Rightarrow \quad {\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}= \lambda{\textcolor{params}{\boldsymbol{w}}}\\ \frac{\partial \ell}{\partial \lambda} &= -({\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{params}{\boldsymbol{w}}}- 1) = 0 \end{aligned} \]
This is an eigenvalue problem, where \(\lambda\) is the eigenvalue and \({\textcolor{params}{\boldsymbol{w}}}\) is the eigenvector of the covariance matrix \({\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}\).
But which eigenvalue?
The eigenvalue \(\lambda\) is the variance of the projected data \({\textcolor{latent}{\boldsymbol{z}}}\): \(\lambda = {\textcolor{params}{\boldsymbol{w}}}^\top{\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}{\textcolor{params}{\boldsymbol{w}}}\)
If we want to maximize the variance, we need to find the largest eigenvalue of the covariance matrix \({\textcolor{input}{\boldsymbol{X}}}^\top{\textcolor{input}{\boldsymbol{X}}}\).
The corresponding eigenvector is the direction of maximum variance: \(\widehat{\textcolor{params}{\boldsymbol{w}}}\)
PCA can be seen as a linear continuous latent variable model, where the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) is a linear combination of the observed data \({\textcolor{input}{\boldsymbol{x}}}\).
Question: Can we approach PCA from a probabilistic point of view? Answer: Yes!
Recall the continuous latent variable model:
\[ p({\textcolor{input}{\boldsymbol{x}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}, {\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}= \int p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{\boldsymbol{z}}})p({\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}} \]
\[ {\textcolor{input}{\boldsymbol{x}}}= {\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{vparams}{\boldsymbol{\mu}}}+ {\textcolor{noise}{\boldsymbol{\varepsilon}}} \]
where:
How can we treat these variables?
The observed data \({\textcolor{input}{\boldsymbol{x}}}\) conditioned on the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) is a Gaussian distribution \[ p({\textcolor{input}{\boldsymbol{x}}}\mid {\textcolor{latent}{\boldsymbol{z}}}) = {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{vparams}{\boldsymbol{\mu}}}, \sigma^2{\boldsymbol{I}}) \]
The marginal distribution of the data \({\textcolor{input}{\boldsymbol{x}}}\) is obtained by integrating out the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\):
\[ p({\textcolor{input}{\boldsymbol{x}}}) = \int p({\textcolor{input}{\boldsymbol{x}}}\mid{\textcolor{latent}{\boldsymbol{z}}})p({\textcolor{latent}{\boldsymbol{z}}}) \dd{\textcolor{latent}{\boldsymbol{z}}}= \int {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}; {\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{vparams}{\boldsymbol{\mu}}}, \sigma^2{\boldsymbol{I}}){\mathcal{N}}({\textcolor{latent}{\boldsymbol{z}}}; 0, {\boldsymbol{I}}) \dd{\textcolor{latent}{\boldsymbol{z}}} \]
It’s a Gaussian distribution \({\mathcal{N}}(?, ?)\).
\[ \begin{aligned} \mathbb{E}{[{\textcolor{input}{\boldsymbol{x}}}]} &= \mathbb{E}{[{\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{vparams}{\boldsymbol{\mu}}}+ {\textcolor{noise}{\boldsymbol{\varepsilon}}}]} \\ &= {\textcolor{params}{\boldsymbol{W}}}\mathbb{E}{[{\textcolor{latent}{\boldsymbol{z}}}]} + {\textcolor{vparams}{\boldsymbol{\mu}}}+ \mathbb{E}{[{\textcolor{noise}{\boldsymbol{\varepsilon}}}]} \\ &= {\textcolor{params}{\boldsymbol{W}}}\cdot 0 + {\textcolor{vparams}{\boldsymbol{\mu}}}+ 0 = {\textcolor{vparams}{\boldsymbol{\mu}}} \end{aligned} \]
\[ \begin{aligned} \text{Cov}({\textcolor{input}{\boldsymbol{x}}}) &= \mathbb{E}[({\textcolor{input}{\boldsymbol{x}}}- {\textcolor{vparams}{\boldsymbol{\mu}}})({\textcolor{input}{\boldsymbol{x}}}- {\textcolor{vparams}{\boldsymbol{\mu}}})^\top] \\ &= \mathbb{E}[({\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{noise}{\boldsymbol{\varepsilon}}})({\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}+ {\textcolor{noise}{\boldsymbol{\varepsilon}}})^\top] \\ &= \mathbb{E}[{\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}{\textcolor{latent}{\boldsymbol{z}}}^\top{\textcolor{params}{\boldsymbol{W}}}^\top + {\textcolor{noise}{\boldsymbol{\varepsilon}}}{\textcolor{noise}{\boldsymbol{\varepsilon}}}^\top + {\textcolor{params}{\boldsymbol{W}}}{\textcolor{latent}{\boldsymbol{z}}}{\textcolor{noise}{\boldsymbol{\varepsilon}}}^\top + {\textcolor{noise}{\boldsymbol{\varepsilon}}}{\textcolor{latent}{\boldsymbol{z}}}^\top{\textcolor{params}{\boldsymbol{W}}}] \\ &= {\textcolor{params}{\boldsymbol{W}}}\mathbb{E}[{\textcolor{latent}{\boldsymbol{z}}}{\textcolor{latent}{\boldsymbol{z}}}^\top]{\textcolor{params}{\boldsymbol{W}}}^\top + \mathbb{E}[{\textcolor{noise}{\boldsymbol{\varepsilon}}}{\textcolor{noise}{\boldsymbol{\varepsilon}}}^\top] + {\textcolor{params}{\boldsymbol{W}}}\mathbb{E}[{\textcolor{latent}{\boldsymbol{z}}}]\mathbb{E}[{\textcolor{noise}{\boldsymbol{\varepsilon}}}^\top] + \mathbb{E}[{\textcolor{noise}{\boldsymbol{\varepsilon}}}]\mathbb{E}[{\textcolor{latent}{\boldsymbol{z}}}^\top]{\textcolor{params}{\boldsymbol{W}}}\\ &= {\textcolor{params}{\boldsymbol{W}}}\cdot {\boldsymbol{I}}\cdot{\textcolor{params}{\boldsymbol{W}}}^\top + \sigma^2{\boldsymbol{I}}+ 0 + 0 \\ &= {\textcolor{params}{\boldsymbol{W}}}{\textcolor{params}{\boldsymbol{W}}}^\top + \sigma^2{\boldsymbol{I}} \end{aligned} \]
Now, we need to find the parameters \({\textcolor{params}{\boldsymbol{W}}}\), \({\textcolor{vparams}{\boldsymbol{\mu}}}\) and \(\sigma^2\) that maximize the log-likelihood of the data: \[ \begin{aligned} \log p({\textcolor{input}{\boldsymbol{X}}}; {\textcolor{params}{\boldsymbol{W}}}, {\textcolor{vparams}{\boldsymbol{\mu}}}, \sigma^2) &= \sum_{n=1}^N \log p({\textcolor{input}{\boldsymbol{x}}}_n; {\textcolor{params}{\boldsymbol{W}}}, {\textcolor{vparams}{\boldsymbol{\mu}}}, \sigma^2) = \sum_{n=1}^N \log {\mathcal{N}}({\textcolor{input}{\boldsymbol{x}}}_n; {\textcolor{vparams}{\boldsymbol{\mu}}}, {\textcolor{params}{\boldsymbol{W}}}{\textcolor{params}{\boldsymbol{W}}}^\top + \sigma^2{\boldsymbol{I}})\\ &= -\frac{NK}{2}\log(2\pi) - \frac{N}{2}\log(\det({\textcolor{params}{\boldsymbol{W}}}{\textcolor{params}{\boldsymbol{W}}}^\top + \sigma^2{\boldsymbol{I}})) - \\ &-\frac{1}{2}\sum_{n=1}^N ({\textcolor{input}{\boldsymbol{x}}}_n - {\textcolor{vparams}{\boldsymbol{\mu}}})^\top({\textcolor{params}{\boldsymbol{W}}}{\textcolor{params}{\boldsymbol{W}}}^\top + \sigma^2{\boldsymbol{I}})^{-1}({\textcolor{input}{\boldsymbol{x}}}_n - {\textcolor{vparams}{\boldsymbol{\mu}}}) \end{aligned} \]
Again, compute the partial derivative w.r.t. \({\textcolor{params}{\boldsymbol{W}}}\), \({\textcolor{vparams}{\boldsymbol{\mu}}}\) and \(\sigma^2\) and set them to zero.
Let’s define the sample covariance matrix \({\boldsymbol{S}}= {\textcolor{input}{\boldsymbol{X}}}{\textcolor{input}{\boldsymbol{X}}}^\top\). Solutions for \({\textcolor{params}{\boldsymbol{W}}}\), \({\textcolor{vparams}{\boldsymbol{\mu}}}\) and \(\sigma^2\) can be found in closed form, no need to use the EM algorithm.
Because everything is Gaussian, we can also compute the posterior distribution of the latent variable \({\textcolor{latent}{\boldsymbol{z}}}\) given the data \({\textcolor{input}{\boldsymbol{x}}}\):
\[ p({\textcolor{latent}{\boldsymbol{z}}}_n\mid{\textcolor{input}{\boldsymbol{x}}}_n) = {\mathcal{N}}({\textcolor{latent}{\boldsymbol{z}}}_n; {\boldsymbol{C}}^{-1}{\textcolor{params}{\boldsymbol{W}}}^\top({\textcolor{input}{\boldsymbol{x}}}_n - {\textcolor{vparams}{\boldsymbol{\mu}}}), \sigma^2{\boldsymbol{C}}^{-1}) \]
with \({\boldsymbol{C}}= {\textcolor{params}{\boldsymbol{W}}}^\top{\textcolor{params}{\boldsymbol{W}}}+ \sigma^2{\boldsymbol{I}}\)
Note: The principal components found by PCA and PPCA span the same subspace, but the actual components (i.e., the directions) may differ by an (un-identifiable) rotation matrix \({\boldsymbol{R}}\)