Naive Bayes Classifier
EURECOM
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]
\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]Naive Bayes is an example of generative classifier.
It is based on the Bayes theorem and assumes that the features are conditionally independent given the class.
\[ p(\textcolor{output}{y}_\star = k \vert {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = \frac{p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) p(\textcolor{output}{y}_\star = k)}{\sum_{j} p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = j, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) p(\textcolor{output}{y}_\star = j)} \]
Questions:
\[ p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, \textcolor{latent}{\boldsymbol{f}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})) \]
How likely is the new data \({\textcolor{input}{\boldsymbol{x}}}_\star\) given the class \(k\) and the training data \({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}\)?
The function \(\textcolor{latent}{\boldsymbol{f}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})\) encodes some likelihood parameters, given the training data.
Free to choose the form of the likelihood, but it needs to be class-conditional.
Common choice: Gaussian distribution (for each class).
Training data for class \(k\) used to estimate the parameters of the Gaussian (mean and variance).
Naive Bayes makes an additional assumption on the likelihood:
\[ p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, \textcolor{latent}{\boldsymbol{f}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})) = \prod_{d=1}^D p(\textcolor{input}x_{\star d} \vert \textcolor{output}{y}_\star = k, \textcolor{latent}{\boldsymbol{f}}_d({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})) \]
The likelihood parameters \(\textcolor{latent}{\boldsymbol{f}}_d({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})\) are specific to each feature \(d\).
In the Gaussian case:
\[ p(\textcolor{output}{y}_\star = k) \]
Examples:
\(\textcolor{latent}{\boldsymbol{f}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})\) needs to estimate mean and variance for each class.
Empirical mean: \[ \textcolor{latent}\mu_{kd} = \frac{1}{N_k} \sum_{i=1}^N \textcolor{input}x_{id} \mathbb{I}(\textcolor{output}y_i = k) \]
Empirical variance: \[ \textcolor{latent}\sigma_{kd}^2 = \frac{1}{N_k} \sum_{i=1}^N (\textcolor{input}x_{id} - \textcolor{latent}\mu_{kd})^2 \mathbb{I}(\textcolor{output}y_i = k) \]
\[ p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = \prod_{d=1}^D {\mathcal{N}}(\textcolor{input}x_{\star d} \vert \textcolor{latent}\mu_{kd}, \textcolor{latent}\sigma_{kd}^2) \]
\[ p(\textcolor{output}{y}_\star = k \vert {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = \frac{p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) p(\textcolor{output}{y}_\star = k)}{\sum_{j} p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = j, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) p(\textcolor{output}{y}_\star = j)} \]
Naive Bayes is a generative classifier based on the Bayes theorem.
Advantages:
Disadvantages: