Advanced Statistical Inference

Naive Bayes Classifier

Simone Rossi

EURECOM

\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]

Naive Bayes Classifier

Introduction

Naive Bayes is an example of generative classifier.
It is based on the Bayes theorem and assumes that the features are conditionally independent given the class.

\[ p(\textcolor{output}{y}_\star = k \vert {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = \frac{p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) p(\textcolor{output}{y}_\star = k)}{\sum_{j} p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = j, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) p(\textcolor{output}{y}_\star = j)} \]

Questions:

What is the likelihood \(p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})\)?
What is the prior \(p(\textcolor{output}{y}_\star = k)\)?

Naive Bayes: Likelihood

\[ p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, \textcolor{latent}{\boldsymbol{f}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})) \]

How likely is the new data \({\textcolor{input}{\boldsymbol{x}}}_\star\) given the class \(k\) and the training data \({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}\)?
The function \(\textcolor{latent}{\boldsymbol{f}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})\) encodes some likelihood parameters, given the training data.
Free to choose the form of the likelihood, but it needs to be class-conditional.
Common choice: Gaussian distribution (for each class).
Training data for class \(k\) used to estimate the parameters of the Gaussian (mean and variance).

Naive Bayes: Likelihood

Naive Bayes makes an additional assumption on the likelihood:

The features are conditionally independent given the class.

\[ p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, \textcolor{latent}{\boldsymbol{f}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})) = \prod_{d=1}^D p(\textcolor{input}x_{\star d} \vert \textcolor{output}{y}_\star = k, \textcolor{latent}{\boldsymbol{f}}_d({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})) \]

The likelihood parameters \(\textcolor{latent}{\boldsymbol{f}}_d({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})\) are specific to each feature \(d\).
In the Gaussian case:
- \(\textcolor{latent}{\boldsymbol{f}}_d({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = \{\textcolor{latent}\mu_{kd}, \textcolor{latent}\sigma_{kd}^2\}\).
- \(p(\textcolor{input}x_{\star d} \vert \textcolor{output}{y}_\star = k, \textcolor{latent}{\boldsymbol{f}}_d({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})) = {\mathcal{N}}(\textcolor{input}x_{\star d} \vert \textcolor{latent}\mu_{kd}, \textcolor{latent}\sigma_{kd}^2)\).

Naive Bayes: Prior

\[ p(\textcolor{output}{y}_\star = k) \]

How likely is the class \(k\)?

Examples:

Uniform prior: \(p(\textcolor{output}{y}_\star = k) = \frac{1}{K}\).
Class imbalance: \(p(\textcolor{output}{y}_\star = k) \ll p(\textcolor{output}{y}_\star = j)\) for \(j \neq k\) if class \(k\) is rare.

Naive Bayes: Example

Step 1: fit the class-conditional distributions

\(\textcolor{latent}{\boldsymbol{f}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})\) needs to estimate mean and variance for each class.

Empirical mean: \[ \textcolor{latent}\mu_{kd} = \frac{1}{N_k} \sum_{i=1}^N \textcolor{input}x_{id} \mathbb{I}(\textcolor{output}y_i = k) \]
Empirical variance: \[ \textcolor{latent}\sigma_{kd}^2 = \frac{1}{N_k} \sum_{i=1}^N (\textcolor{input}x_{id} - \textcolor{latent}\mu_{kd})^2 \mathbb{I}(\textcolor{output}y_i = k) \]

Step 1: fit the class-conditional distributions

\[ p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = \prod_{d=1}^D {\mathcal{N}}(\textcolor{input}x_{\star d} \vert \textcolor{latent}\mu_{kd}, \textcolor{latent}\sigma_{kd}^2) \]

Step 2: predict the class of new data

Summary

Naive Bayes is a generative classifier based on the Bayes theorem.

Advantages:

Simple and fast.
Works well with high-dimensional data.

Disadvantages:

Strong assumption of conditional independence.
Hides the classification problem as a density estimation problem.