Advanced Statistical Inference

Naive Bayes Classifier

Simone Rossi

EURECOM

\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]

\[ \require{physics} \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \definecolor{vparams}{rgb}{0.58, 0, 0.83} \definecolor{noise}{rgb}{0.0, 0.48, 0.65} \definecolor{latent}{rgb}{0.8, 0.0, 0.8} \]

Naive Bayes Classifier

Introduction

  • Naive Bayes is an example of generative classifier.

  • It is based on the Bayes theorem and assumes that the features are conditionally independent given the class.

\[ p(\textcolor{output}{y}_\star = k \vert {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = \frac{p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) p(\textcolor{output}{y}_\star = k)}{\sum_{j} p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = j, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) p(\textcolor{output}{y}_\star = j)} \]

Questions:

  1. What is the likelihood \(p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})\)?
  2. What is the prior \(p(\textcolor{output}{y}_\star = k)\)?

Naive Bayes: Likelihood

\[ p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, \textcolor{latent}{\boldsymbol{f}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})) \]

  • How likely is the new data \({\textcolor{input}{\boldsymbol{x}}}_\star\) given the class \(k\) and the training data \({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}\)?

  • The function \(\textcolor{latent}{\boldsymbol{f}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})\) encodes some likelihood parameters, given the training data.

  • Free to choose the form of the likelihood, but it needs to be class-conditional.

  • Common choice: Gaussian distribution (for each class).

  • Training data for class \(k\) used to estimate the parameters of the Gaussian (mean and variance).

Naive Bayes: Likelihood

Naive Bayes makes an additional assumption on the likelihood:

  • The features are conditionally independent given the class.

\[ p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, \textcolor{latent}{\boldsymbol{f}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})) = \prod_{d=1}^D p(\textcolor{input}x_{\star d} \vert \textcolor{output}{y}_\star = k, \textcolor{latent}{\boldsymbol{f}}_d({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})) \]

  • The likelihood parameters \(\textcolor{latent}{\boldsymbol{f}}_d({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})\) are specific to each feature \(d\).

  • In the Gaussian case:

    • \(\textcolor{latent}{\boldsymbol{f}}_d({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = \{\textcolor{latent}\mu_{kd}, \textcolor{latent}\sigma_{kd}^2\}\).
    • \(p(\textcolor{input}x_{\star d} \vert \textcolor{output}{y}_\star = k, \textcolor{latent}{\boldsymbol{f}}_d({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})) = {\mathcal{N}}(\textcolor{input}x_{\star d} \vert \textcolor{latent}\mu_{kd}, \textcolor{latent}\sigma_{kd}^2)\).

Naive Bayes: Prior

\[ p(\textcolor{output}{y}_\star = k) \]

  • How likely is the class \(k\)?

Examples:

  • Uniform prior: \(p(\textcolor{output}{y}_\star = k) = \frac{1}{K}\).
  • Class imbalance: \(p(\textcolor{output}{y}_\star = k) \ll p(\textcolor{output}{y}_\star = j)\) for \(j \neq k\) if class \(k\) is rare.

Naive Bayes: Example

Step 1: fit the class-conditional distributions

\(\textcolor{latent}{\boldsymbol{f}}({\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}})\) needs to estimate mean and variance for each class.

  1. Empirical mean: \[ \textcolor{latent}\mu_{kd} = \frac{1}{N_k} \sum_{i=1}^N \textcolor{input}x_{id} \mathbb{I}(\textcolor{output}y_i = k) \]

  2. Empirical variance: \[ \textcolor{latent}\sigma_{kd}^2 = \frac{1}{N_k} \sum_{i=1}^N (\textcolor{input}x_{id} - \textcolor{latent}\mu_{kd})^2 \mathbb{I}(\textcolor{output}y_i = k) \]

Step 1: fit the class-conditional distributions

\[ p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = \prod_{d=1}^D {\mathcal{N}}(\textcolor{input}x_{\star d} \vert \textcolor{latent}\mu_{kd}, \textcolor{latent}\sigma_{kd}^2) \]

Step 2: predict the class of new data

\[ p(\textcolor{output}{y}_\star = k \vert {\textcolor{input}{\boldsymbol{x}}}_\star, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) = \frac{p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = k, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) p(\textcolor{output}{y}_\star = k)}{\sum_{j} p({\textcolor{input}{\boldsymbol{x}}}_\star \vert \textcolor{output}{y}_\star = j, {\textcolor{input}{\boldsymbol{X}}}, {\textcolor{output}{\boldsymbol{y}}}) p(\textcolor{output}{y}_\star = j)} \]

Summary

Naive Bayes is a generative classifier based on the Bayes theorem.

Advantages:

  • Simple and fast.
  • Works well with high-dimensional data.

Disadvantages:

  • Strong assumption of conditional independence.
  • Hides the classification problem as a density estimation problem.