Post

Mastering Naive Bayes

Mastering Naive Bayes

Naive Bayes remains a cornerstone of machine learning. Despite the “naive” label, it is a sophisticated probabilistic model that relies on solid statistical foundations. This article explores the mathematical core of Naive Bayes, its all major variants, and how it ultimately functions as a Maximum A Posteriori (MAP) estimator.

1. The “Naive” Foundation & Mathematical Derivation

The objective of a Naive Bayes classifier is simple: Given a set of features $X$, what is the probability that this observation belongs to a specific class $C$?

The Mathematical Formulation

We start with Bayes’ Theorem:

\[P(C | X) = \frac{P(X | C) \cdot P(C)}{P(X)}\]

To find the most likely class, we want to find the class C that maximizes the probability. Since $P(X)$ (the evidence) is constant for all classes, we focus on the numerator:

\[P(C | X) \propto P(C) \cdot P(x_1, x_2, ..., x_n | C)\]

The term $P(x_1​, x_2​, …, x_n ​∣ C)$ is the joint probability of all features. Calculating this directly is impossible for large feature spaces because we would need to see every possible combination of features in our training data.

Why it’s “Naive”

We invoke the Conditional Independence Assumption: we assume that every feature $x_i$ is independent of every other feature $x_j$ given the class $C$. This allows us to decompose the joint probability into a simple product:

\[P(x_1, x_2, ..., x_n | C) = \prod_{i=1}^{n} P(x_i | C)\]

The final decision rule becomes:

\[\hat{y} = \arg\max_{C} \left[ P(C) \prod_{i=1}^{n} P(x_i | C) \right]\]

2. Understanding it Through the Practical Problem of Spam Email Detection

Let’s use the classic Spam vs. Ham (Normal Email) example. We want to detect if an email is spam ($S$) or ham ($H$) (not-spam) based on the word it contains.

\[P(S | \text{words}) = \frac{P(\text{words} | S) \cdot P(S)}{P(\text{words})}\]
  • $P(S \lvert \text{words})$: The Posterior Probability (What we want to find).
  • $P(\text{words} \lvert S)$: The Likelihood (How often these words appear in known spam).
  • $P(S)$: The Prior Probability (How common spam is in general).
  • $P(\text{words})$: The Evidence (The probability of these words appearing in any email).

Since $P(\text{words})$ is the same for both the Spam and Ham calculations, we ignore it and focus on maximizing the numerator:

\[P(S | \text{words}) \propto P(S) \cdot P(\text{words} | S)\]

The “Naive” Connection

In a real email, words are dependent on each other (e.g., “Free” followed by “Money” is very different from “Free” followed by “Time”). Calculating these complex dependencies requires massive datasets. Naive Bayes makes a “Naive” Assumption: It assumes that every feature (word) is completely independent of every other feature given the class. Mathematically, this decomposes the joint probability into a simple product:

\[P(\text{words} | S) = P(w_1 | S) \cdot P(w_2 | S) \cdot ... \cdot P(w_n | S)\]

By treating features as independent, the algorithm becomes incredibly fast and effective, even with smaller datasets.

3. Discrete Variants: Bernoulli vs. Multinomial

When dealing with text, we represent data in two primary ways.

Bernoulli Naive Bayes

Why is it called Bernoulli?

It is named after the Bernoulli Distribution, which models an experiment with exactly two outcomes (success/failure). In this context, it treats every word in the vocabulary as a binary feature: Is the word present (1) or absent (0)?

  • Logic: It ignores how many times a word appears. It only cares if it appears.
  • Example: In an email “Money, Money, Money”, the feature for “Money” is simply 1.
  • When to use: Short texts or when the mere presence of a keyword (like a specific error code) is the strongest signal.

Multinomial Naive Bayes

Why is it called Multinomial?

It is named after the Multinomial Distribution, which generalizes the binomial distribution to allow for more than two outcomes (like rolling a die $n$ times). Here, it models the “count” of word occurrences.

  • Logic: It tracks the frequency of words.
  • Example: In “Money, Money, Money”, the feature for “Money” is 3.
  • When to use: Most standard text classification tasks where word repetition signifies importance.

Laplace Smoothing in Detail

A major issue in Naive Bayes is the Zero-Probability Problem. If a word appears in a test email but was never seen in the “Spam” training data, its likelihood becomes 0. Because we multiply probabilities, this single zero wipes out all other evidence.

To fix this, we use Laplace Smoothing (adding a small count $\alpha$, usually 1).

For Multinomial (Word Counts):

We add $\alpha$ to the word count and $\alpha \times |V|$ to the total word count (where $|V|$ is the vocabulary size) to keep probabilities summing to 1.

\[P(w_i | C) = \frac{\text{count}(w_i, C) + \alpha}{\text{Total Words in } C + \alpha|V|}\]

For Bernoulli (Presence/Absence):

Since there are only two outcomes (Present/Absent), we add $2\alpha$ to the denominator.

\[P(w_i | C) = \frac{\text{Docs with } w_i \text{ in } C + \alpha}{\text{Total Docs in } C + 2\alpha}\]

For Multi-Class Priors:

If you have $K$ classes (e.g., Sports, Politics, Tech), smoothing the prior probability $P(C_k)$ follows the same logic. We add $K\alpha$ to the denominator.

\[P(C_k) = \frac{\text{Docs in } C_k + \alpha}{\text{Total Docs} + K\alpha}\]

4. Gaussian Naive Bayes: Handling Continuous Data

What if your features aren’t words, but numbers like height or temperature? You cannot “count” how many times someone is exactly 175.0034 cm tall. This is where Gaussian Naive Bayes comes in.

The Intuition

It assumes that continuous data for each class follows a Normal (Gaussian) Distribution, also known as a Bell Curve. For each class, the algorithm calculates:

  • The Mean ($\mu$): The “center” of the data.
  • The Variance ($\sigma^2$): How “spread out” the data is.

Mathematical Formulation

To find the probability of a specific value $x$, we use the Gaussian Probability Density Function (PDF):

\[P(x | C) = \frac{1}{\sqrt{2\pi\sigma_C^2}} \exp\left(-\frac{(x - \mu_C)^2}{2\sigma_C^2}\right)\]

Practical Example: Human Classification

Imagine classifying “Male” or “Female” based on Height.

  • Training: Calculate average male height ($\mu_M$) and female height ($\mu_F$).
  • Prediction: A new person is 172cm.
  • Calculation: The algorithm plugs 172 into the Male PDF and the Female PDF. Since 172 is closer to the Male mean, that likelihood will be higher, leading to a “Male” prediction.

5. Naive Bayes and MAP Estimation

At its heart, Naive Bayes is solving a Maximum A Posteriori (MAP) estimation problem.

In statistics, we often choose between:

  • Maximum Likelihood Estimation (MLE): Choosing the class that makes the observed data most likely. This ignores prior knowledge.
\[\text{Maximize } P(X | C)\]
  • Maximum A Posteriori (MAP): Choosing the class that is most likely given the data, incorporating our prior beliefs ($P(C)$).
\[\text{Maximize } P(C | X) \propto P(C) \cdot P(X | C)\]

Naive Bayes explicitly calculates $P(C) \times P(X \lvert C)$.

The $P(X \lvert C)$ term represents the Likelihood (the evidence from the features).

The $P(C)$ term represents the Prior (our baseline belief). When we have very little data, the Prior ($P(C)$) dominates the decision. As we gather more data, the Likelihood ($P(X|C)$) takes over. Laplace Smoothing can even be viewed as introducing a “Bayesian Prior” that prevents us from being too certain (avoiding 0 probabilities) when data is scarce.

Enjoyed this article? Never miss out on future posts - follow me.
This post is licensed under CC BY 4.0 by the author.