Post

A Short Primer on KL/JS Divergence

A Short Primer on KL/JS Divergence

Mastering Data Drift Detection: Why You Should Switch from KL to JS Divergence

In the high-stakes world of machine learning production systems, especially in domains like Fraud Detection, deploying a model is only the beginning. The real challenge begins on Day 2: dealing with the ever-changing nature of the real world.
Models are trained on historical data, but they operate on live data. When these two distributions diverge, your model’s performance can plummet silently. This phenomenon is known as Data Drift.
To catch this, MLOps engineers rely on statistical distance metrics. The most famous is Kullback-Leibler (KL) Divergence. But is it the best tool for the job? In this post, we’ll explore why a modern alternative, Jensen-Shannon (JS) Divergence, might be the robust upgrade your monitoring system needs.

1. The Tale of Two Metrics: KL vs. JS Divergence

Both metrics answer the same fundamental question: “How different is my live production data (Q) from my original training data (P)?”

Kullback-Leibler (KL) Divergence

Often called “relative entropy,” this is the classic measure of information loss. It quantifies how much information is lost when we use the live distribution $Q$ to approximate the reference distribution $P$.

\[KL(P || Q) = \sum_{i} P(i) \cdot \log\left(\frac{P(i)}{Q(i)}\right)\]

While mathematically elegant, it has a critical flaw for monitoring: It is asymmetric.

  • $KL(P \Vert Q) \neq KL(Q \Vert P)$
  • The order matters. Comparing “Training to Live” gives a different score than “Live to Training,” which can be confusing in automated alerts.

Jensen-Shannon (JS) Divergence

JS Divergence is the “smoothed, symmetric cousin” of KL Divergence. It fixes the stability issues of KL by comparing both distributions to their average mixture ($M$).

It is a two-step calculation:

Step 1: Create the Average Distribution (M) First, you define an average distribution, $M$, which is the midpoint between $P$ and $Q$:

\[M = \frac{1}{2}(P + Q)\]
  • For each event $i$, this is just $M(i) = \frac{P(i) + Q(i)}{2}$.

Step 2: Calculate the JS Divergence The JS Divergence is the average of the KL Divergences from $P$ to $M$ and from $Q$ to $M$:

\[JSD(P || Q) = \frac{1}{2}KL(P || M) + \frac{1}{2}KL(Q || M)\]

This seemingly small change gives it three superpowers:

  1. Symmetry: $JSD(P \Vert Q) = JSD(Q \Vert P)$. It is a true distance metric.
  2. Bounded Score: The result is always between 0 and ~0.693 (ln 2). This makes setting alert thresholds much easier.
  3. Stability: It handles zero-probability events gracefully (more on this below).

2. Real-World Use Case: The Fraud Detection System

Let’s visualize a standard fraud detection pipeline to see where these metrics fit in.
Imagine a PayPal-like transaction system processing millions of payments.

  • The “Golden Profile” ($P$): Before deploying Fraud_Model_v1, you calculated the statistical distribution of key features (like transaction_amount) in the training set.
  • The Live Monitor ($Q$): Every hour, a Flink or Spark job calculates the distribution of the exact same features from the last hour of live traffic.

The monitor runs a simple test: IF Distance(P, Q) > Threshold THEN Alert.
If the distance spikes, it means the world has changed, perhaps a new fraud ring is attacking and the model needs retraining.

Example 1: Detecting Data Drift (P(X) changes)

Data drift is a change in the input features. We monitor each feature independently.

  • Feature: transaction_amount (Continuous)
    1. Step 1: Create P_train Profile.
      • We can’t use raw numbers. So, we bin (discretize) the feature into a histogram.
      • Training Data Profile (P):
        • $0 - $50: 40%
        • $50 - $100: 30%
        • $100 - $500: 20%
        • $500+: 10%
    2. Step 2: Create P_live Profile.
      • We run the same binning on the last 24 hours of live traffic.
      • Live Data Profile (Q):
        • $0 - $50: 20% (A big drop!)
        • $50 - $100: 25%
        • $100 - $500: 30%
        • $500+: 25% (A big increase!)
    3. Step 3: Calculate KL(P || Q)
      • The KL formula will sum the divergences for each bin. It will see a huge “surprise” in the $0-$50 and $500+ bins.
      • The resulting drift_score will be high (e.g., 0.28).
    4. Step 4: Act.
      • 0.28 > 0.1. The alarm bells ring.
      • Why? Fraudsters have stopped making small test transactions and are now trying to steal large amounts. The inputs have drifted, and the model must be retrained on this “new normal.”
  • Feature: merchant_category (Categorical) This is even easier. The bins are just the categories.
    • P_train: {"groceries": 0.5, "electronics": 0.2, "gambling": 0.05, ...}
    • P_live: {"groceries": 0.4, "electronics": 0.2, "gambling": 0.15, ...}
    • Action: The KL score for this feature will be high because the “gambling” probability has tripled. This is a clear data drift signal.

Example 2: Detecting Concept Drift (P(y|X) changes)

Concept drift is harder, as the relationship changes. We can’t measure P(y|X) directly. So, we use a proxy.

The two best proxies are the distribution of the target variable (P(y)) or the distribution of the model’s scores (P(score)).

Method: Target Distribution (P(y))

  1. Step 1: Create P_train Profile.
    • This requires your labeled training data.
    • Training Data Profile (P):
      • y=0 (Not Fraud): 99%
      • y=1 (Fraud): 1% (This is our baseline assumption)
  2. Step 2: Create P_live Profile.
    • This requires a feedback loop. We must look at the (delayed) ground truth labels for the last 24 hours of live traffic.
    • Live Data Profile (Q):
      • y=0 (Not Fraud): 97%
      • y=1 (Fraud): 3%
  3. Step 3: Calculate KL(P || Q)
    • The KL formula will see that the “fraud” class is 3x more likely than it was during training. The “surprise” is high.
  4. Step 4: Act.
    • Why? This is a massive concept drift. The base rate of fraud in the real world has tripled.
    • Our model was trained to be very cautious about flagging things (because fraud was 1-in-100). Now that fraud is 3-in-100, its old assumptions are wrong, and it is almost certainly “under-predicting” fraud. It must be retrained on this new data to “re-calibrate” its understanding of how common fraud is.

3. The “Corner Case” That Breaks KL Divergence

This is where the theory hits reality. In fraud detection, data is messy. New trends appear instantly, and old trends vanish.
Let’s look at a categorical feature: merchant_category.

The Scenario: “The Crypto Attack”

Suppose your model was trained in 2020. Back then, nobody bought Crypto on your platform.

  • Training Data ($P$): {“groceries”: 0.5, “travel”: 0.5, “crypto”: 0.0}

In 2025, a new fraud attack targets Crypto transactions. Suddenly, your live traffic looks like this:

  • Live Data ($Q$): {“groceries”: 0.5, “travel”: 0.0, “crypto”: 0.5}

Notice two things happened:

  1. A new category appeared (crypto went from 0.0 to 0.5).
  2. An old category disappeared (travel went from 0.5 to 0.0).

How KL Divergence Fails

If we try to calculate the drift score using KL, we hit a mathematical wall.

  • Case A: P(i) > 0 but Q(i) = 0
    • We try to calculate the drift for “travel”.
    • $P(\text{travel}) = 0.5$, but $Q(\text{travel}) = 0.0$.
    • Formula term: $\ln(0.5 / 0.0) = \ln(\infty) = \infty$.
    • Result: Infinite Drift. Your dashboard crashes or shows NaN.
  • Case B: P(i) = 0 but Q(i) > 0
    • We try to calculate the drift for “crypto”.
    • $P(\text{crypto}) = 0.0$, but $Q(\text{crypto}) = 0.5$.
    • The entire term $0 * \ln(0 / 0.1)$ is treated as 0. This bin does not contribute to the drift score.
    • Result: Drift not captured properly.

In first case, the metric “blows up.” It tells you something is wrong, but an “Infinity” score is useless. It doesn’t tell you the magnitude of the drift compared to yesterday. Is today’s infinity “worse” than yesterday’s infinity?

The Fix (Smoothing): To prevent the “infinite” score from Q(i) = 0, we use smoothing. In practice, libraries like scipy add a tiny epsilon (e.g., 1e-10) to all probabilities in both P and Q to ensure no bin is ever truly zero.

How JS Divergence Saves the Day

Now let’s run the same “Crypto Attack” scenario through JS Divergence.

  1. Calculate Average ($M$):
    • $M(\text{travel}) = (0.5 + 0.0) / 2 = 0.25$
    • $M(\text{crypto}) = (0.0 + 0.5) / 2 = 0.25$
    • Notice: The zeros are gone!
  2. Calculate Score:
    • Because we are comparing against the average $M$, we never divide by zero.
    • The formula resolves cleanly to a score of 0.346.

The Verdict:
Instead of a broken “Infinity” error, JS Divergence gives you a precise, finite number: 0.346.
Since the max possible score is ~0.69, this tells you the drift is significant (about 50% of max), but measurable. You can log this, track it over time, and set a clean threshold (e.g., Alert if > 0.1).

Summary

MetricHandling of New/Missing DataScore RangeVerdict for Production
KL DivergenceFails (Infinite Score). Extremely brittle if a bin probability drops to zero.$0 \to \infty$Use only if distributions are guaranteed to overlap perfectly.
JS DivergenceRobust. Handles new or missing categories smoothly via averaging.$0 \to 0.69$The Production Standard. Safe for automated monitoring.

If you are building a fraud detection system where the adversarial landscape changes daily, switch to Jensen-Shannon Divergence. It turns a “crashing dashboard” problem into a reliable “early warning” signal.

Enjoyed this article? Never miss out on future posts - follow me.
This post is licensed under CC BY 4.0 by the author.