Post

Understanding Bias-Variance Tradeoff

Understanding Bias-Variance Tradeoff

1. Explain the concept of ‘bias variance tradeoff’ with a practical example.

The Bias-Variance Tradeoff is one of the most fundamental concepts in machine learning. It’s the central challenge we face when building any predictive model.

Let’s break it down conceptually and then walk through a practical example.

The Core Idea: An Archery Analogy

Imagine you are an archer trying to hit the bullseye of a target. You take several shots.

  1. Bias: Think of bias as a systematic error or a flaw in your aim.

    • High Bias: You consistently miss the bullseye in the same direction. For example, all your shots land in the top-left corner. Your aim is consistently off.
    • Low Bias: Your shots are centered around the bullseye. On average, you are hitting the target correctly.
  2. Variance: Think of variance as how scattered your shots are. It measures the consistency of your aim.

    • High Variance: Your shots are all over the place - some high, some low, some left, some right. Your aim is very inconsistent and unpredictable.
    • Low Variance: All your shots are tightly clustered together. Your aim is very consistent.

The Four Scenarios

This gives us four possible outcomes:

  • Low Bias, Low Variance (Ideal): You are accurate and consistent. All your shots land on the bullseye.
  • High Bias, Low Variance (Underfitting): You are consistent, but consistently wrong. All your shots are clustered together but miss the bullseye.
  • Low Bias, High Variance (Overfitting): You are accurate on average, but very inconsistent. Your shots are scattered all around the bullseye.
  • High Bias, High Variance (Worst Case): You are inaccurate and inconsistent. Your shots are scattered and miss the bullseye.

The Tradeoff in Machine Learning

In machine learning, the “bullseye” is the true underlying pattern in the data that we want our model to learn.

  • Bias is the error from a model being too simple. It makes overly simplistic assumptions about the data, failing to capture its true complexity. This is called underfitting. An underfit model has high bias.

  • Variance is the error from a model being too complex. It learns not only the underlying pattern but also the random noise specific to the training data. This model is too sensitive to the data it was trained on. This is called overfitting. An overfit model has high variance.

The Tradeoff: As you decrease a model’s bias, you almost always increase its variance, and vice-versa. You can’t have both be zero. The goal is to find the “sweet spot” that minimizes the Total Error.

Total Error ≈ Bias² + Variance + Irreducible Error

  • Irreducible Error: Natural randomness in the data that no model can ever eliminate.
  • Our job is to find the model complexity that minimizes the sum of Bias² and Variance.

A Practical Example: Predicting House Prices

Let’s say we want to build a model to predict the price of a house based on its features (size, number of bedrooms, location, age, etc.).

The High Bias (Underfitting) Model

  • What we do: We create a very simple model. We decide to use only one feature: the size of the house (in square feet). Our model is a simple linear regression: Price = θ₀ + θ₁ * Size
  • What the model learns: It learns a simple rule: “The bigger the house, the more expensive it is.”
  • The Result (High Bias): This model is too simple. It completely ignores critical information like location (a small apartment in South Mumbai is more expensive than a large house in a rural area) or the age of the house. It will be systematically wrong for many houses. It fails to capture the true, complex relationship in the data. This is underfitting. Its variance will be low because if we train it on different sets of housing data, it will likely learn the same simple, upward-sloping line every time.

The High Variance (Overfitting) Model

  • What we do: We get carried away and create an extremely complex model. We use a high-degree polynomial regression that considers every feature and all possible interactions between them. Price = θ₀ + θ₁*Size + θ₂*Size² + ... + θ₁₀*Size¹⁰ + θ₁₁*Bedrooms + ...
  • What the model learns: This model is incredibly flexible. It will learn the main trends, but it will also start learning the noise in our specific training data. It might learn that a 3-bedroom house of exactly 1,753 sq ft is unusually expensive because one such house in our training data happened to be an outlier. It will perfectly memorize the training data.
  • The Result (High Variance): While this model will have near-zero error on the data it was trained on, it will fail dramatically when shown a new, unseen house. Its predictions will be wild and unreliable because it has learned the random noise, not the general underlying pattern. It doesn’t generalize well. This is overfitting.

The “Goldilocks” (Good Balance) Model

  • What we do: We aim for the sweet spot. We use a model that is complex enough to capture the important relationships but not so complex that it memorizes noise. Model: A multiple linear regression with carefully selected features (Size, Bedrooms, Location_Score, Age) and perhaps some regularization (like Ridge or Lasso) to control the complexity.
  • What the model learns: It learns that price generally increases with size and number of bedrooms, decreases with age, and is heavily dependent on location.
  • The Result (Good Balance): This model makes some small errors on the training data (it has a little bias), but it successfully captures the main trends. When shown new data, it makes sensible and reliable predictions. It generalizes well because it has managed the tradeoff between being too simple and too complex.

2. Let’s define the interplay between bias and variance with mathematical rigor.

Of course. This is a classic and very elegant proof in statistical learning theory. The goal is to show how the Mean Squared Error (MSE) of a model can be decomposed into three fundamental components: Bias, Variance, and an irreducible error term.

Let’s walk through the derivation of the expansion of $E[(y - \hat{f}(x))^2]$ step-by-step.

1. Setup and Assumptions

First, we need to define our terms and the underlying assumptions:

  1. The True Relationship: We assume the true value $y$ is generated by a true, but unknown, function $f(x)$ plus some random noise $\epsilon$. \(y = f(x) + \epsilon\)

  2. The Model: Our machine learning model, trained on a specific set of data, produces a prediction $\hat{f}(x)$.

  3. Assumptions about the Noise ($\epsilon$):

    • The noise is random and has a mean of zero: $E[\epsilon] = 0$.
    • The noise is independent of our data and our model’s prediction.
    • The variance of the noise is $\sigma^2$: $Var(\epsilon) = E[\epsilon^2] - (E[\epsilon])^2 = E[\epsilon^2] - 0^2 = \sigma^2$.

Our goal is to decompose the Expected Mean Squared Error for a given point $x$. The expectation $E[\cdot]$ is taken over many different training sets, which is why our model’s prediction $\hat{f}(x)$ is considered a random variable.

\[MSE = E[(y - \hat{f}(x))^2]\]

2. The Derivation

Here is the step-by-step expansion.

Step 1: Substitute the true relationship for $y$.

We start by replacing $y$ with its definition, $f(x) + \epsilon$.

\[MSE = E[(f(x) + \epsilon - \hat{f}(x))^2]\]

Step 2: Rearrange terms and expand the square.

Let’s group the model-related terms together. We’ll treat (f(x) - f̂(x)) as one part (A) and ε as the other part (B) and expand the square $(A+B)^2 = A^2 + 2AB + B^2$.

\[MSE = E[((f(x) - \hat{f}(x)) + \epsilon)^2]\] \[MSE = E[(f(x) - \hat{f}(x))^2 + 2\epsilon(f(x) - \hat{f}(x)) + \epsilon^2]\]

Step 3: Apply the Linearity of Expectation.

The expectation of a sum is the sum of the expectations.

\[MSE = E[(f(x) - \hat{f}(x))^2] + E[2\epsilon(f(x) - \hat{f}(x))] + E[\epsilon^2]\]

Step 4: Simplify each term.

  • The last term, $E[\epsilon^2]$: From our assumptions, we know the variance of the noise is $\sigma^2 = E[\epsilon^2]$. This term is the Irreducible Error. It’s the noise inherent in the data that no model can ever eliminate.

  • The middle term, $E[2\epsilon(f(x) - \hat{f}(x))]$: We can pull out the constant 2. Because we assumed $\epsilon$ is independent of our data and model, we can separate the expectations: \(2E[\epsilon(f(x) - \hat{f}(x))] = 2E[\epsilon] \cdot E[f(x) - \hat{f}(x)]\) Since we assumed the mean of the noise is zero ($E[\epsilon] = 0$), this entire middle term becomes zero. \(2 \cdot (0) \cdot E[f(x) - \hat{f}(x)] = 0\)

  • The first term, $E[(f(x) - \hat{f}(x))^2]$: This is the Expected Squared Error between the true function and our model’s prediction. We need to expand this part further.

Our equation now looks like this:

\[MSE = E[(f(x) - \hat{f}(x))^2] + \sigma^2\]

Step 5: Decompose the remaining error term.

This is the clever part of the proof. We add and subtract the same quantity, $E[\hat{f}(x)]$, inside the square. $E[\hat{f}(x)]$ represents the average prediction our model would make at point $x$ if we were to train it on many different datasets.

\[E[(f(x) - \hat{f}(x))^2] = E[(f(x) - E[\hat{f}(x)] + E[\hat{f}(x)] - \hat{f}(x))^2]\]

Now, group the terms again. Let $A = (f(x) - E[\hat{f}(x)])$ and $B = (E[\hat{f}(x)] - \hat{f}(x))$ and expand $(A+B)^2$:

\[= E[ \underbrace{(f(x) - E[\hat{f}(x)])^2}_{A^2} + \underbrace{2(f(x) - E[\hat{f}(x)])(E[\hat{f}(x)] - \hat{f}(x))}_{2AB} + \underbrace{(E[\hat{f}(x)] - \hat{f}(x))^2}_{B^2} ]\]

Apply linearity of expectation one more time:

\[= E[(f(x) - E[\hat{f}(x)])^2] + E[2(f(x) - E[\hat{f}(x)])(E[\hat{f}(x)] - \hat{f}(x))] + E[(E[\hat{f}(x)] - \hat{f}(x))^2]\]

Let’s analyze these three new terms:

  • First term: $E[(f(x) - E[\hat{f}(x)])^2]$ The quantities $f(x)$ and $E[\hat{f}(x)]$ are constants with respect to the expectation, so the expectation has no effect. The term is $(f(x) - E[\hat{f}(x)])^2$. This is the squared difference between the true function’s value and the average prediction of our model. This is the definition of Squared Bias. \(\text{Bias}^2 = (f(x) - E[\hat{f}(x)])^2\)

  • Third term: $E[(E[\hat{f}(x)] - \hat{f}(x))^2]$ This is the expected squared difference between the average model prediction and a specific model’s prediction. This is the definition of Variance. It measures how much our model’s predictions would scatter if we trained it on different datasets. \(\text{Variance} = E[(\hat{f}(x) - E[\hat{f}(x)])^2]\)

  • Middle term: $E[2(f(x) - E[\hat{f}(x)])(E[\hat{f}(x)] - \hat{f}(x))]$

    The part $(f(x) - E[\hat{f}(x)])$ is a constant, so we can pull it out of the expectation:

    \[2(f(x) - E[\hat{f}(x)]) \cdot E[E[\hat{f}(x)] - \hat{f}(x)]\]

    The expectation of a difference is the difference of expectations:

    \[E[E[\hat{f}(x)] - \hat{f}(x)] = E[E[\hat{f}(x)]] - E[\hat{f}(x)]\]

    The expectation of a constant ($E[\hat{f}(x)]$ is a constant) is just the constant itself, so $E[E[\hat{f}(x)]] = E[\hat{f}(x)]$.

    \[= E[\hat{f}(x)] - E[\hat{f}(x)] = 0\]

    Therefore, the entire middle term is zero.

Step 6: Final Assembly

Putting it all back together, the decomposition of $E[(f(x) - \hat{f}(x))^2]$ is:

\[E[(f(x) - \hat{f}(x))^2] = \text{Bias}^2 + \text{Variance}\]

Now, substitute this into our equation from the end of Step 4:

\[MSE = \underbrace{\text{Bias}^2 + \text{Variance}}_{\text{from Step 5}} + \underbrace{\sigma^2}_{\text{from Step 4}}\]

This gives us the final, famous decomposition:

\[E[(y - \hat{f}(x))^2] = \text{Bias}(\hat{f}(x))^2 + \text{Variance}(\hat{f}(x)) + \text{Irreducible Error}\]
Enjoyed this article? Never miss out on future posts - follow me.
This post is licensed under CC BY 4.0 by the author.