Venturing into the Isolation Forest
1. Explain the working principles behind Isolation Forest?
Isolation Forest detects anomalies by exploiting a simple idea:
Anomalies are few and different, so they can be isolated (separated) from the rest of the data with fewer random splits than normal points.
How It Works
Build Many Random Trees
- At each node, randomly choose:
- a feature
- a random split value between the feature’s min and max
- Continue splitting until:
- the point is isolated in its own leaf, or
- tree reaches a height limit
This produces an isolation tree.
Isolation Path Length
For each data point:
- Track the path length = number of splits needed to isolate it.
Intuition:
- Outliers → isolated quickly → short paths
- Normal points → need more splits → longer paths
Ensemble Averaging
- Build many such random trees (a forest).
- Average the path lengths across trees.
Scoring
- Convert average path length into an anomaly score.
- Shorter average path → higher anomaly score → more likely an outlier.
Why It’s Effective
- No need for distance or density assumptions.
- Handles high-dimensional data well.
- Efficient: works with subsamples, linear time complexity.
- Naturally isolates outliers without modeling normal behavior explicitly.
2. Explain the anomaly score formulation for Isolation Forest.
The core idea is simple: Anomalies are easier to “isolate” than normal points. The score formula in IF is just a way to normalize this “ease of isolation” into a consistent score between 0 and 1.
The formulation has three parts:
- The Path Length ($h(x)$): How easy is it to isolate a point in one tree?
- The Normalization ($c(n)$): What is a “normal” path length for a dataset of this size?
- The Final Score ($s$): How does our point’s path length compare to the normal one?
Part 1: The Path Length ($h(x)$)
This is the raw “score” from a single Isolation Tree.
An Isolation Tree is built by randomly splitting the data until every point is isolated in its own leaf node.
$h(x)$ is defined as the path length (the number of edges) from the root of the tree to the leaf node containing the point $x$.
- Anomalies: Are “few and different.” They are easy to isolate and will have a very short path length.
- Normal Points: Are in dense clusters. They require many splits to isolate and will have a very long path length.
Since the algorithm is random, we build a “forest” of (e.g.) 100 trees and take the average path length for point $x$. This average is called $E(h(x))$.
Part 2: The Normalization Factor ($c(n)$)
The “average path length” $E(h(x))$ is not useful on its own. A path length of 8 is “short” in a dataset of 1,000,000 but “long” in a dataset of 20.
We need to normalize it by the average path length for a “typical” point, which the authors model as the average path length of an unsuccessful search in a Binary Search Tree (BST).
This normalization factor is given by the formula:
\[c(n) = 2H(n-1) - \frac{2(n-1)}{n}\]Where:
- $n$ is the number of samples in the training data.
- $H(i)$ is the Harmonic Number, which can be approximated as $\ln(i) + 0.5772…$ (Euler’s constant).
You don’t need to memorize this. Just know that $c(n)$ is the “expected” or “average” path length for a dataset of size $n$.
Part 3: The Final Anomaly Score ($s$)
The final anomaly score $s$ for a point $x$ is a formula that combines its specific path length $E(h(x))$ with the dataset’s average path length $c(n)$.
The score is defined as:
\[s(x, n) = 2^{-\frac{E(h(x))}{c(n)}}\]This formula is chosen specifically to map the path lengths to a score between 0 and 1.
How to Interpret the Score
This is the most important part. Let’s analyze the formula $s = 2^{-\text{exponent}}$.
- If $x$ is a clear ANOMALY:
- Its path length $E(h(x))$ will be very short: $E(h(x)) \rightarrow 0$.
- The exponent $\frac{E(h(x))}{c(n)}$ will be close to 0.
- The score $s$ will be $2^{-0} = 1$.
- Score is close to 1
- If $x$ is a clear NORMAL point:
- Its path length $E(h(x))$ will be very long, close to the average for the dataset: $E(h(x)) \approx c(n)$.
- The exponent $\frac{E(h(x))}{c(n)}$ will be close to 1.
- The score $s$ will be $2^{-1} = 0.5$.
- Score is close to 0.5
- If $x$ is “deeply” normal (very, very average):
- Its path length $E(h(x))$ might be even longer than the average: $E(h(x)) > c(n)$.
- The exponent $\frac{E(h(x))}{c(n)}$ will be greater than 1.
- The score $s$ will be $2^{-(\text{number} > 1)}$, so it will be even smaller than 0.5.
- Score is close to 0
Summary of Scores:
- $s \approx 1$: Highly likely to be an anomaly.
- $s \approx 0.5$: A typical normal point.
- $s < 0.5$: A very “normal” inlier.
3. Explain the tree/forest building process.
Here is the step-by-step algorithm for building an Isolation Forest.
The core philosophy is: “Randomly slice the data until every point is isolated.”
Unlike a normal Decision Tree, which tries to find the best split to group similar items, an Isolation Tree (iTree) tries to find any split to separate items.
The High-Level Setup
Before we build a single tree, we prepare the forest.
- Subsampling:
- We do not use the entire dataset to build a tree.
- We take a random subsample of size $\psi$ (usually 256).
- Why? Anomalies are rare. In a small sample, they stand out more clearly (masking effect is reduced), and it makes training extremely fast.
- Forest Initialization:
- We decide to build $T$ trees (e.g., 100).
- We repeat the tree-building process below 100 times, each with a different random subsample.
The Tree Building Process (Recursive)
We start with a Root Node containing all 256 points in our subsample. We then apply the following recursive function:
Function BuildTree(data, current_height):
Step 1: Check Stopping Conditions (Base Cases) If any of these are true, we stop splitting and create a Leaf Node.
- Isolation: The
datacontains only 1 point. (We successfully isolated it). - Duplicate Data: The
datacontains multiple points, but they are all identical (values are the same). We can’t split them. - Height Limit:
current_heightreached themax_depthlimit (usually set to $\approx \log_2(\psi)$, e.g., 8 for 256 samples). We force a stop to prevent deep, computational waste.
Step 2: Choose a Split (The Random Slice) If we didn’t stop, we must split the data.
- Pick a Feature: Randomly select one feature $q$ from the available $d$ features.
- Pick a Split Value:
- Find the min and max values of feature $q$ in the current
data. - Randomly select a split value $p$ uniformly between min and max: \(p \sim \text{Uniform}(min, max)\)
- Find the min and max values of feature $q$ in the current
Step 3: Partition the Data We divide the data into two buckets based on the random split $p$:
- Left Set ($S_L$): All points where $feature_q < p$.
- Right Set ($S_R$): All points where $feature_q \ge p$.
Step 4: Recurse (Create Children) We create two child nodes and repeat the process:
- Left Child: Call
BuildTree(S_L, current_height + 1) - Right Child: Call
BuildTree(S_R, current_height + 1)
Step 5: Return Node We return the current internal node, storing the split rule: “Feature $q$ < $p$?”
The Result
After this process finishes, you have a fully grown Isolation Tree.
- Anomalies: Likely ended up in leaf nodes very close to the root (short path length) because their values were extreme (close to min/max), so a random split was highly likely to “chop” them off early.
- Normal Points: Likely ended up deep in the tree (long path length) because they were clustered in the middle, requiring many precise cuts to separate them from their neighbors.
4. Explain why we take sub-samples of the data while building the tree.
This is the most counter-intuitive part of the Isolation Forest algorithm, and understanding it is the “Aha!” moment for this technique.
To answer your specific scenario directly: We do NOT build trees on the whole dataset.
If you have 10,000 points and you set the subsample size to 256:
- Tree 1 picks 256 random points and ignores the other
9,744. - Tree 2 picks a new set of 256 random points and ignores the rest.
- Tree 100 picks 256 random points and ignores the rest.
It is highly possible that some of your 10,000 points are never used in training any tree.
And this is exactly what we want. Here is why.
Reason 1: Solving “Swamping” and “Masking”
The goal of an Isolation Tree is to isolate anomalies. Too much data actually makes this harder, not easier.
- The Problem of Swamping: Imagine an anomaly is surrounded by a few normal points. If you use the entire dataset (10,000 points), that anomaly is likely surrounded by thousands of normal points. To isolate the anomaly, the tree has to make many, many cuts to peel away all those thousands of normal neighbors. The anomaly looks “normal” (long path length) simply because it’s crowded.
- The Solution (Subsampling): If we only take 256 points, we pick the anomaly and maybe only 2 or 3 of its normal neighbors. Now, it is very easy to isolate the anomaly with just one or two cuts. The anomaly stands out clearly.
- The Problem of Masking: Imagine you have a cluster of 50 anomalies. If you use the whole dataset, they form a dense cluster. The algorithm might think, “Hey, this is a dense area, these must be normal!” They “mask” each other’s anomalous nature.
- The Solution (Subsampling): In a sample of 256, we might statistically only pick 1 or 2 of those anomalies. Without their gang of 50 friends, they look lonely and isolated. They are easily detected.
Reason 2: Generalization over Memorization
You are asking: “If we don’t use all the data, how does the model learn?”
Isolation Forest is not trying to memorize the data (like a database). It is trying to learn the shape (topology) of the data.
- Analogy: To understand the shape of a forest, you don’t need to map every single tree. You just need to take a few random photos (samples) from a helicopter.
- If you take 100 different random samples of 256 points, you get 100 different “sketches” of what “Normal” looks like.
- Tree 1 sketches the dense center.
- Tree 2 sketches the dense center.
- Tree 100 sketches the dense center.
- When you average these 100 sketches, you have a perfect map of where the “normal” (dense) regions are and where the “empty” (anomaly) regions are. You didn’t need to see every single data point to learn where the dense region is.
Summary for N=10,000
- Training: You build 100 trees. Each tree sees a random unique subset of 256 points. The vast majority of data is ignored by any single tree.
- Inference (The “Whole Dataset” part):
- When you want to find anomalies, you pass ALL 10,000 points (one by one) through the forest.
- Even if Point A was never used to train Tree 1, Tree 1 can still score it.
- Tree 1 says: “Based on the random 256 points I saw, Point A lands in an empty area. It looks like an anomaly.”
- Tree 2 says: “Based on the 256 points I saw, Point A also looks like an anomaly.”
- Result: Point A is flagged as an anomaly, even if it was never part of the training samples.