Post

Let us Support Vector Machines

Let us Support Vector Machines

1. Derive the equation of a line (or a hyper-plane in higher dimension).

This is a fundamental concept in linear algebra and geometry. To understand why a hyperplane is defined by $\mathbf{w} \cdot \mathbf{x} + b = 0$, we need to look at the geometric relationship between a vector and a plane.

1. The Geometric Definition

A line in 2D (or a plane in 3D, or hyperplane in $n$D) can be uniquely defined by two things:

  1. A fixed point that lies on the plane, let’s call it $\mathbf{x}_0$.
  2. A vector that is perpendicular (normal) to the plane, let’s call it $\mathbf{w}$.

2. The Derivation

Let $\mathbf{x}$ be any arbitrary point $[x_1, x_2, \dots, x_n]$ lying anywhere on this hyperplane.

If you draw a vector from the fixed point $\mathbf{x}_0$ to this arbitrary point $\mathbf{x}$, you get the vector $\mathbf{x} - \mathbf{x}_0$.

Since both $\mathbf{x}$ and $\mathbf{x}_0$ lie on the plane, the vector connecting them ($\mathbf{x} - \mathbf{x}_0$) must lie flat on the plane.

By definition, the normal vector $\mathbf{w}$ is perpendicular to every vector that lies on the plane. Therefore, $\mathbf{w}$ must be perpendicular to $(\mathbf{x} - \mathbf{x}_0)$.

The Orthogonality Condition: Two vectors are perpendicular if and only if their dot product is zero.

\[\mathbf{w} \cdot (\mathbf{x} - \mathbf{x}_0) = 0\]

3. Expanding the Equation

Let’s distribute the dot product:

\[\mathbf{w} \cdot \mathbf{x} - \mathbf{w} \cdot \mathbf{x}_0 = 0\]

Now, look at the second term, $\mathbf{w} \cdot \mathbf{x}_0$.

  • $\mathbf{w}$ is a fixed constant vector (the normal).
  • $\mathbf{x}_0$ is a fixed constant point on the plane.
  • Therefore, their dot product is just a scalar constant.

Let’s define this scalar constant as $-b$.

\[b = -(\mathbf{w} \cdot \mathbf{x}_0)\]

Substituting this back into our equation:

\[\mathbf{w} \cdot \mathbf{x} + b = 0\]

Summary of Terms

  • $\mathbf{w}$ (Weight Vector): Defines the orientation of the hyperplane. It points perpendicular to the surface.
  • $b$ (Bias): Defines the position of the hyperplane relative to the origin.
    • If $b=0$, the hyperplane passes directly through the origin (because if $\mathbf{x}=\mathbf{0}$, then $\mathbf{w} \cdot \mathbf{0} + 0 = 0$).
    • Changing $b$ shifts the plane parallel to itself without changing its orientation.

This is why in SVM and linear algebra, this compact notation represents any flat boundary in any dimension.


2. Using the above equation, derive the perpendicular distance of an arbitrary point $x_i$ to the hyperplane.

This is a classic vector geometry derivation. To find the distance, we project the vector connecting the plane to the point onto the plane’s normal vector.

The Setup

  1. The Hyperplane: Defined by $\mathbf{w} \cdot \mathbf{x} + b = 0$.
  2. The Point: An arbitrary point $\mathbf{x}_i$ that is somewhere in space (not necessarily on the plane).
  3. The Projection: Let $\mathbf{x}_p$ be the point on the hyperplane that is closest to $\mathbf{x}_i$. This means the line connecting $\mathbf{x}_i$ and $\mathbf{x}_p$ is perpendicular to the plane.
  4. The Distance ($d$): This is the length of the vector between $\mathbf{x}_i$ and $\mathbf{x}_p$.

The Derivation

Step 1: Define the Relationship Vector Since the line connecting $\mathbf{x}_i$ and $\mathbf{x}_p$ is perpendicular to the plane, it must be parallel to the normal vector $\mathbf{w}$.

We can express $\mathbf{x}_i$ as starting from $\mathbf{x}_p$ and moving a distance $d$ in the direction of the unit normal vector $\frac{\mathbf{w}}{|\mathbf{w}|}$.

\[\mathbf{x}_i = \mathbf{x}_p + d \frac{\mathbf{w}}{\|\mathbf{w}\|}\]

(Note: $d$ here is technically a signed distance. It could be positive or negative depending on which side of the plane the point is. We will take the absolute value at the end).

Step 2: Isolate the Point on the Plane ($\mathbf{x}_p$) Let’s rearrange the equation to solve for $\mathbf{x}_p$:

\[\mathbf{x}_p = \mathbf{x}_i - d \frac{\mathbf{w}}{\|\mathbf{w}\|}\]

Step 3: Use the Hyperplane Constraint This is the crucial step. Since $\mathbf{x}_p$ lies on the hyperplane, it must satisfy the hyperplane equation:

\[\mathbf{w} \cdot \mathbf{x}_p + b = 0\]

Now, substitute the expression for $\mathbf{x}_p$ from Step 2 into this equation:

\[\mathbf{w} \cdot \left( \mathbf{x}_i - d \frac{\mathbf{w}}{\|\mathbf{w}\|} \right) + b = 0\]

Step 4: Distribute and Simplify Expand the dot product:

\[\mathbf{w} \cdot \mathbf{x}_i - d \frac{\mathbf{w} \cdot \mathbf{w}}{\|\mathbf{w}\|} + b = 0\]

Recall that the dot product of a vector with itself is the square of its norm ($\mathbf{w} \cdot \mathbf{w} = |\mathbf{w}|^2$):

\[\mathbf{w} \cdot \mathbf{x}_i + b - d \frac{\|\mathbf{w}\|^2}{\|\mathbf{w}\|} = 0\]

Cancel out one $|\mathbf{w}|$ term in the fraction:

\[\mathbf{w} \cdot \mathbf{x}_i + b - d \|\mathbf{w}\| = 0\]

Step 5: Solve for Distance ($d$) Move the term with $d$ to the other side:

\[d \|\mathbf{w}\| = \mathbf{w} \cdot \mathbf{x}_i + b\] \[d = \frac{\mathbf{w} \cdot \mathbf{x}_i + b}{\|\mathbf{w}\|}\]

Step 6: Absolute Distance Since geometric distance is always non-negative, and $\mathbf{w} \cdot \mathbf{x}_i + b$ could be negative (if the point is on the negative side of the plane), we take the absolute value of the numerator.

\[\text{Distance} = \frac{|\mathbf{w} \cdot \mathbf{x}_i + b|}{\|\mathbf{w}\|}\]

Connection to SVM

This formula explains why maximizing the margin works the way it does. For a Support Vector, we enforce that $|\mathbf{w} \cdot \mathbf{x}_{sv} + b| = 1$. Substituting this into the distance formula above gives us the geometric margin:

\[\text{Margin} = \frac{1}{\|\mathbf{w}\|}\]

This proves mathematically why minimizing $|\mathbf{w}|$ is equivalent to maximizing the physical distance (margin) between the hyperplane and the support vectors.


3. Let’s now look at the primal formulation behind linear SVM algorithm.

This is the culmination of our previous steps. We will now assemble the pieces - the hyperplane definition, the distance formula, and the concept of scaling, to build the complete Primal Formulation of the Support Vector Machine.

1. The Problem Statement

Goal: Given a set of linearly separable data points belonging to two classes ($y \in {+1, -1}$), we want to find the optimal linear decision boundary (hyperplane) that separates them.

Definition of “Optimal”: The optimal hyperplane is the one that maximizes the geometric margin. The margin is the distance between the decision boundary and the closest data points from either class. By maximizing this “street” width, we maximize the model’s robustness and ability to generalize to new, unseen data.

2. Deriving the Margin Boundaries (The “Street”)

Let our decision boundary be the hyperplane: \(H_0: \mathbf{w} \cdot \mathbf{x} + b = 0\)

We assume the data is linearly separable. This means we can find a hyperplane where all positive points ($y=+1$) are on one side and all negative points ($y=-1$) are on the other.

The Canonical Scale (The “Scaling Trick”): We can scale $\mathbf{w}$ and $b$ by any constant without changing the position of the hyperplane $H_0$. To make the math unique and solvable, we fix the scale. (NOTE: We discuss on this in much more detail in the next question, so let’s just assume it to be true for now)

We enforce that for the data points closest to the hyperplane (the Support Vectors), the absolute value of the decision function is exactly 1.

This defines two additional hyperplanes that form the edges of our “street”:

  1. Positive Hyperplane ($H_1$): The boundary for the positive support vectors. \(\mathbf{w} \cdot \mathbf{x} + b = 1\)
  2. Negative Hyperplane ($H_2$): The boundary for the negative support vectors. \(\mathbf{w} \cdot \mathbf{x} + b = -1\)

3. Deriving the Margin Width

Now we calculate the physical width of the street created by $H_1$ and $H_2$.

We use the distance formula we derived earlier: $d = \frac{\lvert \mathbf{w} \cdot \mathbf{x} + b \rvert}{\lVert \mathbf{w} \rVert}$.

  • Distance from $H_0$ to $H_1$: Take a point $\mathbf{x}^+$ on the positive hyperplane. We know $\mathbf{w} \cdot \mathbf{x}^+ + b = 1$. \(d_+ = \frac{|1|}{\|\mathbf{w}\|} = \frac{1}{\|\mathbf{w}\|}\)

  • Distance from $H_0$ to $H_2$: Take a point $\mathbf{x}^-$ on the negative hyperplane. We know $\mathbf{w} \cdot \mathbf{x}^- + b = -1$. \(d_- = \frac{|-1|}{\|\mathbf{w}\|} = \frac{1}{\|\mathbf{w}\|}\)

  • Total Margin ($M$): The total width is the sum of these two distances: \(M = d_+ + d_- = \frac{1}{\|\mathbf{w}\|} + \frac{1}{\|\mathbf{w}\|} = \frac{2}{\|\mathbf{w}\|}\)

Conclusion: To maximize the margin $M$, we must maximize $\frac{2}{|\mathbf{w}|}$.

4. Deriving the Unified Constraint

We need to ensure that no data points are inside the street. They must all be on the correct side of the margin boundaries.

  • For Positive Samples ($y_i = +1$): They must be on or above the positive hyperplane $H_1$. \(\mathbf{w} \cdot \mathbf{x}_i + b \ge 1\)

  • For Negative Samples ($y_i = -1$): They must be on or below the negative hyperplane $H_2$. \(\mathbf{w} \cdot \mathbf{x}_i + b \le -1\)

Combining them: We can cleverly combine these two inequalities into one. Look at the negative case: $\mathbf{w} \cdot \mathbf{x}_i + b \le -1$. If we multiply both sides by $-1$ (and flip the inequality sign), we get: $-(\mathbf{w} \cdot \mathbf{x}_i + b) \ge 1$.

Since $y_i = -1$ for these points, this is the same as: $y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \ge 1$. For the positive points, $y_i = +1$, so $1(\mathbf{w} \cdot \mathbf{x}_i + b) \ge 1$ holds too.

Thus, the Unified Single Constraint is: \(y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \ge 1 \quad \forall i\)

5. The Final Primal Formulation

Now we put it all together into an optimization problem.

The Objective: We want to Maximize $\frac{2}{|\mathbf{w}|}$. Maximizing $\frac{2}{|\mathbf{w}|}$ is mathematically equivalent to minimizing $|\mathbf{w}|$. To make the derivatives easier (avoiding square roots), we square the norm. Minimizing $|\mathbf{w}|$ is equivalent to minimizing $|\mathbf{w}|^2$. By convention, we add a factor of $\frac{1}{2}$.

So, Maximize Margin $\equiv$ Minimize $\frac{1}{2}|\mathbf{w}|^2$.

The Formal Primal Problem:

\[\begin{aligned} & \textbf{Minimize:} & & \frac{1}{2}\|\mathbf{w}\|^2 \\ & \textbf{Subject to:} & & y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \ge 1 \quad \text{for } i = 1, \dots, n \end{aligned}\]

This is a Convex Quadratic Programming problem. Because it is convex, we are guaranteed that if a solution exists, it is the global minimum, the absolute best separating hyperplane.


4. Why are the two margin boundaries defined by $\mathbf{w}.\mathbf{x} + b = \pm 1$ and not by $\mathbf{w}.\mathbf{x} + b = \pm c$ for some arbirtary c?

That is an excellent and very insightful question. It gets to the heart of how the SVM optimization problem is elegantly formulated.

The short answer is: It’s a clever mathematical simplification. We can set the margin boundaries to +1 and -1 without any loss of generality because of the way the hyperplane equation can be scaled.

Let’s break this down into two key concepts: the Functional Margin and the Geometric Margin.

1. The Problem of a “Floating” Definition

First, let’s consider the general equation of our separating hyperplane:

\[\mathbf{w} \cdot \mathbf{x} + b = 0\]

For any point $\mathbf{x}_i$, the value of $\mathbf{w} \cdot \mathbf{x}_i + b$ tells us how far that point is from the hyperplane, at least in a relative sense.

However, notice that if we scale w and b by some constant c, we get the exact same hyperplane:

\[c(\mathbf{w} \cdot \mathbf{x} + b) = 0 \implies (c\mathbf{w}) \cdot \mathbf{x} + (cb) = 0\]

Let’s call $\mathbf{w}^* = c\mathbf{w}$ and $b^* = cb$. The new equation $\mathbf{w}^* \cdot \mathbf{x} + b^* = 0$ defines the exact same line.

This creates a problem. We could have a point close to the margin, but by just scaling up w and b, we can make the value of w . x + b arbitrarily large. This “floating” value is called the functional margin.

  • Functional Margin: \(\hat{\gamma} = y_i(\mathbf{w} \cdot \mathbf{x}_i + b)\) (Here, \(y_i\) is the class label, +1 or -1).

The functional margin tells us if a point is correctly classified (if it’s positive), but its magnitude can be changed just by scaling, which isn’t useful for finding a unique, optimal solution.

2. The Solution: Normalization and the Geometric Margin

What we actually care about is the real, physical distance from a point to the hyperplane. This is called the geometric margin. The formula for the shortest distance from a point $\mathbf{x}_i$ to the hyperplane $\mathbf{w} \cdot \mathbf{x} + b = 0$ is:

  • Geometric Margin: \(\gamma = \frac{y_i(\mathbf{w} \cdot \mathbf{x}_i + b)}{\|\mathbf{w}\|}\)

Notice that the geometric margin is just the functional margin normalized by the magnitude of w. Now, if we scale w and b by c, the geometric margin remains unchanged: \(\gamma' = \frac{y_i((c\mathbf{w}) \cdot \mathbf{x}_i + (cb))}{\|c\mathbf{w}\|} = \frac{c \cdot y_i(\mathbf{w} \cdot \mathbf{x}_i + b)}{|c| \cdot \|\mathbf{w}\|} = \frac{y_i(\mathbf{w} \cdot \mathbf{x}_i + b)}{\|\mathbf{w}\|} = \gamma\) This is exactly what we want! A stable, meaningful measure of distance.

3. The Elegance of Setting the Margin to 1

The goal of SVM is to maximize this geometric margin for the points closest to the hyperplane (the support vectors). Let’s call the geometric margin for a support vector \(\gamma_{sv}\).

So, our objective is: Maximize \(\gamma_{sv}\)

\[\gamma_{sv} = \frac{y_{sv}(\mathbf{w} \cdot \mathbf{x}_{sv} + b)}{\|\mathbf{w}\|}\]

Here comes the clever trick. Since we can scale w and b by any constant c without changing the geometric margin, we can choose a specific scaling that makes our math easier.

Let’s choose to scale w and b such that for the support vectors (the points closest to the hyperplane), the functional margin is exactly 1.

\[y_{sv}(\mathbf{w} \cdot \mathbf{x}_{sv} + b) = 1\]

By enforcing this scaling, we are essentially “pinning down” the functional margin to a fixed value. This hyperplane is now in what is called a canonical form.

What does this do to our optimization problem?

  1. The constraint that all points must be on or outside the margin becomes: \(y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \ge 1\) This is because the support vectors have a functional margin of 1, and all other points must be further away.

  2. The objective of maximizing the geometric margin (\(\gamma_{sv}\)) becomes much simpler. Since we defined \(y_{sv}(\mathbf{w} \cdot \mathbf{x}_{sv} + b) = 1\), the equation for the margin simplifies to: \(\text{Maximize } \gamma_{sv} = \frac{1}{\|\mathbf{w}\|}\)

Maximizing \(\frac{1}{\|\mathbf{w}\|}\)is the same as minimizing\(\|\mathbf{w}\|\), which is, in turn, the same as minimizing \(\frac{1}{2}\|\mathbf{w}\|^2\) (done for mathematical convenience).

This is the punchline: By fixing the functional margin of the support vectors to 1, we transform a complex maximization problem into a much cleaner minimization problem with a simple constraint.

So, the hyperplanes \(\mathbf{w} \cdot \mathbf{x} + b = 1\) and \(\mathbf{w} \cdot \mathbf{x} + b = -1\) are not arbitrary; they are the result of a deliberate and elegant mathematical choice to fix the scale of the problem, which simplifies the entire optimization process without losing any generality.


5. How do we formulate the dual problem from the primal formulation? How does our objective change with this new formulation?

Part 1: The Dual Formulation

Solving the primal problem directly can be complex. We can re-frame it into a more manageable form called the Dual Problem using the technique of Lagrange Multipliers.

Form the Lagrangian: We introduce non-negative Lagrange multipliers $\alpha_i \ge 0$ (one for each constraint) and form the Lagrangian function, $\mathcal{L}$:

\[\mathcal{L}(\mathbf{w}, b, \mathbf{\alpha}) = \underbrace{\frac{1}{2}\|\mathbf{w}\|^2}_{\text{Original Objective}} + \underbrace{\sum_{i=1}^{n} \alpha_i [1 - y_i(\mathbf{w} \cdot \mathbf{x}_i + b)]}_{\text{Constraints with Multipliers }\alpha_i \ge 0}\]

Let’s expand the terms inside the sum to make differentiation easier:

\[\mathcal{L}(\mathbf{w}, b, \alpha) = \frac{1}{2}(\mathbf{w} \cdot \mathbf{w}) - \sum_{i=1}^{n} \alpha_i y_i (\mathbf{w} \cdot \mathbf{x}_i) - \sum_{i=1}^{n} \alpha_i y_i b + \sum_{i=1}^{n} \alpha_i\]

To find the Dual Function $W(\alpha)$, we must minimize $\mathcal{L}$ with respect to the primal variables $\mathbf{w}$ and $b$. We do this by calculating gradients and setting them to zero.

Step 1: Derivative with respect to w

We take the partial derivative of $\mathcal{L}$ with respect to the vector $\mathbf{w}$:

\[\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{\partial}{\partial \mathbf{w}}\left(\frac{1}{2}\mathbf{w} \cdot \mathbf{w}\right) - \frac{\partial}{\partial \mathbf{w}}\left(\sum_{i=1}^{n} \alpha_i y_i (\mathbf{w} \cdot \mathbf{x}_i)\right) - 0 + 0\] \[\nabla_\mathbf{w} \mathcal{L} = \mathbf{w} - \sum_{i=1}^{n} \alpha_i y_i \mathbf{x}_i\]

Setting this to zero gives us our first crucial relationship:

\[\mathbf{w} = \sum_{i=1}^{n} \alpha_i y_i \mathbf{x}_i \quad \text{--- (Equation 1)}\]

Step 2: Derivative with respect to b

Now, we take the derivative with respect to the bias $b$:

\[\frac{\partial \mathcal{L}}{\partial b} = 0 - 0 - \frac{\partial}{\partial b}\left(b \sum_{i=1}^{n} \alpha_i y_i\right) + 0\] \[\frac{\partial \mathcal{L}}{\partial b} = - \sum_{i=1}^{n} \alpha_i y_i\]

Setting this to zero gives us our second constraint:

\[\sum_{i=1}^{n} \alpha_i y_i = 0 \quad \text{--- (Equation 2)}\]

Step 3: Substitute Back to Find W(α)

Now, we substitute (Equation 1) and (Equation 2) back into the original Lagrangian to eliminate $\mathbf{w}$ and $b$.

\[\mathcal{L} = \frac{1}{2}\underbrace{(\mathbf{w} \cdot \mathbf{w})}_{\text{Sub Eq. 1}} - \sum_{i=1}^{n} \alpha_i y_i \mathbf{x}_i \cdot \underbrace{\mathbf{w}}_{\text{Sub Eq. 1}} - b \underbrace{\sum_{i=1}^{n} \alpha_i y_i}_{\text{Eq. 2 says this is 0}} + \sum_{i=1}^{n} \alpha_i\]

A. Expand the first term: \(\frac{1}{2}(\mathbf{w} \cdot \mathbf{w}) = \frac{1}{2} \left( \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i \right) \cdot \left( \sum_{j=1}^n \alpha_j y_j \mathbf{x}_j \right) = \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j (\mathbf{x}_i \cdot \mathbf{x}_j)\)

B. Expand the second term: \(- \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i \cdot \mathbf{w} = - \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i \cdot \left( \sum_{j=1}^n \alpha_j y_j \mathbf{x}_j \right) = - \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j (\mathbf{x}_i \cdot \mathbf{x}_j)\)

C. Combine Everything: Notice that Term B is exactly $-2$ times Term A. \(\mathcal{L} = \left[ \frac{1}{2} \sum \sum \dots \right] - \left[ \sum \sum \dots \right] - 0 + \sum_{i=1}^n \alpha_i\)

Simplifying $(\frac{1}{2} X - X = -\frac{1}{2} X)$ gives the final Dual Objective Function:

\[W(\alpha) = \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j (\mathbf{x}_i \cdot \mathbf{x}_j)\]

Bridge: The Two-Step Process to Find the Saddle Point

Think of solving the Lagrangian as a strategic game between two opposing players.

  • Player 1 (Primal Player): Controls w and b. Their goal is to minimize L.
  • Player 2 (Dual Player): Controls α. Their goal is to maximize L.

A stable solution, or equilibrium, is found at a saddle point: a point that is a minimum in the primal direction (w, b) and a maximum in the dual direction (α). We find this point in two steps:

Step 1: Minimize with Respect to w and b (This defines W(α))

First, we let the dual player fix a set of multipliers, α. Now, the primal player gets to make their move. They take this fixed α and find the values of w and b that make L as small as possible.

This step is exactly the one you mentioned: we minimize L(w, b, α) over w and b.

When we do this (by taking partial derivatives and setting them to zero), we get expressions for w and b that depend only on α. The result of this minimization is a new function, which we call the Lagrange dual function, W(α).

\[W(\alpha) = \inf_{\mathbf{w},b} \mathcal{L}(\mathbf{w}, b, \mathbf{\alpha})\]

The inf (infimum, or minimum) here explicitly shows that W(α) is defined as the result of the minimization of L with respect to the primal variables.

Step 2: Maximize the Result with Respect to α

Now it’s the dual player’s turn. They know that for any α they choose, the primal player will minimize L to get the value W(α). The dual player’s goal is to choose an α that makes this resulting value as large as possible.

Therefore, the second step is to maximize W(α).

Why maximize? Because the dual function W(α) provides a lower bound on the true optimal value of our original problem. To get the best, tightest possible lower bound, we must find the highest point of this floor, which means maximizing it.

An Analogy: The King in the Mountain Pass

Imagine a king who wants to find the highest possible campingspot, but he is constrained to only travel along the bottom of mountain valleys. The mountain range represents the function L.

  • The east-west direction represents the primal variables (w, b).
  • The north-south direction represents the dual variables (α).
  1. Minimizing L: For any fixed north-south path (α), the king walks east-west to find the absolute lowest point in that valley. He is minimizing his altitude for that path. This gives him one point, W(α).

  2. Maximizing W(α): The king repeats this for every possible north-south path. He now has a collection of the lowest points from all the valleys. To find the best possible camping spot under his constraint, he chooses the highest among all these low points. He is maximizing the function W(α).

The final spot is a saddle point: it’s the lowest point in the east-west direction, but the highest point when viewed along the north-south collection of valley floors.

The Guarantee: Strong Duality

For SVMs (and other convex optimization problems), a property called strong duality holds. This guarantees that the minimum of the primal problem is exactly equal to the maximum of the dual problem.

\[\min (\text{Primal Problem}) = \max (\text{Dual Problem})\] \[\min_{\mathbf{w},b} \left( \max_{\alpha \ge 0} \mathcal{L}(\mathbf{w}, b, \mathbf{\alpha}) \right) = \max_{\alpha \ge 0} \left( \min_{\mathbf{w},b} \mathcal{L}(\mathbf{w}, b, \mathbf{\alpha}) \right)\]

This is why this two-step process works and why solving the “max of the min” (the dual problem) gives us the answer to our original “min of the max” (the primal problem).

Part 2: The Logic Chain (Primal to Dual)

Here is how we logically move from the original minimization problem to the maximization of the dual we just derived.

1. The Original Primal Problem

Our goal is to find the minimum weights while satisfying constraints:

\[\min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{subject to } y_i(\mathbf{w}\cdot\mathbf{x}_i+b) \ge 1\]

2. $\min(\max(\text{Lagrangian}))$

We convert the constrained problem into an unconstrained “Min-Max” problem using the Lagrangian.

\[P^* = \min_{\mathbf{w}, b} \left( \max_{\alpha \ge 0} \mathcal{L}(\mathbf{w}, b, \alpha) \right)\]
  • Logic: The inner maximization acts as a “validity check.”
    • If constraints are violated, $\max \mathcal{L} = \infty$.
    • If constraints are satisfied, $\max \mathcal{L} = \frac{1}{2}|\mathbf{w}|^2$ (since the penalty term becomes 0).
    • The outer minimization will naturally avoid the $\infty$ cases and choose the smallest valid weights. Thus, this is equivalent to Step 1.

3. Strong Duality: ($\min(\max(\text{Lagrangian})) \equiv \max(\min(\text{Lagrangian}))$)

This is the critical bridge. Because the SVM optimization problem is convex (a quadratic objective with linear constraints) and satisfies Slater’s condition, Strong Duality holds.

Strong Duality asserts that the order of Min and Max can be swapped without changing the result.

\[\underbrace{\min_{\mathbf{w}, b} \max_{\alpha \ge 0} \mathcal{L}(\mathbf{w}, b, \alpha)}_{\text{Primal Problem } (P^*)} = \underbrace{\max_{\alpha \ge 0} \min_{\mathbf{w}, b} \mathcal{L}(\mathbf{w}, b, \alpha)}_{\text{Dual Problem } (D^*)}\]
  • Without Strong Duality (Weak Duality), the Dual is only a lower bound ($D^* \le P^*$).
  • With Strong Duality, they are equal ($D^* = P^*$).

4. $\max(\min(\text{Lagrangian})) \Rightarrow \max(\text{Dual})$

Now we solve the right side of the equation from Step 3.

  • Inner Minimization: We perform $\min_{\mathbf{w}, b} \mathcal{L}$. This is exactly what we did in Part 1 of this answer (taking derivatives w.r.t w and b). The result of this calculation is the function $W(\alpha)$.
  • Outer Maximization: We are left with the final step:
\[\text{Maximize } W(\alpha) \quad \text{subject to } \alpha \ge 0 \text{ and } \sum \alpha_i y_i = 0\]

This is why we end up maximizing the dual formulation to solve the original minimization problem.


6. Help me with a comparative study between the primal and the dual formulation. Which one is easier to solve in general and the more preferred approach?

That’s an excellent question that gets to the practical reasons behind the mathematical formulation of Support Vector Machines.

It’s not that solving the primal problem is always inherently “harder” in a theoretical sense, but the dual problem is often computationally more efficient and flexible, especially in the scenarios where SVMs truly shine.

Here’s a breakdown of the key reasons why the dual problem is often preferred over the primal:

1. The “Tall vs. Wide Data” Dilemma (Features vs. Data Points)

Let’s say you have a dataset with:

  • n = number of data points (samples)
  • d = number of features (dimensions)

Primal Problem:

The primary variable you are solving for in the primal problem is the weight vector w, which has a size equal to the number of features, d.

  • When d is very large (e.g., thousands of features in text or image analysis), solving for a massive w vector becomes computationally expensive. The complexity can be roughly in the order of O(d³ + n*d²).

Dual Problem:

The primary variables you are solving for in the dual problem are the Lagrange multipliers α, one for each data point. So you are solving for n variables.

  • When n (the number of samples) is smaller than d (the number of features), solving for n alphas is much more efficient than solving for d weights. The complexity is roughly in the order of O(n³ + d*n²).

Conclusion: For “wide” datasets where you have more features than samples (d > n), which is common in fields like bioinformatics and text classification, the dual problem is computationally cheaper to solve.

2. The Power of the Kernel Trick

This is arguably the most important reason for using the dual formulation.

Primal Problem:

The primal problem is expressed in terms of the weight vector w. To classify a new point x, you calculate w ⋅ x + b. This formulation requires you to know the coordinates of your data in the feature space. If you want to use a non-linear model, you would have to first explicitly map your data into a higher-dimensional space using a transformation function φ(x) and then solve for w in that new, potentially infinite-dimensional space. This is often impractical or outright impossible.

Dual Problem:

The dual problem’s objective function and the final decision function depend on the data only through dot products of the feature vectors ($x_i \cdot x_j$).

Decision Function: $f(\mathbf{x}) = \text{sign} \left( \sum_{i=1}^{n} \alpha_i y_i (\mathbf{x}_i \cdot \mathbf{x}) + b \right)$

This is a crucial property. It means we can use the kernel trick:

  1. Choose a kernel function K($x_i, x_j$) that calculates the dot product in a higher-dimensional space, without ever having to transform the data. For example, the Radial Basis Function (RBF) kernel implicitly maps data to an infinite-dimensional space.
  2. Simply replace the dot product ($x_i \cdot x_j$) with the kernel function K($x_i, x_j$) in the dual formulation.

Conclusion: The dual problem provides an elegant and efficient way to train non-linear SVMs on complex datasets where a linear boundary would fail. The primal problem does not naturally accommodate the kernel trick.

3. Simpler Constraints

From a numerical optimization perspective, the nature of the constraints matters.

  • Primal Problem Constraints: You have n inequality constraints of the form $y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \ge 1$. These are linear inequalities, which can be complex to handle for some optimization algorithms.

  • Dual Problem Constraints: You have simpler constraints:

    • n “box” constraints: $0 \le \alpha_i \le C$ (in the soft-margin case).
    • One equality constraint: $\sum \alpha_i y_i = 0$.

For many quadratic programming (QP) solvers, this set of constraints is easier to work with. The structure of the dual problem led to the development of highly efficient algorithms like Sequential Minimal Optimization (SMO), which breaks the large QP problem into a series of the smallest possible QP problems that can be solved analytically.

Summary Table

AspectPrimal ProblemDual ProblemAdvantage
Variables to SolveWeight vector w (size d)Lagrange multipliers α (size n)Dual, if #features > #samples
Kernel TrickNot directly applicableNaturally incorporates kernels by replacing dot productsDual, enables non-linear models
Constraintsn linear inequality constraintsn box constraints + 1 equality constraintDual, structure is often easier for QP solvers
Ideal Use CaseLinear SVMs where d is much smaller than n.Non-linear SVMs, or when d > n. 

In modern machine learning libraries, if you are explicitly using a linear kernel with a large number of samples, the implementation might solve the primal problem (e.g., using methods like Stochastic Gradient Descent). However, for the powerful, non-linear capabilities that made SVMs famous, the dual problem is the key.


7. How to we determine which points will act as support vectors? Also, why only the support-vectors have non-zero $\alpha$?

The answer to both questions lies in a set of conditions from optimization theory called the Karush-Kuhn-Tucker (KKT) conditions. Let’s break them down.

1. How do we determine which points will act as support vectors?

You don’t determine this in advance. The optimization algorithm discovers which points are the support vectors as a result of finding the maximum margin hyperplane.

A point $\mathbf{x}_i$ is defined as a support vector if it lies exactly on one of the margin boundaries. In other words, its “functional margin” is exactly 1.

Mathematically, a point $\mathbf{x}_i$ is a support vector if it satisfies:

\[y_i(\mathbf{w} \cdot \mathbf{x}_i + b) = 1\]

Let’s look at the possibilities for any given data point once the optimal hyperplane ($\mathbf{w}, b$) has been found:

  • Case 1: The point is on the margin. $y_i(\mathbf{w} \cdot \mathbf{x}_i + b) = 1$. These are the support vectors. They are the most critical points because they are the closest to the decision boundary and define its position.

  • Case 2: The point is correctly classified and is “off the street” (beyond the margin). $y_i(\mathbf{w} \cdot \mathbf{x}_i + b) > 1$. These points are correctly classified with room to spare. They are not support vectors. If you were to remove one of these points, the optimal hyperplane would not change.

  • Case 3 (in soft-margin SVM): The point is a margin violator. $y_i(\mathbf{w} \cdot \mathbf{x}_i + b) < 1$. This includes points that are on the wrong side of the hyperplane or are inside the margin. These are also considered support vectors because they influence the final position of the hyperplane.

So, the identity of the support vectors is a direct outcome of the optimization process that finds the w and b that maximize the margin.

2. Why do only the support vectors have non-zero α?

This is the most elegant part of the SVM formulation, and it comes directly from one of the KKT conditions known as complementary slackness.

Recall our primal problem’s constraints: \(y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \ge 1 \quad \text{or, written differently,} \quad y_i(\mathbf{w} \cdot \mathbf{x}_i + b) - 1 \ge 0\)

When we created the Lagrangian dual, we introduced a multiplier $\alpha_i$ for each one of these constraints. The complementary slackness condition for SVM states that for the optimal solution, the following must be true for every single data point:

\[\alpha_i \left( y_i(\mathbf{w} \cdot \mathbf{x}_i + b) - 1 \right) = 0\]

This simple equation is a powerful “either-or” statement. For this product to be zero, at least one of its factors must be zero. Let’s analyze the two possibilities for any data point i:

Possibility A: The point is NOT a support vector.

  • If a point is not a support vector, it lies correctly classified and beyond the margin.
  • As we saw above, this means its functional margin is greater than 1: \(y_i(\mathbf{w} \cdot \mathbf{x}_i + b) > 1\)
  • Therefore, the term in the parenthesis of the KKT condition is not zero: \(\left( y_i(\mathbf{w} \cdot \mathbf{x}_i + b) - 1 \right) > 0\)
  • For the entire KKT equation to equal zero, the other factor must be zero.
  • Conclusion: For any non-support vector, its multiplier $\alpha_i$ must be 0.

Possibility B: The point IS a support vector.

  • If a point is a support vector, it lies exactly on the margin.
  • This means its functional margin is exactly 1: \(y_i(\mathbf{w} \cdot \mathbf{x}_i + b) = 1\)
  • Therefore, the term in the parenthesis of the KKT condition is zero: \(\left( y_i(\mathbf{w} \cdot \mathbf{x}_i + b) - 1 \right) = 0\)
  • Now, for the entire KKT equation to be zero, $\alpha_i$ can be anything (as long as it’s not negative). Since this point is essential for defining the hyperplane, the optimization process will assign it a non-zero value.
  • Conclusion: For any support vector, its multiplier $\alpha_i$ will be > 0.

Summary

Position of Point iValue of $(y_i(\mathbf{w} \cdot \mathbf{x}_i + b) - 1)$Value of $\alpha_i$Is it a Support Vector?
Beyond the margin$> 0$ (Strictly Positive)$= 0$No
On the margin$= 0$$> 0$Yes

This is why solving the dual problem is so powerful. By finding which $\alpha_i$ values are non-zero, we simultaneously identify the support vectors and find the weights needed to construct our final decision boundary, $\mathbf{w} = \sum \alpha_i y_i \mathbf{x}_i$. The entire model is built only from the points that truly matter: the support vectors.


The relationship between Hinge Loss and Support Vector Machines (SVMs) is fundamental; in fact, Hinge Loss is the specific loss function that the soft-margin SVM algorithm is designed to minimize.

To see this connection clearly, let’s look at the goals of each and how their mathematical formulations merge into one.

1. What is Hinge Loss?

Hinge Loss is a loss function used for “maximum margin” classification. It’s designed to penalize data points not just for being misclassified, but also for being too close to the decision boundary (i.e., inside the margin).

For a data point with a true label $y \in {-1, +1}$ and a raw model output (or score) $f(\mathbf{x})$, the Hinge Loss is defined as:

\[L(y, f(\mathbf{x})) = \max(0, 1 - y \cdot f(\mathbf{x}))\]

Let’s interpret this formula:

  • If $y \cdot f(\mathbf{x}) \ge 1$: This means the point is classified correctly and is on or outside the correct margin boundary. The term $1 - y \cdot f(\mathbf{x})$ is zero or negative, so the max function makes the loss 0. The model is not penalized.
  • If $y \cdot f(\mathbf{x}) < 1$: This means the point is either on the wrong side of the hyperplane or it is correctly classified but lies inside the margin. The term $1 - y \cdot f(\mathbf{x})$ is positive, and this value becomes the loss. The penalty increases linearly the further the point is from the correct margin.

In essence, Hinge Loss says: “I don’t care about points that are correctly classified with confidence. I only care about penalizing points that violate the margin.”

2. The Soft-Margin SVM Formulation

In the real world, data is rarely perfectly separable. The soft-margin SVM was introduced to handle this by allowing some points to be misclassified or to lie inside the margin. It does this by introducing a slack variable $\xi_i$ for each data point $\text{x}_i$.

The objective of the soft-margin SVM is to solve the following optimization problem:

Minimize: \(\frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^{n} \xi_i\)

Subject to the constraints:

  1. $y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \ge 1 - \xi_i$
  2. $\xi_i \ge 0$

Here:

  • Minimizing $\frac{1}{2}\lVert\mathbf{w}\rVert^2$ is equivalent to maximizing the margin.
  • $\xi_i$ is the slack variable that measures how much point $i$ violates the margin. If $\xi_i=0$, the point is correctly classified outside the margin. If $\xi_i > 0$, it’s a margin violator.
  • Minimizing $\sum \xi_i$ means minimizing the total amount of margin violation across all points.
  • C is a hyperparameter that controls the trade-off between maximizing the margin and minimizing the violations.

3. The Connection: Unveiling the Hinge Loss

Now, let’s connect the two concepts. Look at the first constraint of the SVM:

\[y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \ge 1 - \xi_i\]

We can rearrange this to express $\xi_i$:

\[\xi_i \ge 1 - y_i(\mathbf{w} \cdot \mathbf{x}_i + b)\]

We also know that $\xi_i$ must be greater than or equal to 0. Since the SVM’s objective function includes minimizing the sum of all $\xi_i$, the optimization will naturally push each $\xi_i$ to be the smallest possible value it can be while satisfying both conditions.

Therefore, for each data point, the optimal value of the slack variable $\xi_i$ will be:

\[\xi_i = \max(0, 1 - y_i(\mathbf{w} \cdot \mathbf{x}_i + b))\]

This is exactly the definition of the Hinge Loss where the model’s raw score is $f(\mathbf{x}_i) = \mathbf{w} \cdot \mathbf{x}_i + b$.

Re-framing the SVM Objective

Now that we see that the slack variable $\xi_i$ is just the Hinge Loss for point $i$, we can rewrite the SVM’s objective function by substituting it in:

Minimize: \(\underbrace{\frac{1}{2}\|\mathbf{w}\|^2}_{\text{Regularization Term}} + C \sum_{i=1}^{n} \underbrace{\max(0, 1 - y_i(\mathbf{w} \cdot \mathbf{x}_i + b))}_{\text{Hinge Loss Term}}\)

This new perspective makes the relationship crystal clear:

  • The SVM objective is a combination of two things:
    1. Hinge Loss: The sum of the Hinge Loss over all training points. This measures how much the model fits the training data (empirical risk).
    2. Regularization: The term $\frac{1}{2}\lVert\mathbf{w}\rVert^2$ (also called an L2-regularizer) penalizes model complexity and is responsible for maximizing the margin (structural risk).

In conclusion, Hinge Loss is not just related to SVM; it is the loss function that defines the classification error that the soft-margin SVM aims to minimize. This perspective shows that SVMs are part of a broader family of machine learning models that work by minimizing the sum of a loss function and a regularization term.


9. How does the Kernel trick fit in here?

The Kernel Trick is a powerful and elegant concept that gives SVMs much of their versatility. Let’s break it down, first as a general idea, and then apply it directly to SVM.

Part 1: The Kernel Trick as a Generic Concept

At its core, the Kernel Trick is a mathematical shortcut that allows us to work in a very high-dimensional feature space without ever having to actually compute the coordinates of our data in that space.

The Problem: Non-Linear Data

Imagine your data is not separable by a simple straight line. A classic example is data arranged in concentric circles, where one class is the inner circle (o) and the other is the outer ring (+).

Text-Based Diagram:

1
2
3
4
5
6
7
8
9
      + + + + + + +
    + +           + +
   +   o o o o o   +
  +   o         o   +
 +    o    o    o    +
  +   o         o   +
   +   o o o o o   +
    + +           + +
      + + + + + + +

Clearly, you cannot draw a single straight line (a linear separator) to separate the o’s from the +’s. Linear algorithms like the basic SVM would fail.

The “Hard Way” Solution: Explicit Feature Mapping

The standard way to solve this is to create new features from the existing ones, hoping that in this new, higher-dimensional feature space, the data becomes linearly separable. This is called feature mapping.

Let’s say a data point is defined by two features, $\mathbf{x} = [x_1, x_2]$. We could define a mapping function, $\phi$, that transforms our 2D data into 3D. A good one for this problem would be to add a third feature that represents the squared distance from the origin:

\[\phi(\mathbf{x}) = \phi([x_1, x_2]) = [x_1, x_2, x_1^2 + x_2^2]\]

If you were to apply this transformation, the concentric circles in 2D would be “lifted” into 3D space. The inner circle (o) points would have a low value for the new third feature, while the outer ring (+) points would have a high value. In this new 3D space, the data could be separated by a simple flat plane (a linear separator).

The Problems with the “Hard Way”:

  1. Computational Cost: Imagine your mapping function $\phi$ transforms your data into a million-dimensional space (or even an infinite-dimensional space!). Calculating the new coordinates for every single data point would be incredibly slow or literally impossible.
  2. Finding the Right Map: How do you even know which mapping function $\phi$ will work? Designing good features is a difficult task in itself.

The “Magic” of the Kernel Trick

Here’s the key insight: What if an algorithm doesn’t need the actual coordinates of the points in the high-dimensional space, but only the dot products between them?

Many machine learning algorithms (including SVM) have this property. Their mathematics only require the value of $\phi(\mathbf{x}_i) \cdot \phi(\mathbf{x}_j)$.

A Kernel is a function $K(\mathbf{x}_i, \mathbf{x}_j)$ that gives you the result of this high-dimensional dot product directly from the original low-dimensional vectors, completely bypassing the need to compute the transformation $\phi$.

\[K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i) \cdot \phi(\mathbf{x}_j)\]

The Kernel Trick is the act of replacing the expensive, two-step process:

  1. Compute $\mathbf{z}_i = \phi(\mathbf{x}_i)$ and $\mathbf{z}_j = \phi(\mathbf{x}_j)$ for all points.
  2. Compute the dot product $\mathbf{z}_i \cdot \mathbf{z}_j$.

with a single, cheap step:

  1. Compute $K(\mathbf{x}_i, \mathbf{x}_j)$.

This allows us to get all the benefits of working in a high-dimensional space (finding non-linear patterns) with the computational cost of working in the original low-dimensional space.

Part 2: Applying the Kernel Trick to SVM

Now, let’s see why SVM is the perfect candidate for this trick.

The Opportunity in the Dual Formulation

Recall the dual formulation for the hard-margin SVM. The goal is to find the alphas ($\alpha_i$) that maximize:

\[W(\alpha) = \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j (\mathbf{x}_i \cdot \mathbf{x}_j)\]

Look closely at this equation. The data points $\mathbf{x}_i$ and $\mathbf{x}_j$ are only ever used inside a dot product. They never appear on their own. This is the crucial property that allows us to use the Kernel Trick.

Performing the Trick on SVM

To handle non-linear data, we follow these steps:

  1. Choose a Kernel: Instead of trying to find a complicated mapping function $\phi$, we simply choose a kernel function $K$. Two popular examples are:
    • Polynomial Kernel: $K(\mathbf{x}_i, \mathbf{x}_j) = (\gamma \mathbf{x}_i \cdot \mathbf{x}_j + r)^d$. This computes the dot product in a feature space of polynomial combinations of the original features.
    • Radial Basis Function (RBF) Kernel: $K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \lVert\mathbf{x}_i - \mathbf{x}_j\rVert^2)$. This is the most popular choice. It corresponds to a feature mapping into an infinite-dimensional space, which would be impossible to compute the “hard way.”
  2. Substitute the Dot Product: We replace every instance of $(\mathbf{x}_i \cdot \mathbf{x}_j)$ in the dual formulation with our chosen kernel function $K(\mathbf{x}_i, \mathbf{x}_j)$. The problem we solve becomes:

    \[W(\alpha) = \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j)\]
  3. Kernelize the Decision Function: We also need to adapt how we classify a new, unseen point $\mathbf{z}$. The original decision function is $\text{sign}(\mathbf{w} \cdot \mathbf{z} + b)$. By substituting $\mathbf{w} = \sum \alpha_i y_i \mathbf{x}_i$, we get:

    \[\text{sign}\left(\left(\sum_{i=1}^{n} \alpha_i y_i \mathbf{x}_i\right) \cdot \mathbf{z} + b\right) = \text{sign}\left(\sum_{i=1}^{n} \alpha_i y_i (\mathbf{x}_i \cdot \mathbf{z}) + b\right)\]

    Again, we see a dot product. We replace it with the kernel function:

    \[\text{Decision Function} = \text{sign}\left(\sum_{i \in \text{SV}} \alpha_i y_i K(\mathbf{x}_i, \mathbf{z}) + b\right)\]

    (Note: The sum is only over the support vectors, since their alphas are the only ones that are non-zero).

In summary for SVM: The kernel trick allows us to solve for a non-linear decision boundary (like a curve or a circle) using the same dual optimization machinery as for a linear SVM, just by swapping the simple dot product with a more powerful kernel function. It gives us the ability to find complex patterns without ever explicitly defining the complex feature space. This applies equally to both hard-margin and soft-margin SVMs, as both of their dual formulations rely on the dot product.


10. How do Polynomial and RBF kernel help us calculate dot product in higher dimensions?

Let’s break down exactly how the two most popular kernels, Polynomial and RBF, act as a shortcut to calculate dot products in higher dimensions.

We’ll use a side-by-side comparison for each:

  • The Kernel Way (The Shortcut): A simple, direct calculation using the kernel function.
  • The Hard Way (The Long Calculation): Explicitly defining the high-dimensional space and computing the dot product there.

The core idea to remember is that the Kernel Trick works because $K(\mathbf{x}, \mathbf{z}) = \phi(\mathbf{x}) \cdot \phi(\mathbf{z})$, where $\phi(\mathbf{x})$ is the feature map to the higher dimension.

1. The Polynomial Kernel

The Polynomial kernel creates new features that are combinations and powers of the original features. Its formula is:

\[K(\mathbf{x}, \mathbf{z}) = (\mathbf{x} \cdot \mathbf{z} + r)^d\]

Here, d is the degree of the polynomial and r is an offset constant.

Example: A Simple 2D Case

Let’s use a very simple Polynomial kernel with degree $d=2$ and offset $r=0$. The kernel is: $K(\mathbf{x}, \mathbf{z}) = (\mathbf{x} \cdot \mathbf{z})^2$.

Let’s take two simple 2D vectors:

  • $\mathbf{x} = [x_1, x_2]$
  • $\mathbf{z} = [z_1, z_2]$

The Kernel Way (The Shortcut):

We just plug our vectors into the kernel formula. This calculation happens in 2D.

\(K(\mathbf{x}, \mathbf{z}) = ([x_1, x_2] \cdot [z_1, z_2])^2\)\(= (x_1z_1 + x_2z_2)^2\)\(= (x_1z_1)^2 + 2(x_1z_1)(x_2z_2) + (x_2z_2)^2\) \(= \mathbf{x_1^2z_1^2 + 2x_1z_1x_2z_2 + x_2^2z_2^2}\)

This was very fast and easy.

The Hard Way (The Long Calculation):

To do this the long way, we first need to figure out which higher-dimensional space our kernel corresponds to. Our kernel with $d=2$ corresponds to a feature mapping $\phi$ that creates all second-degree polynomial features. For 2D data, this mapping is to 3D:

\(\phi(\mathbf{x}) = [x_1^2, x_2^2, \sqrt{2}x_1x_2]\) (The $\sqrt{2}$ is a mathematical detail needed to make the dot product match perfectly).

Now, let’s perform the two expensive steps:

  1. Map the vectors to the 3D space:
    • $\phi(\mathbf{x}) = [x_1^2, x_2^2, \sqrt{2}x_1x_2]$
    • $\phi(\mathbf{z}) = [z_1^2, z_2^2, \sqrt{2}z_1z_2]$
  2. Compute the dot product in the 3D space: \(\phi(\mathbf{x}) \cdot \phi(\mathbf{z}) = (x_1^2)(z_1^2) + (x_2^2)(z_2^2) + (\sqrt{2}x_1x_2)(\sqrt{2}z_1z_2)\) \(= x_1^2z_1^2 + x_2^2z_2^2 + 2x_1x_2z_1z_2\)

The Result:

The results are identical! The Polynomial Kernel gave us the answer to a 3D dot product using only a simple 2D calculation. Now imagine using a degree of $d=3$ or $d=4$. The dimensionality of the feature space $\phi(\mathbf{x})$ would explode, making the “Hard Way” incredibly complex, while the “Kernel Way” remains a simple calculation.

2. The RBF (Radial Basis Function) Kernel

The RBF kernel is the most popular choice for SVMs. It is a general-purpose kernel that works well when you have no prior knowledge about the data. Its formula is:

\[K(\mathbf{x}, \mathbf{z}) = \exp(-\gamma \|\mathbf{x} - \mathbf{z}\|^2)\]

Here, $\gamma$ is a hyperparameter that controls the “width” of the kernel.

Example: A Simple 2D Case

Let’s take the same two 2D vectors:

  • $\mathbf{x} = [x_1, x_2]$
  • $\mathbf{z} = [z_1, z_2]$

The Kernel Way (The Shortcut):

Again, we just plug the vectors into the formula. It’s a simple calculation based on the squared Euclidean distance between the two points.

\[K(\mathbf{x}, \mathbf{z}) = \exp(-\gamma ( (x_1-z_1)^2 + (x_2-z_2)^2 ))\]

This is a single, straightforward calculation.

The Hard Way (The Long Calculation):

This is where the RBF kernel reveals its true power. The feature space that the RBF kernel corresponds to is infinite-dimensional.

This means we cannot perform the “Hard Way” calculation. It is mathematically impossible to write down the feature map $\phi(\mathbf{x})$ as a finite vector.

So how do we know it’s a dot product in an infinite-dimensional space?

We can get an intuition by looking at the Taylor series expansion of the exponential function, $e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \dots$.

Let’s expand the RBF kernel formula:

\[K(\mathbf{x}, \mathbf{z}) = \exp(-\gamma (\lVert\mathbf{x} - \mathbf{z}\rVert^2))\] \[K(\mathbf{x}, \mathbf{z}) = \exp(-\gamma ((\mathbf{x} - \mathbf{z}) \cdot (\mathbf{x} - \mathbf{z})))\] \[K(\mathbf{x}, \mathbf{z}) = \exp(-\gamma (\|\mathbf{x}\|^2 - 2\mathbf{x}\cdot\mathbf{z} + \|\mathbf{z}\|^2))\] \[K(\mathbf{x}, \mathbf{z}) = \underbrace{\exp(-\gamma \|\mathbf{x}\|^2)\exp(-\gamma \|\mathbf{z}\|^2)}_{\text{Scaling factors}} \cdot \underbrace{\exp(2\gamma \mathbf{x}\cdot\mathbf{z})}_{\text{The interesting part}}\]

Now, let’s look at the Taylor expansion of that interesting part: \(\exp(2\gamma \mathbf{x}\cdot\mathbf{z}) = 1 + (2\gamma \mathbf{x}\cdot\mathbf{z}) + \frac{(2\gamma \mathbf{x}\cdot\mathbf{z})^2}{2!} + \frac{(2\gamma \mathbf{x}\cdot\mathbf{z})^3}{3!} + \dots\)

Notice that this is an infinite sum of polynomial dot products of every possible degree! ($(\mathbf{x}\cdot\mathbf{z})$, $(\mathbf{x}\cdot\mathbf{z})^2$, $(\mathbf{x}\cdot\mathbf{z})^3$, etc.).

This shows that the RBF kernel is implicitly doing the work of a combination of an infinite number of polynomial kernels. It is calculating a dot product in a feature space that contains features of all possible polynomial degrees.

The Result:

The RBF kernel allows us to leverage the immense power of an infinite-dimensional feature space to find extremely complex non-linear boundaries. The “Hard Way” isn’t just hard, it’s impossible. The “Kernel Way” makes it a simple, finite calculation. This is the ultimate expression of the power of the Kernel Trick.


11. Can SVM be used for outlier detection?

Of course. Outlier detection can be performed using One-Class SVM. It’s a powerful algorithm used for unsupervised outlier and novelty detection.

The core idea is simple and elegant: Instead of finding a hyperplane to separate two classes of data, a One-Class SVM learns a boundary that encloses a single class of data. Think of it as building a tight “fence” or “bubble” around the normal data points. Any point that falls outside this fence is then considered an outlier.

The Core Concept: From Separating Classes to Enclosing Data

Let’s quickly contrast it with a regular two-class SVM to understand the shift in thinking:

  • Regular SVM (Two Classes): Finds the widest possible street (margin) that separates Class A from Class B. The hyperplane is in the middle of the two classes.

  • One-Class SVM: Assumes all the training data belongs to one “normal” class. It then tries to find a hyperplane that separates these data points from the origin in a high-dimensional space, effectively creating a boundary around the data.

The goal is no longer to push two classes apart, but to define the territory of a single, “normal” class.

How It Works: The Mechanism

The algorithm is clever and relies heavily on the kernel trick.

Step 1: Map to a High-Dimensional Space (Kernel Trick)

First, the data is mapped from its original space into a very high-dimensional feature space using a kernel function. The RBF (Radial Basis Function) kernel is almost always used for this task. This mapping is crucial because it “unfolds” the data, making it possible to enclose even complex, non-spherical distributions with a simple hyperplane in the new space.

Step 2: Separate the Data from the Origin

In this new high-dimensional space, the algorithm’s objective is to find a hyperplane that separates the mapped data points from the origin (0, 0, ...) with the maximum possible margin.

This might seem strange, but by separating the data from the origin, the hyperplane effectively creates a boundary (like a sphere or a more complex shape) around the dense region of data points.

Step 3: Control the Boundary with the ν (Nu) Hyperparameter

You don’t need to define the margin width manually. Instead, you control the behavior of the algorithm with a single, powerful hyperparameter called ν (nu).

The ν parameter has a very intuitive meaning:

  • It’s an upper bound on the fraction of training points that can be considered outliers (i.e., allowed to be outside the fence).
  • It’s also a lower bound on the fraction of training points that will become support vectors (the points that actually define the fence).

For example, if you set ν = 0.05, you are telling the algorithm: “I assume that no more than 5% of my training data are outliers. Please build a boundary that encloses at least 95% of the points.” This gives you direct control over how tightly the model fits your data.

Using One-Class SVM for Outlier Detection

Here is the practical workflow for finding anomalies in new data.

Phase 1: Training

You train the One-Class SVM on a dataset that contains only (or mostly) normal, inlier data. For example:

  • A dataset of thousands of successful, non-fraudulent credit card transactions.
  • Sensor readings from a machine while it’s operating correctly.

During this phase, the SVM learns the shape of “normalcy” by building a boundary around these points.

Phase 2: Prediction

Now, you take a new, unseen data point and ask the model to classify it. The model calculates the point’s position relative to the learned boundary.

  • If the model returns +1, it means the point falls inside the learned boundary. It is classified as an inlier (normal).
  • If the model returns -1, it means the point falls outside the boundary, on the same side as the origin. It is flagged as an outlier or anomaly.

Example Application: You train a One-Class SVM on a month’s worth of normal factory machine vibrations. The model learns the boundary of these normal vibrations. The next day, a new vibration reading comes in.

  • You feed this new reading to the trained model.
  • The model returns -1.
  • This signals an anomaly, alerting an operator that the machine’s behavior has deviated from the norm and may require maintenance.

12. In one class SVM (with kernel trick), what if the in-liers are around the origin? How does the model separate the origin from the boundaries of the inliers?

In the original 2D space, if the inliers surround the origin, you cannot separate them from an origin they already contain. The solution is that the separation does not happen in the original space; it happens in the high-dimensional feature space created by the kernel.

The Paradox in the Original Space

Let’s imagine your normal data (+) forms a circular cloud centered at the origin (0,0). An outlier (o) could be further out.

1
2
3
4
5
6
7
8
9
10
             ^
             |
             |
             |
             |
             + o
           + + + 
         + + +-+-+-------->
           + + + 
             +

In this 2D view, the One-Class SVM’s goal of “separating the data from the origin” seems impossible. The origin is in the middle of the data.

The Solution: The Magic of the Kernel Mapping

The kernel trick transforms the data so that this paradox is resolved. Let’s see how the commonly used RBF kernel $K(\mathbf{x}, \mathbf{z}) = \exp(-\gamma \lVert\mathbf{x} - \mathbf{z}\rVert^2)$ achieves this.

The kernel mapping, $\phi(\mathbf{x})$, takes your 2D points and projects them into an infinite-dimensional feature space. The crucial insight comes from looking at the distance of these new points from the origin in the new feature space.

The squared distance of a mapped point $\phi(\mathbf{x})$ from the feature space’s origin is $\lVert\phi(\mathbf{x})\rVert^2$. A key property of kernels is that this squared distance is equal to the kernel of the point with itself, $K(\mathbf{x}, \mathbf{x})$.

Let’s calculate this for the RBF kernel:

\[\|\phi(\mathbf{x})\|^2 = K(\mathbf{x}, \mathbf{x}) = \exp(-\gamma \|\mathbf{x} - \mathbf{x}\|^2)\] \[\|\phi(\mathbf{x})\|^2 = \exp(-\gamma \cdot 0) = \exp(0) = 1\]

The “Aha!” Moment

This is a stunning result. It means that every single data point, regardless of its position in the original 2D space, is mapped to a new point that is exactly 1 unit away from the origin in the infinite-dimensional feature space.

The data that was a solid ball in 2D, including the origin, becomes the surface of a unit hypersphere in the feature space. The dense cluster of inliers is now a dense patch on the surface of this sphere.

Separating from the Origin in the Feature Space

Now, the problem is easy again!

  1. Data is Relocated: All your data points now live on a hollow shell (the hypersphere’s surface).
  2. Origin is Isolated: The origin of the feature space is at the center of this shell, completely empty and clearly separated from all the data points on the surface.
  3. SVM Finds a Boundary: The One-Class SVM can now do its job perfectly. It finds a hyperplane in the feature space that creates a small “bubble” around the origin, separating it from the data points on the hypersphere.

When this hyperplane boundary is mapped back down to your original 2D space, it forms the tight, closed boundary (like a circle) around your inliers that you expect.

An Analogy: The Inflating Beach Ball

Imagine your 2D data is drawn on a completely deflated beach ball, with the inliers clustered around the center point (the future origin).

  1. The Kernel Trick is the process of inflating the ball.
  2. As you inflate it, your 2D drawing stretches and becomes the 3D surface of the ball.
  3. The Result: The center of the inflated ball (the 3D origin) is now empty. All of your data points are now on the surface, a set distance away from this new origin. The paradox is solved. The One-Class SVM can now easily find a boundary that separates the empty center from the data on the surface.

In summary, the kernel trick fundamentally rearranges the data’s geometry, ensuring that even if your inliers contain the origin in the original space, they are cleanly separated from the origin in the feature space, allowing the algorithm to work as intended.


13. In the above example of inliers surrounding the origin, we never assume anything about $\text{x}$ while performing the kernel trick. Does that mean, irrespective of the positions of these points, they will always form a hollow hypersphere of radius 1?

Yes, that is exactly right. This observation is at the core of understanding the RBF kernel’s power.

Irrespective of the positions of the data points in the original space, whether they are centered at the origin, far from it, or scattered everywhere, the RBF kernel maps every single point to the surface of a unit hypersphere in the high-dimensional feature space.

The Property Belongs to the Kernel, Not the Data

This seemingly magical property is a direct mathematical consequence of the RBF kernel’s definition, not the data’s layout. Let’s revisit the short derivation:

  1. The squared distance of a mapped point $\phi(\mathbf{x})$ from the origin in the feature space is $|\phi(\mathbf{x})|^2$.
  2. This is equal to the kernel of the point with itself: $K(\mathbf{x}, \mathbf{x})$.
  3. For the RBF kernel, $K(\mathbf{x}, \mathbf{x}) = \exp(-\gamma \lVert\mathbf{x} - \mathbf{x}\rVert^2)$.
  4. Since $\lVert\mathbf{x} - \mathbf{x}\rVert^2$ is always zero for any vector $\mathbf{x}$, the expression simplifies to $\exp(0) = 1$.

This proves that $\lVert\phi(\mathbf{x})\rVert^2 = 1$ for every point. The specific coordinates of $\mathbf{x}$ cancel themselves out, making the result universal for all points.

So, How Does It Differentiate Between Points?

This leads to the crucial follow-up question: If all points are just dumped onto the same sphere, how does the algorithm preserve the original data’s structure? How can it tell an inlier from an outlier?

The key is that while all points have the same distance from the new origin, their positions relative to each other on the hypersphere’s surface are meaningfully preserved.

The structure isn’t lost; it’s just “unfolded” onto this curved surface. The way the algorithm measures the new distance between points is by using the kernel function itself:

  • Points close in original space: If $\mathbf{x}_i$ and $\mathbf{x}_j$ are close, the term $\lVert\mathbf{x}_i - \mathbf{x}_j\rVert^2$ is small, making $K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\text{small value})$ close to 1. This means they are mapped to points that are very close to each other on the hypersphere.
  • Points far apart in original space: If $\mathbf{x}_i$ and $\mathbf{x}_j$ are far apart, the term $\lVert\mathbf{x}_i - \mathbf{x}_j\rVert^2$ is large, making $K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\text{large value})$ close to 0. This means they are mapped to points that are far from each other (nearly orthogonal) on the hypersphere.

The One-Class SVM then works on this hypersphere, finding the densest cluster of points and drawing a boundary (a hyperplane that intersects the sphere) around them. Outliers are simply points on the sphere that are far away from this dense cluster.

An Analogy: The World Map

Think of it like this:

  • Original Space: Your 2D data is like a flat map of the world. Cities are at different (x, y) coordinates.
  • Kernel Mapping: The RBF kernel acts like a projection that wraps this flat map onto a 3D globe.
  • The Unit Hypersphere Property: Every city, whether it’s Bengaluru or New York, now lies on the surface of the globe. All cities are now approximately the same distance from the center of the Earth.
  • Preserving Relative Positions: However, the structure is perfectly preserved. Bengaluru and Chennai are still close to each other on the globe’s surface. Bengaluru and New York are still far apart.
  • Finding the Boundary: The One-Class SVM’s job is to find a boundary (like a circle drawn on the globe’s surface) that encloses the dense cluster of “normal” cities, correctly identifying a remote, isolated city as an outlier.


Resources

Enjoyed this article? Never miss out on future posts - follow me.
This post is licensed under CC BY 4.0 by the author.