Post

Understanding PCA once and for all

Understanding PCA once and for all

This is a detailed, math-heavy outline for explaining PCA (Principal Component Analysis). It ensures all the statistical and geometric principles are covered, leading directly to the eigenvalue decomposition.

Principal Component Analysis (PCA): Unlocking the Geometric Core

PCA is fundamentally a geometric optimization problem: finding the directions (vectors) that capture the maximum amount of variance in a dataset. This article derives the core equations that explain why we rely on eigenvectors and eigenvalues to solve dimensionality reduction.

1. The Geometric Foundation: Vector Projection

The first step in PCA is projecting data onto a new axis. We start by deriving the equation for projecting an arbitrary vector $\mathbf{x}$ onto an arbitrary vector $\mathbf{v}$.

The projection, $\mathbf{p}$, of vector $\mathbf{x}$ onto $\mathbf{v}$ is defined as the component of $\mathbf{x}$ that lies along the direction of $\mathbf{v}$.

Derivation of the Projection Vector $\mathbf{p}$

  1. Find the Scalar Projection: The length of the projection is found using the dot product and the length of the vector $\mathbf{v}$. The dot product $\mathbf{x} \cdot \mathbf{v}$ gives us the component of $\mathbf{x}$ that is parallel to $\mathbf{v}$, scaled by the length of $\mathbf{v}$.
    • The scalar projection (length) of $\mathbf{x}$ onto $\mathbf{v}$ is: \(\text{length}(\mathbf{p}) = \frac{\mathbf{x} \cdot \mathbf{v}}{\|\mathbf{v}\|}\)
  2. Find the Direction: The direction of $\mathbf{p}$ is the unit vector along $\mathbf{v}$, denoted $\mathbf{u}$. \(\mathbf{u} = \frac{\mathbf{v}}{\|\mathbf{v}\|}\)

  3. Form the Projection Vector: The projection vector $\mathbf{p}$ is simply the scalar length multiplied by the unit direction vector $\mathbf{u}$.
\[\mathbf{p} = \left(\frac{\mathbf{x} \cdot \mathbf{v}}{\|\mathbf{v}\|}\right) \mathbf{u}\]

Substituting $\mathbf{u} = \frac{\mathbf{v}}{|\mathbf{v}|}$, the projection vector $\mathbf{p}$ can be written as:

\[\mathbf{p} = \frac{\mathbf{x} \cdot \mathbf{v}}{\|\mathbf{v}\|^2} \mathbf{v}\]

2. The Statistical Foundation: Data Centering and Covariance

PCA is a procedure that maximizes variance, which is intimately related to the covariance matrix.

Defining Covariance and Centering

The (sample) covariance between two random variables, $X$ and $Y$, is defined as the expected value of the product of their deviations from their respective means:

\[\text{Cov}(X, Y) = E[(X - E(X))(Y - E(Y))] = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{m - 1}\]

A natural extension for covariance for an n-dimensional dataset of size m, would be a covariance matrix of dimension $n \times n$.

\[\operatorname{Cov}(\mathbf{X}) = \begin{bmatrix} \operatorname{Cov}(X_1, X_1) & \operatorname{Cov}(X_1, X_2) & \cdots & \operatorname{Cov}(X_1, X_n) \\ \operatorname{Cov}(X_2, X_1) & \operatorname{Cov}(X_2, X_2) & \cdots & \operatorname{Cov}(X_2, X_n) \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{Cov}(X_n, X_1) & \operatorname{Cov}(X_n, X_2) & \cdots & \operatorname{Cov}(X_n, X_n) \end{bmatrix}\]
  • Diagonal entries: variances
  • Off-diagonal entries: covariances between each pair of dimensions

Before calculating the covariance matrix, data is typically centered. Centering data $\mathbf{x}$ involves subtracting the mean $\mu$ of the entire dataset $\mathbf{X}$ from every data point: $\mathbf{X}_{\text{centered}} = \mathbf{X} - E(\mathbf{X})$. Effectively, we’ll be subtracting the mean of each column for itself.

If the data is perfectly centered, the mean of the new dataset is zero: $E(\mathbf{X}_{\text{centered}}) = \mathbf{0}$.

Effect of Centering on the Covariance Matrix

Let $\mathbf{X}$ be the original $m \times n$ feature matrix (where $m$ is the number of samples and $n$ is the number of features). Let $\mathbf{X}_c$ be the centered matrix.

The Covariance Matrix of the dataset $\mathbf{X}$ is a square $n \times n$ matrix where the element at $(i, j)$ is the covariance between feature $i$ and feature $j$.

When the data is centered ($\mathbf{X}_c$), the covariance matrix $C$ is mathematically related to the matrix product $\mathbf{X}_c^T \mathbf{X}_c$:

\[\text{Cov}(\mathbf{X}) = \frac{1}{m-1} \mathbf{X}_c^T \mathbf{X}_c\]

This means that for the centered data, the covariance matrix $\text{Cov}(\mathbf{X})$ is simply the scaled version of the product $\mathbf{X}_c^T \mathbf{X}_c$.

If we are only concerned with finding the directions that maximize the variance (as in PCA), the scaling factor $\frac{1}{m-1}$ does not affect the optimal direction. Therefore, for the purpose of optimization, maximizing variance is equivalent to maximizing the quadratic form defined by the centered data’s covariance matrix.

3. The Optimization Problem: Maximizing Projected Variance

Let $\mathbf{X}$ be the centered training data matrix ($m$ rows/samples, $n$ columns/features). Let $\mathbf{w}$ be the principal component vector (our projection axis), which is an $n \times 1$ vector.

Data Projection

The projection of all data points onto the vector $\mathbf{w}$ can be efficiently calculated via matrix multiplication. Since $\mathbf{X}$ is centered, the projection of the $i$-th row $\mathbf{x}_i$ onto $\mathbf{w}$ is approximately $\mathbf{x}_i \cdot \mathbf{w}$.

When applied to the entire dataset $\mathbf{X}$, the product $\mathbf{z} = \mathbf{X}\mathbf{w}$ results in a column vector $\mathbf{z}$ where each element $z_i$ is the projection of the data point $\mathbf{x}_i$ onto the axis defined by $\mathbf{w}$.

\[\mathbf{z} = \mathbf{X}\mathbf{w} \quad (\text{an } m \times 1 \text{ vector of projections})\]

Maximizing Variance

PCA seeks to find the vector $\mathbf{w}$ that maximizes the variance of this projected data, $\text{Var}(\mathbf{z})$. Since $\mathbf{X}$ is centered, $E(\mathbf{z}) = \mathbf{0}$. The variance of $\mathbf{z}$ is proportional to its squared magnitude:

\[\text{Var}(\mathbf{z}) \propto \|\mathbf{z}\|^2\]

Since the vector $\mathbf{w}$ defines a direction and not a magnitude, we constrain $\mathbf{w}$ to be a unit vector.

The PCA objective is thus defined as:

\[\text{Maximize } \|\mathbf{X}\mathbf{w}\|^2 \quad \text{subject to } \|\mathbf{w}\|^2 = 1\]

4. Re-Formulating the Objective and Connecting to Eigenvectors

The expression to maximize, $\lVert\mathbf{X}\mathbf{w}\rVert^2$, can be reformulated using properties of vectors and matrices.

  1. Reformulating the Norm: The squared Euclidean norm of a vector $\mathbf{v}$ is the dot product of the vector with itself: $|\mathbf{v}|^2 = \mathbf{v}^T \mathbf{v}$.

    \[\|\mathbf{X}\mathbf{w}\|^2 = (\mathbf{X}\mathbf{w})^T (\mathbf{X}\mathbf{w})\]
  2. Using Transpose Properties: $(AB)^T = B^T A^T$.

    \[(\mathbf{X}\mathbf{w})^T (\mathbf{X}\mathbf{w}) = \mathbf{w}^T \mathbf{X}^T \mathbf{X}\mathbf{w}\]
  3. The Final Quadratic Form: Since $\mathbf{C} = \mathbf{X}^T \mathbf{X}$ (which is proportional to the covariance matrix $\text{Cov}(\mathbf{X})$), the optimization problem becomes a quadratic form:

\[\text{Maximize } \mathbf{w}^T (\mathbf{X}^T \mathbf{X}) \mathbf{w} \quad \text{subject to } \mathbf{w}^T \mathbf{w} = 1\]

The Connection to Eigenvectors

The problem of maximizing the quadratic form $\mathbf{w}^T \mathbf{C} \mathbf{w}$ subject to $\lVert\mathbf{w}\rVert^2 = 1$ is a standard problem solved by Eigen Decomposition.

The maximum value of the quadratic form occurs when $\mathbf{w}$ is the eigenvector corresponding to the largest eigenvalue ($\lambda_{\max}$) of the matrix $\mathbf{C}$ (where $\mathbf{C} = \mathbf{X}^T \mathbf{X}$ or the covariance matrix). The maximum variance captured is equal to $\lambda_{\max}$.

5. Solving via Lagrangian Multipliers

We can formally solve the maximization problem using the Lagrangian Multiplier method, which explicitly demonstrates the link between the optimal direction $\mathbf{w}$ and the eigenvectors.

Formulating the Lagrangian

We combine the objective function and the constraint into the Lagrangian function $L$:

\[L(\mathbf{w}, \lambda) = \mathbf{w}^T \mathbf{C} \mathbf{w} - \lambda (\mathbf{w}^T \mathbf{w} - 1)\]

Where $\mathbf{C} = \mathbf{X}^T \mathbf{X}$ and $\lambda$ is the Lagrange multiplier.

Calculating the Gradient

To find the maximum, we take the gradient of $L$ with respect to $\mathbf{w}$ and set it to the zero vector ($\mathbf{0}$).

  1. Gradient of the First Term ($\mathbf{w}^T \mathbf{C} \mathbf{w}$): The gradient of a quadratic form is $2\mathbf{C}\mathbf{w}$. (Since $\mathbf{C}$ is symmetric, $\mathbf{C} = \mathbf{C}^T$).

  2. Gradient of the Second Term ($\lambda (\mathbf{w}^T \mathbf{w} - 1)$): The gradient is $2\lambda \mathbf{w}$.

Setting the gradient to zero:

\[\nabla_{\mathbf{w}} L(\mathbf{w}, \lambda) = 2\mathbf{C}\mathbf{w} - 2\lambda\mathbf{w} = \mathbf{0}\]

Dividing by 2 and factoring out $\mathbf{w}$ yields:

\[\mathbf{C}\mathbf{w} = \lambda\mathbf{w}\]

This final expression is the fundamental definition of the Eigenvalue Problem.

6. The Practical Solution: Connecting to SVD

In practical machine learning libraries, PCA is almost always solved using the Singular Value Decomposition (SVD) rather than explicit Eigen Decomposition. SVD is numerically more stable and more efficient for large, sparse matrices.

Derivation via SVD

  1. Start with SVD: The SVD of the centered data matrix $\mathbf{X}$ is: $\mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$
    • Where:
      • $\mathbf{U}$ contains the left singular vectors.
      • $\mathbf{\Sigma}$ is a rectangular diagonal matrix containing the singular values ($\sigma_i$).
      • $\mathbf{V}$ contains the right singular vectors.
  2. Calculate $\mathbf{X}^T\mathbf{X}$:
\[\mathbf{X}^T\mathbf{X} = (\mathbf{U} \mathbf{\Sigma} \mathbf{V}^T)^T (\mathbf{U} \mathbf{\Sigma} \mathbf{V}^T)\]
  1. Apply Transpose Rule: $(\mathbf{A} \mathbf{B} \mathbf{C})^T = \mathbf{C}^T \mathbf{B}^T \mathbf{A}^T$.
\[\mathbf{X}^T\mathbf{X} = (\mathbf{V} \mathbf{\Sigma}^T \mathbf{U}^T) (\mathbf{U} \mathbf{\Sigma} \mathbf{V}^T)\]
  1. Simplify using Orthogonality: $\mathbf{U}^T \mathbf{U} = \mathbf{I}$ (the Identity Matrix).
\[\mathbf{X}^T\mathbf{X} = \mathbf{V} \mathbf{\Sigma}^T \mathbf{\Sigma} \mathbf{V}^T\]
  1. Final Result: Since $\mathbf{\Sigma}$ is diagonal, $\mathbf{\Sigma}^T \mathbf{\Sigma}$ is a diagonal matrix containing the squared singular values: $\mathbf{\Sigma}^T \mathbf{\Sigma} = \mathbf{\Sigma}^2$.
\[\mathbf{X}^T\mathbf{X} = \mathbf{V} \mathbf{\Sigma}^2 \mathbf{V}^T\]

Observations

  • Comparing this SVD result to the definition of Eigen Decomposition ($\mathbf{C} = \mathbf{P} \mathbf{\Lambda} \mathbf{P}^T$) proves the following essential connections:
    • The Eigenvectors of $\mathbf{X}^T\mathbf{X}$ (the principal components $\mathbf{w}$) are the columns of the Right Singular Vector Matrix ($\mathbf{V}$).
    • The Eigenvalues ($\lambda_i$) are the squared Singular Values ($\sigma_i^2$).
    • Therefore, to perform PCA, modern libraries simply compute the SVD of the centered data matrix $\mathbf{X}$, and the principal components are read directly from the $\mathbf{V}$ matrix.

Conclusion

The solution proves that the vector $\mathbf{w}$ that maximizes the projected variance must be an eigenvector of the covariance-related matrix $\mathbf{C}$, and the maximized variance itself is given by the corresponding eigenvalue $\lambda$.

To find the direction that captures the absolute maximum variance (the first principal component), we must choose the eigenvector corresponding to the largest eigenvalue ($\lambda_{\max}$).

The subsequent principal components are the eigenvectors corresponding to the next largest eigenvalues, as they capture the maximum remaining variance orthogonal to the previously found components.


PCA in Practice: Projection and Reconstruction

When applying Principal Component Analysis (PCA) to reduce data from $n$ dimensions to $k$ dimensions (where $k \ll n$), we perform a specific sequence of matrix operations. This guide explains the exact linear algebra behind the Projection (Encoding) and Reconstruction (Decoding) phases using the Singular Value Decomposition (SVD) components.

1. The Setup: SVD Decomposition

Let $\mathbf{X}$ be our centered data matrix of shape $(m \times n)$, where $m$ is the number of samples and $n$ is the number of features. We perform SVD on $\mathbf{X}$:

\[\mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T\]
  • $\mathbf{U}$: $(m \times m)$ Orthogonal matrix (Left Singular Vectors).
  • $\mathbf{\Sigma}$: $(m \times n)$ Diagonal matrix of singular values.
  • $\mathbf{V}^T$: $(n \times n)$ Orthogonal matrix (Right Singular Vectors). The rows of $\mathbf{V}^T$ (or columns of $\mathbf{V}$) are the Principal Components (Eigenvectors of $\mathbf{X}^T\mathbf{X}$).

2. Projection: Reducing Dimensions ($n \to k$)

To reduce dimensions, we want to project our original data $\mathbf{X}$ onto the top $k$ principal components.

Step 1: Select the Components

We take the first $k$ columns of the matrix $\mathbf{V}$. Let’s call this truncated matrix $\mathbf{V}_k$.

  • Shape of $\mathbf{V}_k$: $(n \times k)$

Step 2: The Projection Formula

The projected data, let’s call it $\mathbf{Z}$ (or $X_{new}$), represents our data in the new $k$-dimensional subspace.

\[\mathbf{Z} = \mathbf{X} \cdot \mathbf{V}_k\]

Matrix Multiplication Analysis

  • $\mathbf{X}$: $(m \times n)$
  • $\mathbf{V}_k$: $(n \times k)$
  • Result $\mathbf{Z}$: $(m \times k)$

Interpretation: Each row in $\mathbf{Z}$ is a sample represented by $k$ new features (the principal components). We have successfully compressed the data from $n$ features to $k$.

3. Lossy Reconstruction: The Reverse Process ($k \to n$)

We can attempt to go back to the original dimension space. However, since we discarded information (the bottom $n-k$ components), we cannot get the exact original data back. We get an approximation.

The Reconstruction Formula

To reconstruct the data ($\mathbf{X}_{approx}$), we project our lower-dimensional representation $\mathbf{Z}$ back into the original feature space using the transpose of the components matrix.

\[\mathbf{X}_{approx} = \mathbf{Z} \cdot \mathbf{V}_k^T\]

Matrix Multiplication Analysis

  • $\mathbf{Z}$: $(m \times k)$
  • $\mathbf{V}_k^T$: $(k \times n)$ (The transpose of our selected components)
  • Result $\mathbf{X}_{approx}$: $(m \times n)$

Interpretation: We are back to $m$ samples and $n$ features. However, $\mathbf{X}_{approx} \neq \mathbf{X}$.

The reconstruction $\mathbf{X}_{approx}$ is the “shadow” of the original data on the $k$-dimensional hyperplane defined by the principal components.

The difference, $\lVert\mathbf{X} - \mathbf{X}_{approx}\rVert^2$, is the projection error (or information loss).

Enjoyed this article? Never miss out on future posts - follow me.
This post is licensed under CC BY 4.0 by the author.