Understanding PCA once and for all
This is a detailed, math-heavy outline for explaining PCA (Principal Component Analysis). It ensures all the statistical and geometric principles are covered, leading directly to the eigenvalue decomposition.
Principal Component Analysis (PCA): Unlocking the Geometric Core
PCA is fundamentally a geometric optimization problem: finding the directions (vectors) that capture the maximum amount of variance in a dataset. This article derives the core equations that explain why we rely on eigenvectors and eigenvalues to solve dimensionality reduction.
1. The Geometric Foundation: Vector Projection
The first step in PCA is projecting data onto a new axis. We start by deriving the equation for projecting an arbitrary vector $\mathbf{x}$ onto an arbitrary vector $\mathbf{v}$.
The projection, $\mathbf{p}$, of vector $\mathbf{x}$ onto $\mathbf{v}$ is defined as the component of $\mathbf{x}$ that lies along the direction of $\mathbf{v}$.
Derivation of the Projection Vector $\mathbf{p}$
- Find the Scalar Projection: The length of the projection is found using the dot product and the length of the vector $\mathbf{v}$. The dot product $\mathbf{x} \cdot \mathbf{v}$ gives us the component of $\mathbf{x}$ that is parallel to $\mathbf{v}$, scaled by the length of $\mathbf{v}$.
- The scalar projection (length) of $\mathbf{x}$ onto $\mathbf{v}$ is: \(\text{length}(\mathbf{p}) = \frac{\mathbf{x} \cdot \mathbf{v}}{\|\mathbf{v}\|}\)
Find the Direction: The direction of $\mathbf{p}$ is the unit vector along $\mathbf{v}$, denoted $\mathbf{u}$. \(\mathbf{u} = \frac{\mathbf{v}}{\|\mathbf{v}\|}\)
- Form the Projection Vector: The projection vector $\mathbf{p}$ is simply the scalar length multiplied by the unit direction vector $\mathbf{u}$.
Substituting $\mathbf{u} = \frac{\mathbf{v}}{|\mathbf{v}|}$, the projection vector $\mathbf{p}$ can be written as:
\[\mathbf{p} = \frac{\mathbf{x} \cdot \mathbf{v}}{\|\mathbf{v}\|^2} \mathbf{v}\]2. The Statistical Foundation: Data Centering and Covariance
PCA is a procedure that maximizes variance, which is intimately related to the covariance matrix.
Defining Covariance and Centering
The (sample) covariance between two random variables, $X$ and $Y$, is defined as the expected value of the product of their deviations from their respective means:
\[\text{Cov}(X, Y) = E[(X - E(X))(Y - E(Y))] = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{m - 1}\]A natural extension for covariance for an n-dimensional dataset of size m, would be a covariance matrix of dimension $n \times n$.
\[\operatorname{Cov}(\mathbf{X}) = \begin{bmatrix} \operatorname{Cov}(X_1, X_1) & \operatorname{Cov}(X_1, X_2) & \cdots & \operatorname{Cov}(X_1, X_n) \\ \operatorname{Cov}(X_2, X_1) & \operatorname{Cov}(X_2, X_2) & \cdots & \operatorname{Cov}(X_2, X_n) \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{Cov}(X_n, X_1) & \operatorname{Cov}(X_n, X_2) & \cdots & \operatorname{Cov}(X_n, X_n) \end{bmatrix}\]- Diagonal entries: variances
- Off-diagonal entries: covariances between each pair of dimensions
Before calculating the covariance matrix, data is typically centered. Centering data $\mathbf{x}$ involves subtracting the mean $\mu$ of the entire dataset $\mathbf{X}$ from every data point: $\mathbf{X}_{\text{centered}} = \mathbf{X} - E(\mathbf{X})$. Effectively, we’ll be subtracting the mean of each column for itself.
If the data is perfectly centered, the mean of the new dataset is zero: $E(\mathbf{X}_{\text{centered}}) = \mathbf{0}$.
Effect of Centering on the Covariance Matrix
Let $\mathbf{X}$ be the original $m \times n$ feature matrix (where $m$ is the number of samples and $n$ is the number of features). Let $\mathbf{X}_c$ be the centered matrix.
The Covariance Matrix of the dataset $\mathbf{X}$ is a square $n \times n$ matrix where the element at $(i, j)$ is the covariance between feature $i$ and feature $j$.
When the data is centered ($\mathbf{X}_c$), the covariance matrix $C$ is mathematically related to the matrix product $\mathbf{X}_c^T \mathbf{X}_c$:
\[\text{Cov}(\mathbf{X}) = \frac{1}{m-1} \mathbf{X}_c^T \mathbf{X}_c\]This means that for the centered data, the covariance matrix $\text{Cov}(\mathbf{X})$ is simply the scaled version of the product $\mathbf{X}_c^T \mathbf{X}_c$.
If we are only concerned with finding the directions that maximize the variance (as in PCA), the scaling factor $\frac{1}{m-1}$ does not affect the optimal direction. Therefore, for the purpose of optimization, maximizing variance is equivalent to maximizing the quadratic form defined by the centered data’s covariance matrix.
3. The Optimization Problem: Maximizing Projected Variance
Let $\mathbf{X}$ be the centered training data matrix ($m$ rows/samples, $n$ columns/features). Let $\mathbf{w}$ be the principal component vector (our projection axis), which is an $n \times 1$ vector.
Data Projection
The projection of all data points onto the vector $\mathbf{w}$ can be efficiently calculated via matrix multiplication. Since $\mathbf{X}$ is centered, the projection of the $i$-th row $\mathbf{x}_i$ onto $\mathbf{w}$ is approximately $\mathbf{x}_i \cdot \mathbf{w}$.
When applied to the entire dataset $\mathbf{X}$, the product $\mathbf{z} = \mathbf{X}\mathbf{w}$ results in a column vector $\mathbf{z}$ where each element $z_i$ is the projection of the data point $\mathbf{x}_i$ onto the axis defined by $\mathbf{w}$.
\[\mathbf{z} = \mathbf{X}\mathbf{w} \quad (\text{an } m \times 1 \text{ vector of projections})\]Maximizing Variance
PCA seeks to find the vector $\mathbf{w}$ that maximizes the variance of this projected data, $\text{Var}(\mathbf{z})$. Since $\mathbf{X}$ is centered, $E(\mathbf{z}) = \mathbf{0}$. The variance of $\mathbf{z}$ is proportional to its squared magnitude:
\[\text{Var}(\mathbf{z}) \propto \|\mathbf{z}\|^2\]Since the vector $\mathbf{w}$ defines a direction and not a magnitude, we constrain $\mathbf{w}$ to be a unit vector.
The PCA objective is thus defined as:
\[\text{Maximize } \|\mathbf{X}\mathbf{w}\|^2 \quad \text{subject to } \|\mathbf{w}\|^2 = 1\]4. Re-Formulating the Objective and Connecting to Eigenvectors
The expression to maximize, $\lVert\mathbf{X}\mathbf{w}\rVert^2$, can be reformulated using properties of vectors and matrices.
Reformulating the Norm: The squared Euclidean norm of a vector $\mathbf{v}$ is the dot product of the vector with itself: $|\mathbf{v}|^2 = \mathbf{v}^T \mathbf{v}$.
\[\|\mathbf{X}\mathbf{w}\|^2 = (\mathbf{X}\mathbf{w})^T (\mathbf{X}\mathbf{w})\]Using Transpose Properties: $(AB)^T = B^T A^T$.
\[(\mathbf{X}\mathbf{w})^T (\mathbf{X}\mathbf{w}) = \mathbf{w}^T \mathbf{X}^T \mathbf{X}\mathbf{w}\]The Final Quadratic Form: Since $\mathbf{C} = \mathbf{X}^T \mathbf{X}$ (which is proportional to the covariance matrix $\text{Cov}(\mathbf{X})$), the optimization problem becomes a quadratic form:
The Connection to Eigenvectors
The problem of maximizing the quadratic form $\mathbf{w}^T \mathbf{C} \mathbf{w}$ subject to $\lVert\mathbf{w}\rVert^2 = 1$ is a standard problem solved by Eigen Decomposition.
The maximum value of the quadratic form occurs when $\mathbf{w}$ is the eigenvector corresponding to the largest eigenvalue ($\lambda_{\max}$) of the matrix $\mathbf{C}$ (where $\mathbf{C} = \mathbf{X}^T \mathbf{X}$ or the covariance matrix). The maximum variance captured is equal to $\lambda_{\max}$.
5. Solving via Lagrangian Multipliers
We can formally solve the maximization problem using the Lagrangian Multiplier method, which explicitly demonstrates the link between the optimal direction $\mathbf{w}$ and the eigenvectors.
Formulating the Lagrangian
We combine the objective function and the constraint into the Lagrangian function $L$:
\[L(\mathbf{w}, \lambda) = \mathbf{w}^T \mathbf{C} \mathbf{w} - \lambda (\mathbf{w}^T \mathbf{w} - 1)\]Where $\mathbf{C} = \mathbf{X}^T \mathbf{X}$ and $\lambda$ is the Lagrange multiplier.
Calculating the Gradient
To find the maximum, we take the gradient of $L$ with respect to $\mathbf{w}$ and set it to the zero vector ($\mathbf{0}$).
Gradient of the First Term ($\mathbf{w}^T \mathbf{C} \mathbf{w}$): The gradient of a quadratic form is $2\mathbf{C}\mathbf{w}$. (Since $\mathbf{C}$ is symmetric, $\mathbf{C} = \mathbf{C}^T$).
Gradient of the Second Term ($\lambda (\mathbf{w}^T \mathbf{w} - 1)$): The gradient is $2\lambda \mathbf{w}$.
Setting the gradient to zero:
\[\nabla_{\mathbf{w}} L(\mathbf{w}, \lambda) = 2\mathbf{C}\mathbf{w} - 2\lambda\mathbf{w} = \mathbf{0}\]Dividing by 2 and factoring out $\mathbf{w}$ yields:
\[\mathbf{C}\mathbf{w} = \lambda\mathbf{w}\]This final expression is the fundamental definition of the Eigenvalue Problem.
6. The Practical Solution: Connecting to SVD
In practical machine learning libraries, PCA is almost always solved using the Singular Value Decomposition (SVD) rather than explicit Eigen Decomposition. SVD is numerically more stable and more efficient for large, sparse matrices.
Derivation via SVD
- Start with SVD: The SVD of the centered data matrix $\mathbf{X}$ is: $\mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$
- Where:
- $\mathbf{U}$ contains the left singular vectors.
- $\mathbf{\Sigma}$ is a rectangular diagonal matrix containing the singular values ($\sigma_i$).
- $\mathbf{V}$ contains the right singular vectors.
- Where:
- Calculate $\mathbf{X}^T\mathbf{X}$:
- Apply Transpose Rule: $(\mathbf{A} \mathbf{B} \mathbf{C})^T = \mathbf{C}^T \mathbf{B}^T \mathbf{A}^T$.
- Simplify using Orthogonality: $\mathbf{U}^T \mathbf{U} = \mathbf{I}$ (the Identity Matrix).
- Final Result: Since $\mathbf{\Sigma}$ is diagonal, $\mathbf{\Sigma}^T \mathbf{\Sigma}$ is a diagonal matrix containing the squared singular values: $\mathbf{\Sigma}^T \mathbf{\Sigma} = \mathbf{\Sigma}^2$.
Observations
- Comparing this SVD result to the definition of Eigen Decomposition ($\mathbf{C} = \mathbf{P} \mathbf{\Lambda} \mathbf{P}^T$) proves the following essential connections:
- The Eigenvectors of $\mathbf{X}^T\mathbf{X}$ (the principal components $\mathbf{w}$) are the columns of the Right Singular Vector Matrix ($\mathbf{V}$).
- The Eigenvalues ($\lambda_i$) are the squared Singular Values ($\sigma_i^2$).
- Therefore, to perform PCA, modern libraries simply compute the SVD of the centered data matrix $\mathbf{X}$, and the principal components are read directly from the $\mathbf{V}$ matrix.
Conclusion
The solution proves that the vector $\mathbf{w}$ that maximizes the projected variance must be an eigenvector of the covariance-related matrix $\mathbf{C}$, and the maximized variance itself is given by the corresponding eigenvalue $\lambda$.
To find the direction that captures the absolute maximum variance (the first principal component), we must choose the eigenvector corresponding to the largest eigenvalue ($\lambda_{\max}$).
The subsequent principal components are the eigenvectors corresponding to the next largest eigenvalues, as they capture the maximum remaining variance orthogonal to the previously found components.
PCA in Practice: Projection and Reconstruction
When applying Principal Component Analysis (PCA) to reduce data from $n$ dimensions to $k$ dimensions (where $k \ll n$), we perform a specific sequence of matrix operations. This guide explains the exact linear algebra behind the Projection (Encoding) and Reconstruction (Decoding) phases using the Singular Value Decomposition (SVD) components.
1. The Setup: SVD Decomposition
Let $\mathbf{X}$ be our centered data matrix of shape $(m \times n)$, where $m$ is the number of samples and $n$ is the number of features. We perform SVD on $\mathbf{X}$:
\[\mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T\]- $\mathbf{U}$: $(m \times m)$ Orthogonal matrix (Left Singular Vectors).
- $\mathbf{\Sigma}$: $(m \times n)$ Diagonal matrix of singular values.
- $\mathbf{V}^T$: $(n \times n)$ Orthogonal matrix (Right Singular Vectors). The rows of $\mathbf{V}^T$ (or columns of $\mathbf{V}$) are the Principal Components (Eigenvectors of $\mathbf{X}^T\mathbf{X}$).
2. Projection: Reducing Dimensions ($n \to k$)
To reduce dimensions, we want to project our original data $\mathbf{X}$ onto the top $k$ principal components.
Step 1: Select the Components
We take the first $k$ columns of the matrix $\mathbf{V}$. Let’s call this truncated matrix $\mathbf{V}_k$.
- Shape of $\mathbf{V}_k$: $(n \times k)$
Step 2: The Projection Formula
The projected data, let’s call it $\mathbf{Z}$ (or $X_{new}$), represents our data in the new $k$-dimensional subspace.
\[\mathbf{Z} = \mathbf{X} \cdot \mathbf{V}_k\]Matrix Multiplication Analysis
- $\mathbf{X}$: $(m \times n)$
- $\mathbf{V}_k$: $(n \times k)$
- Result $\mathbf{Z}$: $(m \times k)$
Interpretation: Each row in $\mathbf{Z}$ is a sample represented by $k$ new features (the principal components). We have successfully compressed the data from $n$ features to $k$.
3. Lossy Reconstruction: The Reverse Process ($k \to n$)
We can attempt to go back to the original dimension space. However, since we discarded information (the bottom $n-k$ components), we cannot get the exact original data back. We get an approximation.
The Reconstruction Formula
To reconstruct the data ($\mathbf{X}_{approx}$), we project our lower-dimensional representation $\mathbf{Z}$ back into the original feature space using the transpose of the components matrix.
\[\mathbf{X}_{approx} = \mathbf{Z} \cdot \mathbf{V}_k^T\]Matrix Multiplication Analysis
- $\mathbf{Z}$: $(m \times k)$
- $\mathbf{V}_k^T$: $(k \times n)$ (The transpose of our selected components)
- Result $\mathbf{X}_{approx}$: $(m \times n)$
Interpretation: We are back to $m$ samples and $n$ features. However, $\mathbf{X}_{approx} \neq \mathbf{X}$.
The reconstruction $\mathbf{X}_{approx}$ is the “shadow” of the original data on the $k$-dimensional hyperplane defined by the principal components.
The difference, $\lVert\mathbf{X} - \mathbf{X}_{approx}\rVert^2$, is the projection error (or information loss).