medium

Dimensionality Reduction & PCA

Compress many features into a few by projecting onto the directions of greatest variance.

Real datasets often have dozens or thousands of features, many of them redundant or correlated. Dimensionality reduction squeezes them into a handful of new features while keeping as much information as possible — making data easier to visualize, faster to model, and less prone to the curse of dimensionality.

The workhorse is Principal Component Analysis (PCA). It finds new axes, the principal components, pointing along the directions where the data varies most. Below, a tilted 2D cloud gets its $PC1$ and $PC2$ axes; toggle the projection to collapse every point onto $PC1$ .

show projection onto PC1

PC1 (max variance) PC2

variance explained: PC1 = 96.8% · PC2 = 3.2% — keeping PC1 alone retains most of the spread in one dimension.

Variance is information

PCA’s core assumption is that directions of high variance carry the signal and low-variance directions are mostly noise. $PC1$ is the single line along which the points are most spread out. $PC2$ is the next most-spread direction that is orthogonal (perpendicular) to $PC1$ , and so on. Each component is a combination of the original features pointing a new way through the space.

How it works

Center the data by subtracting the mean of each feature.
Compute the covariance matrix, which records how features vary together.
Take its eigenvectors and eigenvalues. The eigenvectors are the principal-component directions; each eigenvalue $\lambda_k$ is the variance captured along that direction.
Sort by eigenvalue and keep the top few. Project the data onto them — drop a perpendicular from each point onto the chosen axis (the dashed lines in the visualizer).

Variance explained

Because eigenvalues are variances, the fraction kept by component $k$ is

$\text{explained}_k = \frac{\lambda_k}{\sum_j \lambda_j}$

The plot reports these percentages. If $PC1$ explains, say, 95% of the variance, you can throw away $PC2$ and represent each 2D point with one number, losing almost nothing. With high-dimensional data you keep just enough components to reach a target like 95% cumulative variance — often a huge reduction.

Caveats

PCA is linear: it can only rotate and project, so it misses curved structure. Nonlinear methods (used mainly for visualization) handle that.
It is unsupervised — it maximizes variance, not class separation, so the most spread-out direction is not always the most predictive one.
Scale matters: standardize features first, or the largest-unit feature dominates every component.

Takeaways

Dimensionality reduction compresses many correlated features into a few informative ones.
PCA projects onto orthogonal directions of maximum variance, found via the eigenvectors of the covariance matrix.
Eigenvalues give the variance explained, telling you how many components to keep; PCA is linear and unsupervised.