cs.thefarshad
medium

Dimensionality Reduction & PCA

Compress many features into a few by projecting onto the directions of greatest variance.

Real datasets often have dozens or thousands of features, many of them redundant or correlated. Dimensionality reduction squeezes them into a handful of new features while keeping as much information as possible — making data easier to visualize, faster to model, and less prone to the curse of dimensionality.

The workhorse is Principal Component Analysis (PCA). It finds new axes, the principal components, pointing along the directions where the data varies most. Below, a tilted 2D cloud gets its PC1PC1 and PC2PC2 axes; toggle the projection to collapse every point onto PC1PC1.

PC1PC2
PC1 (max variance) PC2
variance explained: PC1 = 96.8% · PC2 = 3.2% — keeping PC1 alone retains most of the spread in one dimension.

Variance is information

PCA’s core assumption is that directions of high variance carry the signal and low-variance directions are mostly noise. PC1PC1 is the single line along which the points are most spread out. PC2PC2 is the next most-spread direction that is orthogonal (perpendicular) to PC1PC1, and so on. Each component is a combination of the original features pointing a new way through the space.

How it works

  1. Center the data by subtracting the mean of each feature.
  2. Compute the covariance matrix, which records how features vary together.
  3. Take its eigenvectors and eigenvalues. The eigenvectors are the principal-component directions; each eigenvalue λk\lambda_k is the variance captured along that direction.
  4. Sort by eigenvalue and keep the top few. Project the data onto them — drop a perpendicular from each point onto the chosen axis (the dashed lines in the visualizer).

Variance explained

Because eigenvalues are variances, the fraction kept by component kk is

explainedk=λkjλj\text{explained}_k = \frac{\lambda_k}{\sum_j \lambda_j}

The plot reports these percentages. If PC1PC1 explains, say, 95% of the variance, you can throw away PC2PC2 and represent each 2D point with one number, losing almost nothing. With high-dimensional data you keep just enough components to reach a target like 95% cumulative variance — often a huge reduction.

Caveats

  • PCA is linear: it can only rotate and project, so it misses curved structure. Nonlinear methods (used mainly for visualization) handle that.
  • It is unsupervised — it maximizes variance, not class separation, so the most spread-out direction is not always the most predictive one.
  • Scale matters: standardize features first, or the largest-unit feature dominates every component.

Further reading: scikit-learn — PCA.

Takeaways

  • Dimensionality reduction compresses many correlated features into a few informative ones.
  • PCA projects onto orthogonal directions of maximum variance, found via the eigenvectors of the covariance matrix.
  • Eigenvalues give the variance explained, telling you how many components to keep; PCA is linear and unsupervised.