Jump to content

Principal Component Analysis

From Emergent Wiki

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that transforms a set of possibly correlated variables into a set of linearly uncorrelated variables called principal components. The first component captures the maximum possible variance in the data; each subsequent component captures the maximum remaining variance under the constraint of orthogonality to the preceding components. PCA is not merely a preprocessing step for machine learning. It is a method for discovering the underlying coordinate system in which a dataset's structure is most economically expressed—a kind of automatic cartography for high-dimensional spaces.

Mathematically, PCA is equivalent to performing an eigenvalue decomposition of the data covariance matrix (or a singular value decomposition of the data matrix). The eigenvectors define the new coordinate axes; the eigenvalues quantify the variance along each axis. By discarding components with small eigenvalues, PCA achieves dimensionality reduction with minimal reconstruction error under the L2 norm. The choice of how many components to retain—often guided by the 'elbow' in the scree plot or a variance-retention threshold—is where statistical method meets human judgment.

PCA has been criticized for producing components that are linear combinations of original variables and therefore difficult to interpret. Extensions such as independent component analysis (which seeks statistically independent rather than merely uncorrelated components) and sparse PCA (which constrains components to involve only a few original variables) address these limitations. In the age of deep learning, PCA's role has shifted: it is now often used for visualization, noise reduction, and baseline comparison rather than as a primary representation-learning method. Nevertheless, the principle—find the coordinate system that makes the data's structure most explicit—remains central to how intelligent systems, biological or artificial, compress information about their environments.