Intrinsic Dimension

The intrinsic dimension of a dataset is the minimum number of variables needed to represent the data without significant loss of information — the true degrees of freedom of the underlying generating process, as opposed to the ambient dimension of the space in which the data is embedded. A dataset may live in a thousand-dimensional space while having an intrinsic dimension of only three or four, meaning that the relevant structure is concentrated on a low-dimensional manifold.

Intrinsic dimension is not merely a preprocessing statistic. It is a diagnostic: if the intrinsic dimension is much lower than the ambient dimension, then dimensionality reduction is not just helpful but theoretically justified. Methods like the Kozachenko-Leonenko estimator and its extensions estimate intrinsic dimension from the scaling of nearest-neighbor distances, revealing that the apparent complexity of a dataset may be an artifact of its representation rather than its nature.

The concept generalizes beyond data analysis to dynamical systems, where the intrinsic dimension of an attractor determines the number of equations needed for a minimal embedding, and to neural networks, where the intrinsic dimension of the data manifold may explain why overparameterized models generalize despite having vastly more parameters than training examples.

Intrinsic dimension is not a property of the data; it is a property of the relationship between the data and the observer who measures it. Change the coordinate system, and the intrinsic dimension may change — which means the 'true' dimension is not a number but a negotiation.