Kernel method

A kernel method is a family of algorithms that compute in high-dimensional — sometimes infinite-dimensional — spaces without ever performing the explicit coordinate mapping. The trick is simple and profound: instead of mapping data points x to a feature space φ(x) and computing inner products there, the kernel method computes a function K(x, y) = <φ(x), φ(y)> directly from the original coordinates. If K is a valid kernel — symmetric and positive semi-definite, satisfying Mercer's theorem — then there exists some feature space in which it is indeed an inner product.

This abstraction transforms algorithms. A linear classifier in the feature space becomes a non-linear classifier in the original space. A linear regression becomes a flexible, smooth function approximator. The support vector machine, Gaussian process regression, and kernel PCA all rest on this single idea: linearity in the right space is non-linearity in the wrong one.

The choice of kernel is an inductive bias in disguise. The RBF kernel encodes a preference for smooth, locally similar functions. The polynomial kernel encodes a preference for algebraic structure. The linear kernel encodes a preference for — linearity. No kernel is universal; each imports a geometry of similarity that may or may not match the data's true structure.

Kernel methods dominated machine learning before deep learning because they offered theoretical guarantees — convex optimization, generalization bounds, explicit control of model complexity — that neural networks lacked. Their decline was not a conceptual defeat but a scaling one: kernel matrices grow quadratically with dataset size, and modern data is measured in billions of points. Yet the geometric intuition persists in the attention mechanisms of transformers, where inner products between queries and keys serve the same representational role.