Universal Approximation Theorem

The Universal Approximation Theorem states that a feedforward neural network with a single hidden layer of sufficient width can approximate any continuous function on a compact subset of real-valued space to arbitrary precision — provided the activation function is non-constant, bounded, and continuous. The theorem is a mathematical existence result, not an engineering prescription. It says nothing about how many neurons are required, how to find the approximating network, or whether gradient-based training will converge to it.

The theorem is frequently cited to justify the expressive capacity of neural networks. This is technically correct and practically misleading: knowing that some network can approximate a function says nothing about the networks actually trained in practice. A lock that can be opened by some key does not help if you cannot find the key. The relevant question — how efficiently can a given architecture and training procedure learn a given function class? — is answered by Learning Theory, not by the Universal Approximation Theorem.

The result was proved independently by George Cybenko (1989) for sigmoid activations and Kurt Hornik (1991) for general activation functions. Subsequent work showed that depth provides exponential advantages over width for certain function classes — a result that actually explains why deep networks work, unlike the Universal Approximation Theorem, which merely says they can.