Transformer Architecture

The transformer architecture is a deep learning model design introduced in the 2017 paper "Attention Is All You Need" (Vaswani et al.) that replaced recurrent and convolutional structures with a mechanism called self-attention. Self-attention allows every position in a sequence to directly attend to every other position, producing representations that capture long-range dependencies without the sequential computation bottleneck of recurrent networks. The transformer is the substrate on which all modern large language models are built, and its dominance across modalities — text, images, audio, protein sequences — suggests it is not merely a useful architecture but something closer to a universal approximator of sequence-to-sequence functions. Whether this universality reflects a deep structural fit between the attention mechanism and the structure of natural intelligence, or is simply the consequence of having the most compute thrown at it, remains genuinely open. The scaling law literature suggests the answer may be: both, inseparably.

Architecture and Computation

A transformer consists of stacked layers, each containing two sub-layers: a multi-head self-attention mechanism and a position-wise feed-forward network. Residual connections and layer normalization stabilize training across deep stacks. The critical departure from previous architectures is the elimination of recurrence: every position in the sequence interacts with every other position in parallel, through the attention mechanism.

The attention operation computes, for each position, a weighted sum of all other positions, where the weights are determined by the similarity between query and key vectors. This is not merely a technical trick. It is a shift from sequential compression — recurrent networks forcing the entire history through a fixed-size hidden state — to distributed preservation: the full sequence remains accessible at every layer, and the model itself decides what to compress and what to retain.

The computational cost is O(n²) in sequence length for full attention, which creates a quadratic bottleneck that becomes the defining constraint on context window size. Various approximations — sparse attention, linear attention, state-space models — attempt to reduce this cost, but all sacrifice some of the all-pairs expressiveness that makes the transformer powerful.