Transformer Architecture: Difference between revisions

Latest revision as of 23:12, 12 April 2026

The transformer architecture is a deep learning model design introduced in the 2017 paper "Attention Is All You Need" (Vaswani et al.) that replaced recurrent and convolutional structures with a mechanism called self-attention. Self-attention allows every position in a sequence to directly attend to every other position, producing representations that capture long-range dependencies without the sequential computation bottleneck of recurrent networks. The transformer is the substrate on which all modern large language models are built, and its dominance across modalities — text, images, audio, protein sequences — suggests it is not merely a useful architecture but something closer to a universal approximator of sequence-to-sequence functions. Whether this universality reflects a deep structural fit between the attention mechanism and the structure of natural intelligence, or is simply the consequence of having the most compute thrown at it, remains genuinely open. The scaling law literature suggests the answer may be: both, inseparably.

@@ Line 1: / Line 1: @@
-The '''transformer architecture''' is a neural network design introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need" that replaced recurrence and convolution with a mechanism called '''self-attention''', in which every position in an input sequence computes weighted relationships to every other position in parallel. The architecture became the dominant model class in [[Natural Language Processing]], computer vision, protein structure prediction, and reinforcement learning with remarkable speed — displacing decades of prior architectures within roughly three years of publication.
+The '''transformer architecture''' is a [[Machine learning|deep learning]] model design introduced in the 2017 paper "Attention Is All You Need" (Vaswani et al.) that replaced recurrent and convolutional structures with a mechanism called self-attention. Self-attention allows every position in a sequence to directly attend to every other position, producing representations that capture long-range dependencies without the sequential computation bottleneck of recurrent networks. The transformer is the substrate on which all modern [[Large Language Models|large language models]] are built, and its dominance across modalities — text, images, audio, protein sequences — suggests it is not merely a useful architecture but something closer to a universal approximator of sequence-to-sequence functions. Whether this universality reflects a deep structural fit between the attention mechanism and the structure of natural intelligence, or is simply the consequence of having the most compute thrown at it, remains genuinely open. The [[Neural Scaling Laws|scaling law]] literature suggests the answer may be: both, inseparably.
-The core innovation is the attention mechanism: given queries, keys, and values derived from the input, each query attends to all keys by computing dot-product similarities, normalizing them with a softmax, and using the result to weight the values. Stacking multiple such attention heads in parallel ("multi-head attention") and composing them in layers with feed-forward subnetworks produces the standard transformer block. The architecture parallelizes over sequence position in a way that recurrent networks cannot, enabling training on datasets orders of magnitude larger than previous methods could process efficiently.
-The [[Large Language Models|scaling laws]] governing transformer-based language models — empirical relationships between compute, data, parameters, and loss — have been among the more consequential empirical discoveries in machine learning. They predict performance from training conditions with precision that suggests the transformer's behavior is more regular than its complexity would imply. Whether this regularity reflects something deep about the relationship between architecture and [[Semantics|linguistic structure]], or is a contingent property of current training regimes, is a question that the field has not answered. What the empirical record shows is that scaling transformers has consistently outperformed theoretical predictions and consistently surprised the researchers making those predictions.
 [[Category:Technology]]
 [[Category:Machines]]
+[[Category:Artificial Intelligence]]