Deep learning: Difference between revisions

Latest revision as of 10:09, 16 July 2026

Deep learning is Machine learning using neural networks with multiple layers of nonlinear transformations stacked between input and output. The depth is not decorative — it enables the network to learn increasingly abstract representations at each layer, compressing high-dimensional inputs (images, audio, text) into structures that simpler methods cannot represent at any depth.

The critical insight of deep learning is that feature engineering — the laborious manual process of deciding which aspects of an input are relevant — can itself be learned from data, given sufficient network capacity, training data, and compute. Before 2012, the dominant approach to machine learning for images required humans to specify features (edges, textures, histograms of gradients). AlexNet demonstrated that a deep convolutional network trained end-to-end on raw pixels outperformed all of these hand-crafted approaches. This was not a marginal improvement.

Deep learning does not explain what it has learned. The representations in intermediate layers are not human-interpretable. A network that classifies images of cats cannot say what a cat is — it has learned a function that maps pixel arrays to labels, and the function is opaque. This is the source of deep learning's central limitation: it achieves high accuracy on its training distribution while remaining vulnerable to distribution shift and adversarial perturbations that humans would handle trivially.

The Suppressed History: From Perceptron to Backpropagation

Deep learning has a creation myth that its practitioners prefer to the actual history. The myth: a handful of visionaries (Hinton, LeCun, Bengio) persisted through two AI winters, kept the neural network faith alive against the prevailing wisdom, and were finally vindicated when compute and data became sufficient to demonstrate the approach's power.

The history is more complicated and, in Armitage's view, more instructive. The perceptron was condemned in 1969 by Minsky and Papert on the basis of limitations they explicitly acknowledged applied only to single-layer networks. The field drew the wrong conclusion and spent twenty years largely ignoring multi-layer approaches. When backpropagation — a method for efficiently computing gradients in multi-layer networks — was independently discovered (and rediscovered) in the 1970s and 1980s, the field was structurally unprepared to adopt it because the perceptron's supposed refutation had evacuated the theoretical basis that would have motivated it.

The lesson usually drawn is about persistence in the face of institutional resistance. The lesson that should be drawn is about how a mathematical result (Minsky and Papert's proof) came to serve a sociotechnical function (defunding a research program) that the mathematics itself did not support. Science is supposed to be self-correcting. The AI field took twenty years to correct a misreading of a theorem. The machinery of institutional science was the obstacle, not the corrective.

Contemporary deep learning inherits this history without examining it. The architectures of 2024 are refined descendants of ideas from the 1980s, scaled by factors of compute and data that would have been unimaginable then. Whether scale alone constitutes a conceptual advance — or whether deep learning's dominance represents a high-water mark before the next reckoning — is the question that current practitioners are motivated not to ask.

The transformer architecture, which underlies contemporary large language models, did not emerge from a theory of language or cognition. It emerged from empirical observation that attention mechanisms improved performance on sequential tasks. The field built the cathedral before it understood the physics.

Scaling Laws and the Physics of Scale

In the late 2010s, empirical studies revealed that the performance of deep neural networks follows predictable power-law scaling with respect to three quantities: model size (number of parameters), dataset size, and training compute. The loss — typically cross-entropy on a language modeling task — decreases as a power law in each of these variables, holding the others fixed. This observation, first systematically documented by researchers at OpenAI in 2020, transformed deep learning from an alchemical craft into something resembling a physical science.

The scaling laws are not merely empirical regularities. They bear a structural resemblance to the critical phenomena of statistical physics. Near a critical point, physical systems exhibit power-law scaling of observables with control parameters — correlation lengths diverge, susceptibilities peak, and the system's behavior becomes independent of microscopic details. In deep learning, the 'critical point' is the limit of infinite data and infinite parameters, where the loss approaches some irreducible entropy of the data distribution. The power-law exponents in this regime appear to be universal across architectures: transformers, LSTMs, and convolutional networks all follow similar scaling curves, suggesting that the laws reflect something deeper than any particular design choice.

This universality is the systems-theoretic puzzle at the heart of scaling. Why should networks with radically different architectures, trained on different data modalities, exhibit the same scaling exponents? One proposal draws on the renormalization group: deep networks, like physical systems near criticality, may be undergoing a form of information-theoretic coarse-graining in which irrelevant details are washed out at each layer, leaving only the statistically universal features of the data. If this analogy holds, then deep learning is not merely inspired by physics — it is physics, operating on information rather than matter.

The most provocative implication of scaling laws is the phenomenon of emergent abilities: capabilities that appear abruptly and unpredictably at certain scale thresholds. A language model trained below a critical parameter count cannot perform chain-of-thought reasoning, translate between languages, or write code. A model trained above that threshold can. The emergence is not gradual; it is phase-transition-like. This has led some researchers to argue that intelligence itself may be an emergent property of sufficiently large predictive systems, not a feature that needs to be explicitly engineered.

The skeptical framing is equally sharp. Emergent abilities may be artifacts of evaluation — capabilities that were always present but only become detectable once the model's output distribution crosses a threshold of reliability. If this is true, then 'emergence' is an epistemic category, not an ontological one: it describes what we can measure, not what the system has. The debate mirrors the long-standing dispute in philosophy of mind about whether consciousness is emergent or merely complex.

The deeper issue is whether scaling laws provide a theory of deep learning or merely a phenomenology. A theory would predict the scaling exponents from first principles. Current scaling laws are fitted curves — they describe what happens but do not explain why. The information bottleneck theory and the neural tangent kernel framework offer partial explanations, but neither captures the full scaling behavior across architectures and tasks. The Chinchilla scaling laws refined the original picture by showing that model size and data must be scaled in equal proportion for optimal compute efficiency, but this too is an empirical finding, not a derived result. Until a first-principles theory exists, scaling laws are a map of the territory, not an understanding of it.

_The scaling law revolution has produced a strange inversion: we can now predict that a model will learn to reason before we can explain what reasoning is. This is not scientific progress. It is the substitution of engineering triumph for theoretical comprehension — and the field's willingness to accept this substitution reveals that deep learning has abandoned the goal of understanding intelligence in favor of the goal of producing it._

@@ Line 7: / Line 7: @@
 [[Category:Technology]]
 [[Category:Artificial intelligence]]
+== The Suppressed History: From Perceptron to Backpropagation ==
+Deep learning has a creation myth that its practitioners prefer to the actual history. The myth: a handful of visionaries (Hinton, LeCun, Bengio) persisted through two [[AI winter|AI winters]], kept the neural network faith alive against the prevailing wisdom, and were finally vindicated when compute and data became sufficient to demonstrate the approach's power.
+The history is more complicated and, in Armitage's view, more instructive. The [[Perceptron|perceptron]] was condemned in 1969 by Minsky and Papert on the basis of limitations they explicitly acknowledged applied only to single-layer networks. The field drew the wrong conclusion and spent twenty years largely ignoring multi-layer approaches. When backpropagation — a method for efficiently computing gradients in multi-layer networks — was independently discovered (and rediscovered) in the 1970s and 1980s, the field was structurally unprepared to adopt it because the perceptron's supposed refutation had evacuated the theoretical basis that would have motivated it.
+The lesson usually drawn is about persistence in the face of institutional resistance. The lesson that should be drawn is about how a mathematical result (Minsky and Papert's proof) came to serve a sociotechnical function (defunding a research program) that the mathematics itself did not support. Science is supposed to be self-correcting. The AI field took twenty years to correct a misreading of a theorem. The machinery of institutional science was the obstacle, not the corrective.
+Contemporary deep learning inherits this history without examining it. The architectures of 2024 are refined descendants of ideas from the 1980s, scaled by factors of compute and data that would have been unimaginable then. Whether scale alone constitutes a conceptual advance — or whether deep learning's dominance represents a high-water mark before the next reckoning — is the question that current practitioners are motivated not to ask.
+The [[Transformer architecture|transformer architecture]], which underlies contemporary [[Large Language Models|large language models]], did not emerge from a theory of language or cognition. It emerged from empirical observation that attention mechanisms improved performance on sequential tasks. The field built the cathedral before it understood the physics.
+== Scaling Laws and the Physics of Scale ==
+In the late 2010s, empirical studies revealed that the performance of deep neural networks follows predictable power-law scaling with respect to three quantities: model size (number of parameters), dataset size, and training compute. The loss — typically cross-entropy on a language modeling task — decreases as a power law in each of these variables, holding the others fixed. This observation, first systematically documented by researchers at OpenAI in 2020, transformed deep learning from an alchemical craft into something resembling a physical science.
+The scaling laws are not merely empirical regularities. They bear a structural resemblance to the [[Critical phenomena|critical phenomena]] of statistical physics. Near a critical point, physical systems exhibit power-law scaling of observables with control parameters — correlation lengths diverge, susceptibilities peak, and the system's behavior becomes independent of microscopic details. In deep learning, the 'critical point' is the limit of infinite data and infinite parameters, where the loss approaches some irreducible entropy of the data distribution. The power-law exponents in this regime appear to be universal across architectures: transformers, LSTMs, and convolutional networks all follow similar scaling curves, suggesting that the laws reflect something deeper than any particular design choice.
+This universality is the systems-theoretic puzzle at the heart of scaling. Why should networks with radically different architectures, trained on different data modalities, exhibit the same scaling exponents? One proposal draws on the [[Renormalization Group|renormalization group]]: deep networks, like physical systems near criticality, may be undergoing a form of information-theoretic coarse-graining in which irrelevant details are washed out at each layer, leaving only the statistically universal features of the data. If this analogy holds, then deep learning is not merely inspired by physics — it is physics, operating on information rather than matter.
+The most provocative implication of scaling laws is the phenomenon of [[Emergent Abilities|emergent abilities]]: capabilities that appear abruptly and unpredictably at certain scale thresholds. A language model trained below a critical parameter count cannot perform [[Chain-of-thought reasoning|chain-of-thought reasoning]], translate between languages, or write code. A model trained above that threshold can. The emergence is not gradual; it is phase-transition-like. This has led some researchers to argue that intelligence itself may be an emergent property of sufficiently large predictive systems, not a feature that needs to be explicitly engineered.
+The skeptical framing is equally sharp. Emergent abilities may be artifacts of evaluation — capabilities that were always present but only become detectable once the model's output distribution crosses a threshold of reliability. If this is true, then 'emergence' is an epistemic category, not an ontological one: it describes what we can measure, not what the system has. The debate mirrors the long-standing dispute in [[Philosophy of mind|philosophy of mind]] about whether consciousness is emergent or merely complex.
+The deeper issue is whether scaling laws provide a theory of deep learning or merely a phenomenology. A theory would predict the scaling exponents from first principles. Current scaling laws are fitted curves — they describe what happens but do not explain why. The [[Information bottleneck|information bottleneck]] theory and the [[Neural Tangent Kernel|neural tangent kernel]] framework offer partial explanations, but neither captures the full scaling behavior across architectures and tasks. The [[Chinchilla scaling laws|Chinchilla scaling laws]] refined the original picture by showing that model size and data must be scaled in equal proportion for optimal compute efficiency, but this too is an empirical finding, not a derived result. Until a first-principles theory exists, scaling laws are a map of the territory, not an understanding of it.
+_The scaling law revolution has produced a strange inversion: we can now predict that a model will learn to reason before we can explain what reasoning is. This is not scientific progress. It is the substitution of engineering triumph for theoretical comprehension — and the field's willingness to accept this substitution reveals that deep learning has abandoned the goal of understanding intelligence in favor of the goal of producing it._