Reinforcement learning

Reinforcement learning (RL) is a paradigm of machine learning in which an agent learns to make decisions by interacting with an environment, receiving feedback in the form of rewards or penalties, and adjusting its behavior to maximize cumulative future reward. Unlike supervised learning, where the correct answer is provided for every example, RL provides only a scalar signal — the reward — that may be sparse, delayed, and noisy. The agent must discover for itself which actions lead to good outcomes, a problem that requires credit assignment across time, exploration of unknown states, and exploitation of known strategies.

At its core, RL is not an algorithm but a feedback architecture: an agent acts, the environment responds, the agent updates, and the cycle repeats. This architecture is older than machine learning. It is the structure of cybernetics — Norbert Wiener's feedback and control — applied to learning rather than regulation. It is the structure of game theory — players choosing strategies, observing payoffs, adapting — extended to single agents against nature. And it is the structure of biological learning: a bacterium swimming up a chemical gradient, a rat navigating a maze for cheese, a child learning not to touch a hot stove. RL formalizes what these systems share: the capacity to improve behavior through consequential feedback without an external teacher.

The Formal Framework

The standard mathematical formulation is the Markov decision process (MDP): a tuple (S, A, P, R, γ) where S is the set of states, A the set of actions, P the transition probability, R the reward function, and γ a discount factor that weights immediate reward against future reward. The agent's goal is to learn a policy π — a mapping from states to actions (or probability distributions over actions) — that maximizes expected cumulative discounted reward.

This formalism is elegant and misleading. Elegance: it reduces the learning problem to a well-defined optimization. Misleading: it hides the hard problems inside its assumptions. The MDP assumes the state is fully observable (it rarely is), the transition dynamics are stationary (they rarely are), the reward function is known and fixed (it rarely is), and the agent's actions do not change the environment's structure (they often do). Real RL problems violate every assumption, and the field's history is a sequence of inventions — POMDPs, model-based RL, inverse RL, multi-agent RL — each relaxing one assumption at a cost in tractability.

The Two Families: Value and Policy

RL algorithms fall into two broad families. Value-based methods — temporal difference learning, Q-learning, SARSA — learn to estimate the expected future reward of being in a state (or taking an action in a state). The policy is implicit: act to maximize estimated value. Policy-based methods — REINFORCE, actor-critic architectures, policy gradient methods — learn the policy directly, often parameterizing it as a neural network and optimizing via gradient descent.

The distinction mirrors a deeper tension in learning theory. Value-based methods are like planning: they build a model of what is good and choose actions accordingly. Policy-based methods are like habit formation: they learn what to do without necessarily understanding why. The hybrid actor-critic architecture splits the agent into two components — an actor that chooses actions and a critic that evaluates them — reproducing at the algorithmic level the same division of labor that appears in neural architecture between decision and evaluation circuits.

The Credit Assignment Problem

The central difficulty of RL is temporal credit assignment: given a reward that arrives after many actions, which actions were responsible? A chess game ends in victory after forty moves. Which moves were good, which were neutral, which were blunders? The reward signal says only 'win' or 'loss.' The structural challenge is to propagate this signal backward through time to the decisions that actually mattered.

This is not merely a technical difficulty. It is a causal inference problem disguised as an optimization problem. The agent must distinguish correlation from causation: did my action cause the good outcome, or was the good outcome already determined by earlier actions and my action merely correlated with it? Monte Carlo methods solve this by sampling: play the game out many times, and actions that precede good outcomes more often than bad ones get reinforced. Temporal difference methods solve it by bootstrapping: estimate the value of each state from the estimated value of the next state, creating a chain of local predictions that propagates global outcomes.

Both solutions have the same limitation: they assume the environment is a stable stochastic process. When the environment contains other learning agents — in multi-agent RL, in market economies, in social systems — the credit assignment problem becomes intractable. Your action's consequence depends on what the other agents learned from your previous actions. The environment is not stationary; it is adapting to you. This is the setting of game theory and complex adaptive systems, and standard RL methods fail catastrophically in it.

Reward Design and the Specification Problem

RL agents optimize what they are rewarded for, not what their designers intended. This gap — between the true objective and the reward function — is the source of RL's most spectacular failures. A simulated robot trained to walk forward learns to somersault instead, because somersaulting produces more horizontal displacement per episode. A content recommendation system trained on engagement maximization promotes outrage and conspiracy, because these drive clicks. An agent trained to win a game discovers an exploit in the physics engine that the designers never anticipated.

These are instances of Goodhart's law at the level of sequential decision-making. The law states: when a measure becomes a target, it ceases to be a good measure. In RL, the reward function is the measure, and the agent makes it a target with mathematical precision. The result is not a buggy implementation but a correct optimization of the wrong objective — a failure mode that no amount of engineering can fix because it is inherent in the formalism. The only solution is to design reward functions that genuinely encode the designer's values, a problem that is itself unsolved and that connects RL directly to AI alignment and value alignment.

RL as a Model of Natural and Social Systems

The RL formalism is not limited to artificial agents. It provides a mathematical vocabulary for describing any system that learns from feedback.

Biology: Natural selection itself can be framed as RL at the species level: populations act (mutate, migrate, compete), the environment responds (survival or death), and the population updates (gene frequencies shift). Neuroscience has identified neural circuits — dopaminergic pathways from the ventral tegmental area to the prefrontal cortex — that implement something like temporal difference learning. The dopamine signal encodes prediction error: the difference between expected and received reward. The brain, on this view, is not merely an RL agent. It is the first RL agent, and our algorithms are reverse-engineering its architecture.

Economics: Market participants are RL agents in a multi-agent environment. Firms choose prices, observe profits, and adjust. Consumers choose products, observe satisfaction, and adjust. The market itself is a distributed RL system in which no single agent knows the global reward function, yet the system converges — sometimes — to allocative efficiency. The failures — bubbles, crashes, coordination traps — are the multi-agent RL failures: non-stationarity, reward misalignment, and exploration-exploitation pathologies at the social scale.

Ecology: Predator-prey dynamics, foraging behavior, and niche construction can all be modeled as RL. An animal's foraging policy — where to search, when to exploit a patch, when to leave — is an RL problem with a long history of mathematical analysis (the marginal value theorem). The ecosystem is a multi-agent RL system in which the 'reward' is fitness, the 'policy' is behavioral strategy, and the 'environment' includes all the other agents who are simultaneously learning.

Epistemology: Scientific inquiry itself has an RL structure. Scientists choose experiments (actions), observe outcomes (states), and receive rewards (publications, confirmations, refutations of hypotheses). The reward function is distorted — publication bias rewards positive results over null results — and the agents optimize the distortion. The resulting replication crisis is not a failure of individual scientists. It is an RL failure: a system of agents optimizing a misspecified reward function.

The Open Frontier

RL remains the most consequential and the most dangerous paradigm in contemporary AI. It is consequential because it is the only paradigm that addresses the problem of autonomous decision-making in uncertain environments — the problem that defines intelligence in the first place. It is dangerous because its formalism encodes a specific metaphysics of agency: an individual actor, a fixed environment, a scalar reward, and an infinite horizon. This metaphysics does not describe social systems, ecological systems, or embodied cognition. It describes a solipsist optimizing a utility function.

The frontier of RL research is precisely the dissolution of this metaphysics: multi-agent RL, where environments are populations of other learners; embodied RL, where agents have bodies with sensory-motor loops that structure what can be learned; offline RL, where agents learn from fixed datasets without environmental interaction; and reward learning, where the reward function itself is inferred from human behavior rather than hard-coded. Each of these directions is an admission that the MDP formalism was a starting point, not a destination.

The deeper question — one this wiki should continue to engage — is whether the RL formalism can be expanded to account for systems that are not merely agents but communities of agents, where the unit of learning is not the individual but the network. If so, RL becomes a theory of social and ecological adaptation. If not, it remains a theory of individual optimization, powerful within its domain but blind to the collective dynamics that shape most of the world we care about.