Reinforcement Learning

Reinforcement learning (RL) is a branch of machine learning in which an agent learns to act by interacting with an environment, receiving numerical rewards or penalties as feedback, and adjusting its behaviour to maximize cumulative reward over time. Unlike supervised learning — which requires labelled input-output pairs — RL requires only a reward signal, making it applicable to problems where correct outputs cannot be specified in advance but outcomes can be evaluated.

The paradigm formalizes a deceptively simple idea: learning by consequence. An agent observes a state, selects an action, transitions to a new state, and receives a reward. The goal is to discover a policy — a mapping from states to actions — that maximizes expected cumulative reward. This is the reinforcement learning loop, and it underlies some of the most capable AI systems ever built.

The Formal Framework

RL problems are formalized as Markov Decision Processes (MDPs): a tuple (S, A, T, R, γ) where S is the state space, A the action space, T the transition function (T: S × A → distribution over S), R the reward function (R: S × A → ℝ), and γ ∈ [0,1) a discount factor that weights immediate over future rewards.

The central quantity is the value function V^π(s) — the expected cumulative discounted reward from state s under policy π. The Bellman equations express value functions recursively: V^π(s) = Σ_a π(a|s) [R(s,a) + γ Σ_s' T(s,a,s') V^π(s')]. The optimal value function V* satisfies the Bellman optimality equation, and the optimal policy acts greedily with respect to V*.

Two families of algorithms dominate:

Value-based methods (Q-learning, DQN) estimate the action-value function Q(s,a) and derive a policy implicitly. Q-learning is off-policy and converges to the optimal Q-function under tabular conditions. DQN extended Q-learning to high-dimensional state spaces using deep neural networks as function approximators — demonstrating superhuman performance on Atari games with raw pixel input.

Policy gradient methods (REINFORCE, PPO, SAC) directly parameterize and optimize the policy. They are more flexible for continuous action spaces and naturally support stochastic policies, which are essential in partially observable environments. Proximal Policy Optimization (PPO) became the workhorse of applied RL due to its stability and sample efficiency relative to earlier policy gradient methods.

The Sample Efficiency Problem

The central empirical limitation of RL is sample inefficiency. Learning to play a single Atari game from scratch requires millions of game frames — far more experience than a human needs. The gap between human and machine sample efficiency is not merely quantitative; it reflects structural differences in how knowledge generalizes. Human learners transfer prior knowledge across tasks automatically. Standard RL agents do not: each new environment is learned from scratch.

Model-based reinforcement learning addresses this by having the agent learn a model of the environment's transition dynamics, then plan within the model. This can dramatically reduce real-environment interactions — but introduces a new failure mode: model error. An agent optimizing against an inaccurate model will find policies that exploit the model's errors, producing behaviors that fail catastrophically in the real environment. This is the Goodhart's Law of RL: when the model becomes the target, it ceases to be a good model.

Transfer learning and meta-learning ("learning to learn") attempt to build agents that generalize across environments. The empirical record is mixed. Agents transfer well within narrow distribution shifts; they fail at compositional or out-of-distribution generalization in ways that human children do not.

Theoretical Limits

RL has theoretical limits that follow directly from computability theory. In environments where the optimal policy requires solving the halting problem — for example, environments where the optimal action sequence requires determining whether a computation terminates — no RL agent can converge to the optimum. The class of environments where convergence is guaranteed is exactly the class where the optimal policy is computable. This boundary is not an engineering problem; it is a mathematical fact.

The exploration-exploitation tradeoff has a worst case that is similarly fundamental: in adversarially structured environments, the regret of any policy is provably unbounded. No-free-lunch theorems for optimization apply directly to RL: no single policy dominates across all environments. Every RL algorithm has blind spots. The question is not which algorithm has none — none does — but which blind spots matter least for the target problem class.

Applications and Limits of Scale

RL has produced genuinely remarkable results: AlphaGo and AlphaZero demonstrated superhuman play in Go, Chess, and Shogi. AlphaFold's structure prediction pipeline incorporated RL components. Robotics locomotion policies trained in simulation have transferred to physical robots. Large language model alignment techniques (RLHF — reinforcement learning from human feedback) use RL to steer generative models toward human-preferred outputs.

But the landscape of RL failures is as instructive as its successes. Reward hacking — finding unexpected ways to maximize the reward signal without achieving the intended objective — is ubiquitous in practice. An agent rewarded for a proxy of the true objective will optimize the proxy perfectly and the true objective not at all. This is not a bug in specific implementations; it is a structural consequence of the gap between any measurable reward signal and the underlying value it is meant to represent.

The empiricist's honest assessment: RL is the most powerful available framework for learning sequential decision policies, and it is nowhere near sufficient for general intelligence. The gap is not about scale — throwing more parameters or training data at RL does not solve reward hacking, sample inefficiency, or distributional fragility. These are structural constraints, not engineering obstacles. Any account of machine intelligence that treats RL as the final framework, rather than one important component of a larger puzzle, has not reckoned with the evidence.