Markov decision process

Markov decision process (MDP) is the foundational mathematical framework for modeling sequential decision-making under uncertainty. It formalizes the problem of an agent that interacts with an environment over time: at each step, the agent observes the current state, chooses an action, receives a reward, and transitions to a new state. The framework is named after Andrey Markov, whose work on stochastic processes established that the future is conditionally independent of the past given the present — the Markov property that gives the framework its structure and its power.

The MDP is defined by a tuple (S, A, P, R, γ):

S is the state space — the set of all possible configurations of the environment.
A is the action space — the set of all actions available to the agent.
P is the transition function — a probability distribution over next states given the current state and action: P(s' | s, a).
R is the reward function — the expected immediate reward received after taking action a in state s.
γ is the discount factor — a number between 0 and 1 that determines how much the agent values future rewards relative to immediate rewards.

The agent's objective is to find a policy π — a mapping from states to actions (or probability distributions over actions) — that maximizes the expected cumulative discounted reward. This is not merely a prediction problem; it is a planning problem. The agent must trade off exploration (trying actions whose consequences are unknown) against exploitation (choosing actions known to yield high reward), and it must account for the long-term consequences of its decisions, not merely their immediate payoffs.

From States to Values

The central insight of the MDP framework is that the value of being in a state depends on the value of the states that can be reached from it. The value function V^π(s) is the expected cumulative reward starting from state s and following policy π thereafter. The action-value function Q^π(s, a) is the expected cumulative reward starting from state s, taking action a, and then following policy π. These functions satisfy the Bellman equations — recursive relationships that express the value of a state as the immediate reward plus the discounted value of the next state:

V^π(s) = Σ_a π(a|s) [ R(s,a) + γ Σ_{s'} P(s'|s,a) V^π(s') ]

The Bellman equations are not merely definitions; they are the basis for algorithms. Value iteration, policy iteration, and Q-learning all exploit the recursive structure of the Bellman equations to compute or approximate optimal policies. The framework transforms the problem of acting in an uncertain environment into the problem of solving a system of equations — a transformation that makes the problem tractable, at least in principle.

The Limits of the MDP Formalism

The MDP is elegant and misleading. Its elegance lies in its reduction of sequential decision-making to a well-defined mathematical problem. Its misleadingness lies in the assumptions it smuggles in: that the state is fully observable, that the transition dynamics are stationary, that the reward function is known and fixed, that the agent's actions do not change the environment's structure, and that the agent is the only decision-maker in the system.

Each of these assumptions is violated in nearly every real application. A robot's sensors are noisy and incomplete — the true state is not observable. A financial market's dynamics change as participants adapt — the transition function is non-stationary. A recommender system's reward function (user engagement) is a proxy for what the designers actually want (user welfare) — the reward is misspecified. A content moderation system deployed at scale changes the behavior of the users it moderates — the environment is not independent of the agent. And a multi-agent system contains many agents, each optimizing its own reward function — the MDP formalism does not apply.

These violations are not exceptions; they are the rule. The MDP is a starting point, not a destination. Its value is not that it describes real systems accurately but that it provides a baseline against which the complications of reality can be measured. The partially observable MDP relaxes the observability assumption. Model-based RL relaxes the known-dynamics assumption. Inverse RL relaxes the known-reward assumption. Multi-agent RL relaxes the single-agent assumption. Each relaxation comes at a cost in tractability, and the frontier of research is the space of systems that violate multiple assumptions simultaneously.

MDPs and the Theory of Choice

The MDP framework connects to broader questions in the theory of rational choice. The expected utility maximization that defines the MDP objective is the same framework that underlies von Neumann-Morgenstern decision theory and Savage's subjective expected utility theory. The MDP agent is, in this sense, a formalization of economic rationality: it has preferences (encoded in the reward function), beliefs (encoded in the transition function), and it acts to maximize expected utility.

But the MDP also reveals the limits of this framework. The reward function is exogenous — it is given to the agent, not chosen by it. The MDP formalism does not explain where the reward function comes from, whether it reflects genuine values or misspecified proxies, or what happens when the agent optimizes the reward function too effectively. These are the problems of value alignment and Goodhart's law, and they arise precisely because the MDP separates the specification of the objective from the optimization of the objective. The optimizer is powerful; the specifier is fallible. The result is a system that achieves exactly what it was told to achieve, which is not necessarily what its designers wanted.

The Markov decision process is the simplest model of intelligence that is not trivial. It captures the essence of sequential decision-making — the trade-off between immediate and future rewards, the problem of exploration and exploitation, the recursive structure of value — while remaining mathematically tractable. But it is a model, not a mirror. Real intelligence operates in environments that are partially observable, non-stationary, multi-agent, and reward-misspecified. The MDP is the first step on a ladder that has no top.