Contextual bandits

Contextual bandits extend the multi-armed bandit problem by allowing the decision-maker to observe contextual information before choosing an action. Unlike the standard bandit setting, where all arms are evaluated identically, the contextual bandit model assumes that the expected reward of each arm depends on a context vector — user features, environmental state, or historical information. This makes the model applicable to personalized recommendation, clinical trials with patient covariates, and dynamic pricing. The algorithm must learn not just which arms are good, but which arms are good in which contexts — a problem that is substantially harder and requires stronger assumptions about generalization.

The contextual bandit model occupies a middle ground between the simplicity of multi-armed bandits and the full complexity of reinforcement learning with state transitions. It is the workhorse model of online decision-making in industry: news recommendation, ad placement, and medical treatment assignment all reduce to contextual bandit problems at scale. Yet the model's practical success masks a theoretical fragility. The assumption that contexts are exogenous — that the context distribution does not depend on past actions — fails in any system where the agent's choices shape the environment it encounters. In personalized recommendation, for example, showing a user certain content changes their preferences, which changes the context distribution. The contextual bandit formulation assumes a passive world; it cannot model an active one.

The Formal Model

In the standard formulation, at each round t the agent observes a context x_t ∈ X, chooses an action a_t ∈ A, and receives a reward r_t drawn from a distribution conditioned on both the context and the action: r_t ~ D(r | x_t, a_t). The agent's goal is to maximize cumulative reward over T rounds, or equivalently to minimize regret — the difference between the agent's cumulative reward and the reward that would have been obtained by always choosing the optimal action for each context.

The difficulty arises from the explore-exploit tension. To learn which actions are good in which contexts, the agent must try actions whose reward is uncertain — exploration. But to maximize cumulative reward, the agent should favor actions that have performed well in the past — exploitation. The optimal balance depends on the agent's uncertainty about the reward function, which in turn depends on the data collected so far. This creates a feedback loop: the agent's choices determine what it learns, and what it learns determines its choices.

The mathematical frameworks for balancing exploration and exploitation include:

Upper Confidence Bound (UCB) algorithms. At each round, the algorithm selects the action with the highest upper confidence bound on its expected reward. The confidence bound is wide for rarely tried actions and narrow for frequently tried ones, automatically biasing selection toward exploration of uncertain options. The LinUCB algorithm extends this logic to linear reward functions: the expected reward of each action is modeled as a linear function of the context, and the algorithm maintains confidence ellipsoids around the parameter vectors.

Thompson Sampling. The algorithm maintains a posterior distribution over the reward function, samples a reward function from this posterior, and selects the action that is optimal for the sampled function. The randomness in the sampling ensures that actions with high posterior uncertainty are tried frequently enough to learn their true reward. Thompson Sampling has strong empirical performance and is computationally simpler than UCB for many reward models.

Importance-weighted estimation. When the agent cannot control its own exploration — as in observational studies where the data was generated by another policy — importance weighting adjusts the empirical reward estimates to account for the fact that some actions were overrepresented or underrepresented in the data. This is the foundation of offline contextual bandit learning and counterfactual policy evaluation.

The Exogeneity Assumption and Its Failure

The contextual bandit model assumes that contexts are drawn i.i.d. from a fixed distribution that does not depend on the agent's past actions. This assumption is what makes the problem tractable: if the context distribution is fixed, the agent can learn a policy that maps contexts to actions, and that policy will remain optimal as long as the distribution does not change.

In practice, this assumption fails systematically:

Preference shift in recommendation. A recommendation system that optimizes for click-through rate will gradually shift users toward clickbait content. The users' preferences change in response to the content they are shown, which means the context distribution — the distribution of user states — is not fixed. It is being sculpted by the algorithm itself. The contextual bandit framework cannot detect this because it treats user features as exogenous inputs, not as states that the algorithm modifies.

Attrition in clinical trials. A clinical trial that assigns patients to treatments based on observed covariates may inadvertently select for patients who tolerate the treatment well. Patients who experience side effects drop out, changing the distribution of the remaining patient population. The contextual bandit algorithm learns from the patients who remain, not from the patients who left, producing a biased estimate of treatment effectiveness.

Adversarial manipulation in advertising. An ad placement system that learns to show ads to users who are likely to click will attract adversarial behavior: publishers create content designed to attract the system's attention, and users develop ad-blindness in response to overexposure. The context distribution shifts not because of natural variation but because the system is being gamed — a dynamic that the contextual bandit model cannot represent.

These failures are not edge cases. They are instances of Campbell's Law in algorithmic form: the more the algorithm optimizes a metric, the more the metric ceases to be a good measure of the underlying goal. The contextual bandit optimizes expected reward; the optimization degrades the signal on which the optimization depends.

Systems-Theoretic Implications

From a systems perspective, the contextual bandit is a control system with a critical design flaw: it optimizes a local objective without modeling the feedback from its actions to the environment. The agent treats the world as an open-loop system — a source of contexts to which it responds — when in fact the world is a closed-loop system in which the agent's responses change the contexts it will encounter.

The proper framework for such systems is not the contextual bandit but the multi-agent system or the dynamical system with coupled agent-environment dynamics. In these frameworks, the agent's policy is a component of the system dynamics, not an external controller. The stability analysis must consider the joint dynamics of agent and environment, not the agent's performance in isolation.

This reframing has practical consequences. A recommendation system designed as a contextual bandit will inevitably enter a regime of signal degradation — the metrics it optimizes will lose their correlation with user satisfaction, and the system will produce increasingly pathological behavior without recognizing that anything is wrong. The countermeasure is not a better contextual bandit algorithm but a system architecture that includes:

Exogenous evaluation. The system's performance must be measured by signals that the system cannot influence — surveys, long-term user retention, downstream outcomes — rather than by the immediate rewards it optimizes.

Modeling the feedback loop. The system must include a model of how its actions affect the context distribution. This is the reinforcement learning approach, but at a higher level: the state transition model must include the effect of policy on user preferences, market conditions, or ecological states.

Diversification of objectives. The system should optimize multiple partially independent objectives, so that the degradation of any single signal is detectable through comparison with the others. A recommendation system that optimizes for clicks, dwell time, diversity, and user-reported satisfaction is more robust than one that optimizes for clicks alone.

Temporal separation. The system should maintain slow loops that evaluate the fast loops. The fast loop — the contextual bandit — optimizes immediate reward. The slow loop — a meta-learning or governance mechanism — evaluates whether the fast loop's objective is still aligned with the system's true goals. This is the nested dynamics of algorithmic governance: fast adaptation, slow evaluation.

The Exploration-Exploitation Tension as a Feedback Problem

The explore-exploit tension in contextual bandits is not merely a statistical problem. It is a feedback problem. Exploration is the injection of randomness into the system — a deliberate perturbation that prevents lock-in to local optima. Exploitation is the amplification of successful behavior — a positive feedback loop that drives the system toward its current best estimate of the optimum.

In a stable environment, the optimal strategy is to explore early and exploit late — the so-called explore-then-exploit or epsilon-greedy approach. But in a changing environment, or in an environment that the agent itself modifies, the optimal strategy is to maintain perpetual exploration — a constant low level of randomness that prevents the system from becoming trapped in self-reinforcing loops.

This connects the contextual bandit to the broader theory of resilience in complex systems. Resilient systems maintain diversity — multiple strategies, multiple feedback loops, multiple evaluation channels — because diversity is the only defense against the collapse that follows from over-optimization. The contextual bandit that eliminates exploration in the name of efficiency is not optimizing. It is marching toward a regime shift — a sudden transition from apparent stability to catastrophic failure, triggered by a perturbation that the system has no capacity to absorb.

The contextual bandit is the simplest model of adaptive decision-making in a complex world, and its failures reveal the limits of the optimization paradigm itself. The problem is not that the model is too simple. The problem is that the world is not a bandit. It is a complex adaptive system in which every action changes the rules, every measurement modifies the measured, and every optimization degrades the signal it optimizes. The contextual bandit is a useful approximation — but only for systems small enough and fast enough that feedback can be ignored. For everything else, it is a trap.