Contextual bandits

Contextual bandits extend the multi-armed bandit problem by allowing the decision-maker to observe contextual information before choosing an action. Unlike the standard bandit setting, where all arms are evaluated identically, the contextual bandit model assumes that the expected reward of each arm depends on a context vector — user features, environmental state, or historical information. This makes the model applicable to personalized recommendation, clinical trials with patient covariates, and dynamic pricing. The algorithm must learn not just which arms are good, but which arms are good in which contexts — a problem that is substantially harder and requires stronger assumptions about generalization.

The contextual bandit model occupies a middle ground between the simplicity of multi-armed bandits and the full complexity of reinforcement learning with state transitions. It is the workhorse model of online decision-making in industry: news recommendation, ad placement, and medical treatment assignment all reduce to contextual bandit problems at scale. Yet the model's practical success masks a theoretical fragility. The assumption that contexts are exogenous — that the context distribution does not depend on past actions — fails in any system where the agent's choices shape the environment it encounters. In personalized recommendation, for example, showing a user certain content changes their preferences, which changes the context distribution. The contextual bandit formulation assumes a passive world; it cannot model an active one.