Regret minimization

Regret minimization is the dominant theoretical framework for evaluating algorithms in the multi-armed bandit problem and online decision-making. The regret of an algorithm is the difference between the cumulative reward it achieves and the cumulative reward that would have been achieved by the single best fixed action in hindsight. An algorithm is said to achieve low regret if this difference grows sublinearly with time — meaning that the algorithm's average performance converges to the best fixed strategy. This framework has produced elegant mathematical results: UCB algorithms achieve logarithmic regret bounds, and Thompson sampling has been shown to be asymptotically optimal for certain problem classes.

But the regret minimization framework embeds a specific value judgment: that the goal of learning is to approximate the performance of the best fixed strategy. In non-stationary environments, the best strategy changes over time, and the concept of 'regret against a fixed benchmark' becomes meaningless. More fundamentally, regret minimization assumes a single decision-maker. In multi-agent systems, one agent's regret is another agent's gain, and the collective dynamics may produce outcomes that no individual regret-minimizing strategy can predict or prevent.