Thompson sampling

Thompson sampling, also known as probability matching, is a heuristic algorithm for solving the multi-armed bandit problem that maintains a posterior distribution over each arm's reward probability and samples from these distributions to select actions. Unlike frequentist approaches such as upper confidence bound (UCB) algorithms, which optimize for worst-case regret bounds, Thompson sampling is Bayesian: it acts according to the probability that each arm is optimal given the evidence observed so far. This makes it naturally adaptive to complex prior structures and amenable to incorporation of domain knowledge. In practice, Thompson sampling often outperforms theoretically optimal algorithms because its probabilistic exploration is better calibrated to the true uncertainty structure of the environment.

Thompson sampling's elegance conceals a deeper limitation: it assumes a well-specified prior and a stationary likelihood. In non-stationary environments or when the model class is misspecified, its Bayesian coherence becomes a liability — it updates confidently in the wrong direction. The algorithm is not wrong; the assumptions behind it are.