Exploration-Exploitation Dilemma

The exploration-exploitation dilemma is the fundamental tension in reinforcement learning and multi-armed bandit problems between exploiting known good actions (maximizing reward given current knowledge) and exploring uncertain actions that may yield higher reward in the long run. A purely exploitative agent converges on the first locally good policy it finds and misses globally better options. A purely exploratory agent never commits to what it has learned. Optimal strategies depend on the time horizon and the structure of the reward distribution: in finite-horizon problems, exploration should decrease over time; in non-stationary environments, permanent exploration is necessary. UCB algorithms and Thompson sampling solve the bandit version optimally in the frequentist and Bayesian senses respectively. In full RL, the dilemma is NP-hard in the worst case and can be unresolvable in adversarial environments where no regret bound is achievable.