Q-learning

Q-learning is a model-free reinforcement learning algorithm that learns the expected cumulative reward of taking a given action in a given state, then behaving optimally thereafter. Introduced by Chris Watkins in 1989, it is an off-policy temporal difference method: it learns about the optimal policy while potentially exploring via a different policy. The algorithm maintains a table (or function approximator) of Q-values and updates them using the Bellman equation, bootstrapping from its own predictions. Q-learning is provably convergent in tabular settings but notoriously unstable when combined with neural network function approximation — a limitation that DQN partially addressed through experience replay and target networks. The algorithm's simplicity conceals a deeper tension: by learning to maximize expected reward, Q-learning assumes that the reward function is a faithful proxy for the true objective — an assumption that fails precisely when reward functions are misaligned with designer intent, producing reward hacking and other pathologies.