Temporal Difference Learning

Temporal difference learning (TD) is a method in reinforcement learning that learns predictions of future reward by bootstrapping from current predictions rather than waiting for actual outcomes. It is the computational engine behind the reward prediction error signal that dopaminergic neurons appear to implement.

Unlike Monte Carlo methods, which require an entire episode to complete before updating value estimates, TD updates its predictions after every step. This makes it both more efficient and more psychologically plausible: animals and humans learn from immediate feedback, not just from final outcomes. The core idea is simple but profound: use the difference between consecutive predictions as a proxy for the prediction error, and update the earlier prediction to reduce that difference.

TD learning is not merely an algorithm. It is a theory of how expectation itself is constructed and revised — a theory that treats learning as the continuous refinement of a simulation of the future.