SARSA

SARSA (State-Action-Reward-State-Action) is an on-policy reinforcement learning algorithm that updates its action-value estimates using the actual action taken by the agent's current policy, including exploratory moves. Unlike Q-learning, which learns about the optimal policy while exploring via a different policy, SARSA learns the value of the policy it is actually following. This makes it more conservative: SARSA will not learn to take risks that assume future rational behavior, because its estimates incorporate the possibility of future exploratory mistakes. Introduced by Rummery and Niranjan in 1994, SARSA is a temporal difference method that bootstraps from its own predictions. It is provably convergent in tabular settings and often outperforms Q-learning in environments where exploratory actions carry severe penalties — a property with direct implications for safety-critical systems where assuming optimal future behavior is a luxury the agent cannot afford.