SARSA (State-Action-Reward-State-Action)
SARSA is an on-policy RL algorithm that updates Q-values based on the action actually taken – unlike Q-Learning's off-policy maximum.
SARSA learns Q-values on-policy – accounts for actual exploration and is therefore safer than off-policy Q-Learning.
Explanation
Update rule: Q(s,a) ← Q(s,a) + α[r + γQ(s',a') - Q(s,a)], where a' is the action actually chosen next (not the maximum). Named after the quintuple sequence (S,A,R,S',A').
Marketing Relevance
SARSA is safer than Q-Learning in risky environments because it accounts for actual behavior (including exploration).
Common Pitfalls
Converges to the policy it follows (not optimal). Can be too conservative. Exploration policy influences learned Q-values.
Origin & History
Rummery & Niranjan (1994) introduced SARSA (originally "Modified Connectionist Q-Learning"). Sutton (1996) gave the algorithm its name SARSA. Today primarily used as teaching material and baseline.
Comparisons & Differences
SARSA (State-Action-Reward-State-Action) vs. Q-Learning
Q-Learning uses max Q(s',a') (off-policy, more optimistic); SARSA uses Q(s',a') of the actual action (on-policy, more conservative/safer).