Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence
    (SARSA)

    SARSA (State-Action-Reward-State-Action)

    Also known as:
    SARSA
    On-Policy TD Control
    State-Action-Reward-State-Action
    Updated: 2/10/2026

    SARSA is an on-policy RL algorithm that updates Q-values based on the action actually taken – unlike Q-Learning's off-policy maximum.

    Quick Summary

    SARSA learns Q-values on-policy – accounts for actual exploration and is therefore safer than off-policy Q-Learning.

    Explanation

    Update rule: Q(s,a) ← Q(s,a) + α[r + γQ(s',a') - Q(s,a)], where a' is the action actually chosen next (not the maximum). Named after the quintuple sequence (S,A,R,S',A').

    Marketing Relevance

    SARSA is safer than Q-Learning in risky environments because it accounts for actual behavior (including exploration).

    Common Pitfalls

    Converges to the policy it follows (not optimal). Can be too conservative. Exploration policy influences learned Q-values.

    Origin & History

    Rummery & Niranjan (1994) introduced SARSA (originally "Modified Connectionist Q-Learning"). Sutton (1996) gave the algorithm its name SARSA. Today primarily used as teaching material and baseline.

    Comparisons & Differences

    SARSA (State-Action-Reward-State-Action) vs. Q-Learning

    Q-Learning uses max Q(s',a') (off-policy, more optimistic); SARSA uses Q(s',a') of the actual action (on-policy, more conservative/safer).

    Related Services

    Related Terms

    👋Questions? Chat with us!