Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Actor-Critic

    Also known as:
    Actor-Critic Methods
    A2C
    A3C
    Advantage Actor-Critic
    Updated: 2/10/2026

    RL architecture with two components: an actor (policy) selects actions, a critic (value function) evaluates them – combines strengths of policy gradient and value-based methods.

    Quick Summary

    Actor-Critic combines policy optimization (actor) with value estimation (critic) – more stable than pure policy gradient, basis of PPO and modern RLHF.

    Explanation

    The actor learns the policy, the critic estimates the advantage (how much better is this action than average). This significantly reduces the variance of pure policy gradient methods.

    Marketing Relevance

    Actor-Critic is the basis of PPO and thus indirectly of RLHF – understanding it explains why LLM training works.

    Common Pitfalls

    Instability when actor and critic learn at different rates. Bias from inaccurately estimated critic. Hyperparameter sensitivity.

    Origin & History

    Konda & Tsitsiklis (1999) formalized Actor-Critic. A3C (Mnih et al., 2016) made it scalable. PPO (2017) is the most popular actor-critic variant. SAC (2018) for continuous control.

    Comparisons & Differences

    Actor-Critic vs. Pure Policy Gradient

    Policy gradient has high variance (Monte Carlo returns); Actor-Critic reduces variance through a learned baseline (critic).

    Actor-Critic vs. Q-Learning (DQN)

    DQN only learns a value function; Actor-Critic explicitly learns a policy – better for continuous action spaces.

    Related Services

    Related Terms

    👋Questions? Chat with us!