Actor-Critic
RL architecture with two components: an actor (policy) selects actions, a critic (value function) evaluates them – combines strengths of policy gradient and value-based methods.
Actor-Critic combines policy optimization (actor) with value estimation (critic) – more stable than pure policy gradient, basis of PPO and modern RLHF.
Explanation
The actor learns the policy, the critic estimates the advantage (how much better is this action than average). This significantly reduces the variance of pure policy gradient methods.
Marketing Relevance
Actor-Critic is the basis of PPO and thus indirectly of RLHF – understanding it explains why LLM training works.
Common Pitfalls
Instability when actor and critic learn at different rates. Bias from inaccurately estimated critic. Hyperparameter sensitivity.
Origin & History
Konda & Tsitsiklis (1999) formalized Actor-Critic. A3C (Mnih et al., 2016) made it scalable. PPO (2017) is the most popular actor-critic variant. SAC (2018) for continuous control.
Comparisons & Differences
Actor-Critic vs. Pure Policy Gradient
Policy gradient has high variance (Monte Carlo returns); Actor-Critic reduces variance through a learned baseline (critic).
Actor-Critic vs. Q-Learning (DQN)
DQN only learns a value function; Actor-Critic explicitly learns a policy – better for continuous action spaces.