GRPO (Group Relative Policy Optimization)
GRPO is an RL alignment method that works without a separate reward model – instead, groups of responses are evaluated relative to each other.
GRPO optimizes LLMs without a separate reward model – through group comparison of responses. The technique behind DeepSeek-R1's reasoning breakthrough.
Explanation
For each question, the model generates multiple responses. The reward is normalized within the group (Group Relative), and the policy is optimized directly – simpler than PPO, no critic/value network needed.
Marketing Relevance
GRPO enabled DeepSeek-R1 and shows that reasoning abilities can emerge through pure RL (without SFT).
Common Pitfalls
Needs good verifier/reward signals. High compute for group sampling. Can lead to mode collapse without diversity constraints.
Origin & History
DeepSeek published GRPO in the DeepSeekMath paper (2024). Became known through DeepSeek-R1 (January 2025), where GRPO enabled reasoning without SFT data.
Comparisons & Differences
GRPO (Group Relative Policy Optimization) vs. PPO
PPO needs a separate value network (critic) and reward model; GRPO eliminates both through group-based normalization.
GRPO (Group Relative Policy Optimization) vs. DPO
DPO needs prepared preference pairs; GRPO generates comparisons on-the-fly from group sampling.