Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    GRPO (Group Relative Policy Optimization)

    Also known as:
    GRPO
    Group Relative PO
    DeepSeek GRPO
    Updated: 2/10/2026

    GRPO is an RL alignment method that works without a separate reward model – instead, groups of responses are evaluated relative to each other.

    Quick Summary

    GRPO optimizes LLMs without a separate reward model – through group comparison of responses. The technique behind DeepSeek-R1's reasoning breakthrough.

    Explanation

    For each question, the model generates multiple responses. The reward is normalized within the group (Group Relative), and the policy is optimized directly – simpler than PPO, no critic/value network needed.

    Marketing Relevance

    GRPO enabled DeepSeek-R1 and shows that reasoning abilities can emerge through pure RL (without SFT).

    Common Pitfalls

    Needs good verifier/reward signals. High compute for group sampling. Can lead to mode collapse without diversity constraints.

    Origin & History

    DeepSeek published GRPO in the DeepSeekMath paper (2024). Became known through DeepSeek-R1 (January 2025), where GRPO enabled reasoning without SFT data.

    Comparisons & Differences

    GRPO (Group Relative Policy Optimization) vs. PPO

    PPO needs a separate value network (critic) and reward model; GRPO eliminates both through group-based normalization.

    GRPO (Group Relative Policy Optimization) vs. DPO

    DPO needs prepared preference pairs; GRPO generates comparisons on-the-fly from group sampling.

    Related Services

    Related Terms

    👋Questions? Chat with us!