Group Relative Policy Optimization (GRPO) is a novel reinforcement learning method introduced in the DeepSeekMath paper4. It builds upon the Proximal Policy Optimization (PPO) framework, aiming to improve mathematical reasoning capabilities while reducing memory consumption4. GRPO eliminates the need for a value function model and uses group scores to estimate the baseline, simplifying the training process and resource requirements1.
GRPO improves upon the Proximal Policy Optimization (PPO) framework by eliminating the need for a value function model, reducing memory and computational complexity3. It uses group scores to estimate the baseline, simplifying the training process and resource requirements. Additionally, GRPO integrates the KL divergence term directly into the loss function, stabilizing the training process and improving performance.
GRPO eliminates the need for a value function model by using group scores to estimate the baseline, simplifying the training process and resource requirements4. This approach differs from traditional PPO, which relies on a value function model for computing advantages and updating the policy. By preceding the value function model and utilizing group scores, GRPO reduces complexity and memory footprint, making the training process more efficient and scalable.