A Deep Dive into Group Relative Policy Optimization (GRPO) Method: Enhancing Mathematical Reasoning in Open Language Models
What is Group Relative Policy Optimization (GRPO)?

Group Relative Policy Optimization (GRPO) is a novel reinforcement learning method introduced in the DeepSeekMath paper4. It builds upon the Proximal Policy Optimization (PPO) framework, aiming to improve mathematical reasoning capabilities while reducing memory consumption4. GRPO eliminates the need for a value function model and uses group scores to estimate the baseline, simplifying the training process and resource requirements1.
How does GRPO improve upon the Proximal Policy Optimization framework?

GRPO improves upon the Proximal Policy Optimization (PPO) framework by eliminating the need for a value function model, reducing memory and computational complexity3. It uses group scores to estimate the baseline, simplifying the training process and resource requirements. Additionally, GRPO integrates the KL divergence term directly into the loss function, stabilizing the training process and improving performance.
How does GRPO eliminate the need for a value function model?

GRPO eliminates the need for a value function model by using group scores to estimate the baseline, simplifying the training process and resource requirements4. This approach differs from traditional PPO, which relies on a value function model for computing advantages and updating the policy. By preceding the value function model and utilizing group scores, GRPO reduces complexity and memory footprint, making the training process more efficient and scalable.