🧠 Understanding R1-Zero-Like Training: A Critical Perspective¶
Authors: Zichen Liu, Changyu Chen, Wenjun Li*, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
Paper: arXiv:2503.20783
Code: GitHub Repository
❓ What’s the Problem?¶
R1-Zero-like training — directly applying Reinforcement Learning (RL) to LLMs without supervised fine-tuning (SFT) — has shown promising improvements in reasoning capabilities.
However, two major issues remain:
- Base Model Bias: Are base models already biased to perform well before RL due to their pretraining?
- Optimization Bias: Does Group Relative Policy Optimization (GRPO) cause LLMs to generate unnecessarily long outputs?
🧠 Motivation¶
- Understand how pretraining characteristics affect RL training outcomes.
- Evaluate GRPO’s potential flaws in the optimization process.
- Propose a minimal, unbiased RL recipe for better reasoning performance.
💡 Proposed Solution: Dr. GRPO (Done Right)¶
A revised version of GRPO that:
- Removes length normalization (
1 / |o_i|
) - Removes standard deviation normalization (
std(...)
in advantage computation) - Uses a simple relative reward for policy advantage.
🧮 Key Equations¶
Original GRPO Advantage¶
$$ \hat{A}_{i,t} = \frac{R(q, o_i) - \mu}{\sigma} $$
$$ Where: - \mu: group mean of rewards - \sigma : standard deviation of rewards $$
PPO-style GRPO Update¶
$$ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min \left[ \frac{p_q(o_{i,t}|q, o_{i,<t})}{p_{\text{old}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \text{clip}\left( \frac{p_q(o_{i,t}|q, o_{i,<t})}{p_{\text{old}}(o_{i,t}|q, o_{i,<t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right] $$
Dr. GRPO Advantage (Unbiased)¶
$$ \hat{A}_{i,t} = R(q, o_i) - \mu $$
🔬 Experimental Design¶
Templates Analyzed¶
- R1 Template (think/answer tags)
- Qwen-Math Template (uses
\boxed{}
for answer) - No Template
Used across:
- Qwen2.5-Math-1.5B
- Qwen2.5-Math-7B
- DeepSeek-Math-7B
- DeepSeek-V3-Base
- LLaMA-3.1-8B
Metrics¶
- Pass@8: Proportion of times a correct answer appears in 8 completions
- Answer Formatting Score: Whether model returns a direct answer or just a sentence continuation (judged by GPT-4o-mini)
📊 Key Results¶
Base Model Insights¶
- All models (even before RL) show math reasoning abilities (e.g., DeepSeek-V3-Base has “Aha” moment)
- Qwen2.5 performs worse with prompt templates than without — suggesting it may have been pretrained on concatenated Q&A.
Optimization Bias in GRPO¶
- Standard GRPO causes longer outputs, especially for wrong answers.
- Dr. GRPO prevents bloat, leading to better token efficiency.
Final Performance¶
Using Dr. GRPO with Qwen2.5-Math-7B + MATH (Level 3–5) + Qwen-Math Template:
- 43.3% pass@8 on AIME 2024 — a new SOTA (State of the Art) with just a few hours on A100s.
📚 Summary of Takeaways¶
Section | Finding |
---|---|
2.1 | Templates help elicit answering behavior; base models already have strong math skills. |
2.2 | Qwen2.5 performs better without templates → possible pretraining on answer-first formats. |
2.3 | Most base models (Qwen, DeepSeek) already show “Aha” moments before RL. |
3.1 | GRPO optimization bias confirmed. |
3.2 | Dr. GRPO solves the bias problem with cleaner gradient signals. |
3.3 | Template-model mismatch hurts performance pre-RL. |
3.4 | LLaMA-3.2-3B benefits significantly from math pretraining. |
🔥 Wrap-Up¶
- Dr. GRPO is a simple, elegant fix to GRPO’s optimization bias.
- Base model choice + pretraining format matters more than previously assumed.
- Minimalist RL tuning with the right optimizer and template can push performance to new SOTA — fast and cheap.
📖 Read the full paper: https://arxiv.org/pdf/2503.20783
💻 Code: https://github.com/sail-sg/understand-r1-zero
References : a)https://yugeten.github.io/posts/2025/01/ppogrpo/