In [ ]:

🧠 Understanding R1-Zero-Like Training: A Critical Perspective¶

Authors: Zichen Liu, Changyu Chen, Wenjun Li*, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
Paper: arXiv:2503.20783
Code: GitHub Repository

❓ What’s the Problem?¶

R1-Zero-like training — directly applying Reinforcement Learning (RL) to LLMs without supervised fine-tuning (SFT) — has shown promising improvements in reasoning capabilities.

However, two major issues remain:

Base Model Bias: Are base models already biased to perform well before RL due to their pretraining?
Optimization Bias: Does Group Relative Policy Optimization (GRPO) cause LLMs to generate unnecessarily long outputs?

🧠 Motivation¶

Understand how pretraining characteristics affect RL training outcomes.
Evaluate GRPO’s potential flaws in the optimization process.
Propose a minimal, unbiased RL recipe for better reasoning performance.

💡 Proposed Solution: Dr. GRPO (Done Right)¶

A revised version of GRPO that:

Removes length normalization (1 / |o_i|)
Removes standard deviation normalization (std(...) in advantage computation)
Uses a simple relative reward for policy advantage.

🧮 Key Equations¶

Original GRPO Advantage¶

$$ \hat{A}_{i,t} = \frac{R(q, o_i) - \mu}{\sigma} $$

$$ Where: - \mu: group mean of rewards - \sigma : standard deviation of rewards $$

PPO-style GRPO Update¶

$$ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min \left[ \frac{p_q(o_{i,t}|q, o_{i,<t})}{p_{\text{old}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \text{clip}\left( \frac{p_q(o_{i,t}|q, o_{i,<t})}{p_{\text{old}}(o_{i,t}|q, o_{i,<t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right] $$

Dr. GRPO Advantage (Unbiased)¶

$$ \hat{A}_{i,t} = R(q, o_i) - \mu $$

🔬 Experimental Design¶

Templates Analyzed¶

R1 Template (think/answer tags)
Qwen-Math Template (uses \boxed{} for answer)
No Template

Used across:

Qwen2.5-Math-1.5B
Qwen2.5-Math-7B
DeepSeek-Math-7B
DeepSeek-V3-Base
LLaMA-3.1-8B

Metrics¶

Pass@8: Proportion of times a correct answer appears in 8 completions
Answer Formatting Score: Whether model returns a direct answer or just a sentence continuation (judged by GPT-4o-mini)

📊 Key Results¶

Base Model Insights¶

All models (even before RL) show math reasoning abilities (e.g., DeepSeek-V3-Base has “Aha” moment)
Qwen2.5 performs worse with prompt templates than without — suggesting it may have been pretrained on concatenated Q&A.

Optimization Bias in GRPO¶

Standard GRPO causes longer outputs, especially for wrong answers.
Dr. GRPO prevents bloat, leading to better token efficiency.

Final Performance¶

Using Dr. GRPO with Qwen2.5-Math-7B + MATH (Level 3–5) + Qwen-Math Template:

43.3% pass@8 on AIME 2024 — a new SOTA (State of the Art) with just a few hours on A100s.

📚 Summary of Takeaways¶

Section	Finding
2.1	Templates help elicit answering behavior; base models already have strong math skills.
2.2	Qwen2.5 performs better without templates → possible pretraining on answer-first formats.
2.3	Most base models (Qwen, DeepSeek) already show “Aha” moments before RL.
3.1	GRPO optimization bias confirmed.
3.2	Dr. GRPO solves the bias problem with cleaner gradient signals.
3.3	Template-model mismatch hurts performance pre-RL.
3.4	LLaMA-3.2-3B benefits significantly from math pretraining.

🔥 Wrap-Up¶

Dr. GRPO is a simple, elegant fix to GRPO’s optimization bias.
Base model choice + pretraining format matters more than previously assumed.
Minimalist RL tuning with the right optimizer and template can push performance to new SOTA — fast and cheap.

📖 Read the full paper: https://arxiv.org/pdf/2503.20783
💻 Code: https://github.com/sail-sg/understand-r1-zero

References : a)https://yugeten.github.io/posts/2025/01/ppogrpo/