What I Think About DeepSeek-Prover-V2 (and Why It’s a Complex Recipe)¶

DeepSeek-Prover-V2 is one of the most ambitious formal theorem-proving models out there right now. It’s impressive—beating state-of-the-art benchmarks like MiniF2F with staggering pass rates. On paper, it feels like a big leap forward in blending natural language reasoning with formal proof generation.

But as I dug into the paper, one thing became clear:
This is not a simple recipe. It’s a seriously complex training pipeline.

The system combines everything:

  • Subgoal decomposition,
  • Expert iteration,
  • Curriculum learning,
  • Chain-of-thought fine-tuning,
  • Reinforcement learning (GRPO),
  • And even distillation into smaller models.

It’s like a layered cake of machine learning techniques all stacked on top of each other. Each part works together to train a model that can not only generate formal Lean proofs but also “think out loud” with detailed reasoning steps.

One thing I noticed, though, is that the paper doesn’t really address the question of data leakage upfront. For example, they mention augmenting the dataset with problems from the validation split of the MiniF2F benchmark, which could raise eyebrows if you’re skimming the paper. However, after a closer read (see Page 6), it’s clear they are careful to only use the validation split (not the test set) to guide training—which is fair game. Still, it's a critical detail that deserved more attention, especially in a field where leakage can totally skew benchmark results.

In this post, I’m unpacking the key methods and results, sharing what stood out to me about this mode And why I think it’s a great step forward, despite being one of the most complex recipes I’ve seen in LLM theorem proving ( or neural theorem proving??)

Method: How DeepSeek-Prover-V2 Works¶

Recursive Proof Search via Subgoal Decomposition¶

The core of DeepSeek-Prover-V2 is inspired by how human mathematicians think: break down a complex theorem into smaller, manageable steps. These smaller steps—called subgoals—serve as stepping stones toward the full proof.

To do this, the system uses DeepSeek-V3 as a general-purpose tool to decompose a theorem into subgoals. Here’s the basic flow:

  • First, DeepSeek-V3 analyzes the theorem in natural language and then breaks it down into smaller proof steps.
  • Each step is translated into a Lean formal statement but only as a sketch—leaving the details out (using sorry placeholders).
  • The result is a structured "chain of thought" that outlines the strategy: a sequence of subgoals ready to be tackled.

This is very human-like: break the hard thing down, write out the pieces, and then solve them step by step.


Recursive Resolution of Subgoals¶

Once the subgoals are generated, the 7B prover model steps in. It:

  • Picks up each subgoal and tries to solve it fully.
  • After solving one subgoal, its result is used as a premise for the next subgoal.

This recursive strategy builds up the full proof layer by layer.

To keep things efficient, the system uses a smaller 7B model to handle these subgoals (instead of the giant 671B model), reducing compute cost while staying effective.


Curriculum Learning with Subgoals¶

One major challenge in training these kinds of models is that most proof attempts fail. This makes it hard for the model to learn because it gets very few "success signals."

To tackle that, the team:

  • Uses subgoals to expand the dataset with easier, more tractable problems.
  • Creates two types of subgoal theorems:
    • One with preceding subgoals as premises,
    • And one without.

This way, they guide the model step by step through a curriculum that gradually increases in difficulty—starting from easy subgoals and moving up to full theorems.


Unifying Reasoning and Formal Proofs¶

The pipeline cleverly combines two types of models:

  • DeepSeek-V3: handles natural language reasoning and subgoal decomposition.
  • 7B prover model: completes the formal Lean proofs.

Here’s a neat trick: sometimes the 7B model can’t solve a full problem directly, but it successfully solves all the subgoals. In that case, the team stitches together the subgoal proofs to form a complete proof and pairs it with the original reasoning chain. This becomes cold-start training data—synthetic but high-quality—and helps train the full DeepSeek-Prover-V2 model.


Two-Stage Training Pipeline¶

DeepSeek-Prover-V2 is trained in two main stages:

1️ Stage 1: non-CoT mode

  • The model learns to generate formal Lean proofs only—no reasoning steps.
  • This is fast and efficient, great for collecting lots of data quickly.

2️ Stage 2: CoT (Chain-of-Thought) mode

  • The model learns to think out loud: first generating reasoning, then writing the formal proof.
  • This stage uses the cold-start data plus reinforcement learning (GRPO) to sharpen the model’s performance.


Training Details¶

  • The 671B model is fine-tuned with a 16,384-token context window at a learning rate of 5e-6.
  • The training data includes:
    • non-CoT data (just formal Lean proofs),
    • and CoT data (reasoning + formal proof).

The 7B model is trained similarly but is also distilled from the 671B model and further polished with RL—resulting in a fast, lightweight prover that can handle long proofs (up to 32,768 tokens).


Expert Iteration & Curriculum Learning: How the Model Gets Smarter¶

One of the most fascinating parts of the DeepSeek-Prover-V2 pipeline is how it keeps improving itself over time. This is thanks to a loop called expert iteration.

Here’s how it works:

  • The model starts off with its base training (human + synthetic proofs).
  • Then it tries to solve harder problems, especially the ones it failed to crack in previous rounds.
  • When it finally succeeds at a tough problem, the new proof gets checked and verified by Lean.
  • That successful proof is then added back into the training set to improve the next version of the model.

This is pure self-improvement: the model learns not just from **human data but from its own breakthroughs. **

And this is where curriculum learning sneaks in:

They don’t just throw hard problems at the model all at once. Instead, they structure the problem set from easier to harder, starting with subgoals (which are easier) and moving up to full theorems. This mirrors how humans learn math: build up step-by-step confidence before tackling the big leagues.

That’s why **subgoal decomposition isn’t just a way to make proofs easier to solve—it’s also a powerful tool to structure the learning journey. **


Reinforcement Learning (GRPO): Making the Model Even Sharper¶

Once the model has learned from static data (proofs and reasoning), it goes through a final polishing phase: reinforcement learning.

They use a method called Group Relative Policy Optimization (GRPO).

Here’s the cool part:

  • In standard RL (like PPO), you need a critic model to estimate how good each proof attempt is.
  • **GRPO skips the critic entirely. **
  • Instead, for each theorem:
    • The model generates a batch of proofs (like 32 attempts),
    • Each proof gets a simple score: 1 if it works, 0 if it fails.
    • The model optimizes its policy by comparing proofs within that batch—favoring the better ones.

This relative ranking is surprisingly powerful because proof tasks are binary by nature (the proof either works or it doesn’t). GRPO keeps the training simple and efficient—no complex reward shaping needed.


Distillation: Teaching the Small 7B Model¶

Here’s where things get super practical. The massive 671B model is a beast, but it’s not exactly something you can run casually.

So, they distill its knowledge into the 7B model:

  • First, they extend the 7B model’s context window (from 4K → 32K tokens) so it can handle long proofs too.
  • Then they fine-tune the 7B model on rollout data from the 671B model.
  • Finally, they apply the same RL polishing (GRPO) to sharpen it.

The result?
A smaller, faster model that can still do real formal proving—hugely valuable for cost-efficient applications.


The 7B Model’s Surprise Move¶

Now here’s the part that caught my eye:
Despite being much smaller, the 7B model (in non-CoT mode) actually solved 13 problems that the 671B model couldn’t.

Why? After looking at the outputs, the researchers found the 7B model relied on clever Lean tactics:

  • Cardinal.toNat
  • Cardinal.natCast_inj

These are tricks for handling problems about counting elements in sets (finite cardinalities). The 7B model naturally fell into using these tactics, while the big 671B model didn’t.

So even though the 671B is more powerful overall, the 7B model’s **niche expertise filled in the gaps—raising the total solved problems on PutnamBench from 49 to 62. **

Benchmark Results: DeepSeek-Prover-V2-671B Performance¶

Problem Category miniF2F-valid curriculum (+Pass@8192) miniF2F-test Pass@8192
IMO 10/20 = 50.0% 10/20 = 50.0%
Olympiad 10(+2)/15 = 80.0% 14/15 = 93.3%
MATH 39/45 = 86.7% 35/45 = 77.8%
AIME 69/70 = 98.6% 70/70 = 100.0%
AMC 58/60 = 96.7% 58/60 = 96.7%
Algebra 18/18 = 100.0% 15/18 = 83.3%
Number Theory 8/8 = 100.0% 7/8 = 87.5%
Induction 8/8 = 100.0% 8/8 = 100.0%
Overall Pass Rate 220(+2)/244 = 91.0% 217/244 = 88.9%

Table: Problems solved by DeepSeek-Prover-V2-671B on the miniF2F benchmark. Results on miniF2F-valid are collected throughout the curriculum learning process, and the prover is further evaluated with Pass@8192 on the remaining test problems.

Other results :https://arxiv.org/pdf/2504.21801