June 2025
Training large language models (LLMs) to reason is one of the most exciting and difficult frontiers in AI. Traditional methods rely on external rewards: human preference models (RLHF) or task-specific correctness checks (RLVR). But what if models could learn directly from their own internal confidence, without any supervision? That’s the promise of INTUITOR, a new method built on Reinforcement Learning from Internal Feedback (RLIF).
In this blog, we’ll explore the key ideas behind INTUITOR, how it works, why self-certainty is a powerful intrinsic signal, and what this tells us about the future of autonomous reasoning in language models.
LLMs fine-tuned with RL typically rely on:
But both approaches are expensive and domain-limited. RLHF requires large-scale human annotations and is prone to reward hacking. RLVR needs gold answers or executable test cases, which aren’t available for many open-ended or creative tasks.
RLIF is a new paradigm where the reward doesn’t come from an external oracle, but from the model itself. Specifically, it uses intrinsic signals like prediction confidence, entropy, or semantic coherence.
INTUITOR is the first major implementation of RLIF using self-certainty as the only reward signal.
Self-certainty measures how confident a model is in its next-token predictions. Formally, it is defined as the average KL divergence between a uniform distribution over the vocabulary and the model's predicted distributions at each token:
Self-certainty(o|q) = (1/|o|) ∑ₜ KL(U || pₜ(·|q, o<t))
This metric rewards outputs where the model is highly certain about each step and is robust to biases that affect other measures like entropy or perplexity.
INTUITOR uses Group Relative Policy Optimization (GRPO) to turn self-certainty into a usable advantage signal. For each query, it samples a group of completions, scores them with self-certainty, and normalizes these scores to produce advantages.
J_GRPO(θ) = E[ (1/G) ∑ᵢ (1/|oᵢ|) ∑ₜ min(cᵢₜ * Âᵢₜ, clip(cᵢₜ, 1-ε, 1+ε) * Âᵢₜ) ] - β KL(πθ || πref)
cᵢₜ(θ) = πθ(oᵢₜ|q, oᵢ,<t) / π_old(oᵢₜ|q, oᵢ,<t)
Âᵢₜ
is the advantageε, β
are hyperparametersExternal rewards are replaced with normalized self-certainty scores:
Âᵢₜ = (uᵢ - mean(u₁,...,uG)) / std(u₁,...,uG), where uᵢ = Self-certainty(oᵢ | q)
Alternate expression:
Self-certainty(o|q) = (1 / |o|·|V|) ∑ᵢ ∑ⱼ log(|V| · pθ(j | q, o<i))
Models trained with INTUITOR begin to develop pre-answer reasoning. For instance, on code generation tasks, they start by explaining their logic in natural language before outputting code — even when not prompted to. This behavior grows over time and helps the model build more understandable and robust answers. On LiveCodeBench, for example, outputs transition from raw code to natural language reasoning followed by JSON or structured responses.
When rewards are computed from a frozen model (offline self-certainty), models may exploit the reward signal by inserting already-solved answers or inflating length. However, INTUITOR avoids this by computing rewards from the evolving policy (online self-certainty), ensuring training stability and mitigating over-optimization risks.
INTUITOR matches or outperforms GRPO on:
It also shows higher confidence and better separation between correct and incorrect answers, as confirmed by Mann–Whitney U tests. Additionally, ablation studies show that tuning the KL penalty is key for ensuring performance, especially in out-of-domain tasks like code generation and reasoning benchmarks.
INTUITOR enables faster early-stage learning by providing token-level continuous feedback. As early as training step 10, INTUITOR-trained models outperform GRPO across GSM8K and MATH for both Qwen2.5-1.5B and Qwen2.5-3B. This demonstrates its advantage in learnability without sacrificing final performance.
Training with INTUITOR leads to generalization from MATH to out-of-domain tasks like LiveCodeBench. Performance on LiveCodeBench continues to improve even after MATH accuracy plateaus, showing the transfer benefits of structured reasoning learned via self-certainty.
Self-certainty helps models generate more self-explanatory traces over time. On CRUXEval and LiveCodeBench, INTUITOR models first reduce invalid code, then add reasoning and structure. This behavior emerges naturally, even without explicit reward supervision for it.
INTUITOR is a step toward a future where LLMs learn to understand and refine their own thoughts. By leveraging internal feedback instead of external labels, it unlocks scalable reasoning in under-supervised domains.
That said, self-certainty is not a silver bullet. It can lead to overconfidence, saturation, or bias toward short, high-certainty answers. It’s best used alongside semantic filters or hybrid reward strategies.
INTUITOR shows us that models don’t just need to answer — they need to believe in their answers too.