Teaching Language Models to Think for Themselves: Inside INTUITOR and Reinforcement Learning from Internal Feedback (RLIF)

June 2025

Introduction

Training large language models (LLMs) to reason is one of the most exciting and difficult frontiers in AI. Traditional methods rely on external rewards: human preference models (RLHF) or task-specific correctness checks (RLVR). But what if models could learn directly from their own internal confidence, without any supervision? That’s the promise of INTUITOR, a new method built on Reinforcement Learning from Internal Feedback (RLIF).

In this blog, we’ll explore the key ideas behind INTUITOR, how it works, why self-certainty is a powerful intrinsic signal, and what this tells us about the future of autonomous reasoning in language models.

The Problem with External Rewards

LLMs fine-tuned with RL typically rely on:

RLHF (Reinforcement Learning from Human Feedback), where models are optimized to match a learned human reward function.
RLVR (Reinforcement Learning with Verifiable Rewards), where tasks like math or code are rewarded if the output matches a correct answer or passes a test.

But both approaches are expensive and domain-limited. RLHF requires large-scale human annotations and is prone to reward hacking. RLVR needs gold answers or executable test cases, which aren’t available for many open-ended or creative tasks.

Enter RLIF: Reinforcement Learning from Internal Feedback

RLIF is a new paradigm where the reward doesn’t come from an external oracle, but from the model itself. Specifically, it uses intrinsic signals like prediction confidence, entropy, or semantic coherence.

INTUITOR is the first major implementation of RLIF using self-certainty as the only reward signal.

What is Self-Certainty?

Self-certainty measures how confident a model is in its next-token predictions. Formally, it is defined as the average KL divergence between a uniform distribution over the vocabulary and the model's predicted distributions at each token:

Self-certainty(o|q) = (1/|o|) ∑ₜ KL(U || pₜ(·|q, o<t))

This metric rewards outputs where the model is highly certain about each step and is robust to biases that affect other measures like entropy or perplexity.

Training with INTUITOR

INTUITOR uses Group Relative Policy Optimization (GRPO) to turn self-certainty into a usable advantage signal. For each query, it samples a group of completions, scores them with self-certainty, and normalizes these scores to produce advantages.

J_GRPO(θ) = E[ (1/G) ∑ᵢ (1/|oᵢ|) ∑ₜ min(cᵢₜ * Âᵢₜ, clip(cᵢₜ, 1-ε, 1+ε) * Âᵢₜ) ] - β KL(πθ || πref)

cᵢₜ(θ) = πθ(oᵢₜ|q, oᵢ,<t) / π_old(oᵢₜ|q, oᵢ,<t)
Âᵢₜ is the advantage
ε, β are hyperparameters

Integration of Self-Certainty

External rewards are replaced with normalized self-certainty scores:

Âᵢₜ = (uᵢ - mean(u₁,...,uG)) / std(u₁,...,uG), where uᵢ = Self-certainty(oᵢ | q)

Alternate expression:

Self-certainty(o|q) = (1 / |o|·|V|) ∑ᵢ ∑ⱼ log(|V| · pθ(j | q, o<i))

Emergent Behaviors: Reasoning Before Answering

Models trained with INTUITOR begin to develop pre-answer reasoning. For instance, on code generation tasks, they start by explaining their logic in natural language before outputting code — even when not prompted to. This behavior grows over time and helps the model build more understandable and robust answers. On LiveCodeBench, for example, outputs transition from raw code to natural language reasoning followed by JSON or structured responses.

Avoiding Reward Hacking

When rewards are computed from a frozen model (offline self-certainty), models may exploit the reward signal by inserting already-solved answers or inflating length. However, INTUITOR avoids this by computing rewards from the evolving policy (online self-certainty), ensuring training stability and mitigating over-optimization risks.

Results and Insights

INTUITOR matches or outperforms GRPO on:

Math reasoning (MATH500, GSM8K)
Code generation (CRUXEval, LiveCodeBench)
Instruction following (AlpacaEval 2.0)

It also shows higher confidence and better separation between correct and incorrect answers, as confirmed by Mann–Whitney U tests. Additionally, ablation studies show that tuning the KL penalty is key for ensuring performance, especially in out-of-domain tasks like code generation and reasoning benchmarks.

Additional Findings

Rapid Initial Learning

INTUITOR enables faster early-stage learning by providing token-level continuous feedback. As early as training step 10, INTUITOR-trained models outperform GRPO across GSM8K and MATH for both Qwen2.5-1.5B and Qwen2.5-3B. This demonstrates its advantage in learnability without sacrificing final performance.

Cross-Task Generalization

Training with INTUITOR leads to generalization from MATH to out-of-domain tasks like LiveCodeBench. Performance on LiveCodeBench continues to improve even after MATH accuracy plateaus, showing the transfer benefits of structured reasoning learned via self-certainty.

Emergence of Long-Form Reasoning

Self-certainty helps models generate more self-explanatory traces over time. On CRUXEval and LiveCodeBench, INTUITOR models first reduce invalid code, then add reasoning and structure. This behavior emerges naturally, even without explicit reward supervision for it.

Conclusion: Toward Self-Aware AI Systems

INTUITOR is a step toward a future where LLMs learn to understand and refine their own thoughts. By leveraging internal feedback instead of external labels, it unlocks scalable reasoning in under-supervised domains.

That said, self-certainty is not a silver bullet. It can lead to overconfidence, saturation, or bias toward short, high-certainty answers. It’s best used alongside semantic filters or hybrid reward strategies.

INTUITOR shows us that models don’t just need to answer — they need to believe in their answers too.