Authors: Shuofei Qiao et al. (2024)
Build an LLM-based agent that can recognize what kind of thinking is needed, reason better with self-awareness, and use memory, reflection, and learning from mistakes.
Mode | Definition | Implementation |
---|---|---|
Fast Thinking (FT) | Intuitive, immediate response | Direct LLM generation |
Slow Thinking (ST) | Deliberative, multi-step reasoning | Chain-of-Thought, self-consistency |
Knowledgeable Thinking (KT) | Reasoning supported by memory | Retrieve + Reason |
A learned controller πmeta decides which cognitive strategy to use given a current situation ht.
πmeta(ht) → {FT, ST, KT}
LSFT = -E(ht, y) ~ Dself log πθ(y | ht)
LDPO = -E(ht, y, yp) ~ Dpair [ log ( πθ(y | ht) / πθ(yp | ht) ) * ( β log πref(y | ht) - β log πref(yp | ht) ) ]
LNLL = -E(ht, y) ~ Dpair (1 / |y|) log πθ(y | ht)
LRPO = LDPO + α LNLL
KT mode uses vector search over episodic memory and integrates retrieved content into prompt.
Model | ALFWorld Success | ScienceWorld Accuracy | Reflection |
---|---|---|---|
Base LLM | 54.1% | 63.4% | No |
CoT + Reflection | 58.9% | 67.0% | Moderate |
KnowSelf | 64.6% | 71.9% | Strong |
Ablation (no KT) | -10% | -8% | Weak |