KnowSelf: Agentic Knowledgeable Self-awareness

Authors: Shuofei Qiao et al. (2024)

Goal

Build an LLM-based agent that can recognize what kind of thinking is needed, reason better with self-awareness, and use memory, reflection, and learning from mistakes.

1. Agent Design: Modular Thinking System

Mode	Definition	Implementation
Fast Thinking (FT)	Intuitive, immediate response	Direct LLM generation
Slow Thinking (ST)	Deliberative, multi-step reasoning	Chain-of-Thought, self-consistency
Knowledgeable Thinking (KT)	Reasoning supported by memory	Retrieve + Reason

2. Meta-Cognition Module

A learned controller π_meta decides which cognitive strategy to use given a current situation h_t.

π_meta(h_t) → {FT, ST, KT}

3. Two-Stage Training Process

Stage 1: Supervised Fine-Tuning (SFT)

L_SFT = -E_{(h_t, y) ~ D_self} log π_θ(y | h_t)

Stage 2: Learning Self-Awareness via Offline DPO

Step 1: Build D_pair from self-exploration

Record correct y and predicted y_p

Step 2: DPO Loss

L_DPO = -E_{(h_t, y, y_p) ~ D_pair} 
[ log ( π_θ(y | h_t) / π_θ(y_p | h_t) ) *
  ( β log π_ref(y | h_t) - β log π_ref(y_p | h_t) ) ]

Step 3: Add Normalized NLL

L_NLL = -E_{(h_t, y) ~ D_pair} (1 / |y|) log π_θ(y | h_t)

Final Loss

L_RPO = L_DPO + α L_NLL

4. Memory and Reflection Integration

KT mode uses vector search over episodic memory and integrates retrieved content into prompt.

5. Empirical Evaluation

Environments

ALFWorld (embodied task planning)
ScienceWorld (multi-hop QA)

Key Results

Model	ALFWorld Success	ScienceWorld Accuracy	Reflection
Base LLM	54.1%	63.4%	No
CoT + Reflection	58.9%	67.0%	Moderate
KnowSelf	64.6%	71.9%	Strong
Ablation (no KT)	-10%	-8%	Weak

Contributions

Modular cognitive reasoning: FT, ST, KT
Self-generated contrastive learning from incorrect decisions
Offline preference optimization (DPO)
Improved efficiency, accuracy, and agentic behavior