Faithful Chain-of-Thought Reasoning (arXiv:2301.13379)¶

Motivation¶

  • Standard CoT prompting improves reasoning but lacks faithfulness: the generated chain may not truly explain the final answer.
  • Faithful CoT enforces causal alignment between reasoning and answer through deterministic execution of symbolic logic.
  • This ensures explanations are not only plausible but faithful to the actual computation.

Framework Overview¶

Two-Stage Pipeline¶

  1. Translation Stage:

    • LM translates input question $Q$ into a reasoning chain $C$ consisting of:

      • CNL: Natural Language (subquestions, rationales, dependency tags)
      • CSL: Symbolic Language (Python, Datalog, PDDL)
  2. Problem Solving Stage:

    • CSL is executed by an external deterministic solver (e.g., Python interpreter, Datalog engine)
    • Final answer $A$ is guaranteed to follow from the reasoning chain

Taxonomy of CoT Prompting¶

Type Description
All-at-once LM generates reasoning and answer as a single block (e.g., Standard CoT)
Ensemble-based Generates multiple (C, A) pairs, selects answer via majority voting
Modularized Breaks query into subquestions and solves step-by-step (e.g., LtM)

Faithful CoT = Modularized + Executable CoT with interleaved NL and SL


Evaluation Tasks and Datasets¶

Math Word Problems (MWP)¶

  • Datasets: GSM8K, SVAMP, MultiArith, ASDiv, AQuA
  • SL: Python
  • Answer: Numeric or string-valued expressions

Multi-hop QA¶

  • Datasets: StrategyQA, Date Understanding, Sports Understanding
  • SL: Datalog, Python
  • Answer: Boolean or short strings

Planning¶

  • Dataset: SayCan
  • SL: PDDL
  • Answer: Action sequence plan

Relational Inference¶

  • Dataset: CLUTRR
  • SL: Python + transitivity logic
  • Answer: Family relation string

Performance Results (Code-davinci-002)¶

Dataset CoT (Greedy) Faithful CoT (Greedy) Faithful CoT (Self-Consistent)
GSM8K 63.3 72.3 (+9.0) 65.9
SVAMP 78.0 80.0 (+2.0) 80.3
MultiArith 86.8 88.8 (+2.0) 88.8
ASDiv 80.0 82.5 (+2.5) 80.2
AQuA 42.1 47.2 (+12.1) 50.2 (+18.1)
StrategyQA 76.5 76.3 76.7
Date Understanding 40.6 47.2 (+6.5) 50.9 (+10.8)
Sports Understanding 98.8 98.8 99.0
SayCan 80.0 89.3 (+9.3) 91.3 (+11.3)
CLUTRR 42.0 63.3 (+21.3) 71.9 (+41.3)

Ablation Study¶

Prompt Variation GSM8K Date SayCan CLUTRR Observation
Full 72.3 47.2 89.3 63.3 Best performance overall
No rationale ~72 ~47 ~88 ↓ CLUTRR most affected
No NL but nudge ↓ ~47 ~88 ↑31.3 Nudge line alone helps
No NL ↓ ↓ ~88 ↓↓ Major drop in CLUTRR
No solver ↓50.8 ↓22.9 ↑2.9 ↓19.4 Solver critical except in SayCan

Robustness to Exemplars¶

  • 5 randomized prompts tested
  • Accuracy range: −1.5 to +1.2 around original
  • Standard deviation: 1.3–2.9%
  • Conclusion: Method is robust to example choice

Human Evaluation (Plausibility of Reasoning Chains)¶

Dataset Fully Correct Reasoning Chain (%)
MWP 90%+
SayCan 90%+
CLUTRR 88.0%
Date 87.9%
StrategyQA 66.7%

StrategyQA suffers due to:

  • Binary format (chance success)
  • Weak Datalog fluency in Codex
  • Mistaken human judgments

Contributions¶

  1. Propose a faithful, interpretable CoT reasoning framework using NL+SL and deterministic solvers.
  2. Demonstrate consistent empirical improvements across 10 reasoning datasets.
  3. Show generalizability across domains and symbolic languages.
  4. Conduct first large-scale human evaluation on CoT reasoning chain plausibility.
  5. Prove that faithfulness improves interpretability without sacrificing performance.