Faithful Chain-of-Thought Reasoning (arXiv:2301.13379)¶
Motivation¶
- Standard CoT prompting improves reasoning but lacks faithfulness: the generated chain may not truly explain the final answer.
- Faithful CoT enforces causal alignment between reasoning and answer through deterministic execution of symbolic logic.
- This ensures explanations are not only plausible but faithful to the actual computation.
Framework Overview¶
Two-Stage Pipeline¶
Translation Stage:
LM translates input question $Q$ into a reasoning chain $C$ consisting of:
- CNL: Natural Language (subquestions, rationales, dependency tags)
- CSL: Symbolic Language (Python, Datalog, PDDL)
Problem Solving Stage:
- CSL is executed by an external deterministic solver (e.g., Python interpreter, Datalog engine)
- Final answer $A$ is guaranteed to follow from the reasoning chain
Taxonomy of CoT Prompting¶
Type | Description |
---|---|
All-at-once | LM generates reasoning and answer as a single block (e.g., Standard CoT) |
Ensemble-based | Generates multiple (C, A) pairs, selects answer via majority voting |
Modularized | Breaks query into subquestions and solves step-by-step (e.g., LtM) |
Faithful CoT = Modularized + Executable CoT with interleaved NL and SL
Evaluation Tasks and Datasets¶
Math Word Problems (MWP)¶
- Datasets: GSM8K, SVAMP, MultiArith, ASDiv, AQuA
- SL: Python
- Answer: Numeric or string-valued expressions
Multi-hop QA¶
- Datasets: StrategyQA, Date Understanding, Sports Understanding
- SL: Datalog, Python
- Answer: Boolean or short strings
Planning¶
- Dataset: SayCan
- SL: PDDL
- Answer: Action sequence plan
Relational Inference¶
- Dataset: CLUTRR
- SL: Python + transitivity logic
- Answer: Family relation string
Performance Results (Code-davinci-002)¶
Dataset | CoT (Greedy) | Faithful CoT (Greedy) | Faithful CoT (Self-Consistent) |
---|---|---|---|
GSM8K | 63.3 | 72.3 (+9.0) | 65.9 |
SVAMP | 78.0 | 80.0 (+2.0) | 80.3 |
MultiArith | 86.8 | 88.8 (+2.0) | 88.8 |
ASDiv | 80.0 | 82.5 (+2.5) | 80.2 |
AQuA | 42.1 | 47.2 (+12.1) | 50.2 (+18.1) |
StrategyQA | 76.5 | 76.3 | 76.7 |
Date Understanding | 40.6 | 47.2 (+6.5) | 50.9 (+10.8) |
Sports Understanding | 98.8 | 98.8 | 99.0 |
SayCan | 80.0 | 89.3 (+9.3) | 91.3 (+11.3) |
CLUTRR | 42.0 | 63.3 (+21.3) | 71.9 (+41.3) |
Ablation Study¶
Prompt Variation | GSM8K | Date | SayCan | CLUTRR | Observation |
---|---|---|---|---|---|
Full | 72.3 | 47.2 | 89.3 | 63.3 | Best performance overall |
No rationale | ~72 | ~47 | ~88 | ↓ | CLUTRR most affected |
No NL but nudge | ↓ | ~47 | ~88 | ↑31.3 | Nudge line alone helps |
No NL | ↓ | ↓ | ~88 | ↓↓ | Major drop in CLUTRR |
No solver | ↓50.8 | ↓22.9 | ↑2.9 | ↓19.4 | Solver critical except in SayCan |
Robustness to Exemplars¶
- 5 randomized prompts tested
- Accuracy range: −1.5 to +1.2 around original
- Standard deviation: 1.3–2.9%
- Conclusion: Method is robust to example choice
Human Evaluation (Plausibility of Reasoning Chains)¶
Dataset | Fully Correct Reasoning Chain (%) |
---|---|
MWP | 90%+ |
SayCan | 90%+ |
CLUTRR | 88.0% |
Date | 87.9% |
StrategyQA | 66.7% |
StrategyQA suffers due to:
- Binary format (chance success)
- Weak Datalog fluency in Codex
- Mistaken human judgments
Contributions¶
- Propose a faithful, interpretable CoT reasoning framework using NL+SL and deterministic solvers.
- Demonstrate consistent empirical improvements across 10 reasoning datasets.
- Show generalizability across domains and symbolic languages.
- Conduct first large-scale human evaluation on CoT reasoning chain plausibility.
- Prove that faithfulness improves interpretability without sacrificing performance.