Faithful Chain-of-Thought Reasoning (arXiv:2301.13379)¶
Motivation¶
- Standard CoT prompting improves reasoning but lacks faithfulness: the generated chain may not truly explain the final answer.
 - Faithful CoT enforces causal alignment between reasoning and answer through deterministic execution of symbolic logic.
 - This ensures explanations are not only plausible but faithful to the actual computation.
 
Framework Overview¶
Two-Stage Pipeline¶
Translation Stage:
LM translates input question $Q$ into a reasoning chain $C$ consisting of:
- CNL: Natural Language (subquestions, rationales, dependency tags)
 - CSL: Symbolic Language (Python, Datalog, PDDL)
 
Problem Solving Stage:
- CSL is executed by an external deterministic solver (e.g., Python interpreter, Datalog engine)
 - Final answer $A$ is guaranteed to follow from the reasoning chain
 
Taxonomy of CoT Prompting¶
| Type | Description | 
|---|---|
| All-at-once | LM generates reasoning and answer as a single block (e.g., Standard CoT) | 
| Ensemble-based | Generates multiple (C, A) pairs, selects answer via majority voting | 
| Modularized | Breaks query into subquestions and solves step-by-step (e.g., LtM) | 
Faithful CoT = Modularized + Executable CoT with interleaved NL and SL
Evaluation Tasks and Datasets¶
Math Word Problems (MWP)¶
- Datasets: GSM8K, SVAMP, MultiArith, ASDiv, AQuA
 - SL: Python
 - Answer: Numeric or string-valued expressions
 
Multi-hop QA¶
- Datasets: StrategyQA, Date Understanding, Sports Understanding
 - SL: Datalog, Python
 - Answer: Boolean or short strings
 
Planning¶
- Dataset: SayCan
 - SL: PDDL
 - Answer: Action sequence plan
 
Relational Inference¶
- Dataset: CLUTRR
 - SL: Python + transitivity logic
 - Answer: Family relation string
 
Performance Results (Code-davinci-002)¶
| Dataset | CoT (Greedy) | Faithful CoT (Greedy) | Faithful CoT (Self-Consistent) | 
|---|---|---|---|
| GSM8K | 63.3 | 72.3 (+9.0) | 65.9 | 
| SVAMP | 78.0 | 80.0 (+2.0) | 80.3 | 
| MultiArith | 86.8 | 88.8 (+2.0) | 88.8 | 
| ASDiv | 80.0 | 82.5 (+2.5) | 80.2 | 
| AQuA | 42.1 | 47.2 (+12.1) | 50.2 (+18.1) | 
| StrategyQA | 76.5 | 76.3 | 76.7 | 
| Date Understanding | 40.6 | 47.2 (+6.5) | 50.9 (+10.8) | 
| Sports Understanding | 98.8 | 98.8 | 99.0 | 
| SayCan | 80.0 | 89.3 (+9.3) | 91.3 (+11.3) | 
| CLUTRR | 42.0 | 63.3 (+21.3) | 71.9 (+41.3) | 
Ablation Study¶
| Prompt Variation | GSM8K | Date | SayCan | CLUTRR | Observation | 
|---|---|---|---|---|---|
| Full | 72.3 | 47.2 | 89.3 | 63.3 | Best performance overall | 
| No rationale | ~72 | ~47 | ~88 | ↓ | CLUTRR most affected | 
| No NL but nudge | ↓ | ~47 | ~88 | ↑31.3 | Nudge line alone helps | 
| No NL | ↓ | ↓ | ~88 | ↓↓ | Major drop in CLUTRR | 
| No solver | ↓50.8 | ↓22.9 | ↑2.9 | ↓19.4 | Solver critical except in SayCan | 
Robustness to Exemplars¶
- 5 randomized prompts tested
 - Accuracy range: −1.5 to +1.2 around original
 - Standard deviation: 1.3–2.9%
 - Conclusion: Method is robust to example choice
 
Human Evaluation (Plausibility of Reasoning Chains)¶
| Dataset | Fully Correct Reasoning Chain (%) | 
|---|---|
| MWP | 90%+ | 
| SayCan | 90%+ | 
| CLUTRR | 88.0% | 
| Date | 87.9% | 
| StrategyQA | 66.7% | 
StrategyQA suffers due to:
- Binary format (chance success)
 - Weak Datalog fluency in Codex
 - Mistaken human judgments
 
Contributions¶
- Propose a faithful, interpretable CoT reasoning framework using NL+SL and deterministic solvers.
 - Demonstrate consistent empirical improvements across 10 reasoning datasets.
 - Show generalizability across domains and symbolic languages.
 - Conduct first large-scale human evaluation on CoT reasoning chain plausibility.
 - Prove that faithfulness improves interpretability without sacrificing performance.