Faithful Chain-of-Thought Reasoning (arXiv:2301.13379)¶

Motivation¶

Standard CoT prompting improves reasoning but lacks faithfulness: the generated chain may not truly explain the final answer.
Faithful CoT enforces causal alignment between reasoning and answer through deterministic execution of symbolic logic.
This ensures explanations are not only plausible but faithful to the actual computation.

Framework Overview¶

Two-Stage Pipeline¶

Translation Stage:
- LM translates input question $Q$ into a reasoning chain $C$ consisting of:
  - CNL: Natural Language (subquestions, rationales, dependency tags)
  - CSL: Symbolic Language (Python, Datalog, PDDL)
Problem Solving Stage:
- CSL is executed by an external deterministic solver (e.g., Python interpreter, Datalog engine)
- Final answer $A$ is guaranteed to follow from the reasoning chain

Taxonomy of CoT Prompting¶

Type	Description
All-at-once	LM generates reasoning and answer as a single block (e.g., Standard CoT)
Ensemble-based	Generates multiple (C, A) pairs, selects answer via majority voting
Modularized	Breaks query into subquestions and solves step-by-step (e.g., LtM)

Faithful CoT = Modularized + Executable CoT with interleaved NL and SL

Evaluation Tasks and Datasets¶

Math Word Problems (MWP)¶

Datasets: GSM8K, SVAMP, MultiArith, ASDiv, AQuA
SL: Python
Answer: Numeric or string-valued expressions

Multi-hop QA¶

Datasets: StrategyQA, Date Understanding, Sports Understanding
SL: Datalog, Python
Answer: Boolean or short strings

Planning¶

Dataset: SayCan
SL: PDDL
Answer: Action sequence plan

Relational Inference¶

Dataset: CLUTRR
SL: Python + transitivity logic
Answer: Family relation string

Performance Results (Code-davinci-002)¶

Dataset	CoT (Greedy)	Faithful CoT (Greedy)	Faithful CoT (Self-Consistent)
GSM8K	63.3	72.3 (+9.0)	65.9
SVAMP	78.0	80.0 (+2.0)	80.3
MultiArith	86.8	88.8 (+2.0)	88.8
ASDiv	80.0	82.5 (+2.5)	80.2
AQuA	42.1	47.2 (+12.1)	50.2 (+18.1)
StrategyQA	76.5	76.3	76.7
Date Understanding	40.6	47.2 (+6.5)	50.9 (+10.8)
Sports Understanding	98.8	98.8	99.0
SayCan	80.0	89.3 (+9.3)	91.3 (+11.3)
CLUTRR	42.0	63.3 (+21.3)	71.9 (+41.3)

Ablation Study¶

Prompt Variation	GSM8K	Date	SayCan	CLUTRR	Observation
Full	72.3	47.2	89.3	63.3	Best performance overall
No rationale	~72	~47	~88	↓	CLUTRR most affected
No NL but nudge	↓	~47	~88	↑31.3	Nudge line alone helps
No NL	↓	↓	~88	↓↓	Major drop in CLUTRR
No solver	↓50.8	↓22.9	↑2.9	↓19.4	Solver critical except in SayCan

Robustness to Exemplars¶

5 randomized prompts tested
Accuracy range: −1.5 to +1.2 around original
Standard deviation: 1.3–2.9%
Conclusion: Method is robust to example choice

Human Evaluation (Plausibility of Reasoning Chains)¶

Dataset	Fully Correct Reasoning Chain (%)
MWP	90%+
SayCan	90%+
CLUTRR	88.0%
Date	87.9%
StrategyQA	66.7%

StrategyQA suffers due to:

Binary format (chance success)

Weak Datalog fluency in Codex

Mistaken human judgments

Contributions¶

Propose a faithful, interpretable CoT reasoning framework using NL+SL and deterministic solvers.
Demonstrate consistent empirical improvements across 10 reasoning datasets.
Show generalizability across domains and symbolic languages.
Conduct first large-scale human evaluation on CoT reasoning chain plausibility.
Prove that faithfulness improves interpretability without sacrificing performance.