Why Do Multi-Agent LLM Systems Fail? Lessons from MAST and How They Show Up in Finance

Multi-agent systems built on large language models (LLMs) are evolving fast. They are showing up in productivity tools, autonomous simulations, and increasingly, in financial applications from portfolio construction to trade execution.

But anyone who has built or evaluated these systems knows the truth: they fail—and not just occasionally, but in ways that are subtle, unpredictable, and frustrating.

Recently, I came across a paper that finally gave structure to this chaos: “Why Do Multi-Agent LLM Systems Fail?” The authors introduce MAST, the Multi-Agent System Failure Taxonomy—a detailed framework that breaks down why multi-agent systems go wrong, not just how.

What Is MAST?

MAST is the first empirically grounded taxonomy designed to classify failure modes in multi-agent LLM systems. It is based on the analysis of seven frameworks across more than 200 tasks, annotated by experts with high agreement (Cohen’s κ = 0.88). To scale, the team developed an LLM-as-a-Judge evaluator to detect these failures automatically.

It groups failures into three categories, with 14 specific failure types:

Specification Issues
Inter-Agent Misalignment
Task Verification Failures

MAST Failure Categories With Financial Agent Examples

1. Specification Issues (System Design)

Disobey Task Specification (10.98%)
Example: A risk control agent is supposed to maintain portfolio volatility below 5 percent but instead it just checks total returns. Cause: The task specification was too ambiguous.
Disobey Role Specification (0.50%)
Example: A research agent designed to output signals begins executing trades. Cause: Role boundaries were not clearly enforced.
Step Repetition (17.14%)
Example: The agent re-runs the same backtest multiple times, thinking each is novel. Cause: It forgets previous steps due to stateless design.
Loss of Conversation History (3.33%)
Example: The execution agent forgets a critical warning from the risk monitor issued two turns ago. Cause: Context window limitations or no memory persistence.
Unaware of Termination Conditions (0.5%)
Example: An optimization loop keeps tuning after hitting target Sharpe or budget. Cause: Termination logic was not built in.

2. Inter-Agent Misalignment (Agent Coordination)

Conversation Reset (9.82%)
Example: An agent drops the shared portfolio state during a thread reset. Cause: No persistent context tracking between agents.
Fail to Ask for Clarification (2.33%)
Example: An alpha model sees conflicting signals but makes assumptions instead of checking with peers. Cause: Lacks clarification behavior or fallbacks.
Task Derailment (11.65%)
Example: A research agent tasked with ranking assets instead starts summarizing financial news. Cause: Goal drift due to open-ended prompting.
Information Withholding (7.15%)
Example: A macro agent learns about CPI shocks but does not inform the execution agent. Cause: No enforced communication channels or triggers.
Ignored Other Agent’s Input (1.66%)
Example: The rebalancer ignores the risk alert from a volatility watchdog. Cause: Agents do not cross-validate each other’s outputs.
Reasoning-Action Mismatch (13.98%)
Example: An agent says TSLA is overpriced, then proceeds to recommend buying it. Cause: Reasoning and action generation are out of sync.

3. Task Verification Failures (Quality Control)

Premature Termination (0.17%)
Example: The entire agent pipeline halts after data cleaning—no signals, no evaluation. Cause: False signal of task completion.
No or Incomplete Verification (6.82%)
Example: The strategy gets deployed without checking backtest results. Cause: No human-in-the-loop or final checklist.
Incorrect Verification (7.82%)
Example: An LLM agent marks a failed trade plan as successful due to a wrong PnL calculation. Cause: Incorrect metric interpretation.

Why MAST Matters in Financial Multi-Agent Systems

In finance, failure is not just an inconvenience—it is a risk. Poor coordination between a signal generator and execution agent can lead to real losses. The lack of verification can mean faulty strategies go live. MAST gives us:

Systematic debugging tools
Clear failure taxonomies to refine design
Agent-level evaluation mechanisms
Shared language for cross-functional teams

Going Forward: What I am Applying

Inspired by MAST, here is what I am starting to do in my own financial multi-agent workflows:

Add role validation at every step
Use summarization agents to retain cross-agent memory
Build peer review agents to validate each output
Adopt the LLM-as-a-Judge method for automated evaluation

Final Thoughts

MAST does not just explain why LLM agents fail—it gives us tools to build smarter. If you are designing LLM agents for high-stakes applications like trading, portfolio management, or risk analysis, this taxonomy is a must-know. It helped me turn chaos into structure, and that is the first step toward reliable, scalable AI systems.