Research

The Token Tax of Multi-Agent Systems (And What Fixing It Is Worth)

When a multi-agent system gets expensive, the reflex is to reach for a bigger model or a bigger context window. But in a fleet, the tokens that dominate the bill aren't spent on reasoning. They're spent on repetition.

The bill is repetition, not reasoning

A single agent that re-reads its whole history every turn is already wasteful. A fleet of agents is the same mistake, multiplied. Four habits drive the waste:

  • Re-sent history: each agent carries its own transcript to "remember," and pays for the early turns again at every later one.
  • Full context, passed around: hand-offs forward the entire conversation instead of the relevant slice.
  • The same fact, N ways: shared specs, conventions and decisions are copied into every agent's window instead of stored once.
  • Re-derivation across agents: one agent solves a problem; a sibling hits the same wall later and solves it again.

The shape of the bill is agents × history × redundancy: three multipliers, none of which is capability. A bigger window doesn't help, because cost still rises with every token, latency climbs, and models lose the thread in the middle of long contexts.

What fixing it is worth

So we measured it. On MA-MemBench, our benchmark for the communication cost of a fleet, we put SuperLazy head-to-head against a naive fleet (every agent re-ships everything) and against mem0, a popular memory layer, all running the same model under identical conditions. The result: you don't trade accuracy for efficiency. You get both.

  • 95% accuracy, the highest of the three.
  • ~10× fewer tokens the lead agent must read versus mem0 (about 900 per query, versus 8,600 for mem0 and 24,000 for a naive fleet).
  • ~8× lower cost per query versus mem0.
  • Zero ingest overhead, where mem0 burned roughly 1.5 million tokens before a single question was asked.

And the advantage widens with scale. As the world grows from 8 to 64 agents, a naive fleet's cost climbs to $2.08 per query while SuperLazy stays flat near $0.012, a gap of over 100× exactly when your system is scaling.

Validated beyond our own benchmark

The same thesis holds on the field's open long-memory benchmark, LongMemEval, where the answer is buried inside roughly 100,000 tokens of history. SuperLazy surfaces the handful of moments that hold the answer, at 95.3% Recall@15 and 92.5% Recall@10, while sending about 18× less context than feeding the full history.

The takeaway is simple: the cheapest token is the one you never re-send.

Read more on the SuperLazy Research page, or see why we built MA-MemBench.

Frequently asked questions

Why do multi-agent systems get so expensive?
Mostly repetition, not reasoning: agents re-send history, copy the same facts into every window, and re-derive what siblings already figured out. Cost scales with agents, history and redundancy together.
Does cutting tokens hurt accuracy?
No. On MA-MemBench, SuperLazy was both the cheapest and the most accurate at 95%, because it surfaces the exact current fact instead of flooding the prompt.
Keep reading

More from the blog